Fallstudie – 60 Jahre kantonale Schweizer Wahldaten aus Papierarchiven befreien

Herausforderung

In einem Land wie der Schweiz, wo seit Jahrhunderten auf allen Verwaltungsebenen mit Herzblut Bürokratie betrieben wird, liegen systematisch gesammelte Daten aller Art brach. Daten, welche für die Wissenschaft von einzigartiger Bedeutung sein können, da vergleichbare Informationen in den wenigstens Regionen der Welt so akribisch gesammelt wurden.

Wir Psychologen sind an solchen Daten bisher wenig interessiert. Für Ökonomen und Politwissenschaftler sind diese aber Gold wert. Der Haken an der Sache ist nur, dass der ganz grosse Teil davon vor dem Computer-Zeitalter gesammelt wurde und daher nur auf Papier gespeichert ist.

Wie schafft man sich eine qualitativ einwandfreie digitale Datenbasis auf Grund von Daten, die in grossen Büchern in Archiven liegen und darüber hinaus in jedem Kanton der Schweiz anders aussehen?
Genau dieser Herausforderung stellten sich Prof. Dr. Mark Schelker und Dr. Lukas Schmid, deren Ziel es war, die Resultate der kantonalen Parlamentswahlen der letzten 60 Jahre zu digitalisieren. Cloud solutions konnte die Forscher bei der Entwicklung und Umsetzung einer optimalen technischen Lösung unterstützen.

Ansätze

Selbstverständlich ist die automatische Texterkennung (OCR) relativ weit fortgeschritten. Für die Herausforderung der kantonalen Wahldaten kam OCR aber aus verschiedenen Gründen nicht in Frage:

  • Beim Scannen von dicken Büchern werden die Inhalte in Bundnähe oft verzerrt, bleicher oder sogar leicht abgeschnitten. OCR Softwares können damit nicht umgehen.
  • Tabellen mit vielen Trennlinien sind ebenfalls ein Problem für OCR.
  • Ältere Schriftstile haben eine schlechtere Erkennungsrate.

Das Nachkorrigieren von schlechten OCR Daten wäre eine Möglichkeit gewesen. Dies wird aber schnell aufwändiger, als die direkte Eingabe der Daten aus einem einfachen Scan und führt mit hoher Wahrscheinlichkeit dazu, dass falsch erkannter Text als Fehler in die Datenmatrix gelangt.
Es blieb also nur die manuelle Erfassung. Dazu würde man traditionellerweise wohl Excel benutzen, was aber verschiedene Probleme mit sich bringt:

  • Arbeit dieser Art ist durch ihre repetitive Natur eher fehleranfällig, Excel bietet keine Unterstützung dabei, verschiedene Fehlerquellen wie Zeilenverschiebungen, falsche Eingaben, falsche Zuordnung, etc. zu vermeiden.
  • Das manuelle Zusammenführen vieler einzelner Excel Dateien stellt eine weitere Fehlerquelle dar.
  • Bei vielen auf mehrere Erfasser verteilten Excel-Dateien ist keine laufende Kontrolle über den Stand der Erfassung und die Qualität der Daten möglich.

Umgesetzte Lösung

Die mit dem Kunden in gemeinsamer Denkarbeit entwickelte und durch CS programmierte Lösung hatte zum Ziel, die jeweiligen Stärken von Technik und Mensch zu vereinen, um so die Qualität der Daten zu maximieren. Das entwickelte System hatte folgende Merkmale:

  • Klar strukturierte, software-geführte Erfassung der Daten.
  • Vermeidung von redundanter Erfassung durch Auftrennen in mehrere Erfassungsebenen (Kanton, Bezirkswahljahr, Kandidaten).
  • Gewisse vorerfasste Daten, die bereits korrekt zur Auswahl gestellt werden konnten.
  • Datenvalidierung bei Eingabe.
  • Eingebaute Qualitätschecks (Vergleich von perfekten, vorerfassten Datensätzen mit den eingegebenen Daten).
  • Sorgfältige Instruktion und Support der ErfasserInnen.
  • Zusätzliche manuelle Stichprobenkontrollen durch das Forscherteam.

Auf diese Weise wurden Ende 2014 / Anfang 2015 durch 30 ErfasserInnen in höchst zufrieden stellender Qualität an die 190’000 Kandidierende erfasst, verteilt auf 60 Jahre, 4000 Wahlbezirke und 15’000 Listen.

 

The relevance of non-response rates in employee attitude surveys

The HR Department of any organization, or consulting company in the field of HR employ surveys as a methodological means of investigating how they can improve as an employer, increase performance and become more profitable. Often the focus lies on systematically analyzing the staff’s perception of working conditions, job attitudes, health and other performance related indicators. In order to best understand which aspects need to be ameliorated and to have a sound decision basis, high response rates are necessary. Unfortunately, the response rates have proven in many instances to be relatively low for surveys addressing the entire organization.

In one of our previous posts we focused on methods which could help to improve the response rate in employee surveys. Now let us focus on how central occupational performance indicators, such as job attitudes, might influence response rates in employee questionnaires. An interesting study by Fauth and his colleagues (2013) focuses on the relationship between employee work attitudes (e.g. job satisfaction or commitment) and non-response rates and how they influence each other.

How non-response rates depict employees’ job attitudes

Whilst previous research on this topic has mainly focused on the relationship between the individual working attitude of employees and their non-responsive behavior in surveys, Fauth and his colleagues took a different approach. They were interested in the effects of group-level work attitude on response rate. Although co-workers and work group members influence the attitude and perspective of other employees, the relationship between the job satisfaction of an entire working team or unit within an organization and their survey response behavior has previously been neglected (Cropazano & Mitchel, 2005). From a practical perspective, such knowledge is crucial, as the survey feedback processes in companies are almost always on aggregated levels (e.g. team, business unit) and not on the individual level. Thus, Fauth et al. (2013) addressed this need for group level-based analysis of non-response rates in organizational surveys. They hypothesize that aggregated job satisfaction is positively related to survey response rates at the work group level.

As the social exchange theory (Cropazano & Mitchel, 2005) underlines, individuals are willing to invest more effort and energy when content. Addressing this idea in the work sphere shows that satisfied employees are willing to invest their energy in additional non-work related tasks, such as completing employee surveys. This form of Organizational Citizen Behavior (OCB; Rogelberg et al., 2003) explains the previously detected positive relationship between work satisfaction and response rate in employee attitude reviews on an individual level (Klein et al., 1994).

In order to test whether employee happiness is also positively related to survey response rates at the work group level, Fauth et al. (2013) conducted two large-scale follow-up employee surveys in four distinct companies in 2002, 2004, and twice in 2006. The participating 1120 employees were gathered into 46 groups with approximately 24 employees per group. Their aggregated job satisfaction was assessed via a multi-item measure – the Job Descriptive Index, the results of which show that work groups with a greater combined job satisfaction had significantly higher response rates. Furthermore, the study also showed that independent of the effect of this contentment, smaller teams and teams with more heterogeneity in tenure and gender had a higher response rate. Intriguingly, no difference in response rate was found for blue collar versus white collar.

This points to an interesting avenue in organizational survey research: that not only the employees’ answers to survey questions are relevant for organizations when assessing group perception of employment situation, but also their response rates. Specifically, higher response rates could indicate a greater general work satisfaction and be an interesting indirect indicator of the overall attitude of a working unit towards their job and their organization.

 

Handy data cleaning tool – CSV fingerprints

Recently I stumbled upon a handy little tool that may be interesting for everyone working with data in tables. An important but often tedious task is the cleaning of your dataset before you can actually start running statistical analyses. During this cleaning or mastering process you may find artifacts like the following:

  • Entries with unexpected data types: When test takers were expected to describe something in prose but a few entered a number instead.
  • Empty cells where no missing values are allowed: Maybe a mistake when entering paper pencil data manually.
  • A sudden shift of cell values to the right, causing a lot of values to fall into the wrong column: This happens, when data separation characters are used in the data itself.

If you’ve ever worked with larger sets of data, you surely know these or similar problems and have experience how hard it can be to spot them.

CSV Fingerprints gives you a very quick first visual of your data and can therefore save you a lot of time. Victor Powell, the author of this handy tool explains CSV Fingerprints in more details on his blog. There is also a full screen version of the tool available.

Tip: Don’t try to copy&paste data directly from Excel, always copy the CSV from a text editor.

 

How the internet changes us and our science

In recent years web-based scientific research is expanding and reinventing itself constantly. Publications and research articles in the Journal of Personality and Social Psychology conducted via web-based tools have relatively increased by about 543% from 2008 to 2009 (Denissen, Neumann, van Zalk, 2010).

With almost near-universal internet access in most of the developed world (e.g. 90 % of Sweden’s population has daily access to the internet as the Internet World Stats report 2001 to 2009 shows), the newest technology does not only affect us on a daily basis, but also shapes our daily social interactions and the way in which we conduct research. In addition to psychological offline data collection via questionnaires and experiments for instance, web-based research through online surveys, apps and special web applications is able to facilitate and amplify our scientific data collection.

Therefore, making use of these new technological opportunities, research in psychology and other humanity sciences has become more virtual and online based. We collect data about us and the world around us online, answer questionnaires on our phones while traveling home or participate in diary studies before going to bed.

Online web-based data collection offers many advantages to scientific research. Most importantly:

  1. Data can be collected more easily and economically.
  2. Entered data can be validated in real time and the user can be prompted for correction.
  3. Data anonymity can be guaranteed if researchers assure the anonymous and separate storage of participants’ answers and their ID codes.
  4. Researchers can reach a more representative sample much easier, especially if distributing their surveys via various social media platforms.

In their brilliant article on “How the internet is changing the implementation of traditional research methods, people’s daily lives, and the way in which developmental scientists conduct research” Denissen, Neumann and van Zalk (2010) explain chances and challenges the new generation of online research provides. They explain why web-based research has risen to such popularity in the past decade and what is needed to conduct it.

The authors do not avoid the challenges of these new possibilities either. Challenges that range from secure storage of participants’ data, secure data transmission, online communication and the need for extensive testing and debugging of online tools.

Hand in hand with these opportunities comes a change. A change in how we interact with other people in our offline world. The frequent use of technology and internet does shape our interpersonal communication and interactions as many researchers of the field of cyberpsychology underline. The massive wealth of data individuals leave on the internet, particularly on social media platforms, such as Facebook or Google+ are used to investigate personality factors and their impact on various outcomes. The existence of this data enables scientists to investigate all kinds of hypotheses, ranging from how personality affects consumer behavior to how the use of social media is associated with depression and loneliness.

For those interested in more information on the advantages and pitfalls of online data collection, we highly recommend reading Dennissen, Neumann and van Zalk’s (2010) article.

Book recommendation: Longitudinal data analysis using structural equation models

In the wake of our recent posts about longitudinal studies we’d like to recommend a recently published book by By John J. McArdle and John R. Nesselroade.

b2ap3_thumbnail_McArdleNesselroade2014

Longitudinal studies are on the rise, no doubt. Properly conducting longitudinal studies and then analyzing the data can be a complex undertaking. John McArdle and John Nesselroade focus on five basic questions that can be tackled with structural equation models, when analyzing longitudinal data:

  • Direct identification of intraindividual changes.
  • Direct identification of interindividual differences in intraindividual changes.
  • Examining interrelationships in intraindividual changes.
  • Analyses of causes (determinants) of intraindividual changes.
  • Analyses of causes (determinants) of interindividual differences in intraindividual changes.

I find it especially noteworthy, that the authors put an emphasis on factorial invariance over time and latent change scores. In my view, this makes this book a must read to become a longitudinal data wizard.

Need another argument? Afraid of cumbersome mathematical language? Here is what the authors say about it: „We focus on the big picture approach rather than the algebraic details.“

 

Cause and effect: Optimizing the designs of longitudinal studies

A rising number of longitudinal studies have been conducted and published in industrial and organizational psychology recently. Although this is a pleasing development, it needs to be considered that most of the published studies are still cross-sectional in nature and thus are far less suited for establishing causal relationships. A longitudinal study can potentially provide insights into the direction of effects and the size of effects over time.

Despite their advantages, designing longitudinal studies needs careful considerations and poses tricky theoretical and methodological questions. As Taris and Kompier put it in their editorial to volume 28 of the journal Work & Stress: “…they are no panacea and could yield disappointing and even misleading findings…“. The authors focus on two crucial challenges in longitudinal designs that have a strong impact on detecting the true effects among a set of constructs.

Choosing the right time lags in longitudinal designs

Failing to choose the right time lag between two consecutive study waves lead to biased estimates of effects (see also Cole & Maxwell, 2003). If the study interval is much shorter than the true interval, the cause has not sufficient time to affect the outcome. In contrary, if the study interval is too long the true effects may already have been vanished. Thus, the estimated size of an effect is strongly linked to the length between two consecutive measurement waves.

a1sx2_post_cause-effectB

The chosen interval should correspond as closely as possible to the true underlying interval. This needs thorough a priori knowledge or reasoning about the possible underlying causal mechanism and time lags before conducting a study. What to do when deducting or estimating an appropriate time lag is not possible? Taris and Kompier (2014) suggest “that researchers include multiple waves in their design, with relatively short time intervals between these waves. Exactly how short will depend on the nature of the variables under study. This way they would maximize the chances of including the right interval between the study waves“. To improve longitudinal research further, the authors propose that researchers report their reasoning for choosing a particular time lag. This would explicitly make temporal considerations what they are a central part of the theoretical foundation of longitudinal study.

Considering reciprocal effects in longitudinal designs

Building on one of their former articles Taris and Kompier(2014) opt for full panel designs meaning that the presumed independent variable as well as the presumed outcome are measured at all waves. Such a design allows testing for reciprocal effects. Not considering existing reciprocal effects in longitudinal analyses may again lead to biased estimates of effects.

 

A helpful checklist for conducting and publishing Longitudinal Research

Longitudinal research has largely increased in the past 20 years due to an advanced development of new theories and methodologies. Nevertheless, studies in social sciences are still mainly dominated by cross-sectional research designs or deficient longitudinal research, because many researcher lack guidelines for conducting adequate longitudinal research to interpret the duration and change in constructs and variables.

To create a more systematic approach to longitudinal research, Ployhart and Ward (2011) have created a quick start guide on how to conduct high quality longitudinal research.

The following information refers to three stages: the theoretical development of the study design, the analysis of longitudinal results and relevant tips for publishing the respective research. The most relevant information provided by the authors will be shared subsequently in form of a checklist which can help you ameliorate your research ideas and design:

Why is longitudinal research important?

It helps to investigate not only the relationship of two variables over time, but allows to disentangle the direction of effects. It also helps to investigate the change of a variable over time and the duration of this change.  For instance one might investigate how job satisfaction of new hires changes over time and whether certain features of the job (i.e., feedback by the supervisor) predict the form of change. Such questions can only be analyzed through longitudinal investigation with repeated measurements of the construct. In order to study change, at least three waves of data are necessary for a well conducted longitudinal research study (Ployhart & Vandenberg, 2010).

What sample size is needed to conduct longitudinal research?

Since the estimation of power is a complex issue in longitudinal research, the authors do give a rather general answer to this question:  “the answer to this is easy—as large as you can get!“ However, they give a useful rule of thumb. The statistical power depends among other things on the number of subjects and on the number of repeated measures. „If one must choose between adding subjects versus measurement occasions, our recommendation is to first identify the minimum number of repeated measurements required to adequately test the hypothesized form of change and then maximize the number of subjects.“

When to administer measures?

When studying change over time, the timing of measurement is crucial (Mitchell & James 2001). The measurement spacing should adequately capture the expected form of change. Spacing will be different for a linear change as compared to non-linear (e.g., exponential or logarithmic) change. Such thinking is still contrary to common practice. Most of the study designs focus on evenly spaced measurement occasions and give rather sparse focus on the type of change under study. However, it is important that measurement waves occur with enough frequency and cover the theoretically important temporal parts of the change. This needs careful theoretical reasoning beforehand. Done otherwise, the statistical models will over- or underestimate the true nature of the changes under study.

Be it a longitudinal study or a diary study the software of cloud solutions can handle any type of timing and frequency between measurement occasions. The flexibility of our online solutions stem from an “event flow engine” that is based on neural networks.

What to do about missing data?

The statistical analysis of longitudinal research can become complex. One particular challenge in longitudinal data is the treatment of missing data. However, since longitudinal studies often suffer from high dropout rates, having missing data is a very common phenomenon. Here you find recommendations to reduce missing data before and during data collection.  When conducting surveys in organizations a way to enhance response rate is to make sure that the company allows their workers to complete the survey during working hours. A specific technique to reduce the burden on individual participants and still measure frequently over a longer time is planned missingness.

When it comes to handling missing data in statistical analyses, the most important question is whether the data are missing at random or not. If the data are missing at random, there is not much to worry about. The use of full information maximum likelihood estimates will provide unbiased estimates of the missing data points. If the data are not missing at random more sophisticated analytical techniques may be required. Ployhart and Ward (2011) recommend Little and Rubin (2002) for further readings on this issue.

Which analytical method to use?

Simply put, there are three statistical frameworks that can be used to model longitudinal data.

  • Repeated measures General Linear Model: Useful when the focus of interest lies on mean changes within persons over time and missing data is unproblematic.
  • Random coefficient modeling: Useful when one is interested in between – person differences in change over time. Especially useful when the growth models are simple and the predictors of change are static.
  • Structural equation modeling: Useful when one is interested in between – person differences in change over time. Especially useful when with more complex growth models, including time-varying predictors, dynamic relationships, or mediated change.

The following table from Ployhart and Ward (2011) gives a more detailed insight into the application of the three methods:

Use the following method… …when these conditions are present
Repeated measures general linear model Focus on group mean change
Identify categorial predictors of change (e.g. training vs. control group)
Assumptions with residuals are reasonably met
Two waves of repeated data
Variables are highly reliable
Little to no missing data
Random coefficient modeling Focus on individual differences in change over time
Identify continuous or categorial predictors of change
Residuals are correlated, heterogeneous etc.
Three or more waves of data
Variables are highly reliable
Model simple mediated or dynamic models
Missing data are random
 Structural equation modeling Focus on individual differences in change over time
Identify continuous or categorial predictors of change
Residuals are correlated, heterogeneous, etc.
Three or more waves of data
Want to remove unreliability
Model complex mediated or dynamic models

How to make a relevant theoretical contribution worth publishing?

When publishing longitudinal research you should always describe why your longitudinal research is better at explaining the constructs and their relationship than equivalent cross-sectional designs. Then you should underline the superiority of study design as compared to previous ones. Try to go through the following questions when justifying your research’s worth for being published:

  • Have you developed hypotheses from a cross-sectional or from a longitudinal theory?
  • Have you explained why change occurs in your constructs?
  • Have you described why you measured the variables at various times and how this constitutes a sufficient sampling rate?
  • Have you considered threats to internal validity?
  • Have you explained how you reduced missing data?
  • Have you explained why you chose this analytical method?

cloud solutions wishes you success with your longitudinal research!

 

Wie kriege ich in meiner Studie eine hohe Rücklaufquote hin?

Rücklaufquoten in Fragebogenstudien – Erkenntnisse aus einer Metaanalyse von über 2000 Umfragen.

Forschung in Organisationen ist stark auf Fragebogenuntersuchungen angewiesen. Dabei besteht das Risiko, dass ein bedeutsamer Anteil der angesprochenen Population nicht antwortet. Tiefe Rücklaufquoten führen zu Problemen bei der Generalisierung von Resultaten auf die untersuchte Population (ungenügende externe Validität). Kleine Stichproben auf Grund zu wenig Antwortenden erhöhen zusätzlich das Risiko tiefer statistischer Power und limitieren die Arten von statistischen Techniken, welche angewendet werden können. Einige Forscher gehen davon aus, dass die Popularität von Fragebogenstudien in den letzten Jahren zu einer Verschärfung dieser Risiken geführt hat.

Für ein optimales Design von Studien in Organisationen stellen sich deswegen vor allem zwei Fragen:

  • Haben die Rücklaufquoten bei Fragebogenstudien in den letzten Jahren abgenommen?
  • Falls ja, welche Techniken zur Erhöhung der Rücklaufquote sind heute besonders effektiv?

Antworten auf diese Fragen liefert eine Metaanalyse von Frederik Anseel, Filip Lievens, Eveline Schollaert und Beata Choragwicka publiziert im Journal of Business Psychology. Die Autoren haben über 2000 Fragebogenstudien analysiert, die zwischen 1995 und 2008 in wissenschaftlichen Journalen der Arbeits- und Organisationspsychologie, der Management- und der Marketingwissenschaften publiziert wurden. Mitunter ist diese Studie einer der ersten überhaupt, die im organisationalen Setting den Effekt von online Fragebogenstudien auf Rücklaufquoten untersucht hat.

Die Studie zeigt folgendes:

  • Die Durchschnittliche Rücklaufquote in den analysierten Studien liegt bei 52% mit einer Standardabweichung von 24%.
  • Die Rücklaufquote hat zwischen 1995 und 2008 leicht abgenommen (0.6% pro Jahr). Dieser Effekt wurde aber kompensiert durch den vermehrten Gebrauch von Techniken zur Erhöhung des Rücklaufs.
  • Über alle Gruppen von Befragten sind die Folgenden effektive Techniken zur Erhöhung des Rücklaufs: Versenden von Vorinformationen vor Studienstart, Personalisierung (z.B. persönliche Adressierung der Teilnehmenden), Relevanz des Themas aufzeigen (Salienz steigern), Verwenden von anonymen Identifikationsnummern, universitäres oder anderweitig seriöses Sponsoring, persönliche Verteilung der Fragebogen.
  • Die Durchführung von online Studien ist nicht bei jeder Population gleich sinnvoll. Die Studie zeigt, dass eine online Befragung bei “nicht Managern” (Mitarbeitenden ohne Führungsfunktion) ein effektives Mittel ist, um den Rücklauf zu erhöhen. Bei anderen Gruppen (z.B. Top-Management) kann eine online Durchführung auch zu kleineren Rückläufen führen als bei Papierbefragung.
  • Finanzielle Anreize sind kein effektives Mittel um den Rücklauf zu erhöhen.

Zusammenfassend geben die Autoren folgende Tipps:

Tabelle-response-rate-guidleines

Die online Lösungen von Bright Answer unterstützt die Anwendung der oben genannten Techniken. Die Software bietet automatisiertes Versenden persönlich adressierter Vorinformationen, Studieneinladungen und Remindern aber auch die Benutzung von anonymen Zugangscodes ist möglich.
Die Software bietet eine zusätzliche Art von Anreiz für Studienteilnehmende. Am Studienende können die Teilnehmenden automatisch generiert, individuelle Rückmeldungen ansehen und sich mit dem Durchschnitt der restlichen Studienteilnehmenden oder falls vorhanden mit anderen Benchmarks vergleichen. Die Erfahrung zeigt, dass die Aussicht auf eine individuelle Rückmeldung hohe Rücklaufquoten ermöglicht. Gerade in Längsschnittstudien mit vielen Messwellen (5 oder mehr) kann dieser Anreiz viele Dropouts verhindern.