The future of the PHP PaaS is here: Our journey to Platform.sh

In our team we’re very confident in our ability to produce high quality software. For the past decade or so we have constantly improved our skills, our tool stack, our working methods and consequently our products. We can’t say the same about our sysadmin skills. By a long shot.

Sysadmins – because even developers need heroes

For many software developers, sysadmin is a mystery at best, a nightmare more often than not. When you frequently find yourself solving server problems by copy-pasting a series of commands you don’€™t fully understand, you know you’€™re not up to the task. This type of server maintenance is definitely not what anyone’€™s customer deserves, but neither do companies of our size have the resources to hire good system administrators.

The DevOps illusion

Fully managed server solutions aren’€™t cheap either and don’€™t provide a lot of flexibility. For a long time we have solved the hosting problem by operating a small number of flexibly scalable VPS instances by ServerGrove who provide unique support so that we could always fall back on their knowledge when needed.

But we arrived at a point where we wanted to be able to spin up individual server instances for each project and each environment and in order for this to be replicable and robust it needed to be automatic. We wanted to be sure that our servers had exactly the dependencies we needed and that we could spin up as many instances as we needed whenever we needed them.

At the same time virtual machines and server provisioning systems started to gain popularity among developers. Everyone and their cat started talking about technologies like Vagrant, Puppet, Ansible, or Docker. There was the promise of a better world where devs would be able to have repeatable server instances and whole infrastructures up in no time and without problems. That, of course, turned out to be an illusion. Server provisioning and containerization are incredibly powerful technologies contributing hugely to the quality of web services and software, but they’€™re not a replacement for a sysadmin. Quite the contrary actually, in order to build quality containers with a stable and robust provisioning infrastructure, you need, you guessed it, a good system administrator.

PaaS to the rescue?

So, with the sobering realization that Docker and Ansible weren’€™t going to solve our problem, our attention was drawn to another relatively new phenomenon, the PaaS. Platforms which promise to do a lot of the sysadmin for you by providing preconfigured managed container systems for deploying applications. This was exactly what we and many others needed. So we started looking into these services, specifically those targeting the modern PHP ecosystem like PhpFog, Pagoda Box, Fortrabbit, etc.

We tested, observed and evaluated. A lot of times we thought we’€™d found a satisfying solution with one of the providers. Always something ruined the fun. Instability, lack of flexibility, no writable folders, still in beta, too expensive, you name it. We found a quantum of solace in the fact that others, including prominent members of the PHP community like Phil Sturgeon, felt the same pain. We concluded that it was too early for the PaaS and went into observation mode. Then Platform.sh came along.

PaaS 2.0

Checking them out was more or less routine along the lines of ‘€žOh, yet another new PHP PaaS product, let’€™s go see how THEY screwed up.’€œ. The promises on the website looked similar to what other providers say, but somewhat more assertive. Who doesn’€™t like to hear this?

High-availability PHP cloud hosting platform that is fast and simple to use. Stop reinventing DevOps and wasting time on tedious admin chores.

At first I was taken aback by the strong Symfony/Drupal orientation, but reading some of the documentation it all just sounded too good to give up on it already. It seemed like many of the problems of the competition had been solved. I started to get the feeling that Platform.sh might be just what we had been looking for and decided to give it a serious try. The result: Minds blown. We realized that Platform.sh had taken the PHP PaaS idea to a whole new level, hopefully spearheading a new generation of PaaS.

A few months later we’€™re using Platform.sh for all our new projects and are migrating older projects over there too. Phil Sturgeon is right, once you’€™ve tried a hover-car, you just don’€™t want to drive a normal car anymore.

What we love about our new deployment and hosting solution

So let me introduce a few of the things we’€™re most thrilled about when working with Platform.sh.

Literally 0 sysadmin

We’€™re completely freed of any kind of sysadmin work but we still have all the control we need over our servers. As with most of the PaaS solutions, everything is configured in a file that belongs to your project. With Platform.sh this file is called .platform.app.yaml. Here’€™s an example:

name: example app
type: php:7.0
build:
    flavor: composer
timezone: Europe/Zurich
relationships:
    database: "mysql:mysql"
    redis: "rediscache:redis"
disk: 2048
mounts:
    "/temp": "shared:files/temp"
    "/sessions": "shared:files/sessions"
dependencies:
    ruby:
      sass: "3.4.17"
    nodejs:
      gulp: "3.9.0"
      bower: "1.7.1"
runtime:
    extensions:
        - redis
hooks:
    build: |
        set -e
        cd
        vendor/bin/phing deploy-db-migrations
        npm install
        bower update
        gulp
    deploy: |
        set -e
        cd
        vendor/bin/phinx migrate --configuration phinx-platform.php
crons:
    offsite-backup:
      spec: "0 2 * * *"
      cmd: "cd /app/httpdocs ; php index.php cron offsite-backup"
web:
  document_root: "/httpdocs"
  passthru: "/index.php"
  whitelist:
      # CSS and Javascript.
      - \.css$
      - \.js$

      # image/* types.
      - \.gif$
      - \.jpe?g$
      - \.png$

The guys at Platform.sh take care of running high quality containers for all recent versions of PHP as well as HHVM. We just indicate what php extensions, ruby gems or npm packages we need and that’s it. As you can see we can also do a lot of other stuff like mounting writable folders, running scripts during build or deployment, setting up cron jobs, or whitelisting files for public access. No need to think about sysadmin at any point.

Plus of course: All of this is under version control, so you’ll know the exact server state at every revision of your software, how cool is that.

Push to deploy with 0 downtime deployment

On Platform.sh the master branch of your git repository is the live site. Whenever you have an update to your application run git push platform master and the platform will attempt to build and deploy your project. If anything goes wrong during the build, the app will not be deployed. During deployment all requests to your app will be buffered. This means 0 downtime deployments in any case. If the app can be successfully deployed, the buffered requests will be resumed against the updated app, if not, they will be resumed against the status quo.

Git branch = fully independent app instance

This is one of the most awesome features. You can push any branch of your app to platform and you’ll instantly get a completely independent instance of your whole app with all it’s containers (PHP, DB, Redis, …).

Imagine you have an older PHP 5.5 app and you want to run it on PHP 7.0 to see what happens. With Platform.sh this is mindblowingly easy. All you need to do is:

  • make a dev branch of your repository, e.g. php-7-dev.
  • change type: php:5.5 to type: php:7.0 in your .platform.app.yaml.
  • commit and push the branch to platform.

Here you go, you’ll have an instance of your app with an web URL and read only shell access running on PHP 7.0.

What’s more, you can branch and merge directly in the Platform.sh GUI if you wish to.

No add-ons needed

If you know your way around Heroku you’re familiar with add-ons. Heroku’s approach is distributed, using a marketplace of services, while Platform.sh foresees combining all required elements within a single testable and consistent environment. This means Platform.sh provides a growing number of services like MySQL, PostgreSQL, Redis, MongoDB, Elastic search, and more out of the box, ready and optimally integrated, running in separate containers. *Batteries included*, as the guys at Platform.sh like to call it.

Obviously, when you create a new branch (= instance of your app) all service containers you’re using will be cloned as well.

What about Heroku?

Heroku is one of the pioneers if not the mother of mainstream PaaS and they surely know what they’re doing. They only added dedicated support for PHP in 2014 but were still the only serious alternative to Platform.sh for us. We’re convinced we could have arrived at a satisfying solution using Heroku as well. For us, Platform.sh wins. It provides storage and data management by means of its “batteries included” approach, on a level that Heroku can’t. While technical skills are required to set up Platform.sh we think their approach is simpler, as well as being more consistent and robust.

Conclusion

Additionally to all the described advantages of using a good PaaS like Platform.sh, this migration has forced us to significantly advance and fully automatize our build and deployment. We’re thrilled with the monumental improvement of our deployment and hosting quality, the productivity boost as well as the peace of mind this new approach is giving us.

Fallstudie – 60 Jahre kantonale Schweizer Wahldaten aus Papierarchiven befreien

Herausforderung

In einem Land wie der Schweiz, wo seit Jahrhunderten auf allen Verwaltungsebenen mit Herzblut Bürokratie betrieben wird, liegen systematisch gesammelte Daten aller Art brach. Daten, welche für die Wissenschaft von einzigartiger Bedeutung sein können, da vergleichbare Informationen in den wenigstens Regionen der Welt so akribisch gesammelt wurden.

Wir Psychologen sind an solchen Daten bisher wenig interessiert. Für Ökonomen und Politwissenschaftler sind diese aber Gold wert. Der Haken an der Sache ist nur, dass der ganz grosse Teil davon vor dem Computer-Zeitalter gesammelt wurde und daher nur auf Papier gespeichert ist.

Wie schafft man sich eine qualitativ einwandfreie digitale Datenbasis auf Grund von Daten, die in grossen Büchern in Archiven liegen und darüber hinaus in jedem Kanton der Schweiz anders aussehen?
Genau dieser Herausforderung stellten sich Prof. Dr. Mark Schelker und Dr. Lukas Schmid, deren Ziel es war, die Resultate der kantonalen Parlamentswahlen der letzten 60 Jahre zu digitalisieren. Cloud solutions konnte die Forscher bei der Entwicklung und Umsetzung einer optimalen technischen Lösung unterstützen.

Ansätze

Selbstverständlich ist die automatische Texterkennung (OCR) relativ weit fortgeschritten. Für die Herausforderung der kantonalen Wahldaten kam OCR aber aus verschiedenen Gründen nicht in Frage:

  • Beim Scannen von dicken Büchern werden die Inhalte in Bundnähe oft verzerrt, bleicher oder sogar leicht abgeschnitten. OCR Softwares können damit nicht umgehen.
  • Tabellen mit vielen Trennlinien sind ebenfalls ein Problem für OCR.
  • Ältere Schriftstile haben eine schlechtere Erkennungsrate.

Das Nachkorrigieren von schlechten OCR Daten wäre eine Möglichkeit gewesen. Dies wird aber schnell aufwändiger, als die direkte Eingabe der Daten aus einem einfachen Scan und führt mit hoher Wahrscheinlichkeit dazu, dass falsch erkannter Text als Fehler in die Datenmatrix gelangt.
Es blieb also nur die manuelle Erfassung. Dazu würde man traditionellerweise wohl Excel benutzen, was aber verschiedene Probleme mit sich bringt:

  • Arbeit dieser Art ist durch ihre repetitive Natur eher fehleranfällig, Excel bietet keine Unterstützung dabei, verschiedene Fehlerquellen wie Zeilenverschiebungen, falsche Eingaben, falsche Zuordnung, etc. zu vermeiden.
  • Das manuelle Zusammenführen vieler einzelner Excel Dateien stellt eine weitere Fehlerquelle dar.
  • Bei vielen auf mehrere Erfasser verteilten Excel-Dateien ist keine laufende Kontrolle über den Stand der Erfassung und die Qualität der Daten möglich.

Umgesetzte Lösung

Die mit dem Kunden in gemeinsamer Denkarbeit entwickelte und durch CS programmierte Lösung hatte zum Ziel, die jeweiligen Stärken von Technik und Mensch zu vereinen, um so die Qualität der Daten zu maximieren. Das entwickelte System hatte folgende Merkmale:

  • Klar strukturierte, software-geführte Erfassung der Daten.
  • Vermeidung von redundanter Erfassung durch Auftrennen in mehrere Erfassungsebenen (Kanton, Bezirkswahljahr, Kandidaten).
  • Gewisse vorerfasste Daten, die bereits korrekt zur Auswahl gestellt werden konnten.
  • Datenvalidierung bei Eingabe.
  • Eingebaute Qualitätschecks (Vergleich von perfekten, vorerfassten Datensätzen mit den eingegebenen Daten).
  • Sorgfältige Instruktion und Support der ErfasserInnen.
  • Zusätzliche manuelle Stichprobenkontrollen durch das Forscherteam.

Auf diese Weise wurden Ende 2014 / Anfang 2015 durch 30 ErfasserInnen in höchst zufrieden stellender Qualität an die 190’000 Kandidierende erfasst, verteilt auf 60 Jahre, 4000 Wahlbezirke und 15’000 Listen.

 

The relevance of non-response rates in employee attitude surveys

The HR Department of any organization, or consulting company in the field of HR employ surveys as a methodological means of investigating how they can improve as an employer, increase performance and become more profitable. Often the focus lies on systematically analyzing the staff’s perception of working conditions, job attitudes, health and other performance related indicators. In order to best understand which aspects need to be ameliorated and to have a sound decision basis, high response rates are necessary. Unfortunately, the response rates have proven in many instances to be relatively low for surveys addressing the entire organization.

In one of our previous posts we focused on methods which could help to improve the response rate in employee surveys. Now let us focus on how central occupational performance indicators, such as job attitudes, might influence response rates in employee questionnaires. An interesting study by Fauth and his colleagues (2013) focuses on the relationship between employee work attitudes (e.g. job satisfaction or commitment) and non-response rates and how they influence each other.

How non-response rates depict employees’ job attitudes

Whilst previous research on this topic has mainly focused on the relationship between the individual working attitude of employees and their non-responsive behavior in surveys, Fauth and his colleagues took a different approach. They were interested in the effects of group-level work attitude on response rate. Although co-workers and work group members influence the attitude and perspective of other employees, the relationship between the job satisfaction of an entire working team or unit within an organization and their survey response behavior has previously been neglected (Cropazano & Mitchel, 2005). From a practical perspective, such knowledge is crucial, as the survey feedback processes in companies are almost always on aggregated levels (e.g. team, business unit) and not on the individual level. Thus, Fauth et al. (2013) addressed this need for group level-based analysis of non-response rates in organizational surveys. They hypothesize that aggregated job satisfaction is positively related to survey response rates at the work group level.

As the social exchange theory (Cropazano & Mitchel, 2005) underlines, individuals are willing to invest more effort and energy when content. Addressing this idea in the work sphere shows that satisfied employees are willing to invest their energy in additional non-work related tasks, such as completing employee surveys. This form of Organizational Citizen Behavior (OCB; Rogelberg et al., 2003) explains the previously detected positive relationship between work satisfaction and response rate in employee attitude reviews on an individual level (Klein et al., 1994).

In order to test whether employee happiness is also positively related to survey response rates at the work group level, Fauth et al. (2013) conducted two large-scale follow-up employee surveys in four distinct companies in 2002, 2004, and twice in 2006. The participating 1120 employees were gathered into 46 groups with approximately 24 employees per group. Their aggregated job satisfaction was assessed via a multi-item measure – the Job Descriptive Index, the results of which show that work groups with a greater combined job satisfaction had significantly higher response rates. Furthermore, the study also showed that independent of the effect of this contentment, smaller teams and teams with more heterogeneity in tenure and gender had a higher response rate. Intriguingly, no difference in response rate was found for blue collar versus white collar.

This points to an interesting avenue in organizational survey research: that not only the employees’ answers to survey questions are relevant for organizations when assessing group perception of employment situation, but also their response rates. Specifically, higher response rates could indicate a greater general work satisfaction and be an interesting indirect indicator of the overall attitude of a working unit towards their job and their organization.

 

Sicherheit und Geschwindigkeit mit CloudFlare & Co

Sicherheit und Geschwindigkeit. Zwei der wichtigsten Eigenschaften jeder Website und zwei Gebiete mit hohem Spezialisierungsgrad. Um jederzeit den neusten Stand der Technik bieten zu können, benutzen wir seit einiger Zeit den Service CloudFlare. CloudFlare ermöglicht es uns, Webseiten sicherer und schneller laufen zu lassen, ohne selbst die dazu nötige Infrastruktur bereitstellen zu müssen. Es gibt eine ganze Reihe von ähnlichen Services wie z.B. Incapsula, Myracloud, MaxCDNCloudFront und etliche mehr.

Grundsätzlich operieren solche Dienste zwischen einer gehosteten Webseite und dem End-Nutzer, der die Webseite besucht. An dieser Stelle können von den Diensten verschiedene Features angeboten werden, welche schwerpunktmässig aber alle in die zwei Kategorien Sicherheit und Geschwindigkeit fallen.

Im Weiteren werden einige der Features näher beschrieben. Die nachstehenden Beschreibungen beziehen sich vorrangig auf das Beispiel von CloudFlare, die Überlappung mit ähnlichen Anbietern ist aber gross.

Content Delivery Network (CDN)

Ein Content Delivery Network (CDN) ist ein Netzwerk von Servern, welches Inhalte auf optimierte Weise an den End-Nutzer liefert. Mit Hilfe der global verteilten Servern eines CDN, wird in diesem Fall eine Webseite möglichst schnell – das heisst auf kürzestem Wege – an den Besucher der Seite übermittelt, was Antwortzeiten enorm verringern kann.

1-cdn
Abbildung: Veranschaulichung eines CDN

Die Server des CDN halten statische Daten (wie z.B. JavaScript, CSS und Bilder) der Webseite vorrätig und übertragen beim Aufruf der Webseite diese Daten direkt von einem der Server an den Besucher. Abhängig davon wo sich der End-Nutzer geographisch aufhält, liefert der am nächsten gelegene CDN Server die entsprechenden Daten.

Der dynamische Inhalt wird weiterhin direkt vom eigentlichen Hauptserver bereitgestellt, während alle statischen Inhalte durch das CDN zum Nutzer gelangen. Laut Cloudflare werden hierdurch Webseiten im Durchschnitt doppelt so schnell für Benutzer geladen.

2-cloudflare-cdn
Abbildung: Die Datenzentren des CloudFlare CDN

Web Content Optimization (WCO)

Wie gerade beschrieben, begründen sich die Vorteile eines CDN darauf, dass eine Webseite mittels der Infrastruktur näher an die End-Nutzer gebracht wird. Die Web Content Optimization (WCO) hingegen befasst sich nicht damit wie die Daten geliefert werden, sondern mit der Optimierung der zu liefernden Daten selbst. Mit unterschiedlichen Ansätzen führt also sowohl CDN als auch WCO zu einer schnelleren Webseite. Somit komplementieren sich beide gegenseitig für diesen Einsatz.
Web Content Optimization wird unter anderem durch folgende Massnahmen erreicht:

  • Bündeln von JavaScript-Dateien: Mehrere JavaScript-Dateien werden automatisch gebündelt damit sämtliche JavaScript-Dateien innerhalb eines Aufrufes übertragen werden. Hierdurch wird der Mehraufwand eingespart, der für mehrere Aufrufe nötig wäre, um jede Datei separat zu übertragen.
  • Asynchrones Laden: Durch asynchrones Laden von Ressourcen wie CSS- oder JavaScript-Dateien kann eine Webseite effektiv schneller laden und wird z.B. nicht durch das synchrone Laden eines grossen Scripts unnötig verzögert.
  • Komprimierung: Die Komprimierung der zu übertragenen Daten findet an dieser Stelle ebenso Anwendung. Bei einer Komprimierungsrate von beispielsweise 30%, beträgt der Geschwindigkeitsgewinn dieser Daten gleichermassen 30%, wegen der entsprechend geringeren Datenmenge.
  • Cache Header: Es werden automatisch die Einstellungen für den Cache Header dahingehend optimiert, damit das Caching des Browser eines Seitenbesuchers vorteilhaft genutzt wird, und damit unnötige neue Aufrufe vermieden werden.

3-async

Sicherheit

Wie eingangs erwähnt, operiert ein Service wie CloudFlare zwischen der gehosteten Webseite und dem Seiten-Nutzer. Neben der besprochenen Geschwindigkeitsoptimierungen, können an dieser Stelle ausserdem wirksame Sicherheitsmassnahmen zum Einsatz kommen um Webseiten besser vor Gefahren im Netz zu schützen. Diese werden im Folgenden kurz vorgestellt:

  • Schutz vor DoS-Attacken: An diesem Punkt findet der Schutz vor Denial of Service (DoS) Attacken statt. Wird eine solche Attacke von der Infrastruktur erkannt, greifen die entsprechenden Abwehrmassnahmen, und der Angriff gelangt nicht bis zum Web-Server der zugrundeliegenden Webseite.
  • Web Application Firewall (WAF): Anhand einer Web Application Firewall können weitere Gefahren für eine Webseite abgewendet werden. So steht etwa ein automatischer Schutz für folgende typische Angriffe bereit:
    •     SQL Injection
    •     Spam in Kommentaren
    •     Cross-site scripting (XSS)
    •     Cross-site request forgery (CSRF)

4-analytics
Abbildung: Webseiten Analyse von CloudFlare mit Informationen zu erkannten Gefahren

Grundsätzlich ist jede Webseite diesen potentiellen Gefahren im Internet ausgesetzt. Anhand des Einsatzes von CloudFlare bzw. eines vergleichbaren Service können viele Gefahren bereits abgewehrt werden, bevor sie überhaupt bis zu der eigentlichen Webseite vordringen können.

Einzelne Webseiten können zusätzlich dadurch profitieren, dass diese Sicherheitsdienste auch für eine Vielzahl von weiteren Webseiten Anwendung finden, welche gleichfalls diesen Service benutzen. Auf dieser Grundlage muss sich die Gefahrenabwehr nicht auf eine individuelle Webseite beschränken, sondern kann all diese Webseiten umfassen. Wird zum Beispiel ein Angriff auf eine spezielle Webseite erkannt, so kann der Angreifer automatisch von sämtlichen Webseiten blockiert werden.

Fazit

Wir sind bisher sehr zufrieden mit CloudFlare und die Benutzung hat sich in der Praxis gut bewährt. Unsere eigenen Server können sich häufiger ausruhen und wir profitieren von gepooltem Wissen und geteilter Hochleistungsinfrastruktur. Wie viele andere Services in der Cloud, bringt auch die Nutzung von CloudFlare & Co neue Probleme mit sich. Störungen von CloudFlare selbst können weitreichende Konsequenzen haben für die Erreichbarkeit von tausenden Seiten. Aus diesem Grund ist es wichtig, jederzeit eine funktionierende Fallback Lösung zu haben und sich damit nicht 100% abhängig zu machen.

Wir konnten bisher insgesamt signifikanten Nutzen aus den beschriebenen Vorteilen ziehen und unsere Infrastruktur wird durch diesen Service hervorragend ergänzt.

Handy data cleaning tool – CSV fingerprints

Recently I stumbled upon a handy little tool that may be interesting for everyone working with data in tables. An important but often tedious task is the cleaning of your dataset before you can actually start running statistical analyses. During this cleaning or mastering process you may find artifacts like the following:

  • Entries with unexpected data types: When test takers were expected to describe something in prose but a few entered a number instead.
  • Empty cells where no missing values are allowed: Maybe a mistake when entering paper pencil data manually.
  • A sudden shift of cell values to the right, causing a lot of values to fall into the wrong column: This happens, when data separation characters are used in the data itself.

If you’ve ever worked with larger sets of data, you surely know these or similar problems and have experience how hard it can be to spot them.

CSV Fingerprints gives you a very quick first visual of your data and can therefore save you a lot of time. Victor Powell, the author of this handy tool explains CSV Fingerprints in more details on his blog. There is also a full screen version of the tool available.

Tip: Don’t try to copy&paste data directly from Excel, always copy the CSV from a text editor.

 

How the internet changes us and our science

In recent years web-based scientific research is expanding and reinventing itself constantly. Publications and research articles in the Journal of Personality and Social Psychology conducted via web-based tools have relatively increased by about 543% from 2008 to 2009 (Denissen, Neumann, van Zalk, 2010).

With almost near-universal internet access in most of the developed world (e.g. 90 % of Sweden’s population has daily access to the internet as the Internet World Stats report 2001 to 2009 shows), the newest technology does not only affect us on a daily basis, but also shapes our daily social interactions and the way in which we conduct research. In addition to psychological offline data collection via questionnaires and experiments for instance, web-based research through online surveys, apps and special web applications is able to facilitate and amplify our scientific data collection.

Therefore, making use of these new technological opportunities, research in psychology and other humanity sciences has become more virtual and online based. We collect data about us and the world around us online, answer questionnaires on our phones while traveling home or participate in diary studies before going to bed.

Online web-based data collection offers many advantages to scientific research. Most importantly:

  1. Data can be collected more easily and economically.
  2. Entered data can be validated in real time and the user can be prompted for correction.
  3. Data anonymity can be guaranteed if researchers assure the anonymous and separate storage of participants’ answers and their ID codes.
  4. Researchers can reach a more representative sample much easier, especially if distributing their surveys via various social media platforms.

In their brilliant article on “How the internet is changing the implementation of traditional research methods, people’s daily lives, and the way in which developmental scientists conduct research” Denissen, Neumann and van Zalk (2010) explain chances and challenges the new generation of online research provides. They explain why web-based research has risen to such popularity in the past decade and what is needed to conduct it.

The authors do not avoid the challenges of these new possibilities either. Challenges that range from secure storage of participants’ data, secure data transmission, online communication and the need for extensive testing and debugging of online tools.

Hand in hand with these opportunities comes a change. A change in how we interact with other people in our offline world. The frequent use of technology and internet does shape our interpersonal communication and interactions as many researchers of the field of cyberpsychology underline. The massive wealth of data individuals leave on the internet, particularly on social media platforms, such as Facebook or Google+ are used to investigate personality factors and their impact on various outcomes. The existence of this data enables scientists to investigate all kinds of hypotheses, ranging from how personality affects consumer behavior to how the use of social media is associated with depression and loneliness.

For those interested in more information on the advantages and pitfalls of online data collection, we highly recommend reading Dennissen, Neumann and van Zalk’s (2010) article.

Book recommendation: Longitudinal data analysis using structural equation models

In the wake of our recent posts about longitudinal studies we’d like to recommend a recently published book by By John J. McArdle and John R. Nesselroade.

b2ap3_thumbnail_McArdleNesselroade2014

Longitudinal studies are on the rise, no doubt. Properly conducting longitudinal studies and then analyzing the data can be a complex undertaking. John McArdle and John Nesselroade focus on five basic questions that can be tackled with structural equation models, when analyzing longitudinal data:

  • Direct identification of intraindividual changes.
  • Direct identification of interindividual differences in intraindividual changes.
  • Examining interrelationships in intraindividual changes.
  • Analyses of causes (determinants) of intraindividual changes.
  • Analyses of causes (determinants) of interindividual differences in intraindividual changes.

I find it especially noteworthy, that the authors put an emphasis on factorial invariance over time and latent change scores. In my view, this makes this book a must read to become a longitudinal data wizard.

Need another argument? Afraid of cumbersome mathematical language? Here is what the authors say about it: „We focus on the big picture approach rather than the algebraic details.“

 

Cause and effect: Optimizing the designs of longitudinal studies

A rising number of longitudinal studies have been conducted and published in industrial and organizational psychology recently. Although this is a pleasing development, it needs to be considered that most of the published studies are still cross-sectional in nature and thus are far less suited for establishing causal relationships. A longitudinal study can potentially provide insights into the direction of effects and the size of effects over time.

Despite their advantages, designing longitudinal studies needs careful considerations and poses tricky theoretical and methodological questions. As Taris and Kompier put it in their editorial to volume 28 of the journal Work & Stress: “…they are no panacea and could yield disappointing and even misleading findings…“. The authors focus on two crucial challenges in longitudinal designs that have a strong impact on detecting the true effects among a set of constructs.

Choosing the right time lags in longitudinal designs

Failing to choose the right time lag between two consecutive study waves lead to biased estimates of effects (see also Cole & Maxwell, 2003). If the study interval is much shorter than the true interval, the cause has not sufficient time to affect the outcome. In contrary, if the study interval is too long the true effects may already have been vanished. Thus, the estimated size of an effect is strongly linked to the length between two consecutive measurement waves.

a1sx2_post_cause-effectB

The chosen interval should correspond as closely as possible to the true underlying interval. This needs thorough a priori knowledge or reasoning about the possible underlying causal mechanism and time lags before conducting a study. What to do when deducting or estimating an appropriate time lag is not possible? Taris and Kompier (2014) suggest “that researchers include multiple waves in their design, with relatively short time intervals between these waves. Exactly how short will depend on the nature of the variables under study. This way they would maximize the chances of including the right interval between the study waves“. To improve longitudinal research further, the authors propose that researchers report their reasoning for choosing a particular time lag. This would explicitly make temporal considerations what they are a central part of the theoretical foundation of longitudinal study.

Considering reciprocal effects in longitudinal designs

Building on one of their former articles Taris and Kompier(2014) opt for full panel designs meaning that the presumed independent variable as well as the presumed outcome are measured at all waves. Such a design allows testing for reciprocal effects. Not considering existing reciprocal effects in longitudinal analyses may again lead to biased estimates of effects.

 

A helpful checklist for conducting and publishing Longitudinal Research

Longitudinal research has largely increased in the past 20 years due to an advanced development of new theories and methodologies. Nevertheless, studies in social sciences are still mainly dominated by cross-sectional research designs or deficient longitudinal research, because many researcher lack guidelines for conducting adequate longitudinal research to interpret the duration and change in constructs and variables.

To create a more systematic approach to longitudinal research, Ployhart and Ward (2011) have created a quick start guide on how to conduct high quality longitudinal research.

The following information refers to three stages: the theoretical development of the study design, the analysis of longitudinal results and relevant tips for publishing the respective research. The most relevant information provided by the authors will be shared subsequently in form of a checklist which can help you ameliorate your research ideas and design:

Why is longitudinal research important?

It helps to investigate not only the relationship of two variables over time, but allows to disentangle the direction of effects. It also helps to investigate the change of a variable over time and the duration of this change.  For instance one might investigate how job satisfaction of new hires changes over time and whether certain features of the job (i.e., feedback by the supervisor) predict the form of change. Such questions can only be analyzed through longitudinal investigation with repeated measurements of the construct. In order to study change, at least three waves of data are necessary for a well conducted longitudinal research study (Ployhart & Vandenberg, 2010).

What sample size is needed to conduct longitudinal research?

Since the estimation of power is a complex issue in longitudinal research, the authors do give a rather general answer to this question:  “the answer to this is easy—as large as you can get!“ However, they give a useful rule of thumb. The statistical power depends among other things on the number of subjects and on the number of repeated measures. „If one must choose between adding subjects versus measurement occasions, our recommendation is to first identify the minimum number of repeated measurements required to adequately test the hypothesized form of change and then maximize the number of subjects.“

When to administer measures?

When studying change over time, the timing of measurement is crucial (Mitchell & James 2001). The measurement spacing should adequately capture the expected form of change. Spacing will be different for a linear change as compared to non-linear (e.g., exponential or logarithmic) change. Such thinking is still contrary to common practice. Most of the study designs focus on evenly spaced measurement occasions and give rather sparse focus on the type of change under study. However, it is important that measurement waves occur with enough frequency and cover the theoretically important temporal parts of the change. This needs careful theoretical reasoning beforehand. Done otherwise, the statistical models will over- or underestimate the true nature of the changes under study.

Be it a longitudinal study or a diary study the software of cloud solutions can handle any type of timing and frequency between measurement occasions. The flexibility of our online solutions stem from an “event flow engine” that is based on neural networks.

What to do about missing data?

The statistical analysis of longitudinal research can become complex. One particular challenge in longitudinal data is the treatment of missing data. However, since longitudinal studies often suffer from high dropout rates, having missing data is a very common phenomenon. Here you find recommendations to reduce missing data before and during data collection.  When conducting surveys in organizations a way to enhance response rate is to make sure that the company allows their workers to complete the survey during working hours. A specific technique to reduce the burden on individual participants and still measure frequently over a longer time is planned missingness.

When it comes to handling missing data in statistical analyses, the most important question is whether the data are missing at random or not. If the data are missing at random, there is not much to worry about. The use of full information maximum likelihood estimates will provide unbiased estimates of the missing data points. If the data are not missing at random more sophisticated analytical techniques may be required. Ployhart and Ward (2011) recommend Little and Rubin (2002) for further readings on this issue.

Which analytical method to use?

Simply put, there are three statistical frameworks that can be used to model longitudinal data.

  • Repeated measures General Linear Model: Useful when the focus of interest lies on mean changes within persons over time and missing data is unproblematic.
  • Random coefficient modeling: Useful when one is interested in between – person differences in change over time. Especially useful when the growth models are simple and the predictors of change are static.
  • Structural equation modeling: Useful when one is interested in between – person differences in change over time. Especially useful when with more complex growth models, including time-varying predictors, dynamic relationships, or mediated change.

The following table from Ployhart and Ward (2011) gives a more detailed insight into the application of the three methods:

Use the following method… …when these conditions are present
Repeated measures general linear model Focus on group mean change
Identify categorial predictors of change (e.g. training vs. control group)
Assumptions with residuals are reasonably met
Two waves of repeated data
Variables are highly reliable
Little to no missing data
Random coefficient modeling Focus on individual differences in change over time
Identify continuous or categorial predictors of change
Residuals are correlated, heterogeneous etc.
Three or more waves of data
Variables are highly reliable
Model simple mediated or dynamic models
Missing data are random
 Structural equation modeling Focus on individual differences in change over time
Identify continuous or categorial predictors of change
Residuals are correlated, heterogeneous, etc.
Three or more waves of data
Want to remove unreliability
Model complex mediated or dynamic models

How to make a relevant theoretical contribution worth publishing?

When publishing longitudinal research you should always describe why your longitudinal research is better at explaining the constructs and their relationship than equivalent cross-sectional designs. Then you should underline the superiority of study design as compared to previous ones. Try to go through the following questions when justifying your research’s worth for being published:

  • Have you developed hypotheses from a cross-sectional or from a longitudinal theory?
  • Have you explained why change occurs in your constructs?
  • Have you described why you measured the variables at various times and how this constitutes a sufficient sampling rate?
  • Have you considered threats to internal validity?
  • Have you explained how you reduced missing data?
  • Have you explained why you chose this analytical method?

cloud solutions wishes you success with your longitudinal research!

 

Wie kriege ich in meiner Studie eine hohe Rücklaufquote hin?

Rücklaufquoten in Fragebogenstudien – Erkenntnisse aus einer Metaanalyse von über 2000 Umfragen.

Forschung in Organisationen ist stark auf Fragebogenuntersuchungen angewiesen. Dabei besteht das Risiko, dass ein bedeutsamer Anteil der angesprochenen Population nicht antwortet. Tiefe Rücklaufquoten führen zu Problemen bei der Generalisierung von Resultaten auf die untersuchte Population (ungenügende externe Validität). Kleine Stichproben auf Grund zu wenig Antwortenden erhöhen zusätzlich das Risiko tiefer statistischer Power und limitieren die Arten von statistischen Techniken, welche angewendet werden können. Einige Forscher gehen davon aus, dass die Popularität von Fragebogenstudien in den letzten Jahren zu einer Verschärfung dieser Risiken geführt hat.

Für ein optimales Design von Studien in Organisationen stellen sich deswegen vor allem zwei Fragen:

  • Haben die Rücklaufquoten bei Fragebogenstudien in den letzten Jahren abgenommen?
  • Falls ja, welche Techniken zur Erhöhung der Rücklaufquote sind heute besonders effektiv?

Antworten auf diese Fragen liefert eine Metaanalyse von Frederik Anseel, Filip Lievens, Eveline Schollaert und Beata Choragwicka publiziert im Journal of Business Psychology. Die Autoren haben über 2000 Fragebogenstudien analysiert, die zwischen 1995 und 2008 in wissenschaftlichen Journalen der Arbeits- und Organisationspsychologie, der Management- und der Marketingwissenschaften publiziert wurden. Mitunter ist diese Studie einer der ersten überhaupt, die im organisationalen Setting den Effekt von online Fragebogenstudien auf Rücklaufquoten untersucht hat.

Die Studie zeigt folgendes:

  • Die Durchschnittliche Rücklaufquote in den analysierten Studien liegt bei 52% mit einer Standardabweichung von 24%.
  • Die Rücklaufquote hat zwischen 1995 und 2008 leicht abgenommen (0.6% pro Jahr). Dieser Effekt wurde aber kompensiert durch den vermehrten Gebrauch von Techniken zur Erhöhung des Rücklaufs.
  • Über alle Gruppen von Befragten sind die Folgenden effektive Techniken zur Erhöhung des Rücklaufs: Versenden von Vorinformationen vor Studienstart, Personalisierung (z.B. persönliche Adressierung der Teilnehmenden), Relevanz des Themas aufzeigen (Salienz steigern), Verwenden von anonymen Identifikationsnummern, universitäres oder anderweitig seriöses Sponsoring, persönliche Verteilung der Fragebogen.
  • Die Durchführung von online Studien ist nicht bei jeder Population gleich sinnvoll. Die Studie zeigt, dass eine online Befragung bei “nicht Managern” (Mitarbeitenden ohne Führungsfunktion) ein effektives Mittel ist, um den Rücklauf zu erhöhen. Bei anderen Gruppen (z.B. Top-Management) kann eine online Durchführung auch zu kleineren Rückläufen führen als bei Papierbefragung.
  • Finanzielle Anreize sind kein effektives Mittel um den Rücklauf zu erhöhen.

Zusammenfassend geben die Autoren folgende Tipps:

Tabelle-response-rate-guidleines

Die online Lösungen von Bright Answer unterstützt die Anwendung der oben genannten Techniken. Die Software bietet automatisiertes Versenden persönlich adressierter Vorinformationen, Studieneinladungen und Remindern aber auch die Benutzung von anonymen Zugangscodes ist möglich.
Die Software bietet eine zusätzliche Art von Anreiz für Studienteilnehmende. Am Studienende können die Teilnehmenden automatisch generiert, individuelle Rückmeldungen ansehen und sich mit dem Durchschnitt der restlichen Studienteilnehmenden oder falls vorhanden mit anderen Benchmarks vergleichen. Die Erfahrung zeigt, dass die Aussicht auf eine individuelle Rückmeldung hohe Rücklaufquoten ermöglicht. Gerade in Längsschnittstudien mit vielen Messwellen (5 oder mehr) kann dieser Anreiz viele Dropouts verhindern.