The Language and Cognition Lab's Quick-and-Dirty Overview of How to Clean up your Data

by Ben Bergen

WARNING:
This outline is meant to be used only as a preliminary, orientational resource for students and other researchers working on questions in quantitative linguistics. Mastery of its contents alone does not suffice to perform professional-grade data cleaning, so please consult other resources before proceeding with work to be presented publicly.

Here's a sketch of how to clean up your data once you've collected it. I'm assuming that you have one or more independent variables and one continuous dependent variable, usually something like reaction time. The procedures below can be modified for other setups. Most of what follows can be done in E-DataAid, or other similar programs. Usually, you'll perform these steps in the order listed below, but specific cases might require changes.

Incorrect responses

Set a threshold for the minimum accuracy that you're willing to accept for participants and items. Often these range from 80% to 90%. Remove from analysis any items or participants that fall below this threshold.
Report number of items and participants removed.
Calculate how many there were in each condition per participant, and conduct an error analysis by participants.
Do the same by items.
Report percentage of incorrect responses per condition and significance of effects in error analyses.
Filter out all remaining incorrect responses for subsequent analysis.

Outlying participants

Calculate the mean value and standard deviation of your DV for each participant.
Establish a threshold for inclusion. This is often between 2 and 3 standard deviations.
Exclude all participants who have mean values of the DV more than this threshold away from the overall mean.
Report the threshold used and the number of participants excluded.

Outlying items

Calculate the mean value and standard deviation of your DV for each item.
Establish a threshold for inclusion. This is often between 2 and 3 std. dev.
Exclude all items that have mean values of the DV more than this threshold away from the overall mean.
Report the threshold used and the number of items excluded.

Outlying trials

Take the new mean value and standard deviation of your DV for each participant.
Establish a threshold for inclusion. This is often between 2 and 3 standard deviations. Alternatively, it can be a firm minimum or maximum value for the DV, e.g. no shorter than 100ms or no longer then 2000ms.
Find all trials that have mean values of the DV more than this threshold away from the overall mean.
Either exclude these trials from analysis, or "Windsorize", that is replace each outlying value with the value that is the threshold from the participant's mean.
Report the threshold used, the total number of outlying trials for the whole experiment, and whether outliers were excluded or replaced.
You may repeat this whole outlying trial procedure by items as well, if you like.