Module 1: Data Wrangling

You Can’t Publish What You Can’t Clean

Author

A. C. Del Re

The “Raw Data” Reality

Most Psychology data comes from Qualtrics or SurveyMonkey. It is usually “wide”, messy, and full of text labels where you want numbers.


Table 1: The Reality: A Standard Qualtrics CSV
The Reality: A Standard Qualtrics CSV
ResponseId Gender Age Q1_Anxiety Q2_Calm Q3_Worry
R_123 Male 19 Strongly Agree Strongly Disagree 5
R_124 Female 21 Agree Disagree 4
R_125 Non-binary twenty Neutral Neutral 3
R_126 Male 22 Disagree Agree 2
R_127 Female 19 Strongly Agree Strongly Disagree 5


Why is this hard to analyze?

  1. Metadata rows: Qualtrics gives 2-3 header rows (not shown).
  2. String Likert scales: “Strongly Agree” instead of 5.
  3. Reverse-scored items: An anxiety scale where “I feel calm” needs to be flipped.
  4. Typos: Notice the “twenty” in the Age column.

1. Simulating the Data in R

To practice cleaning, let’s create this exact dataset in R:

2. Cleaning with Tidyverse

We use the tidyverse package. It is readable, like a sentence.

3. The Reverse Conversion

Psychometrics requires converting text (“Strongly Agree”) to numbers (5), and sometimes flipping them (1 becomes 5).

The Rule of 6

To reverse code a 5-point scale: New_Score = 6 - Old_Score. (1 becomes 5, 2 becomes 4… 5 becomes 1).

4. Labeling

For flextable (Module 2) to look good, we need to attach “Labels” to the variables so “Age_Num” prints as “Participant Age”.



Next Module: APA Tables >