SPSS Dissertation Guide

How to Clean Data in Stata

How to Clean Data in Stata: Step-by-Step Guide for Students Clean data is the foundation of accurate statistical analysis. Before you run descriptive statistics, correlation, regression, ANOVA, chi-square tests, panel data analysis, or any other Stata procedure, your dataset must…

Written by Pius Updated June 2, 2026 19 min read
How to Clean Data in Stata

How to Clean Data in Stata: Step-by-Step Guide for Students

Clean data is the foundation of accurate statistical analysis. Before you run descriptive statistics, correlation, regression, ANOVA, chi-square tests, panel data analysis, or any other Stata procedure, your dataset must be checked and prepared correctly. If your data contains missing values, duplicate responses, incorrect variable types, inconsistent codes, outliers, or poorly labeled variables, your results can become misleading.

For dissertation and thesis students, this can affect the quality of Chapter 4, the accuracy of hypothesis testing, and the credibility of the final interpretation. Stata is helpful because it allows you to clean data using commands, save the process in a do-file, and repeat the same cleaning steps if your supervisor asks for revisions.

This guide explains how to clean data in Stata step by step. It is written for students working with questionnaire data, survey responses, secondary datasets, panel data, Excel files, CSV files, and research datasets that need to be prepared for dissertation analysis.

What Does Data Cleaning in Stata Mean?

Data cleaning in Stata means preparing raw data so it can be analyzed accurately. It involves checking the structure of the dataset, correcting coding errors, handling missing values, removing or reviewing duplicate records, converting variable types, labeling variables, recoding responses, checking outliers, and saving a clean version for analysis.

In dissertation research, data cleaning helps ensure that the results you report are based on reliable data. For example, if a missing response is coded as 999 and you fail to recode it as missing, Stata may treat it as a real number. This can distort the mean, standard deviation, correlation, or regression output.

Data cleaning does not mean changing your data to get a preferred result. It means correcting genuine data quality issues so your analysis remains accurate, transparent, and defensible. Students with complex or messy datasets can also use professional data cleaning services before moving to statistical analysis.

Data Cleaning TaskWhy It Matters in Dissertation AnalysisCommon Stata Commands
Import dataOpens Excel, CSV, or Stata files for analysisimport excel, import delimited, use
Inspect variablesHelps you understand the dataset structuredescribe, codebook, summarize
Check missing valuesIdentifies incomplete responses and coding errorsmisstable summarize, tab, missing
Check duplicatesPrevents repeated cases from affecting resultsduplicates report, duplicates list
Convert variable typesAllows statistical tests to run correctlydestring, encode
Recode variablesPrepares variables for analysisrecode, replace, generate
Label variablesMakes output easier to understandlabel variable, label define, label values
Check outliersFinds impossible or unusual valuessummarize, detail, graph box, list
Merge datasetsCombines related files correctlymerge, append, isid
Save clean dataCreates an analysis-ready datasetsave, export excel, export delimited

Why Data Cleaning Matters Before Stata Analysis

Data cleaning affects every result you produce in Stata. A dataset with errors can produce incorrect frequencies, misleading percentages, weak regression models, unreliable p-values, and inaccurate conclusions. This is especially risky in dissertation work because the final analysis must match the research questions, objectives, and hypotheses.

For example, duplicate survey responses can inflate your sample size. Incorrectly coded Likert-scale items can affect reliability analysis and scale scores. String variables that should be numeric can prevent Stata from running tests. Missing values coded as 99 or 999 can distort descriptive statistics. Outliers may also affect regression coefficients, especially when they represent data entry errors rather than valid extreme values.

Clean data helps you:

  • Run accurate descriptive statistics
  • Prepare reliable correlation and regression models
  • Analyze questionnaire and survey responses correctly
  • Test hypotheses with fewer data-related errors
  • Report Chapter 4 findings clearly
  • Avoid mistakes in tables and interpretation
  • Explain your data preparation process in the methodology chapter

Students who need support with Stata commands, coursework, or interpretation can also use Stata assignment help when cleaning and analyzing research data becomes difficult.

Before You Start: Protect the Raw Dataset

Before you clean your data in Stata, save the original file separately. Never overwrite the raw dataset. If you make a mistake while cleaning, the raw file allows you to restart the process without losing the original data.

A safe Stata data cleaning workflow should include:

File or FolderPurpose
Raw data fileStores the original unchanged dataset
Working data fileStores the version used during cleaning
Clean data fileStores the final analysis-ready dataset
Do-fileRecords all Stata commands used during cleaning
Log fileSaves Stata output and cleaning checks
Notes fileExplains major decisions such as recoding and exclusions

A simple setup may look like this:

clear all
set more off
capture log close

log using "data_cleaning_log.smcl", replace

use "raw_data.dta", clear

describe
codebook
summarize

save "working_data.dta", replace

The clear all command clears Stata’s memory. The set more off command prevents Stata from pausing long output. The log using command saves your output. The use command opens the dataset. The describe, codebook, and summarize commands help you inspect the data before making changes.

Using a do-file is important because it creates a repeatable record of your work. This is useful when your supervisor asks how you handled missing values, outliers, duplicates, or variable coding.

Step 1: Import Your Data into Stata

Students often collect data using Excel, Google Forms, Qualtrics, SurveyMonkey, KoboToolbox, or other online survey platforms. The data may be downloaded as an Excel file, CSV file, or Stata file.

For Excel data, use:

import excel "survey_data.xlsx", firstrow clear

For CSV data, use:

import delimited "survey_data.csv", clear

For an existing Stata file, use:

use "survey_data.dta", clear

The firstrow option tells Stata to use the first row of the Excel sheet as variable names. The clear option removes any dataset currently open in Stata.

After importing the data, check whether the file imported correctly. Common import problems include blank rows, long variable names, numbers stored as text, dates imported incorrectly, merged Excel cells, special characters, and missing responses. Fixing these issues early prevents problems later when you run statistical tests.

Step 2: Inspect the Dataset Structure

Before changing anything, inspect the dataset carefully. This helps you understand the number of observations, number of variables, variable types, missing values, and possible coding errors.

Use these commands:

describe
codebook
summarize
list in 1/10
browse
misstable summarize

The describe command shows variable names, storage types, formats, and labels. The codebook command gives more detailed information about variable values, ranges, unique values, and missing cases. The summarize command gives descriptive statistics for numeric variables. The list in 1/10 command displays the first ten observations. The browse command opens the data viewer. The misstable summarize command identifies missing values.

Question to AskWhy It Matters
How many observations are in the dataset?Confirms the sample size
How many variables are available?Confirms whether all questionnaire items imported
Which variables are numeric?Shows what can be analyzed statistically
Which variables are strings?Helps identify variables that may need conversion
Are there missing values?Helps plan missing data handling
Are variable names clear?Makes Stata commands easier to write
Are value labels available?Makes output easier to interpret
Are ID variables unique?Helps detect duplicates and merge issues

This first inspection also helps you decide whether the dataset is ready for cleaning or whether the file needs correction before analysis.

Step 3: Rename Variables Clearly

Clear variable names make Stata analysis easier. Many survey platforms export variables as Q1, Q2, Q3, or long question text. These names can make commands difficult to write and results harder to interpret.

You can rename variables like this:

rename Q1 age
rename Q2 gender
rename Q3 education_level
rename Q4 job_satisfaction

Use short, clear, lowercase variable names. Use underscores instead of spaces. Avoid symbols, punctuation marks, and vague names where possible.

Poor Variable NameBetter Variable Name
Q1age
Q2gender
RespondentAgerespondent_age
SatisfactionScoresatisfaction_score
Income Levelincome_level
Var0001employment_status

Good variable names help when running descriptive statistics, reliability analysis, correlation, regression, factor analysis, and hypothesis testing. They also make your do-file easier to understand when you return to it later.

Step 4: Convert String Variables to Numeric Variables

Some variables may look numeric but are stored as text. This often happens when data comes from Excel or CSV files. Stata cannot run many statistical tests on numeric values stored as strings.

Check variable types using:

describe

If age, income, test scores, or scale values are stored as strings, convert them using:

destring age, replace
destring income, replace ignore(",")
destring score, generate(score_num)

The replace option converts the original variable. The generate() option creates a new numeric version while keeping the original variable. The ignore(",") option tells Stata to ignore commas during conversion.

Do not convert every string variable. Names, phone numbers, participant codes, student IDs, and IDs with leading zeros should usually remain strings. For example, an ID such as 00125 may lose its leading zeros if converted to numeric.

For categorical text variables such as gender, education level, or marital status, use encode:

encode gender, generate(gender_num)

This creates a numeric version of the variable with value labels. That version is easier to use in statistical analysis while still keeping readable categories.

Step 5: Check Missing Values in Stata

Missing values are common in student research datasets. A respondent may skip a question, stop halfway through a survey, leave demographic information blank, or provide an invalid response.

Use these commands to check missing values:

misstable summarize
misstable patterns
tab gender, missing
summarize age
codebook

You should also check for placeholder missing values. Some datasets use 99, 999, 9999, -99, NA, N/A, “Unknown,” or blank strings to represent missing responses. Stata may not automatically treat these as missing.

Examples:

replace age = . if age == 999
replace income = . if income == -99
replace satisfaction = . if satisfaction == 99

In Stata, . represents a missing numeric value.

Do not delete all missing values automatically. First, check how much data is missing, which variables are affected, and whether the missing values affect your main research questions. If the missing data problem is large, professional dissertation data analysis help can help students choose an appropriate approach before running final analysis.

Missing Data IssuePossible Action
Few missing demographic valuesReport missingness
Missing outcome variableDecide based on the analysis plan
Placeholder code such as 999Recode to Stata missing
Many missing survey responsesReview response quality
Missing Likert-scale itemsApply justified scale scoring rules
Missing ID variableInvestigate before analysis
Blank string valuesStandardize or recode carefully

Missing data should be handled consistently and documented in your methodology or data preparation section.

Step 6: Check for Duplicate Responses

Duplicate records can affect your sample size and results. In survey research, duplicates may occur when one participant submits a form more than once. In secondary data, duplicates may result from repeated records, merging errors, or data entry mistakes.

Use:

duplicates report
duplicates list

To check duplicates using a specific ID variable:

duplicates report respondent_id
duplicates list respondent_id

To tag duplicates:

duplicates tag respondent_id, generate(dup_id)
list respondent_id if dup_id > 0

Do not delete duplicates blindly. First, decide whether repeated records are actual errors. In panel data, repeated IDs may be expected because the same student, patient, firm, country, or organization appears across different time periods.

To remove exact duplicate rows, use:

duplicates drop, force

Only use this command after confirming that the duplicate rows are genuine errors. If duplicates are removed, record how many were removed and why.

Step 7: Recode Variables for Analysis

Recoding is common in dissertation analysis. You may need to group responses, create binary variables, combine small categories, reverse-code questionnaire items, or prepare variables for regression.

Example of recoding satisfaction scores:

recode satisfaction (1/2=0 "Low") (3=1 "Moderate") (4/5=2 "High"), generate(satisfaction_group)

Example of creating a binary variable:

generate employed = .
replace employed = 1 if employment_status == "Employed"
replace employed = 0 if employment_status == "Unemployed"

Example of creating age groups:

generate age_group = .
replace age_group = 1 if age < 25
replace age_group = 2 if age >= 25 & age < 35
replace age_group = 3 if age >= 35

Recoding decisions should match your research questions and analysis plan. For example, if your dissertation compares employed and unemployed respondents, your employment variable must be coded clearly before analysis. If you combine categories, explain the reason in your methodology.

Step 8: Clean Likert-Scale Data in Stata

Many student dissertations use Likert-scale questions. These may measure satisfaction, motivation, anxiety, leadership style, service quality, academic performance, employee engagement, behavioral intention, or patient experience.

A common five-point Likert scale may look like this:

ResponseCode
Strongly disagree1
Disagree2
Neutral3
Agree4
Strongly agree5

If Likert responses are stored as text, standardize the text first:

replace q1 = trim(lower(q1))
replace q1 = "strongly disagree" if q1 == "strongly disagree "

If Likert responses are numeric but use placeholder missing codes, recode them:

replace q1 = . if q1 == 99
replace q2 = . if q2 == 99
replace q3 = . if q3 == 99

To create a scale score, use:

egen satisfaction_scale = rowmean(q1 q2 q3 q4 q5)

Before creating a scale score, check whether all items are coded in the same direction. Negatively worded items must often be reverse-coded.

For a 1–5 scale, reverse coding can be done like this:

generate q3_reverse = 6 - q3

If you forget to reverse-code negatively worded items, your scale score may be incorrect. This can affect reliability analysis, correlation, regression, and interpretation.

Step 9: Add Variable Labels and Value Labels

Labels make Stata output easier to understand. This is important when preparing dissertation tables, interpreting results, and explaining variables in Chapter 4.

Add variable labels:

label variable age "Respondent age"
label variable gender "Respondent gender"
label variable satisfaction_score "Overall satisfaction score"

Add value labels:

label define gender_lbl 1 "Male" 2 "Female"
label values gender gender_lbl

Variable labels describe what a variable means. Value labels describe what numeric codes mean. For example, without value labels, gender may appear as 1 and 2. With value labels, Stata can show Male and Female.

Clear labels are especially helpful when exporting tables, sharing output with a supervisor, or interpreting findings in a dissertation results chapter.

Step 10: Check Outliers and Impossible Values

Outliers are values that are unusually high, unusually low, or impossible. Some outliers are valid, while others are caused by data entry errors.

Use these commands:

summarize age, detail
summarize income, detail
graph box income
tab age

Check impossible values using list:

list respondent_id age if age < 18 | age > 100
list respondent_id income if income < 0
list respondent_id satisfaction if satisfaction < 1 | satisfaction > 5
VariablePossible Error
AgeNegative values or unrealistic ages
IncomeNegative income where not possible
Likert scaleValues outside the expected 1–5 range
Test scoresScores above the maximum possible score
PercentagesValues above 100
DatesFuture dates where not expected

Do not remove outliers simply because they are extreme. First, investigate whether the values are valid or incorrect. A high income value may be real. A satisfaction score of 9 on a 1–5 scale is likely an error. If outliers are removed, corrected, or winsorized, explain the decision clearly.

Step 11: Clean Date Variables in Stata

Date variables often import incorrectly from Excel or CSV files. A date may appear as text, a number, or a format that Stata does not recognize. This matters when analyzing time-series data, panel data, pre-post intervention data, longitudinal data, or repeated observations.

For daily dates:

generate date_clean = date(date_string, "DMY")
format date_clean %td

For monthly dates:

generate month_clean = monthly(month_string, "YM")
format month_clean %tm

After cleaning dates, check whether they look correct:

list date_string date_clean in 1/10
summarize date_clean

If the converted dates look wrong, check the original format. A date written as day-month-year requires a different format from one written as month-day-year. Correct date formatting is important before running trend analysis, panel regression, time-based comparisons, or pre-post analysis.

Step 12: Merge Datasets Safely in Stata

Some dissertation projects use more than one dataset. For example, one file may contain survey responses and another file may contain demographic information. Another project may combine country-level data, firm-level data, financial data, or panel data from different sources.

Before merging, check whether the ID variable is unique:

isid respondent_id
duplicates report respondent_id

Then merge the datasets:

merge 1:1 respondent_id using "demographics.dta"
tab _merge

The _merge variable shows whether observations matched between the two datasets.

_merge ValueMeaning
1Observation only in the master dataset
2Observation only in the using dataset
3Observation matched in both datasets

Always check _merge after merging. A poor merge can duplicate records, drop cases, mismatch participants, or create incorrect analysis results. If your dissertation involves regression after cleaning, you can continue with how to run regression analysis in Stata after validating the final dataset.

Step 13: Validate the Cleaned Dataset

After cleaning, validate the dataset before running analysis. Validation confirms that the dataset is ready for descriptive statistics, correlation, regression, t-tests, ANOVA, chi-square tests, panel analysis, or other Stata procedures.

Use:

count
describe
summarize
codebook
misstable summarize
duplicates report

Then check whether the dataset matches your research questions and analysis plan.

Validation CheckQuestion to Ask
Sample sizeDoes the number of observations match expectations?
Variable countAre all required variables present?
Variable typesAre numeric variables stored correctly?
Missing valuesAre missing values properly coded?
DuplicatesHave duplicate records been reviewed?
LabelsAre variable and value labels clear?
Likert itemsAre all scale items coded consistently?
OutliersHave unusual values been investigated?
Merged dataWere merge results checked?
Analysis readinessCan the dataset answer the research questions?

Validation helps prevent errors before they appear in your final output. It also makes your dissertation analysis easier to explain and defend.

Step 14: Save the Clean Dataset

Once the dataset is clean and validated, save it separately from the raw file.

save "clean_dissertation_data.dta", replace

To export the cleaned data to Excel:

export excel using "clean_dissertation_data.xlsx", firstrow(variables) replace

To export as CSV:

export delimited using "clean_dissertation_data.csv", replace

Use .dta if you will continue working in Stata. Use .xlsx if you need to share the dataset with a supervisor, classmate, or client who prefers Excel. Use .csv if the data will be used in another statistical tool.

A clean dataset should be saved once and used consistently for all analysis. This prevents confusion when reporting results.

Stata Data Cleaning Checklist for Students

StepCompleted?Notes
Raw dataset saved separately
Working dataset created
Do-file created
Log file created
Data imported correctly
Dataset structure inspected
Variable names cleaned
String variables checked
Numeric variables confirmed
Missing values reviewed
Placeholder missing codes recoded
Duplicate records checked
Duplicate decisions documented
Likert-scale items reviewed
Reverse-coded items corrected
Variables recoded where needed
Variable labels added
Value labels added
Outliers checked
Date variables cleaned
Merge results validated
Clean dataset saved
Cleaning decisions documented

Common Stata Data Cleaning Mistakes to Avoid

Overwriting the Raw Dataset

Never clean the only copy of your dataset. Keep the raw file unchanged and save a separate working file.

Ignoring Missing Values

Missing values can affect descriptive statistics, regression models, hypothesis tests, and scale scores. Always check and document them.

Treating Placeholder Codes as Real Values

Codes such as 999, -99, or 99 may represent missing values. If Stata treats them as real values, your results may be wrong.

Dropping Duplicates Without Checking Them

Some repeated records are valid, especially in panel data, repeated-measures data, transaction data, or longitudinal datasets. Check the meaning of each row before deleting duplicates.

Forgetting to Reverse-Code Likert Items

Negatively worded questionnaire items can distort scale scores if they are not reverse-coded correctly.

Running Analysis Before Checking Variable Types

If numbers are stored as strings, some Stata commands may fail or exclude variables from analysis.

Not Labeling Variables

Unlabeled output is harder to interpret. Labels help you understand tables, regression output, and exported results.

Not Checking Merge Results

Always review _merge after combining datasets. A faulty merge can affect your sample size and analysis results.

Removing Outliers Without Justification

Outliers should be investigated first. Removing them without explanation can weaken the credibility of your analysis.

Not Saving a Final Clean Dataset

Save one final clean dataset and use it consistently for all analysis, tables, and reporting.

When Should Students Get Help Cleaning Data in Stata?

You may need Stata data cleaning help if your dataset has many missing values, duplicate responses, unclear variable names, inconsistent Likert-scale responses, badly formatted dates, merge problems, coding errors, or variables that do not work in Stata commands.

Professional support is also useful when your dissertation requires regression, panel data analysis, survey analysis, descriptive statistics, correlation analysis, hypothesis testing, or Chapter 4 results interpretation. Students who are unsure whether Stata is the right software can also compare options using SPSS vs Stata.

For complex projects, Stata assignment help, dissertation data analysis help, and data cleaning services can help students move from messy data to clean, analysis-ready results.

FAQs About Cleaning Data in Stata

What is data cleaning in Stata?

Data cleaning in Stata is the process of preparing raw data for statistical analysis. It includes checking missing values, duplicates, variable types, labels, outliers, coding errors, inconsistent responses, and merge problems. Clean data helps students produce accurate dissertation, thesis, and assignment results.

How do I check missing values in Stata?

You can check missing values in Stata using misstable summarize, misstable patterns, codebook, and tab variable, missing. You should also check for placeholder codes such as 999, -99, or 99, because these may need to be recoded as missing values before analysis.

How do I remove duplicates in Stata?

Use duplicates report to identify duplicates and duplicates list to inspect them. If the duplicates are true errors, you can remove exact duplicate rows using duplicates drop, force. However, repeated records may be valid in panel data, repeated-measures data, or longitudinal datasets.

How do I convert string variables to numeric in Stata?

Use the destring command when a numeric variable is stored as text. For example, destring age, replace converts age from string to numeric. For categorical text variables such as gender or education level, use encode gender, generate(gender_num).

How do I clean Likert-scale data in Stata?

Clean Likert-scale data by checking coding consistency, recoding missing values, correcting invalid responses, and reverse-coding negatively worded items. After that, you can create scale scores using commands such as egen scale_score = rowmean(q1 q2 q3 q4 q5).

Should I remove outliers before analysis?

Do not remove outliers automatically. First, check whether the values are valid or caused by data entry errors. If you remove, recode, or adjust outliers, explain the decision clearly in your methodology or data preparation section.

How do I save cleaned data in Stata?

Use save "clean_data.dta", replace to save your cleaned dataset in Stata format. You can also export it to Excel using export excel using "clean_data.xlsx", firstrow(variables) replace or to CSV using export delimited.

Can I clean questionnaire data in Stata?

Yes. Stata can clean questionnaire data by checking missing responses, recoding Likert-scale items, labeling variables, removing duplicate responses, creating scale scores, and preparing variables for descriptive statistics, reliability analysis, correlation, regression, or hypothesis testing.

Conclusion

Cleaning data in Stata is a key step before dissertation analysis. It helps you prepare accurate, consistent, and analysis-ready data for descriptive statistics, correlation, regression, t-tests, ANOVA, chi-square tests, panel analysis, and Chapter 4 reporting.

A strong Stata cleaning workflow starts with a raw data backup, uses a do-file, checks variable types, handles missing values, reviews duplicates, recodes variables, labels data, checks outliers, validates the cleaned dataset, and saves a final clean version for analysis.

For students, clean data is not just a technical task. It protects the quality of your dissertation findings and makes your results easier to explain, defend, and report.