Data Quality & Cleaning

The patient satisfaction dataset arrives with errors, blank values, duplicate records, and out-of-range survey scores. Work through each quality issue systematically — from profiling to fixing.

~50 minutes

Level: Intermediate

1 dataset, 4 quality passes

Scenario

Your boss flagged issues with the patient satisfaction results dataset before it can be used for reporting. You'll work through four layers of data quality: correcting a broken date value using Power Query's error-handling tools, filling in missing gender values, removing duplicate records, and identifying and correcting survey responses that fall outside the valid 1–10 range.

Capstone Instructions

Steps	Directions
Section 4 — Handling Errors in the birth_date Column
Load patient_satisfaction_results.csv into a new query in the Power Query editor.
Enable Column distribution and Column profile in the data preview.	View›Column distribution and Column profile
Locate the birth_date column. Notice that no distribution or profile data is shown — investigate why.	Look for the error indicator in the column header
Select Keep Errors on this column to isolate the problematic value.	Click birth_date dropdown›Keep Errors
Review the error row, then delete the "Kept Errors" step to undo it.	Click the ✕ next to the step in Applied Steps
Replace the error value with the correct date: 1973-05-13.	Home›Replace Errors
Filter birth_date to only include dates between 1960-01-01 and 1980-12-31.	Use the column header dropdown filter

Checkpoint 4After filtering to 1960–1980, there are 236 rows. How many unique birth date values does the filtered column contain?

Steps	Directions
Section 5 — Assessing Column Quality & Missing Data
Enable the Column quality feature and identify which columns contain empty values.	View›Column quality
Undo the "Filtered Rows" step on birth_date to restore all rows.	Click ✕ on the Filtered Rows step
Replace all empty values in patient_gender with the text "unspecified".	Select column›Transform›Replace Values
For question_1 through question_10: any column with fewer than 3% empty cells — replace those empties with the column's average, rounded up to the nearest whole number.	Check Column statistics for each column's average and empty %. Use Replace Values per column.
For remaining question columns: any column still with fewer than 5% empty cells — remove those rows entirely.	Filter column›Remove blank rows

Checkpoint 5Among respondents with "unspecified" gender, what proportion answered Neutral or Slightly Satisfied (score of 5 or 6) for Question 5? Round to the nearest whole percentage.

Steps	Directions
Section 6 — Removing Duplicate Records
If patient_gender is still filtered, clear the filter. Do not delete the "Filtered Rows" step — other filters are combined there.	Use the column dropdown to clear filter only
Select the patient_id column and observe the distinct vs. unique counts in the Column profile — are they the same?
Sort the dataset by patient_id ascending and visually scan for duplicated values.	Click column header›Sort Ascending
Select Keep Duplicates to isolate all duplicated rows and assess the problem.	Home›Keep Rows›Keep Duplicates
Note the distinct and unique counts for patient_id while duplicates are kept visible.

Checkpoint 6How many duplicated patient IDs were there in the dataset before correction?

Steps (continued)	Directions
Undo the "Kept Duplicates" step, then apply Remove Duplicates to clean the data.	✕ Kept Duplicates step, then Home›Remove Rows›Remove Duplicates

Steps	Directions
Section 7 — Outlier Detection in Survey Scores
Review the Column profile for question_1 through question_10 to identify any columns with values outside the valid range of 1–10.	Check min/max in Column statistics for each column
Any value greater than 10 with duplicated digits is a data entry typo — reduce it to a single digit (e.g., 66 → 6).	Transform›Replace Values
Rename the applied step just created to Corrected entry typo.	Right-click the step in Applied Steps › Rename
Replace any remaining negative values (unhappy respondents who wrote their own score) with 1 — the lowest valid score.	Replace Values per affected column
Rename the last two applied steps to clearly identify them as outlier corrections.	Right-click each step › Rename

Checkpoint 7After all corrections, how many survey respondents were extremely dissatisfied — scoring 1 on both Question 7 and Question 10?

Rename your steps as you go. Power Query auto-names steps generically (e.g. "Replaced Value1", "Replaced Value2"). When you have multiple corrections in the same query, rename each step immediately so future you — or a teammate — understands what each one does without having to re-read the formula.

Capstone Part 1 — Loading Data

Capstone Part 3 — Transformations