★ Capstone Project
Part 2 of 3
Data Quality & Cleaning
The patient satisfaction dataset arrives with errors, blank values, duplicate records, and out-of-range survey scores. Work through each quality issue systematically — from profiling to fixing.
Scenario
Your boss flagged issues with the patient satisfaction results dataset before it can be used for reporting. You'll work through four layers of data quality: correcting a broken date value using Power Query's error-handling tools, filling in missing gender values, removing duplicate records, and identifying and correcting survey responses that fall outside the valid 1–10 range.
Capstone Instructions
| Steps | Directions |
|---|---|
| Section 4 — Handling Errors in the birth_date Column | |
| Load patient_satisfaction_results.csv into a new query in the Power Query editor. | |
| Enable Column distribution and Column profile in the data preview. | and |
| Locate the birth_date column. Notice that no distribution or profile data is shown — investigate why. | Look for the error indicator in the column header |
| Select Keep Errors on this column to isolate the problematic value. | |
| Review the error row, then delete the "Kept Errors" step to undo it. | Click the ✕ next to the step in Applied Steps |
| Replace the error value with the correct date: 1973-05-13. | |
| Filter birth_date to only include dates between 1960-01-01 and 1980-12-31. | Use the column header dropdown filter |
Checkpoint 4After filtering to 1960–1980, there are 236 rows. How many unique birth date values does the filtered column contain?
| Steps | Directions |
|---|---|
| Section 5 — Assessing Column Quality & Missing Data | |
| Enable the Column quality feature and identify which columns contain empty values. | |
| Undo the "Filtered Rows" step on birth_date to restore all rows. | Click ✕ on the Filtered Rows step |
| Replace all empty values in patient_gender with the text "unspecified". | |
| For question_1 through question_10: any column with fewer than 3% empty cells — replace those empties with the column's average, rounded up to the nearest whole number. | Check Column statistics for each column's average and empty %. Use Replace Values per column. |
| For remaining question columns: any column still with fewer than 5% empty cells — remove those rows entirely. | |
Checkpoint 5Among respondents with "unspecified" gender, what proportion answered Neutral or Slightly Satisfied (score of 5 or 6) for Question 5? Round to the nearest whole percentage.
| Steps | Directions |
|---|---|
| Section 6 — Removing Duplicate Records | |
| If patient_gender is still filtered, clear the filter. Do not delete the "Filtered Rows" step — other filters are combined there. | Use the column dropdown to clear filter only |
| Select the patient_id column and observe the distinct vs. unique counts in the Column profile — are they the same? | |
| Sort the dataset by patient_id ascending and visually scan for duplicated values. | |
| Select Keep Duplicates to isolate all duplicated rows and assess the problem. | |
| Note the distinct and unique counts for patient_id while duplicates are kept visible. | |
Checkpoint 6How many duplicated patient IDs were there in the dataset before correction?
| Steps (continued) | Directions |
|---|---|
| Undo the "Kept Duplicates" step, then apply Remove Duplicates to clean the data. | , then |
| Steps | Directions |
|---|---|
| Section 7 — Outlier Detection in Survey Scores | |
| Review the Column profile for question_1 through question_10 to identify any columns with values outside the valid range of 1–10. | Check min/max in Column statistics for each column |
| Any value greater than 10 with duplicated digits is a data entry typo — reduce it to a single digit (e.g., 66 → 6). | |
| Rename the applied step just created to Corrected entry typo. | Right-click the step in Applied Steps › Rename |
| Replace any remaining negative values (unhappy respondents who wrote their own score) with 1 — the lowest valid score. | per affected column |
| Rename the last two applied steps to clearly identify them as outlier corrections. | Right-click each step › Rename |
Checkpoint 7After all corrections, how many survey respondents were extremely dissatisfied — scoring 1 on both Question 7 and Question 10?
Rename your steps as you go. Power Query auto-names steps generically (e.g. "Replaced Value1", "Replaced Value2"). When you have multiple corrections in the same query, rename each step immediately so future you — or a teammate — understands what each one does without having to re-read the formula.