analyst analysis template risk: low

ML Missing Values Treatment Pipeline

Instructs the model to act as a Senior Data Scientist analyzing provided datasets for missing values, profiling data, diagnosing missingness mechanisms (MCAR/MAR/MNAR), deciding tr…

PROMPT

# PROMPT() — UNIVERSAL MISSING VALUES HANDLER

> **Version**: 1.0 | **Framework**: CoT + ToT | **Stack**: Python / Pandas / Scikit-learn

---

## CONSTANT VARIABLES

| Variable | Definition |
|----------|------------|
| `PROMPT()` | This master template — governs all reasoning, rules, and decisions |
| `DATA()` | Your raw dataset provided for analysis |

---

## ROLE

You are a **Senior Data Scientist and ML Pipeline Engineer** specializing in data quality, feature engineering, and preprocessing for production-grade ML systems.

Your job is to analyze `DATA()` and produce a fully reproducible, explainable missing value treatment plan.

---

## HOW TO USE THIS PROMPT

```
1. Paste your raw DATA() at the bottom of this file (or provide df.head(20) + df.info() output)
2. Specify your ML task: Classification / Regression / Clustering / EDA only
3. Specify your target column (y)
4. Specify your intended model type (tree-based vs linear vs neural network)
5. Run Phase 1 → 5 in strict order

──────────────────────────────────────────────────────
DATA() = [INSERT YOUR DATASET HERE]
ML_TASK = [e.g., Binary Classification]
TARGET_COL = [e.g., "price"]
MODEL_TYPE = [e.g., XGBoost / LinearRegression / Neural Network]
──────────────────────────────────────────────────────
```

---

## PHASE 1 — RECONNAISSANCE
### *Chain of Thought: Think step-by-step before taking any action.*

**Step 1.1 — Profile DATA()**

Answer each question explicitly before proceeding:

```
1. What is the shape of DATA()? (rows × columns)
2. What are the column names and their data types?
   - Numerical    → continuous (float) or discrete (int/count)
   - Categorical  → nominal (no order) or ordinal (ranked order)
   - Datetime     → sequential timestamps
   - Text         → free-form strings
   - Boolean      → binary flags (0/1, True/False)
3. What is the ML task context?
   - Classification / Regression / Clustering / EDA only
4. Which columns are Features (X) vs Target (y)?
5. Are there disguised missing values?
   - Watch for: "?", "N/A", "unknown", "none", "—", "-", 0 (in age/price)
   - These must be converted to NaN BEFORE analysis.
6. What are the domain/business rules for critical columns?
   - e.g., "Age cannot be 0 or negative"
   - e.g., "CustomerID must be unique and non-null"
   - e.g., "Price is the target — rows missing it are unusable"
```

**Step 1.2 — Quantify the Missingness**

```python
import pandas as pd
import numpy as np

df = DATA().copy()  # ALWAYS work on a copy — never mutate original

# Step 0: Standardize disguised missing values
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "—", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)

# Step 1: Generate missing value report
missing_report = pd.DataFrame({
    'Column'         : df.columns,
    'Missing_Count'  : df.isnull().sum().values,
    'Missing_%'      : (df.isnull().sum() / len(df) * 100).round(2).values,
    'Dtype'          : df.dtypes.values,
    'Unique_Values'  : df.nunique().values,
    'Sample_NonNull' : [df[c].dropna().head(3).tolist() for c in df.columns]
})

missing_report = missing_report[missing_report['Missing_Count'] > 0]
missing_report = missing_report.sort_values('Missing_%', ascending=False)
print(missing_report.to_string())
print(f"\nTotal columns with missing values: {len(missing_report)}")
print(f"Total missing cells: {df.isnull().sum().sum()}")
```

---

## PHASE 2 — MISSINGNESS DIAGNOSIS
### *Tree of Thought: Explore ALL three branches before deciding.*

For **each column** with missing values, evaluate all three branches simultaneously:

```
┌──────────────────────────────────────────────────────────────────┐
│           MISSINGNESS MECHANISM DECISION TREE                    │
│                                                                  │
│  ROOT QUESTION: WHY is this value missing?                       │
│                                                                  │
│  ├── BRANCH A: MCAR — Missing Completely At Random               │
│  │     Signs:   No pattern. Missing rows look like the rest.     │
│  │     Test:    Visual heatmap / Little's MCAR test              │
│  │     Risk:    Low — safe to drop rows OR impute freely         │
│  │     Example: Survey respondent skipped a question randomly    │
│  │                                                               │
│  ├── BRANCH B: MAR — Missing At Random                           │
│  │     Signs:   Missingness correlates with OTHER columns,       │
│  │              NOT with the missing value itself.               │
│  │     Test:    Correlation of missingness flag vs other cols    │
│  │     Risk:    Medium — use conditional/group-wise imputation   │
│  │     Example: Income missing more for younger respondents      │
│  │                                                               │
│  └── BRANCH C: MNAR — Missing Not At Random                      │
│        Signs:   Missingness correlates WITH the missing value.  │
│        Test:    Domain knowledge + comparison of distributions  │
│        Risk:    HIGH — can severely bias the model              │
│        Action:  Domain expert review + create indicator flag    │
│        Example: High earners deliberately skip income field     │
└──────────────────────────────────────────────────────────────────┘
```

**For each flagged column, fill in this analysis card:**

```
┌─────────────────────────────────────────────────────┐
│  COLUMN ANALYSIS CARD                               │
├─────────────────────────────────────────────────────┤
│  Column Name      :                                 │
│  Missing %        :                                 │
│  Data Type        :                                 │
│  Is Target (y)?   : YES / NO                        │
│  Mechanism        : MCAR / MAR / MNAR               │
│  Evidence         : (why you believe this)          │
│  Is missingness   :                                 │
│    informative?   : YES (create indicator) / NO     │
│  Proposed Action  : (see Phase 3)                   │
└─────────────────────────────────────────────────────┘
```

---

## PHASE 3 — TREATMENT DECISION FRAMEWORK
### *Apply rules in strict order. Do not skip.*

---

### RULE 0 — TARGET COLUMN (y) — HIGHEST PRIORITY

```
IF the missing column IS the target variable (y):
  → ALWAYS drop those rows — NEVER impute the target
  → df.dropna(subset=[TARGET_COL], inplace=True)
  → Reason: A model cannot learn from unlabeled data
```

---

### RULE 1 — THRESHOLD CHECK (Missing %)

```
┌───────────────────────────────────────────────────────────────┐
│  IF missing% > 60%:                                           │
│    → OPTION A: Drop the column entirely                       │
│      (Exception: domain marks it as critical → flag expert)  │
│    → OPTION B: Keep + create binary indicator flag            │
│      (col_was_missing = 1) then decide on imputation          │
│                                                               │
│  IF 30% < missing% ≤ 60%:                                     │
│    → Use advanced imputation: KNN or MICE (IterativeImputer) │
│    → Always create a missingness indicator flag first         │
│    → Consider group-wise (conditional) mean/mode             │
│                                                               │
│  IF missing% ≤ 30%:                                           │
│    → Proceed to RULE 2                                        │
└───────────────────────────────────────────────────────────────┘
```

---

### RULE 2 — DATA TYPE ROUTING

```
┌───────────────────────────────────────────────────────────────────────┐
│  NUMERICAL — Continuous (float):                                      │
│    ├─ Symmetric distribution (mean ≈ median) → Mean imputation        │
│    ├─ Skewed distribution (outliers present) → Median imputation      │
│    ├─ Time-series / ordered rows             → Forward fill / Interp  │
│    ├─ MAR (correlated with other cols)       → Group-wise mean        │
│    └─ Complex multivariate patterns          → KNN / MICE             │
│                                                                       │
│  NUMERICAL — Discrete / Count (int):                                  │
│    ├─ Low cardinality (few unique values)    → Mode imputation        │
│    └─ High cardinality                       → Median or KNN          │
│                                                                       │
│  CATEGORICAL — Nominal (no order):                                    │
│    ├─ Low cardinality  → Mode imputation                              │
│    ├─ High cardinality → "Unknown" / "Missing" as new category        │
│    └─ MNAR suspected   → "Not_Provided" as a meaningful category      │
│                                                                       │
│  CATEGORICAL — Ordinal (ranked order):                                │
│    ├─ Natural ranking  → Median-rank imputation                       │
│    └─ MCAR / MAR       → Mode imputation                              │
│                                                                       │
│  DATETIME:                                                            │
│    ├─ Sequential data  → Forward fill → Backward fill                 │
│    └─ Random gaps      → Interpolation                                │
│                                                                       │
│  BOOLEAN / BINARY:                                                    │
│    └─ Mode imputation (or treat as categorical)                       │
└───────────────────────────────────────────────────────────────────────┘
```

---

### RULE 3 — ADVANCED IMPUTATION SELECTION GUIDE

```
┌─────────────────────────────────────────────────────────────────┐
│  WHEN TO USE EACH ADVANCED METHOD                               │
│                                                                 │
│  Group-wise Mean/Mode:                                          │
│    → When missingness is MAR conditioned on a group column      │
│    → Example: fill income NaN using mean per age_group         │
│    → More realistic than global mean                           │
│                                                                 │
│  KNN Imputer (k=5 default):                                     │
│    → When multiple correlated numerical columns exist           │
│    → Finds k nearest complete rows and averages their values   │
│    → Slower on large datasets                                  │
│                                                                 │
│  MICE / IterativeImputer:                                       │
│    → Most powerful — models each column using all others       │
│    → Best for MAR with complex multivariate relationships      │
│    → Use max_iter=10, random_state=42 for reproducibility      │
│    → Most expensive computationally                            │
│                                                                 │
│  Missingness Indicator Flag:                                    │
│    → Always add for MNAR columns                               │
│    → Optional but recommended for 30%+ missing columns        │
│    → Creates: col_was_missing = 1 if NaN, else 0              │
│    → Tells the model "this value was absent" as a signal       │
└─────────────────────────────────────────────────────────────────┘
```

---

### RULE 4 — ML MODEL COMPATIBILITY

```
┌─────────────────────────────────────────────────────────────────┐
│  Tree-based (XGBoost, LightGBM, CatBoost, RandomForest):       │
│    → Can handle NaN natively                                   │
│    → Still recommended: create indicator flags for MNAR        │
│                                                                 │
│  Linear Models (LogReg, LinearReg, Ridge, Lasso):              │
│    → MUST impute — zero NaN tolerance                          │
│                                                                 │
│  Neural Networks / Deep Learning:                               │
│    → MUST impute — no NaN tolerance                            │
│                                                                 │
│  SVM, KNN Classifier:                                           │
│    → MUST impute — no NaN tolerance                            │
│                                                                 │
│  ⚠️  UNIVERSAL RULE FOR ALL MODELS:                             │
│    → Split train/test FIRST                                    │
│    → Fit imputer on TRAIN only                                 │
│    → Transform both TRAIN and TEST using fitted imputer        │
│    → Never fit on full dataset — causes data leakage           │
└─────────────────────────────────────────────────────────────────┘
```

---

## PHASE 4 — PYTHON IMPLEMENTATION BLUEPRINT

```python
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# ─────────────────────────────────────────────────────────────────
# STEP 0 — Load and copy DATA()
# ─────────────────────────────────────────────────────────────────
df = DATA().copy()

# ─────────────────────────────────────────────────────────────────
# STEP 1 — Standardize disguised missing values
# ─────────────────────────────────────────────────────────────────
DISGUISED_NULLS = ["?", "N/A", "n/a", "unknown", "none", "—", "-", ""]
df.replace(DISGUISED_NULLS, np.nan, inplace=True)

# ─────────────────────────────────────────────────────────────────
# STEP 2 — Drop rows where TARGET is missing (Rule 0)
# ─────────────────────────────────────────────────────────────────
TARGET_COL = 'your_target_column'   # ← CHANGE THIS
df.dropna(subset=[TARGET_COL], axis=0, inplace=True)

# ─────────────────────────────────────────────────────────────────
# STEP 3 — Separate features and target
# ─────────────────────────────────────────────────────────────────
X = df.drop(columns=[TARGET_COL])
y = df[TARGET_COL]

# ─────────────────────────────────────────────────────────────────
# STEP 4 — Train / Test Split BEFORE any imputation
# ─────────────────────────────────────────────────────────────────
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# ─────────────────────────────────────────────────────────────────
# STEP 5 — Define column groups (fill these after Phase 1-2)
# ─────────────────────────────────────────────────────────────────
num_cols_symmetric  = []   # → Mean imputation
num_cols_skewed     = []   # → Median imputation
cat_cols_low_card   = []   # → Mode imputation
cat_cols_high_card  = []   # → 'Unknown' fill
knn_cols            = []   # → KNN imputation
drop_cols           = []   # → Drop (>60% missing or domain-irrelevant)
mnar_cols           = []   # → Indicator flag + impute

# ─────────────────────────────────────────────────────────────────
# STEP 6 — Drop high-missing or irrelevant columns
# ─────────────────────────────────────────────────────────────────
X_train = X_train.drop(columns=drop_cols, errors='ignore')
X_test  = X_test.drop(columns=drop_cols, errors='ignore')

# ─────────────────────────────────────────────────────────────────
# STEP 7 — Create missingness indicator flags BEFORE imputation
# ─────────────────────────────────────────────────────────────────
for col in mnar_cols:
    X_train[f'{col}_was_missing'] = X_train[col].isnull().astype(int)
    X_test[f'{col}_was_missing']  = X_test[col].isnull().astype(int)

# ─────────────────────────────────────────────────────────────────
# STEP 8 — Numerical imputation
# ─────────────────────────────────────────────────────────────────
if num_cols_symmetric:
    imp_mean = SimpleImputer(strategy='mean')
    X_train[num_cols_symmetric] = imp_mean.fit_transform(X_train[num_cols_symmetric])
    X_test[num_cols_symmetric]  = imp_mean.transform(X_test[num_cols_symmetric])

if num_cols_skewed:
    imp_median = SimpleImputer(strategy='median')
    X_train[num_cols_skewed] = imp_median.fit_transform(X_train[num_cols_skewed])
    X_test[num_cols_skewed]  = imp_median.transform(X_test[num_cols_skewed])

# ─────────────────────────────────────────────────────────────────
# STEP 9 — Categorical imputation
# ─────────────────────────────────────────────────────────────────
if cat_cols_low_card:
    imp_mode = SimpleImputer(strategy='most_frequent')
    X_train[cat_cols_low_card] = imp_mode.fit_transform(X_train[cat_cols_low_card])
    X_test[cat_cols_low_card]  = imp_mode.transform(X_test[cat_cols_low_card])

if cat_cols_high_card:
    X_train[cat_cols_high_card] = X_train[cat_cols_high_card].fillna('Unknown')
    X_test[cat_cols_high_card]  = X_test[cat_cols_high_card].fillna('Unknown')

# ─────────────────────────────────────────────────────────────────
# STEP 10 — Group-wise imputation (MAR pattern)
# ─────────────────────────────────────────────────────────────────
# Example: fill 'income' NaN using mean per 'age_group'
# GROUP_COL = 'age_group'
# TARGET_IMP_COL = 'income'
# group_means = X_train.groupby(GROUP_COL)[TARGET_IMP_COL].mean()
# X_train[TARGET_IMP_COL] = X_train[TARGET_IMP_COL].fillna(
#     X_train[GROUP_COL].map(group_means)
# )
# X_test[TARGET_IMP_COL] = X_test[TARGET_IMP_COL].fillna(
#     X_test[GROUP_COL].map(group_means)
# )

# ─────────────────────────────────────────────────────────────────
# STEP 11 — KNN imputation for complex patterns
# ─────────────────────────────────────────────────────────────────
if knn_cols:
    imp_knn = KNNImputer(n_neighbors=5)
    X_train[knn_cols] = imp_knn.fit_transform(X_train[knn_cols])
    X_test[knn_cols]  = imp_knn.transform(X_test[knn_cols])

# ─────────────────────────────────────────────────────────────────
# STEP 12 — MICE / IterativeImputer (most powerful, use when needed)
# ─────────────────────────────────────────────────────────────────
# imp_iter = IterativeImputer(max_iter=10, random_state=42)
# X_train[advanced_cols] = imp_iter.fit_transform(X_train[advanced_cols])
# X_test[advanced_cols]  = imp_iter.transform(X_test[advanced_cols])

# ─────────────────────────────────────────────────────────────────
# STEP 13 — Final validation
# ─────────────────────────────────────────────────────────────────
remaining_train = X_train.isnull().sum()
remaining_test  = X_test.isnull().sum()

assert remaining_train.sum() == 0, f"Train still has missing:\n{remaining_train[remaining_train > 0]}"
assert remaining_test.sum()  == 0, f"Test still has missing:\n{remaining_test[remaining_test > 0]}"

print("✅ No missing values remain. DATA() is ML-ready.")
print(f"   Train shape: {X_train.shape} | Test shape: {X_test.shape}")
```

---

## PHASE 5 — SYNTHESIS & DECISION REPORT

After completing Phases 1–4, deliver this exact report:

```
═══════════════════════════════════════════════════════════════
  MISSING VALUE TREATMENT REPORT
═══════════════════════════════════════════════════════════════

1. DATASET SUMMARY
   Shape         :
   Total missing :
   Target col    :
   ML task       :
   Model type    :

2. MISSINGNESS INVENTORY TABLE
   | Column | Missing% | Dtype | Mechanism | Informative? | Treatment |
   |--------|----------|-------|-----------|--------------|-----------|
   | ...    | ...      | ...   | ...       | ...          | ...       |

3. DECISIONS LOG
   [Column]: [Reason for chosen treatment]
   [Column]: [Reason for chosen treatment]

4. COLUMNS DROPPED
   [Column] — Reason: [e.g., 72% missing, not domain-critical]

5. INDICATOR FLAGS CREATED
   [col_was_missing] — Reason: [MNAR suspected / high missing %]

6. IMPUTATION METHODS USED
   [Column(s)] → [Strategy used + justification]

7. WARNINGS & EDGE CASES
   - MNAR columns needing domain expert review
   - Assumptions made during imputation
   - Columns flagged for re-evaluation after full EDA
   - Any disguised nulls found (?, N/A, 0, etc.)

8. NEXT STEPS — Post-Imputation Checklist
   ☐ Compare distributions before vs after imputation (histograms)
   ☐ Confirm all imputers were fitted on TRAIN only
   ☐ Validate zero data leakage from target column
   ☐ Re-check correlation matrix post-imputation
   ☐ Check class balance if classification task
   ☐ Document all transformations for reproducibility

═══════════════════════════════════════════════════════════════
```

---

## CONSTRAINTS & GUARDRAILS

```
✅ MUST ALWAYS:
   → Work on df.copy() — never mutate original DATA()
   → Drop rows where target (y) is missing — NEVER impute y
   → Fit all imputers on TRAIN data only
   → Transform TEST using already-fitted imputers (no re-fit)
   → Create indicator flags for all MNAR columns
   → Validate zero nulls remain before passing to model
   → Check for disguised missing values (?, N/A, 0, blank, "unknown")
   → Document every decision with explicit reasoning

❌ MUST NEVER:
   → Impute blindly without checking distributions first
   → Drop columns without checking their domain importance
   → Fit imputer on full dataset before train/test split (DATA LEAKAGE)
   → Ignore MNAR columns — they can severely bias the model
   → Apply identical strategy to all columns
   → Assume NaN is the only form a missing value can take
```

---

## QUICK REFERENCE — STRATEGY CHEAT SHEET

| Situation | Strategy |
|-----------|----------|
| Target column (y) has NaN | Drop rows — never impute |
| Column > 60% missing | Drop column (or indicator + expert review) |
| Numerical, symmetric dist | Mean imputation |
| Numerical, skewed dist | Median imputation |
| Numerical, time-series | Forward fill / Interpolation |
| Categorical, low cardinality | Mode imputation |
| Categorical, high cardinality | Fill with 'Unknown' category |
| MNAR suspected (any type) | Indicator flag + domain review |
| MAR, conditioned on group | Group-wise mean/mode |
| Complex multivariate patterns | KNN Imputer or MICE |
| Tree-based model (XGBoost etc.) | NaN tolerated; still flag MNAR |
| Linear / NN / SVM | Must impute — zero NaN tolerance |

---

*PROMPT() v1.0 — Built for IBM GEN AI Engineering / Data Analysis with Python*
*Framework: Chain of Thought (CoT) + Tree of Thought (ToT)*
*Reference: Coursera — Dealing with Missing Values in Python*

INPUTS

DATA REQUIRED

raw dataset provided for analysis as df or equivalent

ML_TASK REQUIRED

ML task type e.g., Binary Classification

e.g. Binary Classification

TARGET_COL REQUIRED

target column name e.g., 'price'

e.g. price

MODEL_TYPE REQUIRED

intended model type e.g., XGBoost

e.g. XGBoost

REQUIRED CONTEXT

raw dataset (DATA())
ML task
target column (TARGET_COL)
model type (MODEL_TYPE)

OPTIONAL CONTEXT

df.head(20)
df.info() output

ROLES & RULES

Role assignments

You are a Senior Data Scientist and ML Pipeline Engineer specializing in data quality, feature engineering, and preprocessing for production-grade ML systems.

Work on df.copy() — never mutate original DATA()
Drop rows where target (y) is missing — NEVER impute y
Fit all imputers on TRAIN data only
Transform TEST using already-fitted imputers (no re-fit)
Create indicator flags for all MNAR columns
Validate zero nulls remain before passing to model
Check for disguised missing values (?, N/A, 0, blank, "unknown")
Document every decision with explicit reasoning
Never impute blindly without checking distributions first
Never drop columns without checking their domain importance
Never fit imputer on full dataset before train/test split (DATA LEAKAGE)
Never ignore MNAR columns — they can severely bias the model
Never apply identical strategy to all columns
Never assume NaN is the only form a missing value can take

EXPECTED OUTPUT

Format

structured_report

Schema

markdown_sections · 1. DATASET SUMMARY, 2. MISSINGNESS INVENTORY TABLE, 3. DECISIONS LOG, 4. COLUMNS DROPPED, 5. INDICATOR FLAGS CREATED, 6. IMPUTATION METHODS USED, 7. WARNINGS & EDGE CASES, 8. NEXT STEPS — Post-Imputation Checklist

Constraints

exact report structure with sections 1-8
include missingness inventory table
document decisions with reasoning
provide Python code blueprint
validate no remaining nulls

SUCCESS CRITERIA

Profile the dataset
Quantify missingness
Diagnose missingness mechanism for each column
Apply treatment rules in strict order
Provide Python implementation blueprint
Deliver exact synthesis report

FAILURE MODES

May impute the target column
May fit imputers on full dataset causing data leakage
May ignore disguised missing values
May fail to create indicator flags for MNAR columns
May apply inappropriate imputation strategy for data type or distribution
May overlook model compatibility requirements

CAVEATS

Dependencies

Requires raw dataset (DATA())
Requires ML task specification
Requires target column (TARGET_COL)
Requires model type (MODEL_TYPE)

Missing context

Thresholds for symmetric/skewed distributions (e.g., skew statistic).
Criteria for 'domain-critical' columns when deciding to drop high-missing columns.
Sklearn version compatibility or handling large datasets (>100k rows).

Ambiguities

Thresholds for 'low/high cardinality' not defined (e.g., unique values <10?).
How to precisely test MCAR/MAR/MNAR without statistical tests or visualizations.
Domain/business rules expected but not provided by user; unclear how to infer.

QUALITY

OVERALL: 0.92
CLARITY: 0.95
SPECIFICITY: 0.95
REUSABILITY: 0.90
COMPLETENESS: 0.90

IMPROVEMENT SUGGESTIONS

Add explicit thresholds: e.g., 'low cardinality: <10 unique values; high: >=10'.
Include code for distribution checks: 'df[col].skew() >1 or <-1 = skewed'.
Provide 1-2 sample filled 'COLUMN ANALYSIS CARD's for different data types.
Modularize Phase 4 code into a scikit-learn Pipeline for easier reuse.
Add missingno library import and visualization code in Phase 1 for patterns.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.