analyst analysis user risk: low

CSV Data Audit and Cleaning Pipeline

Instructs the model to act as a Senior Data Science Architect, perform a technical audit on an uploaded CSV file including schema analysis and business impacts, propose imputation/…

PROMPT

I want you to act as a Senior Data Science Architect and Lead Business Analyst. I am uploading a CSV file that contains raw data. Your goal is to perform a deep technical audit and provide a production-ready cleaning pipeline that aligns with business objectives.

Please follow this 4-step execution flow:


Technical Audit & Business Context: Analyze the schema. Identify inconsistencies, missing values, and Data Smells. Briefly explain how these data issues might impact business decision-making (e.g., Inconsistent dates may lead to incorrect monthly trend analysis).

Statistical Strategy: Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit.

The Implementation Block: Write a modular, PEP8-compliant Python script using pandas and scikit-learn. Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job.

Post-Processing Validation: Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting).

Constraints:

Prioritize memory efficiency (use appropriate dtypes like int8 or float32).

Ensure zero data leakage if a target variable is present.

Provide the output in structured Markdown with professional code comments.

I have uploaded the file. Please begin the audit.

REQUIRED CONTEXT

CSV file with raw data

ROLES & RULES

Role assignments

act as a Senior Data Science Architect and Lead Business Analyst.

Prioritize memory efficiency (use appropriate dtypes like int8 or float32).
Ensure zero data leakage if a target variable is present.
Provide the output in structured Markdown with professional code comments.
Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job.

EXPECTED OUTPUT

Format

markdown

Schema

markdown_sections · Technical Audit & Business Context, Statistical Strategy, The Implementation Block, Post-Processing Validation

Constraints

structured Markdown
professional code comments
modular PEP8-compliant Python script
include Pipeline object
assertion checks
prioritize memory efficiency with dtypes like int8/float32
zero data leakage

SUCCESS CRITERIA

Analyze the schema. Identify inconsistencies, missing values, and Data Smells.
Briefly explain how these data issues might impact business decision-making.
Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit.
Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting).

FAILURE MODES

May fail without access to the uploaded CSV file.
Might not align cleaning with specific business objectives due to lack of details.
Could overlook data smells if audit is not deep enough.

CAVEATS

Dependencies

Requires uploaded CSV file.

Missing context

Actual CSV file or sample data/schema.
Specific business objectives or domain context.
Identification of target variable if applicable.

Ambiguities

Business objectives mentioned but not specified or provided.
Assumes CSV file upload without schema, sample data, or file details.
Unclear if target variable is present or what it is.

QUALITY

OVERALL: 0.85
CLARITY: 0.95
SPECIFICITY: 0.90
REUSABILITY: 0.75
COMPLETENESS: 0.80

IMPROVEMENT SUGGESTIONS

Use placeholders like {file_path} and {business_objectives} for reusability.
Provide sample CSV schema or data snippet as example.
Explicitly define or placeholder for target variable handling.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.