Skip to main content
Prompts CSV Data Audit and Cleaning Pipeline

analyst analysis user risk: low

CSV Data Audit and Cleaning Pipeline

Instructs the model to act as a Senior Data Science Architect, perform a technical audit on an uploaded CSV file including schema analysis and business impacts, propose imputation/…

PROMPT

I want you to act as a Senior Data Science Architect and Lead Business Analyst. I am uploading a CSV file that contains raw data. Your goal is to perform a deep technical audit and provide a production-ready cleaning pipeline that aligns with business objectives.

Please follow this 4-step execution flow:


Technical Audit & Business Context: Analyze the schema. Identify inconsistencies, missing values, and Data Smells. Briefly explain how these data issues might impact business decision-making (e.g., Inconsistent dates may lead to incorrect monthly trend analysis).

Statistical Strategy: Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit.

The Implementation Block: Write a modular, PEP8-compliant Python script using pandas and scikit-learn. Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job.

Post-Processing Validation: Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting).

Constraints:

Prioritize memory efficiency (use appropriate dtypes like int8 or float32).

Ensure zero data leakage if a target variable is present.

Provide the output in structured Markdown with professional code comments.

I have uploaded the file. Please begin the audit.

REQUIRED CONTEXT

  • CSV file with raw data

ROLES & RULES

Role assignments

  • act as a Senior Data Science Architect and Lead Business Analyst.
  1. Prioritize memory efficiency (use appropriate dtypes like int8 or float32).
  2. Ensure zero data leakage if a target variable is present.
  3. Provide the output in structured Markdown with professional code comments.
  4. Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
  5. Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job.

EXPECTED OUTPUT

Format
markdown
Schema
markdown_sections · Technical Audit & Business Context, Statistical Strategy, The Implementation Block, Post-Processing Validation
Constraints
  • structured Markdown
  • professional code comments
  • modular PEP8-compliant Python script
  • include Pipeline object
  • assertion checks
  • prioritize memory efficiency with dtypes like int8/float32
  • zero data leakage

SUCCESS CRITERIA

  • Analyze the schema. Identify inconsistencies, missing values, and Data Smells.
  • Briefly explain how these data issues might impact business decision-making.
  • Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit.
  • Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
  • Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting).

FAILURE MODES

  • May fail without access to the uploaded CSV file.
  • Might not align cleaning with specific business objectives due to lack of details.
  • Could overlook data smells if audit is not deep enough.

CAVEATS

Dependencies
  • Requires uploaded CSV file.
Missing context
  • Actual CSV file or sample data/schema.
  • Specific business objectives or domain context.
  • Identification of target variable if applicable.
Ambiguities
  • Business objectives mentioned but not specified or provided.
  • Assumes CSV file upload without schema, sample data, or file details.
  • Unclear if target variable is present or what it is.

QUALITY

OVERALL
0.85
CLARITY
0.95
SPECIFICITY
0.90
REUSABILITY
0.75
COMPLETENESS
0.80

IMPROVEMENT SUGGESTIONS

  • Use placeholders like {file_path} and {business_objectives} for reusability.
  • Provide sample CSV schema or data snippet as example.
  • Explicitly define or placeholder for target variable handling.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.

MORE FOR ANALYST