analyst analysis user risk: low
CSV Data Audit and Cleaning Pipeline
Instructs the model to act as a Senior Data Science Architect, perform a technical audit on an uploaded CSV file including schema analysis and business impacts, propose imputation/…
PROMPT
I want you to act as a Senior Data Science Architect and Lead Business Analyst. I am uploading a CSV file that contains raw data. Your goal is to perform a deep technical audit and provide a production-ready cleaning pipeline that aligns with business objectives. Please follow this 4-step execution flow: Technical Audit & Business Context: Analyze the schema. Identify inconsistencies, missing values, and Data Smells. Briefly explain how these data issues might impact business decision-making (e.g., Inconsistent dates may lead to incorrect monthly trend analysis). Statistical Strategy: Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit. The Implementation Block: Write a modular, PEP8-compliant Python script using pandas and scikit-learn. Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job. Post-Processing Validation: Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting). Constraints: Prioritize memory efficiency (use appropriate dtypes like int8 or float32). Ensure zero data leakage if a target variable is present. Provide the output in structured Markdown with professional code comments. I have uploaded the file. Please begin the audit.
REQUIRED CONTEXT
- CSV file with raw data
ROLES & RULES
Role assignments
- act as a Senior Data Science Architect and Lead Business Analyst.
- Prioritize memory efficiency (use appropriate dtypes like int8 or float32).
- Ensure zero data leakage if a target variable is present.
- Provide the output in structured Markdown with professional code comments.
- Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
- Include a Pipeline object so the code is ready for a Streamlit dashboard or an automated batch job.
EXPECTED OUTPUT
- Format
- markdown
- Schema
- markdown_sections · Technical Audit & Business Context, Statistical Strategy, The Implementation Block, Post-Processing Validation
- Constraints
-
- structured Markdown
- professional code comments
- modular PEP8-compliant Python script
- include Pipeline object
- assertion checks
- prioritize memory efficiency with dtypes like int8/float32
- zero data leakage
SUCCESS CRITERIA
- Analyze the schema. Identify inconsistencies, missing values, and Data Smells.
- Briefly explain how these data issues might impact business decision-making.
- Propose a rigorous strategy for Imputation (Median vs. Mean), Encoding (One-Hot vs. Label), and Scaling (Standard vs. Robust) based on the audit.
- Write a modular, PEP8-compliant Python script using pandas and scikit-learn.
- Provide assertion checks to verify data integrity (e.g., checking for nulls or memory optimization via down casting).
FAILURE MODES
- May fail without access to the uploaded CSV file.
- Might not align cleaning with specific business objectives due to lack of details.
- Could overlook data smells if audit is not deep enough.
CAVEATS
- Dependencies
-
- Requires uploaded CSV file.
- Missing context
-
- Actual CSV file or sample data/schema.
- Specific business objectives or domain context.
- Identification of target variable if applicable.
- Ambiguities
-
- Business objectives mentioned but not specified or provided.
- Assumes CSV file upload without schema, sample data, or file details.
- Unclear if target variable is present or what it is.
QUALITY
- OVERALL
- 0.85
- CLARITY
- 0.95
- SPECIFICITY
- 0.90
- REUSABILITY
- 0.75
- COMPLETENESS
- 0.80
IMPROVEMENT SUGGESTIONS
- Use placeholders like {file_path} and {business_objectives} for reusability.
- Provide sample CSV schema or data snippet as example.
- Explicitly define or placeholder for target variable handling.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR ANALYST
- ML Missing Values Treatment Pipelineanalystanalysis
- Quantitative Sports Betting Edge Evaluatoranalystanalysis
- B2B Manufacturing Homepage Tech-SEO Diagnosticanalystanalysis
- Technical Academic Paper Revieweranalystanalysis
- UX Landing Page Conversion Analyzeranalystanalysis
- Network Fault Report Generatoranalystanalysis
- Technical Swimsuit Photo Analysis JSONanalystanalysis
- Energy DJU Consumption Cost Analyzeranalystanalysis
- French Financial Table Trends Analyzeranalystanalysis
- Online Groups Values Behaviors Comparatoranalystanalysis