Prompt Engineering for Data Analytics Agents

Practical prompt engineering techniques for building data analytics agents that produce accurate, well-structured answers grounded in retrieved data context.

technical10 min readBy Klivvr Engineering
Share:

Prompt engineering is often dismissed as a soft skill, something you iterate on until the output looks right. In data analytics agents, this attitude is dangerous. A poorly engineered prompt can produce answers that are fluent, confident, and completely wrong. When users rely on your agent to make business decisions based on their data, the precision of your prompts is as critical as the correctness of your code.

Data Whispal Agent processes thousands of analytics queries daily. Each query requires the agent to interpret ambiguous natural language, map it to the correct data context, perform reasoning over numerical values, and produce an answer that is both accurate and clearly sourced. The prompts that drive this behavior evolved through hundreds of iterations, systematic evaluation, and a few painful production incidents. This article shares the techniques and patterns that emerged.

The Anatomy of a Data Analytics Prompt

Every prompt in our system has four components: the role definition, the behavioral constraints, the context injection, and the output format specification. Omitting any of these leads to degraded performance in measurable ways.

The role definition tells the model who it is and what domain it operates in. This is not cosmetic. Models conditioned as domain experts produce measurably more accurate outputs on domain-specific tasks than models given generic instructions.

SYSTEM_PROMPT = """You are Data Whispal Agent, a senior data analyst assistant.
Your expertise includes:
- Interpreting business metrics (revenue, churn, ARPU, CAC, LTV)
- Statistical reasoning (trends, correlations, distributions)
- Data quality assessment (missing values, outliers, inconsistencies)
- Clear, actionable communication of analytical findings
 
You work at a company where data-driven decisions are critical.
Treat every query as if the answer will directly influence a business decision."""

Behavioral constraints prevent the most common failure modes. For analytics agents, the top three failure modes are hallucination (inventing data points), over-extrapolation (drawing conclusions beyond what the data supports), and source confusion (attributing data to the wrong dataset).

CONSTRAINTS = """IMPORTANT RULES:
1. ONLY use information from the provided context. Never invent numbers.
2. If the context lacks sufficient data, say "The available data does
   not contain enough information to answer this question" and explain
   what data would be needed.
3. When citing numbers, always include:
   - The exact value from the context
   - The time period it covers
   - The source dataset name from metadata
4. Distinguish between facts (directly stated in data) and inferences
   (conclusions you draw from the data). Label inferences explicitly.
5. If data quality issues exist (nulls, outliers, small sample sizes),
   flag them before presenting conclusions."""

The context injection section is where retrieved documents are inserted. Format matters enormously here. We experimented with plain text, XML-tagged sections, and structured markdown tables. Structured markdown with clear delimiters produced the best results because it helps the model parse boundaries between different source documents.

def format_context(documents) -> str:
    sections = []
    for i, doc in enumerate(documents, 1):
        source = doc.metadata.get("source", "Unknown")
        domain = doc.metadata.get("domain", "General")
        date_range = doc.metadata.get("date_range", "Not specified")
 
        section = (
            f"### Source {i}: {source}\n"
            f"**Domain:** {domain} | **Period:** {date_range}\n\n"
            f"{doc.page_content}"
        )
        sections.append(section)
 
    return "\n\n---\n\n".join(sections)

Output format specification controls the structure of the response. For analytics queries, we require a specific format that separates the answer from the evidence and caveats:

OUTPUT_FORMAT = """Structure your response as follows:
 
**Answer:** A direct, concise answer to the question (1-3 sentences).
 
**Supporting Data:**
- List the specific data points from the context that support your answer
- Include source attribution for each data point
 
**Caveats:** (include only if relevant)
- Data quality issues, small sample sizes, or missing context
- Assumptions you made in your analysis
 
**Confidence:** Rate your confidence as HIGH (data directly answers the
question), MEDIUM (requires some inference), or LOW (significant gaps
in available data)."""

Few-Shot Examples for Numerical Reasoning

Language models are notoriously weak at arithmetic and numerical reasoning. When your analytics agent needs to compute percentage changes, averages, or trends from raw data, few-shot examples are essential. They teach the model the expected reasoning pattern and dramatically reduce calculation errors.

We maintain a library of few-shot examples organized by query type. Here is how we inject them for trend analysis queries:

TREND_ANALYSIS_EXAMPLES = [
    {
        "question": "What was the month-over-month change in active users?",
        "context": (
            "March 2025: 12,450 active users\n"
            "April 2025: 13,180 active users"
        ),
        "answer": (
            "**Answer:** Active users increased by 5.86% month-over-month, "
            "from 12,450 in March to 13,180 in April 2025.\n\n"
            "**Supporting Data:**\n"
            "- March 2025: 12,450 active users (Source: user_metrics)\n"
            "- April 2025: 13,180 active users (Source: user_metrics)\n"
            "- Calculation: (13,180 - 12,450) / 12,450 = 5.86%\n\n"
            "**Confidence:** HIGH"
        ),
    },
    {
        "question": "Is revenue trending up or down this quarter?",
        "context": (
            "July 2025: $1.2M revenue\n"
            "August 2025: $1.15M revenue\n"
            "September 2025: $1.08M revenue"
        ),
        "answer": (
            "**Answer:** Revenue is trending downward this quarter, "
            "declining 10% from $1.2M in July to $1.08M in September.\n\n"
            "**Supporting Data:**\n"
            "- July: $1.2M, August: $1.15M (-4.2%), September: $1.08M (-6.1%)\n"
            "- Total decline: ($1.2M - $1.08M) / $1.2M = 10%\n"
            "- The decline is accelerating (larger drop in September)\n\n"
            "**Caveats:**\n"
            "- Only 3 data points; seasonal factors not accounted for\n\n"
            "**Confidence:** MEDIUM"
        ),
    },
]
 
def build_few_shot_prompt(query_type: str, context: str, question: str) -> str:
    examples = EXAMPLE_LIBRARY.get(query_type, [])
    example_text = ""
    for ex in examples[:2]:  # Limit to 2 examples to save tokens
        example_text += (
            f"\nExample Question: {ex['question']}\n"
            f"Example Context:\n{ex['context']}\n"
            f"Example Answer:\n{ex['answer']}\n"
            f"{'=' * 40}\n"
        )
 
    return (
        f"{SYSTEM_PROMPT}\n\n{CONSTRAINTS}\n\n"
        f"Here are examples of how to answer:\n{example_text}\n\n"
        f"Now answer this question:\n\n"
        f"Context:\n{context}\n\n"
        f"Question: {question}\n\n{OUTPUT_FORMAT}"
    )

The explicit "Calculation" line in the examples teaches the model to show its work. This serves two purposes: it improves numerical accuracy by forcing step-by-step reasoning, and it gives users a way to verify the math independently.

Query Classification and Prompt Routing

Not all analytics queries are alike. A factual lookup ("What was last month's revenue?") requires different prompting than a comparative analysis ("How does Q2 compare to Q1?") or an exploratory question ("What are the key trends in our user data?"). Using the same prompt for all query types is a common mistake that sacrifices quality on every type.

We classify incoming queries into categories and route each to a specialized prompt template:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field
from enum import Enum
 
class QueryType(str, Enum):
    FACTUAL_LOOKUP = "factual_lookup"
    TREND_ANALYSIS = "trend_analysis"
    COMPARISON = "comparison"
    AGGREGATION = "aggregation"
    EXPLORATORY = "exploratory"
    DATA_QUALITY = "data_quality"
 
class QueryClassification(BaseModel):
    query_type: QueryType = Field(
        description="The type of analytics query"
    )
    requires_calculation: bool = Field(
        description="Whether the answer requires mathematical computation"
    )
    time_sensitive: bool = Field(
        description="Whether the query references a specific time period"
    )
 
CLASSIFIER_PROMPT = ChatPromptTemplate.from_messages([
    ("system", (
        "Classify the following analytics query. Respond with JSON.\n"
        "Query types:\n"
        "- factual_lookup: Direct data retrieval\n"
        "- trend_analysis: Time-series patterns\n"
        "- comparison: Comparing metrics across dimensions\n"
        "- aggregation: Sum, average, count operations\n"
        "- exploratory: Open-ended data exploration\n"
        "- data_quality: Questions about data completeness or accuracy"
    )),
    ("human", "{query}"),
])
 
async def classify_query(query: str) -> QueryClassification:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    structured_llm = llm.with_structured_output(QueryClassification)
    chain = CLASSIFIER_PROMPT | structured_llm
    return await chain.ainvoke({"query": query})

Each query type has its own prompt template optimized for that task. Trend analysis prompts emphasize chronological ordering and rate-of-change calculations. Comparison prompts emphasize structured side-by-side presentation. Exploratory prompts emphasize breadth and suggest follow-up questions.

PROMPT_TEMPLATES = {
    QueryType.TREND_ANALYSIS: (
        "{base_prompt}\n\n"
        "ADDITIONAL INSTRUCTIONS FOR TREND ANALYSIS:\n"
        "- Sort data chronologically before analyzing\n"
        "- Calculate period-over-period changes as percentages\n"
        "- Identify inflection points where trends reverse\n"
        "- Note the overall direction: increasing, decreasing, or stable\n"
    ),
    QueryType.COMPARISON: (
        "{base_prompt}\n\n"
        "ADDITIONAL INSTRUCTIONS FOR COMPARISON:\n"
        "- Present data in a structured comparison format\n"
        "- Calculate absolute and relative differences\n"
        "- Identify the most significant differences\n"
        "- Note any dimensions where data is not comparable\n"
    ),
    QueryType.DATA_QUALITY: (
        "{base_prompt}\n\n"
        "ADDITIONAL INSTRUCTIONS FOR DATA QUALITY:\n"
        "- Check for missing values and their patterns\n"
        "- Identify statistical outliers\n"
        "- Assess data freshness and coverage\n"
        "- Recommend specific remediation steps\n"
    ),
}

Chain-of-Thought for Complex Analytics

Some analytics queries require multi-step reasoning that a single prompt cannot reliably produce. For these, we use explicit chain-of-thought decomposition, where the agent breaks the problem into steps and executes them sequentially.

DECOMPOSITION_PROMPT = """Given the following complex analytics question,
break it into a sequence of simpler sub-questions that can each be
answered from the data. Each sub-question should build on the previous answers.
 
Question: {question}
 
Respond with a numbered list of sub-questions. Each should be answerable
from a single data lookup or simple calculation."""
 
async def decompose_and_answer(question: str, retriever, llm):
    # Step 1: Decompose the complex question
    decomp_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    sub_questions_raw = await decomp_llm.ainvoke(
        DECOMPOSITION_PROMPT.format(question=question)
    )
    sub_questions = parse_numbered_list(sub_questions_raw.content)
 
    # Step 2: Answer each sub-question with retrieval
    intermediate_answers = []
    for sq in sub_questions:
        docs = await retriever.ainvoke(sq)
        context = format_context(docs)
        answer = await llm.ainvoke(
            f"Context:\n{context}\n\nQuestion: {sq}"
        )
        intermediate_answers.append({
            "question": sq,
            "answer": answer.content,
        })
 
    # Step 3: Synthesize final answer
    synthesis_context = "\n".join(
        f"Q: {ia['question']}\nA: {ia['answer']}"
        for ia in intermediate_answers
    )
    final_answer = await llm.ainvoke(
        f"Based on these intermediate findings:\n\n"
        f"{synthesis_context}\n\n"
        f"Provide a comprehensive answer to the original question: "
        f"{question}\n\n{OUTPUT_FORMAT}"
    )
    return final_answer.content

This approach trades latency for accuracy. A question like "What is the correlation between marketing spend and customer acquisition across regions, and which region has the best ROI?" requires multiple lookups and calculations that a single prompt would likely botch. Decomposition makes each step simple enough for reliable execution.

Evaluation-Driven Prompt Iteration

The most important prompt engineering technique is not a prompting trick at all. It is systematic evaluation. Without a test suite, prompt changes are guesswork. You fix one query and unknowingly break ten others.

We maintain an evaluation dataset of 250 question-context-answer triples, categorized by query type and difficulty. Every prompt change is evaluated against this dataset before deployment:

from dataclasses import dataclass
 
@dataclass
class EvalCase:
    question: str
    context: str
    expected_answer: str
    query_type: QueryType
    difficulty: str  # "easy", "medium", "hard"
    required_facts: list[str]  # Facts that must appear in the answer
 
async def evaluate_prompt(
    prompt_template: str,
    eval_cases: list[EvalCase],
    llm,
) -> dict:
    results = {
        "total": len(eval_cases),
        "factual_accuracy": 0,
        "format_compliance": 0,
        "hallucination_count": 0,
    }
 
    for case in eval_cases:
        prompt = prompt_template.format(
            context=case.context, question=case.question
        )
        response = await llm.ainvoke(prompt)
        answer = response.content
 
        # Check factual accuracy
        facts_found = sum(
            1 for fact in case.required_facts if fact.lower() in answer.lower()
        )
        results["factual_accuracy"] += facts_found / len(case.required_facts)
 
        # Check format compliance
        has_answer = "**Answer:**" in answer
        has_data = "**Supporting Data:**" in answer
        results["format_compliance"] += int(has_answer and has_data)
 
        # Check for hallucination (numbers not in context)
        response_numbers = extract_numbers(answer)
        context_numbers = extract_numbers(case.context)
        hallucinated = response_numbers - context_numbers
        if hallucinated:
            results["hallucination_count"] += 1
 
    results["factual_accuracy"] /= len(eval_cases)
    results["format_compliance"] /= len(eval_cases)
 
    return results

We track these metrics over time and require that any prompt change maintains or improves factual accuracy and does not increase hallucination count. This discipline turns prompt engineering from art into engineering.

Conclusion

Prompt engineering for data analytics agents is a precision discipline. The techniques that matter most are not clever tricks but systematic practices: structured prompt architecture with clear role definitions and constraints, few-shot examples that teach numerical reasoning patterns, query classification for specialized prompt routing, chain-of-thought decomposition for complex questions, and evaluation-driven iteration with quantitative metrics. Every prompt in Data Whispal Agent exists because it improved a measurable outcome on our evaluation suite. That is the standard your prompts should meet.

Related Articles

business

Scaling AI Agents: From Prototype to Production

A practical guide to scaling AI agent systems from initial prototype to production deployment, covering infrastructure architecture, cost management, reliability engineering, and team organization.

13 min read
business

Data Privacy and Security in AI Agent Systems

A practical guide to building privacy-preserving AI agent systems, covering data classification, access controls, PII handling, audit logging, and compliance requirements.

11 min read