Embedding Strategies for Domain-Specific Data

How to select, fine-tune, and optimize embedding models for domain-specific data in RAG systems, with practical Python examples for financial and business analytics domains.

technical10 min readBy Klivvr Engineering
Share:

Embeddings are the foundation of every RAG system. They determine which documents are considered "similar" to a query, and by extension, which context the language model sees when generating an answer. A general-purpose embedding model trained on web text does a reasonable job on general-purpose questions, but it struggles with domain-specific terminology, abbreviations, and semantic relationships that differ from everyday language.

In the business analytics domain where Data Whispal Agent operates, "churn" does not mean butter-making, "pipeline" has nothing to do with plumbing, and "ARR" is not a pirate exclamation. When a user asks "What is our net retention rate?", the embedding model needs to retrieve documents about revenue expansion and contraction within existing accounts, not documents about employee retention or data retention policies. Getting this right requires deliberate embedding strategy, not just plugging in the default model.

This article covers how we evaluate embedding models for domain fitness, when and how to fine-tune them, and the practical techniques we use to improve retrieval quality through embedding optimization.

Evaluating Embedding Models for Your Domain

Before optimizing, you need to measure. Embedding quality is evaluated by how well it ranks relevant documents above irrelevant ones for your specific queries. The standard approach is to build a retrieval evaluation dataset and measure recall and precision at various cutoff points.

from sentence_transformers import SentenceTransformer
import numpy as np
from dataclasses import dataclass
 
@dataclass
class RetrievalEvalCase:
    query: str
    relevant_doc_ids: list[str]
    all_doc_ids: list[str]
 
def evaluate_embedding_model(
    model_name: str,
    eval_cases: list[RetrievalEvalCase],
    documents: dict[str, str],
    k_values: list[int] = [1, 5, 10, 20],
) -> dict:
    model = SentenceTransformer(model_name)
 
    # Encode all documents once
    doc_ids = list(documents.keys())
    doc_texts = [documents[did] for did in doc_ids]
    doc_embeddings = model.encode(doc_texts, normalize_embeddings=True)
 
    results = {f"recall@{k}": 0.0 for k in k_values}
    results["mrr"] = 0.0  # Mean Reciprocal Rank
 
    for case in eval_cases:
        query_embedding = model.encode(
            case.query, normalize_embeddings=True
        )
        similarities = np.dot(doc_embeddings, query_embedding)
        ranked_indices = np.argsort(similarities)[::-1]
        ranked_ids = [doc_ids[i] for i in ranked_indices]
 
        # Calculate recall@k
        for k in k_values:
            retrieved = set(ranked_ids[:k])
            relevant = set(case.relevant_doc_ids)
            recall = len(retrieved & relevant) / len(relevant)
            results[f"recall@{k}"] += recall
 
        # Calculate MRR
        for rank, doc_id in enumerate(ranked_ids, 1):
            if doc_id in case.relevant_doc_ids:
                results["mrr"] += 1.0 / rank
                break
 
    # Average across all cases
    n = len(eval_cases)
    for key in results:
        results[key] /= n
 
    return results

We evaluated five embedding models on a dataset of 200 analytics queries with manually labeled relevant documents:

MODELS_TO_EVALUATE = [
    "text-embedding-3-small",       # OpenAI, 1536 dimensions
    "text-embedding-3-large",       # OpenAI, 3072 dimensions
    "BAAI/bge-large-en-v1.5",       # Open source, 1024 dimensions
    "sentence-transformers/all-MiniLM-L6-v2",  # Lightweight, 384 dims
    "intfloat/e5-large-v2",         # Open source, 1024 dimensions
]
 
# Run evaluation
for model_name in MODELS_TO_EVALUATE:
    results = evaluate_embedding_model(
        model_name, eval_cases, documents
    )
    print(f"\n{model_name}:")
    for metric, value in results.items():
        print(f"  {metric}: {value:.3f}")

The results were instructive. OpenAI's text-embedding-3-small had the best recall@5 on general queries, but BAAI/bge-large-en-v1.5 outperformed it on queries involving domain-specific financial terminology. The difference was most pronounced for abbreviation-heavy queries like "What's the MRR breakdown by cohort?" where general-purpose models did not associate "MRR" (Monthly Recurring Revenue) with financial context.

Fine-Tuning Embeddings for Domain Vocabulary

When off-the-shelf models underperform on your domain, fine-tuning bridges the gap. The goal is to teach the model that domain-specific terms have particular semantic relationships. Fine-tuning does not require massive datasets. Even a few hundred high-quality training pairs can produce measurable improvements.

The training data for embedding fine-tuning consists of positive pairs (query, relevant document) and optionally negative pairs (query, irrelevant document). We generate training data from three sources: user query logs paired with the documents users actually found useful, synthetic pairs generated by an LLM from our document corpus, and manually curated pairs for edge cases.

from sentence_transformers import (
    SentenceTransformer,
    InputExample,
    losses,
)
from torch.utils.data import DataLoader
 
def prepare_training_data(
    positive_pairs: list[tuple[str, str]],
    hard_negatives: list[tuple[str, str, str]] | None = None,
) -> list[InputExample]:
    """Prepare training examples for embedding fine-tuning.
 
    Args:
        positive_pairs: List of (query, relevant_document) tuples
        hard_negatives: Optional list of (query, positive, negative) tuples
    """
    examples = []
 
    if hard_negatives:
        # Triplet format: anchor, positive, negative
        for query, positive, negative in hard_negatives:
            examples.append(InputExample(
                texts=[query, positive, negative]
            ))
    else:
        # Pair format with cosine similarity labels
        for query, document in positive_pairs:
            examples.append(InputExample(
                texts=[query, document], label=1.0
            ))
 
    return examples
 
def fine_tune_embedding_model(
    base_model: str,
    training_examples: list[InputExample],
    output_path: str,
    epochs: int = 3,
    batch_size: int = 16,
):
    model = SentenceTransformer(base_model)
 
    train_dataloader = DataLoader(
        training_examples, shuffle=True, batch_size=batch_size
    )
 
    # Use MultipleNegativesRankingLoss for contrastive learning
    # This loss treats other examples in the batch as negatives
    train_loss = losses.MultipleNegativesRankingLoss(model)
 
    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=int(len(train_dataloader) * 0.1),
        output_path=output_path,
        show_progress_bar=True,
    )
 
    return model

The MultipleNegativesRankingLoss is particularly effective because it automatically treats other examples in the batch as negatives, creating a rich contrastive signal without requiring explicit negative mining. For a batch of 16 examples, each positive pair is contrasted against 15 negatives.

Synthetic Training Data Generation

Manually curating hundreds of training pairs is tedious. We use an LLM to generate synthetic training data from our document corpus, then filter it through quality checks:

from langchain_openai import ChatOpenAI
 
async def generate_synthetic_pairs(
    documents: list[str],
    pairs_per_document: int = 3,
) -> list[tuple[str, str]]:
    """Generate synthetic query-document pairs for embedding training."""
 
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
 
    prompt_template = """Given the following data document, generate
{n} realistic analytics questions that this document would be relevant for.
The questions should:
- Use natural business language (not technical jargon)
- Vary in complexity (simple lookups to trend analysis)
- Include domain-specific abbreviations where appropriate
 
Document:
{document}
 
Generate exactly {n} questions, one per line. Only output the questions."""
 
    pairs = []
    for doc in documents:
        response = await llm.ainvoke(
            prompt_template.format(document=doc, n=pairs_per_document)
        )
        questions = [
            q.strip() for q in response.content.strip().split("\n")
            if q.strip()
        ]
        for question in questions[:pairs_per_document]:
            pairs.append((question, doc))
 
    return pairs
 
def filter_quality(
    pairs: list[tuple[str, str]],
    embedding_model,
    min_similarity: float = 0.3,
    max_similarity: float = 0.85,
) -> list[tuple[str, str]]:
    """Filter pairs that are too easy or too hard for training."""
    filtered = []
    for query, doc in pairs:
        embeddings = embedding_model.encode(
            [query, doc], normalize_embeddings=True
        )
        similarity = np.dot(embeddings[0], embeddings[1])
 
        # Too high similarity means the pair is trivial
        # Too low similarity means it might be a bad pair
        if min_similarity <= similarity <= max_similarity:
            filtered.append((query, doc))
 
    return filtered

The similarity filtering step is crucial. Training pairs where the query is nearly identical to the document teach the model nothing. Pairs where the query has almost no relationship to the document introduce noise. The sweet spot is pairs that are semantically related but lexically distinct, which is exactly what forces the model to learn domain-specific associations.

Chunking Strategies That Complement Embeddings

Embedding quality and chunking strategy are deeply intertwined. The best embedding model in the world cannot retrieve a relevant passage if the chunking strategy placed the key information in the middle of an irrelevant chunk, diluting its embedding.

We use three chunking strategies depending on the data type:

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)
 
# Strategy 1: Semantic chunking for prose documents
def chunk_prose(text: str, chunk_size: int = 800) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=200,
        separators=["\n\n", "\n", ". ", " "],
    )
    return splitter.split_text(text)
 
# Strategy 2: Structure-aware chunking for markdown/reports
def chunk_structured(text: str) -> list[str]:
    headers_to_split_on = [
        ("#", "header1"),
        ("##", "header2"),
        ("###", "header3"),
    ]
    splitter = MarkdownHeaderTextSplitter(
        headers_to_split_on=headers_to_split_on,
    )
    splits = splitter.split_text(text)
    # Further split large sections
    recursive_splitter = RecursiveCharacterTextSplitter(
        chunk_size=800,
        chunk_overlap=100,
    )
    final_chunks = []
    for split in splits:
        if len(split.page_content) > 800:
            sub_chunks = recursive_splitter.split_text(split.page_content)
            for sc in sub_chunks:
                final_chunks.append({
                    "content": sc,
                    "metadata": split.metadata,
                })
        else:
            final_chunks.append({
                "content": split.page_content,
                "metadata": split.metadata,
            })
    return final_chunks
 
# Strategy 3: Row-level chunking for tabular data
def chunk_tabular(
    rows: list[dict],
    group_by: str | None = None,
    rows_per_chunk: int = 10,
) -> list[str]:
    """Convert tabular data into embeddable text chunks."""
    if group_by and group_by in rows[0]:
        # Group rows by a key column for coherent chunks
        groups = {}
        for row in rows:
            key = row[group_by]
            groups.setdefault(key, []).append(row)
 
        chunks = []
        for key, group_rows in groups.items():
            chunk_text = f"{group_by}: {key}\n"
            for row in group_rows:
                chunk_text += " | ".join(
                    f"{k}: {v}" for k, v in row.items() if k != group_by
                )
                chunk_text += "\n"
            chunks.append(chunk_text.strip())
        return chunks
    else:
        # Simple sequential chunking
        chunks = []
        for i in range(0, len(rows), rows_per_chunk):
            batch = rows[i:i + rows_per_chunk]
            chunk_text = "\n".join(
                " | ".join(f"{k}: {v}" for k, v in row.items())
                for row in batch
            )
            chunks.append(chunk_text)
        return chunks

The tabular chunking strategy deserves special attention. Business data is often tabular, and naive text chunking destroys the structure that makes it meaningful. Grouping rows by a key column (like region, product, or time period) produces chunks that are semantically coherent, meaning the embedding captures the concept "Q3 revenue by region" rather than a random slice of a CSV file.

Query-Time Embedding Optimization

Embedding optimization does not stop at indexing time. The query embedding can also be improved to better match the indexed representations. We use two techniques: query expansion and instruction-prefixed embeddings.

Query expansion rewrites the user query into multiple variants and averages their embeddings. This broadens the semantic coverage of the query vector:

async def expand_and_embed(
    query: str,
    embedding_model,
    expansion_llm,
    n_expansions: int = 3,
) -> np.ndarray:
    """Generate query expansions and average their embeddings."""
 
    expansion_prompt = f"""Generate {n_expansions} alternative phrasings of
this analytics question. Keep the same meaning but vary the vocabulary.
 
Original: {query}
 
Alternatives (one per line):"""
 
    response = await expansion_llm.ainvoke(expansion_prompt)
    alternatives = [
        line.strip().lstrip("0123456789.-) ")
        for line in response.content.strip().split("\n")
        if line.strip()
    ][:n_expansions]
 
    all_queries = [query] + alternatives
    embeddings = embedding_model.encode(
        all_queries, normalize_embeddings=True
    )
 
    # Average and re-normalize
    avg_embedding = np.mean(embeddings, axis=0)
    avg_embedding = avg_embedding / np.linalg.norm(avg_embedding)
 
    return avg_embedding

Instruction-prefixed embeddings leverage a feature of modern embedding models where prepending a task-specific instruction to the input improves retrieval quality. Models like E5 and BGE are trained with this pattern:

def embed_with_instruction(
    texts: list[str],
    model: SentenceTransformer,
    is_query: bool = True,
) -> np.ndarray:
    """Add task-specific instruction prefixes before embedding."""
    if is_query:
        prefix = "Represent this analytics question for retrieving relevant data: "
    else:
        prefix = "Represent this data document for retrieval: "
 
    prefixed = [prefix + text for text in texts]
    return model.encode(prefixed, normalize_embeddings=True)

This asymmetric prefixing tells the model to embed queries and documents differently, optimizing for the cross-modal similarity that retrieval requires rather than the within-modal similarity that would group all queries together.

Monitoring Embedding Drift

Embedding models are static, but your data is not. As new terminology enters your domain, the model's ability to embed it meaningfully may degrade. We monitor for this with a weekly evaluation job:

async def check_embedding_health(
    model,
    eval_cases: list[RetrievalEvalCase],
    documents: dict[str, str],
    alert_threshold: float = 0.05,
    historical_metrics: dict | None = None,
):
    current_metrics = evaluate_embedding_model(
        model, eval_cases, documents
    )
 
    if historical_metrics:
        for metric, current_value in current_metrics.items():
            previous = historical_metrics.get(metric, current_value)
            delta = previous - current_value
            if delta > alert_threshold:
                logger.warning(
                    "Embedding quality regression detected",
                    extra={
                        "metric": metric,
                        "current": round(current_value, 4),
                        "previous": round(previous, 4),
                        "delta": round(delta, 4),
                    },
                )
 
    return current_metrics

When recall drops below our threshold, it triggers a review of recent queries that produced poor retrieval results, which feeds back into our fine-tuning dataset for the next training cycle.

Conclusion

Embedding strategy is the most underinvested area in most RAG systems. Teams spend weeks on prompt engineering and chain architecture but accept the default embedding model without evaluation. For Data Whispal Agent, investing in domain-specific embedding optimization, including systematic model evaluation, targeted fine-tuning with synthetic data, structure-aware chunking, and query-time enhancements, improved retrieval recall@5 by 23% compared to using OpenAI's default embedding model out of the box. That improvement flows directly into answer quality, because every additional relevant chunk the model sees is another opportunity to produce an accurate, well-grounded response.

Related Articles

business

Scaling AI Agents: From Prototype to Production

A practical guide to scaling AI agent systems from initial prototype to production deployment, covering infrastructure architecture, cost management, reliability engineering, and team organization.

13 min read
business

Data Privacy and Security in AI Agent Systems

A practical guide to building privacy-preserving AI agent systems, covering data classification, access controls, PII handling, audit logging, and compliance requirements.

11 min read