Embedding Strategies for Domain-Specific Data
How to select, fine-tune, and optimize embedding models for domain-specific data in RAG systems, with practical Python examples for financial and business analytics domains.
Embeddings are the foundation of every RAG system. They determine which documents are considered "similar" to a query, and by extension, which context the language model sees when generating an answer. A general-purpose embedding model trained on web text does a reasonable job on general-purpose questions, but it struggles with domain-specific terminology, abbreviations, and semantic relationships that differ from everyday language.
In the business analytics domain where Data Whispal Agent operates, "churn" does not mean butter-making, "pipeline" has nothing to do with plumbing, and "ARR" is not a pirate exclamation. When a user asks "What is our net retention rate?", the embedding model needs to retrieve documents about revenue expansion and contraction within existing accounts, not documents about employee retention or data retention policies. Getting this right requires deliberate embedding strategy, not just plugging in the default model.
This article covers how we evaluate embedding models for domain fitness, when and how to fine-tune them, and the practical techniques we use to improve retrieval quality through embedding optimization.
Evaluating Embedding Models for Your Domain
Before optimizing, you need to measure. Embedding quality is evaluated by how well it ranks relevant documents above irrelevant ones for your specific queries. The standard approach is to build a retrieval evaluation dataset and measure recall and precision at various cutoff points.
from sentence_transformers import SentenceTransformer
import numpy as np
from dataclasses import dataclass
@dataclass
class RetrievalEvalCase:
query: str
relevant_doc_ids: list[str]
all_doc_ids: list[str]
def evaluate_embedding_model(
model_name: str,
eval_cases: list[RetrievalEvalCase],
documents: dict[str, str],
k_values: list[int] = [1, 5, 10, 20],
) -> dict:
model = SentenceTransformer(model_name)
# Encode all documents once
doc_ids = list(documents.keys())
doc_texts = [documents[did] for did in doc_ids]
doc_embeddings = model.encode(doc_texts, normalize_embeddings=True)
results = {f"recall@{k}": 0.0 for k in k_values}
results["mrr"] = 0.0 # Mean Reciprocal Rank
for case in eval_cases:
query_embedding = model.encode(
case.query, normalize_embeddings=True
)
similarities = np.dot(doc_embeddings, query_embedding)
ranked_indices = np.argsort(similarities)[::-1]
ranked_ids = [doc_ids[i] for i in ranked_indices]
# Calculate recall@k
for k in k_values:
retrieved = set(ranked_ids[:k])
relevant = set(case.relevant_doc_ids)
recall = len(retrieved & relevant) / len(relevant)
results[f"recall@{k}"] += recall
# Calculate MRR
for rank, doc_id in enumerate(ranked_ids, 1):
if doc_id in case.relevant_doc_ids:
results["mrr"] += 1.0 / rank
break
# Average across all cases
n = len(eval_cases)
for key in results:
results[key] /= n
return resultsWe evaluated five embedding models on a dataset of 200 analytics queries with manually labeled relevant documents:
MODELS_TO_EVALUATE = [
"text-embedding-3-small", # OpenAI, 1536 dimensions
"text-embedding-3-large", # OpenAI, 3072 dimensions
"BAAI/bge-large-en-v1.5", # Open source, 1024 dimensions
"sentence-transformers/all-MiniLM-L6-v2", # Lightweight, 384 dims
"intfloat/e5-large-v2", # Open source, 1024 dimensions
]
# Run evaluation
for model_name in MODELS_TO_EVALUATE:
results = evaluate_embedding_model(
model_name, eval_cases, documents
)
print(f"\n{model_name}:")
for metric, value in results.items():
print(f" {metric}: {value:.3f}")The results were instructive. OpenAI's text-embedding-3-small had the best recall@5 on general queries, but BAAI/bge-large-en-v1.5 outperformed it on queries involving domain-specific financial terminology. The difference was most pronounced for abbreviation-heavy queries like "What's the MRR breakdown by cohort?" where general-purpose models did not associate "MRR" (Monthly Recurring Revenue) with financial context.
Fine-Tuning Embeddings for Domain Vocabulary
When off-the-shelf models underperform on your domain, fine-tuning bridges the gap. The goal is to teach the model that domain-specific terms have particular semantic relationships. Fine-tuning does not require massive datasets. Even a few hundred high-quality training pairs can produce measurable improvements.
The training data for embedding fine-tuning consists of positive pairs (query, relevant document) and optionally negative pairs (query, irrelevant document). We generate training data from three sources: user query logs paired with the documents users actually found useful, synthetic pairs generated by an LLM from our document corpus, and manually curated pairs for edge cases.
from sentence_transformers import (
SentenceTransformer,
InputExample,
losses,
)
from torch.utils.data import DataLoader
def prepare_training_data(
positive_pairs: list[tuple[str, str]],
hard_negatives: list[tuple[str, str, str]] | None = None,
) -> list[InputExample]:
"""Prepare training examples for embedding fine-tuning.
Args:
positive_pairs: List of (query, relevant_document) tuples
hard_negatives: Optional list of (query, positive, negative) tuples
"""
examples = []
if hard_negatives:
# Triplet format: anchor, positive, negative
for query, positive, negative in hard_negatives:
examples.append(InputExample(
texts=[query, positive, negative]
))
else:
# Pair format with cosine similarity labels
for query, document in positive_pairs:
examples.append(InputExample(
texts=[query, document], label=1.0
))
return examples
def fine_tune_embedding_model(
base_model: str,
training_examples: list[InputExample],
output_path: str,
epochs: int = 3,
batch_size: int = 16,
):
model = SentenceTransformer(base_model)
train_dataloader = DataLoader(
training_examples, shuffle=True, batch_size=batch_size
)
# Use MultipleNegativesRankingLoss for contrastive learning
# This loss treats other examples in the batch as negatives
train_loss = losses.MultipleNegativesRankingLoss(model)
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=epochs,
warmup_steps=int(len(train_dataloader) * 0.1),
output_path=output_path,
show_progress_bar=True,
)
return modelThe MultipleNegativesRankingLoss is particularly effective because it automatically treats other examples in the batch as negatives, creating a rich contrastive signal without requiring explicit negative mining. For a batch of 16 examples, each positive pair is contrasted against 15 negatives.
Synthetic Training Data Generation
Manually curating hundreds of training pairs is tedious. We use an LLM to generate synthetic training data from our document corpus, then filter it through quality checks:
from langchain_openai import ChatOpenAI
async def generate_synthetic_pairs(
documents: list[str],
pairs_per_document: int = 3,
) -> list[tuple[str, str]]:
"""Generate synthetic query-document pairs for embedding training."""
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.7)
prompt_template = """Given the following data document, generate
{n} realistic analytics questions that this document would be relevant for.
The questions should:
- Use natural business language (not technical jargon)
- Vary in complexity (simple lookups to trend analysis)
- Include domain-specific abbreviations where appropriate
Document:
{document}
Generate exactly {n} questions, one per line. Only output the questions."""
pairs = []
for doc in documents:
response = await llm.ainvoke(
prompt_template.format(document=doc, n=pairs_per_document)
)
questions = [
q.strip() for q in response.content.strip().split("\n")
if q.strip()
]
for question in questions[:pairs_per_document]:
pairs.append((question, doc))
return pairs
def filter_quality(
pairs: list[tuple[str, str]],
embedding_model,
min_similarity: float = 0.3,
max_similarity: float = 0.85,
) -> list[tuple[str, str]]:
"""Filter pairs that are too easy or too hard for training."""
filtered = []
for query, doc in pairs:
embeddings = embedding_model.encode(
[query, doc], normalize_embeddings=True
)
similarity = np.dot(embeddings[0], embeddings[1])
# Too high similarity means the pair is trivial
# Too low similarity means it might be a bad pair
if min_similarity <= similarity <= max_similarity:
filtered.append((query, doc))
return filteredThe similarity filtering step is crucial. Training pairs where the query is nearly identical to the document teach the model nothing. Pairs where the query has almost no relationship to the document introduce noise. The sweet spot is pairs that are semantically related but lexically distinct, which is exactly what forces the model to learn domain-specific associations.
Chunking Strategies That Complement Embeddings
Embedding quality and chunking strategy are deeply intertwined. The best embedding model in the world cannot retrieve a relevant passage if the chunking strategy placed the key information in the middle of an irrelevant chunk, diluting its embedding.
We use three chunking strategies depending on the data type:
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
MarkdownHeaderTextSplitter,
)
# Strategy 1: Semantic chunking for prose documents
def chunk_prose(text: str, chunk_size: int = 800) -> list[str]:
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "],
)
return splitter.split_text(text)
# Strategy 2: Structure-aware chunking for markdown/reports
def chunk_structured(text: str) -> list[str]:
headers_to_split_on = [
("#", "header1"),
("##", "header2"),
("###", "header3"),
]
splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on,
)
splits = splitter.split_text(text)
# Further split large sections
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
)
final_chunks = []
for split in splits:
if len(split.page_content) > 800:
sub_chunks = recursive_splitter.split_text(split.page_content)
for sc in sub_chunks:
final_chunks.append({
"content": sc,
"metadata": split.metadata,
})
else:
final_chunks.append({
"content": split.page_content,
"metadata": split.metadata,
})
return final_chunks
# Strategy 3: Row-level chunking for tabular data
def chunk_tabular(
rows: list[dict],
group_by: str | None = None,
rows_per_chunk: int = 10,
) -> list[str]:
"""Convert tabular data into embeddable text chunks."""
if group_by and group_by in rows[0]:
# Group rows by a key column for coherent chunks
groups = {}
for row in rows:
key = row[group_by]
groups.setdefault(key, []).append(row)
chunks = []
for key, group_rows in groups.items():
chunk_text = f"{group_by}: {key}\n"
for row in group_rows:
chunk_text += " | ".join(
f"{k}: {v}" for k, v in row.items() if k != group_by
)
chunk_text += "\n"
chunks.append(chunk_text.strip())
return chunks
else:
# Simple sequential chunking
chunks = []
for i in range(0, len(rows), rows_per_chunk):
batch = rows[i:i + rows_per_chunk]
chunk_text = "\n".join(
" | ".join(f"{k}: {v}" for k, v in row.items())
for row in batch
)
chunks.append(chunk_text)
return chunksThe tabular chunking strategy deserves special attention. Business data is often tabular, and naive text chunking destroys the structure that makes it meaningful. Grouping rows by a key column (like region, product, or time period) produces chunks that are semantically coherent, meaning the embedding captures the concept "Q3 revenue by region" rather than a random slice of a CSV file.
Query-Time Embedding Optimization
Embedding optimization does not stop at indexing time. The query embedding can also be improved to better match the indexed representations. We use two techniques: query expansion and instruction-prefixed embeddings.
Query expansion rewrites the user query into multiple variants and averages their embeddings. This broadens the semantic coverage of the query vector:
async def expand_and_embed(
query: str,
embedding_model,
expansion_llm,
n_expansions: int = 3,
) -> np.ndarray:
"""Generate query expansions and average their embeddings."""
expansion_prompt = f"""Generate {n_expansions} alternative phrasings of
this analytics question. Keep the same meaning but vary the vocabulary.
Original: {query}
Alternatives (one per line):"""
response = await expansion_llm.ainvoke(expansion_prompt)
alternatives = [
line.strip().lstrip("0123456789.-) ")
for line in response.content.strip().split("\n")
if line.strip()
][:n_expansions]
all_queries = [query] + alternatives
embeddings = embedding_model.encode(
all_queries, normalize_embeddings=True
)
# Average and re-normalize
avg_embedding = np.mean(embeddings, axis=0)
avg_embedding = avg_embedding / np.linalg.norm(avg_embedding)
return avg_embeddingInstruction-prefixed embeddings leverage a feature of modern embedding models where prepending a task-specific instruction to the input improves retrieval quality. Models like E5 and BGE are trained with this pattern:
def embed_with_instruction(
texts: list[str],
model: SentenceTransformer,
is_query: bool = True,
) -> np.ndarray:
"""Add task-specific instruction prefixes before embedding."""
if is_query:
prefix = "Represent this analytics question for retrieving relevant data: "
else:
prefix = "Represent this data document for retrieval: "
prefixed = [prefix + text for text in texts]
return model.encode(prefixed, normalize_embeddings=True)This asymmetric prefixing tells the model to embed queries and documents differently, optimizing for the cross-modal similarity that retrieval requires rather than the within-modal similarity that would group all queries together.
Monitoring Embedding Drift
Embedding models are static, but your data is not. As new terminology enters your domain, the model's ability to embed it meaningfully may degrade. We monitor for this with a weekly evaluation job:
async def check_embedding_health(
model,
eval_cases: list[RetrievalEvalCase],
documents: dict[str, str],
alert_threshold: float = 0.05,
historical_metrics: dict | None = None,
):
current_metrics = evaluate_embedding_model(
model, eval_cases, documents
)
if historical_metrics:
for metric, current_value in current_metrics.items():
previous = historical_metrics.get(metric, current_value)
delta = previous - current_value
if delta > alert_threshold:
logger.warning(
"Embedding quality regression detected",
extra={
"metric": metric,
"current": round(current_value, 4),
"previous": round(previous, 4),
"delta": round(delta, 4),
},
)
return current_metricsWhen recall drops below our threshold, it triggers a review of recent queries that produced poor retrieval results, which feeds back into our fine-tuning dataset for the next training cycle.
Conclusion
Embedding strategy is the most underinvested area in most RAG systems. Teams spend weeks on prompt engineering and chain architecture but accept the default embedding model without evaluation. For Data Whispal Agent, investing in domain-specific embedding optimization, including systematic model evaluation, targeted fine-tuning with synthetic data, structure-aware chunking, and query-time enhancements, improved retrieval recall@5 by 23% compared to using OpenAI's default embedding model out of the box. That improvement flows directly into answer quality, because every additional relevant chunk the model sees is another opportunity to produce an accurate, well-grounded response.
Related Articles
Scaling AI Agents: From Prototype to Production
A practical guide to scaling AI agent systems from initial prototype to production deployment, covering infrastructure architecture, cost management, reliability engineering, and team organization.
Data Privacy and Security in AI Agent Systems
A practical guide to building privacy-preserving AI agent systems, covering data classification, access controls, PII handling, audit logging, and compliance requirements.
AI-Powered Data Analytics: Transforming Business Intelligence
How AI-powered analytics agents are changing the way organizations extract insights from data, with practical guidance on adoption strategies, use cases, and measuring business impact.