robgibbon
on 20 April 2026
Many of us are familiar with the retrieval augmented generative AI (RAG) pattern for building agentic AI applications – like digital concierges, frontline support chatbots and agents that can help with basic self-service troubleshooting.
At a high level, the flow for RAG is fairly clear – the user’s prompt is augmented with some relevant contextual information from a knowledge base, and the large language model (LLM) provides the user with a response on the basis of the information provided, instead of from the “baked in” information that it was originally trained on.
In this article, we’ll roll up our sleeves and dive a little deeper to try to get a better grasp of how typical production-grade RAG systems actually work. To understand what’s really going on in the information retrieval process, we need to dig into hybrid search and reranking.
Embeddings and vector search
Before we get to hybrid search and reranking, let’s establish some baseline RAG understanding. Vector databases essentially provide a geometry based search index that can help to find relevant content – or knowledge – in our knowledge base. The way it works is this:
- The underlying source data is encoded into embeddings using a specialized, GPU accelerated AI model. These embeddings are represented as vectors – a list of numbers where each number represents a coordinate in a high-dimensional space.
- These embeddings are stored in a database table, and typically a special database index is then precomputed using a search engine specialised in vector searches, to help speed things up.
- Then, at runtime, the “distance” between two concepts can be calculated using one of various mathematical metrics – for example, cosine similarity, Euclidian distance (L2 search) and others.
- When the search runs, the results return the closest matching vectors, mapped back to records in the underlying source data. That could be text chunks, (or if using a multi-modal language model) it could be images, audio recordings, PDF documents, and so on. We’ll stick with text chunks in this article, to keep it simple.
The results of the vector search will be text chunks from the raw source data, which will be sent to the LLM along with the user’s prompt. The encoded vector embeddings help to find the right information in the knowledge base, but LLMs can’t interpret those vector embeddings directly.
- Search: Your query is converted into a vector to find matching pieces of data in the database.
- Retrieval: The database pulls the original text associated with those matching vectors.
- Augmentation: That text is inserted into a prompt template (e.g., “Using this info: [Source Text], answer the question: [User Prompt]”).
- Generation: This combined text prompt is sent to the LLM.
It’s important to note that first step. When you run the search, the user’s prompt has to be converted to vectors too, at runtime. This is so that the vector search engine can compare the vectors in the prompt with the vectors in the database and find the nearest matches.
In order to get meaningful results, you need to use the same embedding model as you used when you created the vector index in the database. This is because each model creates its own unique “map” of meaning (often called a vector space). Using a different embedding model is a silent killer – the application will run without errors, but the information retrieved will be completely irrelevant.
So that’s vector search. But, to make a RAG system production-ready, you typically need to move beyond “naive” vector search to a multi-stage retrieval process.
Hybrid search
Hybrid search runs both vector search and full text search algorithms in parallel and merges their results.
- Vector search (dense) is good at finding “meaning.” If you search for “sunny weather,” it might find “clear blue skies” even though the words don’t match.
- Full-text search (sparse) is good at finding “exact matches.” It uses a text matching algorithm like best match 25 (BM25) to find specific product IDs, acronyms, or rare technical terms that vector models sometimes can’t differentiate.
These two methods use different scoring systems (0.0 to 1.0 for vectors vs. unbounded scores for BM25). Thus they’re combined using an algorithm like reciprocal rank fusion (RRF). RRF looks at the position of a document in both lists rather than the raw score, giving a higher final rank to documents that appear near the top of both.
Calculating the combined score
There are actually two approaches that are typically found for calculating the combined ranking score.
Reciprocal rank fusion (RRF)
RRF ignores the raw scores entirely and only looks at the rank (1st, 2nd, 3rd, etc.) of the document in each list. RRF uses a smoothing factor – typically this is set to the whole number `60`. This standard “smoothing” factor prevents a single very high rank in one search from completely drowning out a moderate rank in the other.
Relative score fusion (weighted average)
This method keeps the raw scores but normalizes them first.
- Both the BM25 and Vector scores are scaled to a 0–1 range
- A weighting parameter is applied to decide which search method you trust more.
RRF doesn’t need much tweaking and is pretty robust, while weighted average can give more surgical control if you know your keyword search is consistently more reliable than your vector search (or vice-versa).
Hybrid search in the database
In modern RAG stacks, the fusion of full text keyword search (BM25) and vector search is increasingly moving into the database layer to reduce latency and “glue code” – middleware logic which might otherwise increase the overall management complexity of the solution.
- For PostgreSQL (pgvector and `full-text`) you can write a single SQL query or stored procedure that runs two subqueries – one for the vector distance and one for the tsvector keyword match – and then applies the RRF formula to the results.
- Some databases like OpenSearch provide built-in functions that handle the RRF calculation for you automatically.
PostgreSQL hybrid search with RRF
Let’s see what hybrid search might look like in PostgreSQL. This example assumes you have a table called documents with a content column (for text search) and an embedding column (for vector search).
WITH
-- 1. Semantic Search: Rank by vector similarity
semantic_search AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> '[0.1, -0.2, ...]'::vector) as rank
FROM documents
ORDER BY embedding <=> '[0.1, -0.2, ...]'::vector
LIMIT 50
),
-- 2. Keyword Search: Rank by text relevance (BM25 or ts_rank)
keyword_search AS (
SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank_cd(to_tsvector('english', content), query) DESC) as rank
FROM documents, plainto_tsquery('english', 'your search terms') query
WHERE to_tsvector('english', content) @@ query
LIMIT 50
),
-- 3. Fusion: Combine ranks using the RRF formula: 1 / (rank + k)
combined_results AS (
SELECT id, 1.0 / (60 + s.rank) as score FROM semantic_search s
UNION ALL
SELECT id, 1.0 / (60 + k.rank) as score FROM keyword_search k
)
-- 4. Final Output: Sum scores for items found in both/either list
SELECT d.id, d.content, SUM(c.score) as final_rrf_score
FROM combined_results c
JOIN documents d ON c.id = d.id
GROUP BY d.id, d.content
ORDER BY final_rrf_score DESC
LIMIT 10;
The benefits of running this in a stored procedure inside the database are
- Performance: By doing the fusion in the database, you only return the final 10 rows to your application instead of two large lists of 50.
- Simplicity: You don’t have to write logic in your application middleware to “match” IDs from two different arrays.
You might be wondering about some of the unusual operators and datatypes used in the above code listing. The `<=>` operator is pg_vector’s cosine distance operator, used to calculate the distance between two vectors. The `[0.1, 0.2 … ]` signifies an example of how a vector is represented. On the other hand, `@@` is the operator used by the full text search engine to check for a match between the search query and a field.
Reranking
A reranker is a second-stage AI model that improves the accuracy of the contextual information that gets sent to the LLM along with the user’s prompt. While vector search is fast but fuzzy, a reranker is slower but more accurate.
For reranking, you don’t need to encode the data into embeddings. Instead, you send the raw text of the query and the raw text of the result chunks directly to the reranker model.
The workflow is:
- Recall: you perform a hybrid search and grab a relatively large candidate resultset (e.g., the top 50–100 chunks).
- Rerank: you send the user prompt together with those top candidate chunks to a reranking model. Unlike vector models, a reranker looks at the query and the document together to see how well they actually fit.
- Final Selection: you take only the top 5–10 results from the reranker and send those to the LLM.
How reranking works
A reranking model is typically a cross-encoder. Unlike the embedding models (bi-encoders) used for the initial search, a cross-encoder does not look at the query and document separately.
- It takes a single combined input string: [CLS] Query [SEP] Document Chunk [SEP].
- Because it sees both together, it can perform “cross-attention” – literally weighing every word in the query against every word in the document at the same time.
- It produces a single relevance score (usually a fractional number between 0 and 1). This is why it’s so accurate but also slow; you can’t pre-calculate these scores because they depend on the specific query being asked.
Reranking in application middleware
Reranking is almost always handled in the application layer or via a dedicated inference endpoint. Database stored procedures are great at maths (RRF), but they aren’t ideal for running heavy deep-learning models like the cross-encoder models typically used for reranking.
- Your application middleware calls the database to get the top 50 “candidates” (via hybrid search), then sends those 50 raw text chunks to a reranking model.
- The reranker returns the final “best” 5–10 chunks, which your application then feeds to the LLM.
Wrap-up
So we’ve examined in more detail how modern, production-grade RAG architectures use hybrid search and reranking to get the best results (and user experience):
- Vector search engines need the data to be encoded into vectors using an embedding model. The search index is then built over those vectors.
- Vector search results can be combined with keyword search results to deliver more accuracy. You’ll use an algorithm like RRF to calculate a combined result ranking. That’s called hybrid search.
- You feed the results of the hybrid search into a reranking model in order to get the best results to the top of the result list, helping to make sure that the LLM responds based on the most relevant available information.
Hybrid search helps make sure that you don’t miss anything, and reranking helps to make sure the best stuff is at the very top.
The complete production RAG flow typically looks like this:
| Task | Location | Why? |
| Calculate embeddings | Middleware / API | Requires a GPU heavy model to compute embeddings |
| Store embeddings | Database | Embeddings need to be stored for later search and retrieval |
| Vector/keyword searches | Database | Needs direct access to the database indexes. |
| Hybrid search | Database stored procedure | Faster to rank at the source than fetching two huge lists to the app. |
| Reranking | Middleware / API | Requires a GPU-heavy Cross-Encoder model. |
| LLM generation | Middleware / API | The final step that uses the retrieved context. |
Learn more about Canonical’s OpenSearch and PostgreSQL solutions, or get in touch.
Further reading
Explore our “Guide to RAG” to build and deploy a RAG workflow on public clouds with open source tools.


