Comparing Retrieval Methods

playgrdstar
2 min readApr 6, 2024

--

Langchain is great for getting LLM systems up and running quickly in a few lines of code.

But, sometimes, one just wants to take a peek under the hood, or customize Langchain classes or functions. I’ve tried to write simpler functions for some retrievers provided in Langchain from scratch using some of the simpler building blocks on Langchain.

The HuggingFace space is here; while the Github repository with a notebook version is available here.

Quick overview of the key functions below.

multi_query_retrieval

This function uses a language model to generate alternative versions of a query, then retrieves documents for each version.

# Assume 'llm' is a language model and 'retriever' is an object with a method to retrieve documents.
generated_queries = mq_llm_chain.invoke(query)['text'].split("\n")
# Now 'generated_queries' contains 3 alternative questions to the original 'query'.

all_retrieved_docs = []
for q in [query] + generated_queries:
# Retrieve documents for each version of the query.
retrieved_docs = retriever.get_relevant_documents(q)
all_retrieved_docs.extend(retrieved_docs)

# Remove duplicates from the retrieved documents.
unique_retrieved_docs = [doc for i, doc in enumerate(all_retrieved_docs) if doc not in all_retrieved_docs[:i]]

# Finally, extract the text from unique documents.
return get_text(unique_retrieved_docs)

compressed_retrieval

This function retrieves documents for a query and then compresses them, depending on the type of extractor chosen.

retrieved_docs = retriever.get_relevant_documents(query)
if extractor_type == 'chain':
# Initialize an extractor based on a language model chain.
extractor = LLMChainExtractor.from_llm(llm)
elif extractor_type == 'filter':
# Initialize an extractor that filters documents through a language model.
extractor = LLMChainFilter.from_llm(llm)
elif extractor_type == 'embeddings':
# Initialize an extractor that uses embeddings to find similar documents.
if embedding_model is None:
raise ValueError("Embeddings model must be provided for embeddings extractor.")
extractor = EmbeddingsFilter(embeddings=embedding_model, similarity_threshold=0.5)
else:
raise ValueError("Invalid extractor_type. Options are 'chain', 'filter', or 'embeddings'.")

# Compress documents to the most relevant ones.
compressed_docs = extractor.compress_documents(retrieved_docs, query)

# Extract the text from compressed documents.
return get_text(compressed_docs)

ensemble_retrieval

This function combines the results from multiple document retrievers and ranks the documents using the Reciprocal Rank Fusion method.

# Retrieve documents from each retriever for the given query.
retrieved_docs_by_retriever = [retriever.get_relevant_documents(query) for retriever in retrievers_list]

# Calculate RRF scores for each document.
rrf_score = defaultdict(float)
for doc_list, weight in zip(retrieved_docs_by_retriever, weights):
for rank, doc in enumerate(doc_list, start=1):
rrf_score[doc.page_content] += weight / (rank + c)

# Sort documents based on their RRF score.
sorted_docs = sorted(
unique_by_key(chain.from_iterable(retrieved_docs_by_retriever), lambda doc: doc.page_content),
key=lambda doc: rrf_score[doc.page_content],
reverse=True
)

# Extract the text from sorted documents.
return get_text(sorted_docs)

long_context_reorder_retrieval

This function reorders retrieved documents to alternate between the beginning and the end of the results list.

retrieved_docs = retriever.get_relevant_documents(query)
# Reverse the list of documents.
retrieved_docs.reverse()

reordered_results = []
for i, doc in enumerate(retrieved_docs):
# Alternate placing documents at the beginning and the end of the results list.
if i % 2 == 1:
reordered_results.append(doc)
else:
reordered_results.insert(0, doc)

# Extract the text from reordered documents.
return get_text(reordered_results)

--

--

playgrdstar

ming // gary ang // illustration, coding, writing // portfolio >> playgrd.com | writings >> quaintitative.com