RAG in a nutshell
Jan, 2024
RAG == ’Asking Informed Questions’
In essence, it involves asking Large Language Models (LLMs) "informed questions". This means we include more context, or contextual knowledge, in the questions we ask the LLMs. For example, if the user question is “How many days did it take team X to complete the project Y?”; the contextual knowledge might be something like: “X project was delivered in four months.”. When this context is augmented into the question as a prior knowledge to the LLM, the LLM will - ideally - be able to conclude that “Team X has spent 120 days working on project Y.” Contextual knowledge typically comes from local or private data that the LLM has likely not encountered during its pre-training phase. This is known as Retrieval Augmented Generation or RAG. Generally, RAG can be best suited for fact-based scenarios and use cases.
There are several methods to implement RAG, from highly sophisticated to more straightforward approaches. It is still experimental space. However, at its core, it is about extracting the contextual knowledge, a process akin to a database search. In this scenario, the database is vector-based (instead of rows and columns), and the search is based on similarity, often using the cosine similarity metric. The user's question is sent to the vector database to search the most similar "relevant" documents. The search results returned are considered the context to be provided to the LLM. Two factors directly influence the search performance and accuracy when creating a vector-based database: the characteristics of the local data and the embedding model used to generate vectors for this data. Therefore, it's crucial to prepare the data properly and select the appropriate embedding model.
In a simplified concept, the sequence of RAG processes might look something like this:
sequenceDiagram
participant User as User
participant RAG as Vector store
participant LLM_Prompt as LLM Prompt manager
participant LLM as LLM
participant Response as Response handler
User->>RAG: Provide input
RAG->>LLM_Prompt: Provide relevant context
LLM_Prompt->>LLM: Provide prompt (incorporated with context)
LLM->>Response: Generate response
Response->>User: Display LLM response
The following basic code example demonstrates the simple concept and sequence of processes.
Jan 12, 2024 | a minimal example for a toy RAG with Mixtral
LLM
Creating the RAG system
GitHub - iamaziz/mini_RAG_LLM: A minimal example for in-memory RAG using ChromaDB and an Ollama LLM
from typing import List
from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings
BASE_LLM = "mixtral"
def build_rag(docs: List[str]):
docs = [Document(page_content=doc) for doc in docs]
return Chroma.from_documents(documents=docs, embedding=OllamaEmbeddings(model=BASE_LLM))
def search_rag(rag, query: str, k=1, **kwargs):
result = rag.similarity_search_with_score(query, k=k, **kwargs)
return result[0][0].page_content
def create_prompt(context: str, question: str):
return f"Given the following context: \\n\\t{context} \\n\\nAnswer this question: \\n\\t{question}"
def get_llm(name: str, **kwargs):
return Ollama(model=name, **kwargs)
def ask_llm(prompt: str):
llm = get_llm(BASE_LLM)
return llm.invoke(prompt)
Given the following hypothetical local data “documents”
# -- example usage
# local documents for RAG
docs = [
"Aziz Alto has lived in NYC for 10 years.",
"aziz alto is an imaginery LLM engineer in the movive 'The Matrix'.", # intentional typo
"New York City's subway system is the oldest in the world.",
]
Using RAG with the sample local data