RAG in a nutshell

Jan, 2024

RAG == ’Asking Informed Questions’

In essence, it involves asking Large Language Models (LLMs) "informed questions". This means we include more context, or contextual knowledge, in the questions we ask the LLMs. For example, if the user question is “How many days did it take team X to complete the project Y?”; the contextual knowledge might be something like: “X project was delivered in four months.”. When this context is augmented into the question as a prior knowledge to the LLM, the LLM will - ideally - be able to conclude that “Team X has spent 120 days working on project Y.” Contextual knowledge typically comes from local or private data that the LLM has likely not encountered during its pre-training phase. This is known as Retrieval Augmented Generation or RAG. Generally, RAG can be best suited for fact-based scenarios and use cases.

There are several methods to implement RAG, from highly sophisticated to more straightforward approaches. It is still experimental space. However, at its core, it is about extracting the contextual knowledge, a process akin to a database search. In this scenario, the database is vector-based (instead of rows and columns), and the search is based on similarity, often using the cosine similarity metric. The user's question is sent to the vector database to search the most similar "relevant" documents. The search results returned are considered the context to be provided to the LLM. Two factors directly influence the search performance and accuracy when creating a vector-based database: the characteristics of the local data and the embedding model used to generate vectors for this data. Therefore, it's crucial to prepare the data properly and select the appropriate embedding model.

Mini RAG Example

In a simplified concept, the sequence of RAG processes might look something like this:

sequenceDiagram
    participant User as User
    participant RAG as Vector store
    participant LLM_Prompt as LLM Prompt manager
    participant LLM as LLM
    participant Response as Response handler
    User->>RAG: Provide input
    RAG->>LLM_Prompt: Provide relevant context
    LLM_Prompt->>LLM: Provide prompt (incorporated with context)
    LLM->>Response: Generate response
    Response->>User: Display LLM response

source

The following basic code example demonstrates the simple concept and sequence of processes.

Jan 12, 2024 | a minimal example for a toy RAG with Mixtral LLM

Untitled

Creating the RAG system

GitHub - iamaziz/mini_RAG_LLM: A minimal example for in-memory RAG using ChromaDB and an Ollama LLM

from typing import List

from langchain_community.vectorstores import Chroma
from langchain.docstore.document import Document
from langchain_community.llms import Ollama
from langchain_community.embeddings import OllamaEmbeddings

BASE_LLM = "mixtral"

def build_rag(docs: List[str]):
    docs = [Document(page_content=doc) for doc in docs]
    return Chroma.from_documents(documents=docs, embedding=OllamaEmbeddings(model=BASE_LLM))

def search_rag(rag, query: str, k=1, **kwargs):
    result = rag.similarity_search_with_score(query, k=k, **kwargs)
    return result[0][0].page_content

def create_prompt(context: str, question: str):
    return f"Given the following context: \\n\\t{context} \\n\\nAnswer this question: \\n\\t{question}"

def get_llm(name: str, **kwargs):
    return Ollama(model=name, **kwargs)

def ask_llm(prompt: str):
    llm = get_llm(BASE_LLM)
    return llm.invoke(prompt)

local data

Given the following hypothetical local data “documents”

# -- example usage

# local documents for RAG
docs = [
    "Aziz Alto has lived in NYC for 10 years.",
    "aziz alto is an imaginery LLM engineer in the movive 'The Matrix'.", # intentional typo
    "New York City's subway system is the oldest in the world.",
]

Build and use RAG

Using RAG with the sample local data