Implementing RAG with Chroma and Llama 2 | Generative AI Series

Introduction

In artificial intelligence (AI), retrieval augmented generation (RAG) is the technique of retrieving data from external data sources to improve large language models' (LLMs) response.

Sometimes, LLM's trained data isn't enough, and this is where RAG comes into play to reduce a model's knowledge gaps and avoid hallucinations.

In this guide, you'll use Chroma, an open-source vector database, to improve the quality of the Llama 2 model.

Prerequisites

Before you begin:

Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Establish an SSH connection to the server.
Create a non-root user with sudo rights and switch to the account.
Create a HuggingFace account.
Create a Hugging Face user access token.

Install the Python Modules

The sample Python source codes in this guide require some Python libraries. Install the libraries using the following command.

                            console
                            
$ pip install sentence_transformers huggingface-hub transformers chromadb ipywidgets pandas

Review Word Embeddings in Natural Language Processing

In Natural Language Processing (NLP), an embedding is a numerical representation of a word. Embeddings of words with similar meanings are always close to each other in a vector space. Computers understand numbers better than words, and converting words to embeddings improves the quality of NLP. To know how embeddings work in LLM, follow the steps below:

Create a new embeddings.py file using a text editor like nano.
console
```
$ nano embeddings.py
```

Enter the following information into the embeddings.py file.

                            python
                            
                        
from sentence_transformers import SentenceTransformer
import numpy as np

def text_embedding(text):
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    return model.encode(text, normalize_embeddings = True)

def vector_similarity(vec1, vec2):
    return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))

phrase1    = "Apple is a fruit"
embedding1 = text_embedding(phrase1)
print(len(embedding1))

phrase2    = "Apple iPhone is expensive"
embedding2 = text_embedding(phrase2)
print(len(embedding2))

phrase3    = "Mango is a fruit"
embedding3 = text_embedding(phrase3) 
print(len(embedding3))

phrase4    = "There is a new Apple iPhone"
embedding4 = text_embedding(phrase4)
print(len(embedding4))

print(vector_similarity(embedding1,embedding3))
print(vector_similarity(embedding1,embedding4))

print(vector_similarity(embedding2,embedding3))
print(vector_similarity(embedding2,embedding4))

Save and close the file.
Run the embeddings.py file.
console
```
$ python3 embeddings.py
```
Output:
```
384
384
384
384
0.6773864
0.3809798
0.15007737
0.64330834
```
From the above output:
- The length of each embedding is the same, irrespective of the input text.
- The similarity results clearly show that phrase1 and phrase3 are semantically closer to each other. Similary, phrase2 and phrase4 are also closer in the vector space.

Implement the Chroma Vector Database

Chroma is an open-source vector database that allows you to store and query embeddings using sematic search. Unlike relational database management systems like MySQL or PostgreSQL, Chroma uses collections instead of data tables to organize data.

The following is the basic process of how you should perform a semantic search works in a Chroma database:

Convert text to embeddings.
Store the embeddings in the Chroma database as vectors.
Perform a sematic search.

To understand how you can implement the above process in a real-life example, follow the steps below:

Create a new chroma.py file.
console
```
$ nano chroma.py
```

Enter the following information into the chroma.py file.

                            python
                            
                        
import chromadb

phrases = [
    "Amanda baked cookies and will bring Jerry some tomorrow.",
    "Olivia and Olivier are voting for liberals in this election.",
    "Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.",
    "John's cookies were only half-baked but he still carries them for Mary."
]

ids = [
    "001",
    "002",
    "003",
    "004"
]

metadatas = [
    {"source": "pdf-1"}, 
    {"source": "doc-1"}, 
    {"source": "pdf-2"},
    {"source": "txt-1"}
]

chroma_client = chromadb.Client()

collection = chroma_client.create_collection(name = "embeddings_demo")

collection.add(
    documents = phrases,
    metadatas = metadatas,
    ids       = ids
)

collection.peek()

results = collection.query(
    query_texts = ["Mary got half-baked cake from John"],
    n_results   = 1
)

print(results['documents'])

results = collection.query(
    query_texts = ["cookies"],
    where       = {"source": "pdf-1"},
    n_results   = 1
)

print(results['documents'])

Save and close the file.

Run the chroma.py file.

console

$ python3 chroma.py

The following output returns the list items closely related to the Mary got half-baked cake from John search phrase.

[["John's cookies were only half-baked but he still carries them for Mary."]]
[['Amanda baked cookies and will bring Jerry some tomorrow.']]

Perform Semantic Search Using a Dataset

You populated the Chroma database using a Python list in the previous step. You can now populate the database with data from external sources to address advanced use cases.

In this section, you're to use a sample CSV file for Oscars winners and nominees to populate the Chroma database

Download the oscars.csv file using the Linux wget command.

console

$ wget https://docs.vultr.com/public/doc-assets/new/implementing-rag-with-chroma-and-llama-2-generative-ai-series/oscars.csv

Create a new search.py file.
console
```
$ nano search.py
```

Enter the following information into the search.py file.

                            python
                            
                        
import pandas as pd
import chromadb
from sentence_transformers import SentenceTransformer
import numpy as np
import os

df = pd.read_csv('oscars.csv')
#print(df)

df = df.loc[df['year_ceremony'] == 2023]
#print(df)

df = df.dropna(subset=['film'])
df.loc[:, 'category'] = df['category'].str.lower()
#print(df)

df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award'

df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win'
df

client = chromadb.Client()
collection = client.get_or_create_collection("oscars-2023")

docs = df["text"].tolist() 

ids = [str(x) for x in df.index.tolist()]

collection.add(
    documents = docs,
    ids       = ids
)

results = collection.query(
    query_texts = ["RRR"],
    n_results   = 1
)

print(results['documents'])

Save and close the file

Run the search.py file.

console

$ python3 search.py

Output:

[['Music by M.M. Keeravaani; Lyric by Chandrabose got nominated under the category, music (original song), for the film RRR to win the award']]

Implement RAG with Chroma and LLM

In this section, you'll implement the power of RAG and vector database to improve the response of a Llama model. Follow the steps below to launch a Docker container, expose a Llama model through an API, and access the API using a Python code:

Initialize the container variables. Replace $HF_TOKEN with the correct Hugging Face access token.
console
```
$ model=meta-llama/Llama-2-7b-chat-hf
volume=$PWD/data
token=$HF_TOKEN
```

Run the following command to download and start the Hugging Face text generation inference container.

                            console
                            
                        
$ sudo docker run -d  \
--name hf-tgi  \
--runtime=nvidia  \
--gpus all  \
-e HUGGING_FACE_HUB_TOKEN=$token  \
-p 8080:80  \
-v $volume:/data  \
ghcr.io/huggingface/text-generation-inference:1.1.0  \
--model-id $model  \
--max-input-length 2048  \
--max-total-tokens 4096

Wait for the image to download and install.
Monitor the Docker logs as the container loads.
console
```
$ sudo docker logs -f hf-tgi
```

Wait until you see the following output that confirms the API is listening for incoming connections.

2023-12-03T21:32:56.546160Z  INFO text_generation_router: router/src/main.rs:247: Connected
2023-12-03T21:32:56.546167Z  WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0

Open a new rag.py file.
console
```
$ nano rag.py
```

Enter the following information into the rag.py file.

                            python
                            
                        
import pandas as pd
import chromadb
from chromadb.utils import embedding_functions
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient
import numpy as np
import os

def text_embedding(text) -> None:
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
    return model.encode(text)

def generate_context(query):
    vector = text_embedding(query).tolist()

    results = collection.query(    
        query_embeddings = vector,
        n_results = 15,
        include = ["documents"]
    )

    res = " \n".join(str(item) for item in results['documents'][0])
    return res

def chat_completion(system_prompt, user_prompt, length = 1000):

    final_prompt = f"""<s>[INST]<<SYS>>
    {system_prompt}
    <</SYS>>

    {user_prompt} [/INST]"""
    return client.text_generation(prompt = final_prompt,max_new_tokens = length).strip()

df = pd.read_csv('oscars.csv')
df = df.loc[df['year_ceremony'] == 2023]
df = df.dropna(subset=['film'])
df.loc[:, 'category'] = df['category'].str.lower()
df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award'
df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win'               

client = chromadb.Client()
collection = client.get_or_create_collection("oscars-2023")

docs = df["text"].tolist() 
ids  = [str(x) for x in df.index.tolist()]

collection.add(
    documents = docs,
    ids       = ids
)

URI    = 'http://127.0.0.1:8080'
client = InferenceClient(model = URI)

#query="What did Ke Huy Quan work on?"
#query="Which movie won the best music award?"
query = "Did Lady Gaga win an award at Oscars 2023?"
#query="Who is the music director of RRR?"
context = generate_context(query)

system_prompt = """ \
You are a helpful AI assistant that can answer questions on Oscar 2023 awards. Answer based on the context provided. If you cannot find the correct answerm, say I don't know. Be concise and just include the response.
"""

user_prompt = f"""
Based on the context:
{context}
Answer the below query:
{query}
"""

resp = chat_completion(system_prompt, user_prompt)
print(resp)

Save and close the file.

Run the rag.py file.

console

$ python3 rag.py

Output:

Lady Gaga was nominated but did not win at the Oscars 2023.

Conclusion

In this guide, you've implemented a Chroma vector database to store vector data and improve the quality of LLM semantic search. You've also learned how embeddings work in NLP.

Tags:

Natural Language Processing

Generative AI

Python