Implementing RAG with Chroma and Llama 2 | Generative AI Series
Introduction
In artificial intelligence (AI), retrieval augmented generation (RAG) is the technique of retrieving data from external data sources to improve large language models' (LLMs) response.
Sometimes, LLM's trained data isn't enough, and this is where RAG comes into play to reduce a model's knowledge gaps and avoid hallucinations.
In this guide, you'll use Chroma, an open-source vector database, to improve the quality of the Llama 2 model.
Prerequisites
Before you begin:
Deploy a new Ubuntu 22.04 A100 Vultr Cloud GPU Server with at least:
- 80 GB GPU RAM
- 12 vCPUs
- 120 GB Memory
Create a non-root user with
sudo
rights and switch to the account.
Install the Python Modules
The sample Python source codes in this guide require some Python libraries. Install the libraries using the following command.
$ pip install sentence_transformers huggingface-hub transformers chromadb ipywidgets pandas
Review Word Embeddings in Natural Language Processing
In Natural Language Processing (NLP), an embedding is a numerical representation of a word. Embeddings of words with similar meanings are always close to each other in a vector space. Computers understand numbers better than words, and converting words to embeddings improves the quality of NLP. To know how embeddings work in LLM, follow the steps below:
Create a new
embeddings.py
file using a text editor likenano
.console$ nano embeddings.py
Enter the following information into the
embeddings.py
file.pythonfrom sentence_transformers import SentenceTransformer import numpy as np def text_embedding(text): model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') return model.encode(text, normalize_embeddings = True) def vector_similarity(vec1, vec2): return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2))) phrase1 = "Apple is a fruit" embedding1 = text_embedding(phrase1) print(len(embedding1)) phrase2 = "Apple iPhone is expensive" embedding2 = text_embedding(phrase2) print(len(embedding2)) phrase3 = "Mango is a fruit" embedding3 = text_embedding(phrase3) print(len(embedding3)) phrase4 = "There is a new Apple iPhone" embedding4 = text_embedding(phrase4) print(len(embedding4)) print(vector_similarity(embedding1,embedding3)) print(vector_similarity(embedding1,embedding4)) print(vector_similarity(embedding2,embedding3)) print(vector_similarity(embedding2,embedding4))
Save and close the file.
Run the
embeddings.py
file.console$ python3 embeddings.py
Output:
384 384 384 384 0.6773864 0.3809798 0.15007737 0.64330834
From the above output:
- The length of each embedding is the same, irrespective of the input text.
- The similarity results clearly show that
phrase1
andphrase3
are semantically closer to each other. Similary,phrase2
andphrase4
are also closer in the vector space.
Implement the Chroma Vector Database
Chroma is an open-source vector database that allows you to store and query embeddings using sematic search. Unlike relational database management systems like MySQL or PostgreSQL, Chroma uses collections instead of data tables to organize data.
The following is the basic process of how you should perform a semantic search works in a Chroma database:
- Convert text to embeddings.
- Store the embeddings in the Chroma database as vectors.
- Perform a sematic search.
To understand how you can implement the above process in a real-life example, follow the steps below:
Create a new
chroma.py
file.console$ nano chroma.py
Enter the following information into the
chroma.py
file.pythonimport chromadb phrases = [ "Amanda baked cookies and will bring Jerry some tomorrow.", "Olivia and Olivier are voting for liberals in this election.", "Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.", "John's cookies were only half-baked but he still carries them for Mary." ] ids = [ "001", "002", "003", "004" ] metadatas = [ {"source": "pdf-1"}, {"source": "doc-1"}, {"source": "pdf-2"}, {"source": "txt-1"} ] chroma_client = chromadb.Client() collection = chroma_client.create_collection(name = "embeddings_demo") collection.add( documents = phrases, metadatas = metadatas, ids = ids ) collection.peek() results = collection.query( query_texts = ["Mary got half-baked cake from John"], n_results = 1 ) print(results['documents']) results = collection.query( query_texts = ["cookies"], where = {"source": "pdf-1"}, n_results = 1 ) print(results['documents'])
Save and close the file.
Run the
chroma.py
file.console$ python3 chroma.py
The following output returns the list items closely related to the
Mary got half-baked cake from John
search phrase.[["John's cookies were only half-baked but he still carries them for Mary."]] [['Amanda baked cookies and will bring Jerry some tomorrow.']]
Perform Semantic Search Using a Dataset
You populated the Chroma database using a Python list in the previous step. You can now populate the database with data from external sources to address advanced use cases.
In this section, you're to use a sample CSV file for Oscars winners and nominees to populate the Chroma database
Download the
oscars.csv
file using the Linuxwget
command.console$ wget https://docs.vultr.com/public/doc-assets/new/implementing-rag-with-chroma-and-llama-2-generative-ai-series/oscars.csv
Create a new
search.py
file.console$ nano search.py
Enter the following information into the
search.py
file.pythonimport pandas as pd import chromadb from sentence_transformers import SentenceTransformer import numpy as np import os df = pd.read_csv('oscars.csv') #print(df) df = df.loc[df['year_ceremony'] == 2023] #print(df) df = df.dropna(subset=['film']) df.loc[:, 'category'] = df['category'].str.lower() #print(df) df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award' df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win' df client = chromadb.Client() collection = client.get_or_create_collection("oscars-2023") docs = df["text"].tolist() ids = [str(x) for x in df.index.tolist()] collection.add( documents = docs, ids = ids ) results = collection.query( query_texts = ["RRR"], n_results = 1 ) print(results['documents'])
Save and close the file
Run the
search.py
file.console$ python3 search.py
Output:
[['Music by M.M. Keeravaani; Lyric by Chandrabose got nominated under the category, music (original song), for the film RRR to win the award']]
Implement RAG with Chroma and LLM
In this section, you'll implement the power of RAG and vector database to improve the response of a Llama model. Follow the steps below to launch a Docker container, expose a Llama model through an API, and access the API using a Python code:
Initialize the container variables. Replace
$HF_TOKEN
with the correct Hugging Face access token.console$ model=meta-llama/Llama-2-7b-chat-hf volume=$PWD/data token=$HF_TOKEN
Run the following command to download and start the Hugging Face text generation inference container.
console$ sudo docker run -d \ --name hf-tgi \ --runtime=nvidia \ --gpus all \ -e HUGGING_FACE_HUB_TOKEN=$token \ -p 8080:80 \ -v $volume:/data \ ghcr.io/huggingface/text-generation-inference:1.1.0 \ --model-id $model \ --max-input-length 2048 \ --max-total-tokens 4096
Wait for the image to download and install.
Monitor the Docker logs as the container loads.
console$ sudo docker logs -f hf-tgi
Wait until you see the following output that confirms the API is listening for incoming connections.
2023-12-03T21:32:56.546160Z INFO text_generation_router: router/src/main.rs:247: Connected 2023-12-03T21:32:56.546167Z WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0
Open a new
rag.py
file.console$ nano rag.py
Enter the following information into the
rag.py
file.pythonimport pandas as pd import chromadb from chromadb.utils import embedding_functions from sentence_transformers import SentenceTransformer from huggingface_hub import InferenceClient import numpy as np import os def text_embedding(text) -> None: model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') return model.encode(text) def generate_context(query): vector = text_embedding(query).tolist() results = collection.query( query_embeddings = vector, n_results = 15, include = ["documents"] ) res = " \n".join(str(item) for item in results['documents'][0]) return res def chat_completion(system_prompt, user_prompt, length = 1000): final_prompt = f"""<s>[INST]<<SYS>> {system_prompt} <</SYS>> {user_prompt} [/INST]""" return client.text_generation(prompt = final_prompt,max_new_tokens = length).strip() df = pd.read_csv('oscars.csv') df = df.loc[df['year_ceremony'] == 2023] df = df.dropna(subset=['film']) df.loc[:, 'category'] = df['category'].str.lower() df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award' df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win' client = chromadb.Client() collection = client.get_or_create_collection("oscars-2023") docs = df["text"].tolist() ids = [str(x) for x in df.index.tolist()] collection.add( documents = docs, ids = ids ) URI = 'http://127.0.0.1:8080' client = InferenceClient(model = URI) #query="What did Ke Huy Quan work on?" #query="Which movie won the best music award?" query = "Did Lady Gaga win an award at Oscars 2023?" #query="Who is the music director of RRR?" context = generate_context(query) system_prompt = """ \ You are a helpful AI assistant that can answer questions on Oscar 2023 awards. Answer based on the context provided. If you cannot find the correct answerm, say I don't know. Be concise and just include the response. """ user_prompt = f""" Based on the context: {context} Answer the below query: {query} """ resp = chat_completion(system_prompt, user_prompt) print(resp)
Save and close the file.
Run the
rag.py
file.console$ python3 rag.py
Output:
Lady Gaga was nominated but did not win at the Oscars 2023.
Conclusion
In this guide, you've implemented a Chroma vector database to store vector data and improve the quality of LLM semantic search. You've also learned how embeddings work in NLP.
- Generative AI for Developers | Generative AI Series
- Understanding Foundation Models | Generative AI Series
- Exploring Vultr GPU Stack | Generative AI Series
- A Deeper Dive Into Large Language Models | Generative AI Series
- Interacting with Llama 2 | Generative AI Series
- Implementing RAG with Chroma and Llama 2 | Generative AI Series
- Using LangChain with Llama 2 | Generative AI Series
- Fine-Tuning Llama 2 | Generative AI Series
- Generating Images with Stable Diffusion | Generative AI Series
- Transcribing and Translating Audio | Generative AI Series