Implementing RAG with Chroma and Llama 2 | Generative AI Series

Updated on December 19, 2023
Implementing RAG with Chroma and Llama 2 | Generative AI Series header image

Introduction

In artificial intelligence (AI), retrieval augmented generation (RAG) is the technique of retrieving data from external data sources to improve large language models' (LLMs) response.

Sometimes, LLM's trained data isn't enough, and this is where RAG comes into play to reduce a model's knowledge gaps and avoid hallucinations.

In this guide, you'll use Chroma, an open-source vector database, to improve the quality of the Llama 2 model.

Prerequisites

Before you begin:

Install the Python Modules

The sample Python source codes in this guide require some Python libraries. Install the libraries using the following command.

console
$ pip install sentence_transformers huggingface-hub transformers chromadb ipywidgets pandas

Review Word Embeddings in Natural Language Processing

In Natural Language Processing (NLP), an embedding is a numerical representation of a word. Embeddings of words with similar meanings are always close to each other in a vector space. Computers understand numbers better than words, and converting words to embeddings improves the quality of NLP. To know how embeddings work in LLM, follow the steps below:

  1. Create a new embeddings.py file using a text editor like nano.

    console
    $ nano embeddings.py
    
  2. Enter the following information into the embeddings.py file.

    python
    from sentence_transformers import SentenceTransformer
    import numpy as np
    
    def text_embedding(text):
        model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
        return model.encode(text, normalize_embeddings = True)
    
    def vector_similarity(vec1, vec2):
        return np.dot(np.squeeze(np.array(vec1)),np.squeeze(np.array(vec2)))
    
    phrase1    = "Apple is a fruit"
    embedding1 = text_embedding(phrase1)
    print(len(embedding1))
    
    phrase2    = "Apple iPhone is expensive"
    embedding2 = text_embedding(phrase2)
    print(len(embedding2))
    
    phrase3    = "Mango is a fruit"
    embedding3 = text_embedding(phrase3) 
    print(len(embedding3))
    
    phrase4    = "There is a new Apple iPhone"
    embedding4 = text_embedding(phrase4)
    print(len(embedding4))
    
    print(vector_similarity(embedding1,embedding3))
    print(vector_similarity(embedding1,embedding4))
    
    print(vector_similarity(embedding2,embedding3))
    print(vector_similarity(embedding2,embedding4))
    
  3. Save and close the file.

  4. Run the embeddings.py file.

    console
    $ python3 embeddings.py
    

    Output:

    384
    384
    384
    384
    0.6773864
    0.3809798
    0.15007737
    0.64330834

    From the above output:

    • The length of each embedding is the same, irrespective of the input text.
    • The similarity results clearly show that phrase1 and phrase3 are semantically closer to each other. Similary, phrase2 and phrase4 are also closer in the vector space.

Implement the Chroma Vector Database

Chroma is an open-source vector database that allows you to store and query embeddings using sematic search. Unlike relational database management systems like MySQL or PostgreSQL, Chroma uses collections instead of data tables to organize data.

The following is the basic process of how you should perform a semantic search works in a Chroma database:

  • Convert text to embeddings.
  • Store the embeddings in the Chroma database as vectors.
  • Perform a sematic search.

To understand how you can implement the above process in a real-life example, follow the steps below:

  1. Create a new chroma.py file.

    console
    $ nano chroma.py
    
  2. Enter the following information into the chroma.py file.

    python
    import chromadb
    
    phrases = [
        "Amanda baked cookies and will bring Jerry some tomorrow.",
        "Olivia and Olivier are voting for liberals in this election.",
        "Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.",
        "John's cookies were only half-baked but he still carries them for Mary."
    ]
    
    ids = [
        "001",
        "002",
        "003",
        "004"
    ]
    
    metadatas = [
        {"source": "pdf-1"}, 
        {"source": "doc-1"}, 
        {"source": "pdf-2"},
        {"source": "txt-1"}
    ]
    
    chroma_client = chromadb.Client()
    
    collection = chroma_client.create_collection(name = "embeddings_demo")
    
    collection.add(
        documents = phrases,
        metadatas = metadatas,
        ids       = ids
    )
    
    collection.peek()
    
    results = collection.query(
        query_texts = ["Mary got half-baked cake from John"],
        n_results   = 1
    )
    
    print(results['documents'])
    
    results = collection.query(
        query_texts = ["cookies"],
        where       = {"source": "pdf-1"},
        n_results   = 1
    )
    
    print(results['documents'])
    
  3. Save and close the file.

  4. Run the chroma.py file.

    console
    $ python3 chroma.py
    

    The following output returns the list items closely related to the Mary got half-baked cake from John search phrase.

    [["John's cookies were only half-baked but he still carries them for Mary."]]
    [['Amanda baked cookies and will bring Jerry some tomorrow.']]

Perform Semantic Search Using a Dataset

You populated the Chroma database using a Python list in the previous step. You can now populate the database with data from external sources to address advanced use cases.

In this section, you're to use a sample CSV file for Oscars winners and nominees to populate the Chroma database

  1. Download the oscars.csv file using the Linux wget command.

    console
    $ wget https://docs.vultr.com/public/doc-assets/new/implementing-rag-with-chroma-and-llama-2-generative-ai-series/oscars.csv
    
  2. Create a new search.py file.

    console
    $ nano search.py
    
  3. Enter the following information into the search.py file.

    python
    import pandas as pd
    import chromadb
    from sentence_transformers import SentenceTransformer
    import numpy as np
    import os
    
    df = pd.read_csv('oscars.csv')
    #print(df)
    
    df = df.loc[df['year_ceremony'] == 2023]
    #print(df)
    
    df = df.dropna(subset=['film'])
    df.loc[:, 'category'] = df['category'].str.lower()
    #print(df)
    
    df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award'
    
    df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win'
    df
    
    client = chromadb.Client()
    collection = client.get_or_create_collection("oscars-2023")
    
    docs = df["text"].tolist() 
    
    ids = [str(x) for x in df.index.tolist()]
    
    collection.add(
        documents = docs,
        ids       = ids
    )
    
    results = collection.query(
        query_texts = ["RRR"],
        n_results   = 1
    )
    
    print(results['documents'])
    
  4. Save and close the file

  5. Run the search.py file.

    console
    $ python3 search.py
    

    Output:

    [['Music by M.M. Keeravaani; Lyric by Chandrabose got nominated under the category, music (original song), for the film RRR to win the award']]

Implement RAG with Chroma and LLM

In this section, you'll implement the power of RAG and vector database to improve the response of a Llama model. Follow the steps below to launch a Docker container, expose a Llama model through an API, and access the API using a Python code:

  1. Initialize the container variables. Replace $HF_TOKEN with the correct Hugging Face access token.

    console
    $ model=meta-llama/Llama-2-7b-chat-hf
    volume=$PWD/data
    token=$HF_TOKEN
    
  2. Run the following command to download and start the Hugging Face text generation inference container.

    console
    $ sudo docker run -d  \
    --name hf-tgi  \
    --runtime=nvidia  \
    --gpus all  \
    -e HUGGING_FACE_HUB_TOKEN=$token  \
    -p 8080:80  \
    -v $volume:/data  \
    ghcr.io/huggingface/text-generation-inference:1.1.0  \
    --model-id $model  \
    --max-input-length 2048  \
    --max-total-tokens 4096
    
  3. Wait for the image to download and install.

  4. Monitor the Docker logs as the container loads.

    console
    $ sudo docker logs -f hf-tgi
    
  5. Wait until you see the following output that confirms the API is listening for incoming connections.

    2023-12-03T21:32:56.546160Z  INFO text_generation_router: router/src/main.rs:247: Connected
    2023-12-03T21:32:56.546167Z  WARN text_generation_router: router/src/main.rs:252: Invalid hostname, defaulting to 0.0.0.0
  6. Open a new rag.py file.

    console
    $ nano rag.py
    
  7. Enter the following information into the rag.py file.

    python
    import pandas as pd
    import chromadb
    from chromadb.utils import embedding_functions
    from sentence_transformers import SentenceTransformer
    from huggingface_hub import InferenceClient
    import numpy as np
    import os
    
    def text_embedding(text) -> None:
        model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
        return model.encode(text)
    
    def generate_context(query):
        vector = text_embedding(query).tolist()
    
        results = collection.query(    
            query_embeddings = vector,
            n_results = 15,
            include = ["documents"]
        )
    
        res = " \n".join(str(item) for item in results['documents'][0])
        return res
    
    def chat_completion(system_prompt, user_prompt, length = 1000):
    
        final_prompt = f"""<s>[INST]<<SYS>>
        {system_prompt}
        <</SYS>>
    
        {user_prompt} [/INST]"""
        return client.text_generation(prompt = final_prompt,max_new_tokens = length).strip()
    
    df = pd.read_csv('oscars.csv')
    df = df.loc[df['year_ceremony'] == 2023]
    df = df.dropna(subset=['film'])
    df.loc[:, 'category'] = df['category'].str.lower()
    df.loc[:, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' to win the award'
    df.loc[df['winner'] == False, 'text'] = df['name'] + ' got nominated under the category, ' + df['category'] + ', for the film ' + df['film'] + ' but did not win'               
    
    client = chromadb.Client()
    collection = client.get_or_create_collection("oscars-2023")
    
    docs = df["text"].tolist() 
    ids  = [str(x) for x in df.index.tolist()]
    
    collection.add(
        documents = docs,
        ids       = ids
    )
    
    URI    = 'http://127.0.0.1:8080'
    client = InferenceClient(model = URI)
    
    #query="What did Ke Huy Quan work on?"
    #query="Which movie won the best music award?"
    query = "Did Lady Gaga win an award at Oscars 2023?"
    #query="Who is the music director of RRR?"
    context = generate_context(query)
    
    system_prompt = """ \
    You are a helpful AI assistant that can answer questions on Oscar 2023 awards. Answer based on the context provided. If you cannot find the correct answerm, say I don't know. Be concise and just include the response.
    """
    
    user_prompt = f"""
    Based on the context:
    {context}
    Answer the below query:
    {query}
    """
    
    resp = chat_completion(system_prompt, user_prompt)
    print(resp)
    
  8. Save and close the file.

  9. Run the rag.py file.

    console
    $ python3 rag.py
    

    Output:

    Lady Gaga was nominated but did not win at the Oscars 2023.

Conclusion

In this guide, you've implemented a Chroma vector database to store vector data and improve the quality of LLM semantic search. You've also learned how embeddings work in NLP.