Building LLM Chatbots with RAG on Vultr Cloud GPU

Updated on July 25, 2024
Building LLM Chatbots with RAG on Vultr Cloud GPU header image

Introduction

Retrieval Augement Generation (RAG) uses the retrieval and generation based approached to generate responses, RAG technique utlizes the concept of embeddings to create semantic sense of the text in the corpus and answers more accurately in a personalized manner.

Ollama allows you to run Large Language Models (LLM) such as Llama 2, Mistral and CodeLlama on your local machine and build chatbots using Langchain integrations and interfaces like Streamlit and Gradio.

In this guide, you are to build a chatbot that has RAG capabilities by using Langchain for splitting the text, ChromaDB to store embeddings and Streamlit as a chat interface to generate responses using the Mistral model.

Prerequisites

Before you begin:

Preparing The Environment

In this section, you are to install the Ollama CLI tool, the Mistral LLM and all the packages that are required to use the Langchain functionality of splitting a PDF document, creating embeddings and the packages to store the embeddings and launching the ChainLit interface.

  1. Download the Ollama CLI.

    console
    $ curl -fsSL https://ollama.com/install.sh | sh
    
  2. Download the Mistral model.

    console
    $ ollama pull mistral
    
  3. Create a requirememts.txt file.

    console
    $ sudo nano requirements.txt
    
  4. Copy and paste the list of packages below.

    console
    langchain
    langserve[all]
    langchain-cli
    langsmith
    Pillow
    tesseract
    lxml
    docx2txt
    pdf2image
    xlrd
    pandas
    python-dateutil>=2.8.2
    gpt4all
    chromadb
    pypdf
    langchain_community
    langchainhub
    chainlit
    

    Save and close the file.

  5. Install all the packages.

    console
    $ sudo pip3 install -r requirements.txt
    
  6. Using the terminal of your local machine upload a PDF file.

    console
    scp -rp /path/to/local/dir usrname@server_ip:/path/to/remote/dir
    

    Make sure to replace:

    • /path/to/local/dir with the actual path where your PDF file is located.
    • usrname with the actual username of your GPU server.
    • server_ip with the actual Vultr server IP.
    • /path/to/remote/dir with the actual path where you want the file to be copied on the server.

    Upon the execution of this command you will prompted for a password, enter the password of the server you provisioned.

  7. Verify if the file is present.

    console
    $ ls
    

Creating The Embeddings

In this section, you are to split the document into chunks using a text splitter, create embeddings using GPT4AllEmbeddings from those chunks and store those embeddings in a Chroma Database.

  1. Create a load.py file.

    console
    $ nano load.py
    
  2. Import all the necessary modules in the load.py file.

    python
    from langchain_community.vectorstores import Chroma
    from langchain_community.embeddings import GPT4AllEmbeddings
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.document_loaders import PyPDFLoader
    

    The above modules by Langchain allows you to use ChromaDB to store the embeddings created using GPT4AllEmbeddings by spilling the text in PDF using RecursiveCharacterTextSplitter that is loaded using PyPDFLoader.

  3. Load the PDF document in the load.py file.

    python
    vdb_dir = 'vdb'
    loader = PyPDFLoader('./document.pdf')
    documents = loader.load()
    

    Make sure to replace the document.pdf with your actual PDF document.

  4. Split the document to create embeddings in the load.py file.

    python
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    texts = text_splitter.split_documents(documents)
    

    The above commands split the PDF document into smaller manageable chunks using a text splitter that makes chunks of 1000 characters each with no overlap.

  5. Create the embeddings and load them into the database in the load.py file.

    python
    print("Save to database %s." % vdb_dir)
    vectordb  = Chroma.from_documents(documents=texts,embedding=GPT4AllEmbeddings(), persist_directory=vdb_dir)
    vectordb.persist()
    print("Done")
    

    The above commands create embeddings using GPT4AllEmbeddings() to form chunks and storing the embeddings in the Chroma database.

    Save and close the file.

  6. Execute the file.

    console
    $ python3 load.py
    

    Output.

    output
    Save to database vdb.
    bert_load_from_file: gguf version     = 2
    bert_load_from_file: gguf alignment   = 32
    bert_load_from_file: gguf data offset = 695552
    bert_load_from_file: model name           = BERT
    bert_load_from_file: model architecture   = bert
    bert_load_from_file: model file type      = 1
    bert_load_from_file: bert tokenizer vocab = 30522
    Done
    

Creating The Interface

In this section, you are to load the mistral model, define the retrieval chain and the chabot, define the async functions to add the chat functionality and launch the interface.

  1. Create an app.py file.

    console
    nano app.py
    
  2. Import all the necessary modules in the load.py file.

    python
    from langchain import hub
    from langchain_community.embeddings import GPT4AllEmbeddings
    from langchain_community.vectorstores import Chroma
    from langchain_community.llms import Ollama
    from langchain.callbacks.manager import CallbackManager
    from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
    import chainlit as cl
    from langchain.chains import RetrievalQA,RetrievalQAWithSourcesChain
    

    The above modules by Langchain and Streamlit provides the functionality to launch an interface, build a retrieval chain, and define chatbot configurations.

  3. Download the retriever model in the load.py file.

    python
    QA_CHAIN_PROMPT = hub.pull("rlm/rag-prompt-mistral")
    
  4. Load the model in the load.py file.

    python
    def load_llm():
        llm = Ollama(
        model="mistral",
        verbose=True,
        callback_manager=CallbackManager([StreamingStdOutCallbackHandler()]),
        )
        return llm
    
  5. Define the retrieval chain in the load.py file.

    python
    def retrieval_qa_chain(llm,vectorstore):
        qa_chain = RetrievalQA.from_chain_type(
            llm,
            retriever=vectorstore.as_retriever(),
            chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
            return_source_documents=True,
            )
        return qa_chain
    

    The above retrieval_qa_chain function creates a question-answering chain that returns the source document along with the answer to the user prompt.

  6. Define the question-answering bot in the load.py file.

    python
    def qa_bot(): 
        llm=load_llm() 
        DB_PATH = "vdb/"
        vectorstore = Chroma(persist_directory=DB_PATH, embedding_function=GPT4AllEmbeddings())
    
        qa = retrieval_qa_chain(llm,vectorstore)
        return qa
    

    The above qa_bot function uses the retrieval_qa_chain function to setup the question-answering bot.

  7. Define the initial question-answering system in the load.py file.

    python
    @cl.on_chat_start
    async def start():
        chain=qa_bot()
        msg=cl.Message(content="Bot is initializing")
        await msg.send()
        msg.content= "Hi, how may i help you?"
        await msg.update()
        cl.user_session.set("chain",chain)
    

    The above start() function initializes a question-answering system, sends a message to notify users, prompts the user for their query and saves the query for future use in the user session.

  8. Define message handling during the chat session in the load.py file.

    python
    @cl.on_message
    async def main(message):
        chain=cl.user_session.get("chain")
        cb = cl.AsyncLangchainCallbackHandler(
        stream_final_answer=True,
        answer_prefix_tokens=["FINAL", "ANSWER"]
        )
        cb.answer_reached=True
    
        res=await chain.acall(message.content, callbacks=[cb])
        print(f"response: {res}")
        answer=res["result"]
        answer=answer.replace(".",".\n")
        sources=res["source_documents"]
    
        if sources:
            answer+=f"\nSources: "+str(str(sources))
        else:
            answer+=f"\nNo Sources found"
    
        await cl.Message(content=answer).send()
    

    The main() function above handles incoming messages during a chat session, processes them using the question-answering chain, formats the response, and sends it back to the user along with the source.

    Save and close the file.

  9. Allow incoming connections to port 8000.

    console
    $ sudo ufw allow 8000
    
  10. Launch the Streamlit interface.

    console
    $ chainlit run app.py -w
    

    The chatbot can be accessed at http://SERVER_IP:8000.

    The Interface should like below once initiated.

    Chatbot Interface

    The interface provides many options to perform, you can prompt the bot to ask specific questions about the document you uploaded and you can also upload a new document.

    While creating a response the interface show the whole process of retrieval like the specific place from where the response is picked and how it is processed.

Conclusion

In this guide, you made a chatbot using Ollama to run Mistral 7B on your Vultr Cloud GPU server, you integrated the RAG functionality using Langchain and embeddings. At last you created a chat interface to query the model about the specific document you gave to the model.