Building a simple RAG application using OpenAI, LangChain, and Streamlit
In March 2024, I embarked on a thrilling journey as I commenced my Master of Artificial Intelligence program. Among the many intriguing subjects, Programming with Python presented a delightful blend of simplicity and challenge.
The beauty of this course lay in its emphasis on project-based learning, allowing for a hands-on experience in mastering the subject. For my capstone project, I chose to build a Question and Answering (Q&A) interface targeting Indonesian citizens curious about Australia’s Working Holiday Program.
The problem at hand
Over recent years, the allure of acquiring a Working Holiday Visa for Australia has significantly grown among Indonesian youth. Observing the escalating search trends on Google and the high engagement rates on social media platforms like TikTok and Instagram, especially post-Covid-19, the interest is palpable.
However, a significant barrier confronted by Indonesian applicants is the language barrier when attempting to comprehend the official information from the Australian Department of Home Affairs. This gap has led many to rely heavily on information propagated by social media influencers and content creators, often resulting in misleading or inaccurate information due to a lack of professional or authoritative sources.
Introducing the RAG Q&A Application
To surmount these challenges, I proposed the construction of a Q&A interface. This platform enables users to input questions about the Working Holiday Visa in Indonesian and receive reliable answers sourced from authoritative resources such as official government websites. The key features of this application include:
- A simple Q&A interface developed using Streamlit that accepts questions in both Indonesian and English and offers answers in Indonesian.
- A knowledge base created by scraping and processing relevant pages from the official Australian Department of Home Affairs website and its Indonesian counterpart.
- Implementation of the Retrieval-Augmented Generation (RAG) model, capable of retrieving pertinent information from the knowledge base, and generating accurate answers.
Step-by-Step Guide to Creating a Q&A Interface
In this section, I will share an in-depth guide of the steps taken to implement the project, focusing on data preparation, importing dependencies, programming, and setting up a Streamlit interface.
Data Preparation
The first step was to assemble the text data from the official websites, which was subsequently stored in a .txt
file named scraped_data.txt
.
Importing Dependencies
Creating a new file called streamlit_app.py
, I ensured that the following dependencies were installed:
import streamlit as st
from langchain_openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
from pathlib import Path
Programming
The programming process begins by setting the system_prompt
variable. This instructs the LLM/GPT to generate outputs in Bahasa Indonesia, limiting them to 100 words to conserve tokens.
# Define the function to generate response using RAG
def generate_response(openai_api_key, query_text):
# Prepare the system prompt for Bahasa Indonesia responses
system_prompt = "Always generate output in Bahasa Indonesia! Output must be no more than 100 words!\\\\n"
query_text = system_prompt + query_text # Append the actual query text to the system prompt
Within the function, the pathlib
library is used to load the .txt
file into the .py
application.
# Define the document path
doc_path = Path(__file__).parent / "scraped_data.txt"
# Read the document
with open(doc_path, 'r', encoding='utf-8') as file:
documents = [file.read()]
The content of the documents list is divided into smaller chunks or “texts” using the RecursiveCharacterTextSplitter
from the Langchain
library. The chunk_size
parameter limits each chunk to 2500 characters, while chunk_overlap
maintains a 250 character overlap between chunks to preserve context.
# Split documents into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2500, chunk_overlap=250)
texts = text_splitter.create_documents(documents)
After splitting the text into chunks, an instance of OpenAIEmbeddings
is created using the provided openai_api_key
. Embeddings are vector representations of chunked text used for similarity search and retrieval. A Chroma vector store (db) is then created from the texts and embeddings to perform similarity searches on the document chunks.
# Select embeddings
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
# Create a vector store from documents
db = Chroma.from_documents(texts, embeddings)
Next, a retriever interface (retriever) is created from the vector store (db). The retriever retrieves relevant document chunks based on the query. A RetrievalQA
instance combines a language model (OpenAI) and a retriever to generate answers based on the retrieved context from documents.
The chain_type='stuff'
parameter specifies the StuffQAChain
, which takes a question and retrieved context, and "stuffs" the context into the prompt for the language model to generate an answer.
# Create retriever interface
retriever = db.as_retriever()
# Create QA chain
qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=openai_api_key), chain_type='stuff', retriever=retriever)
return qa.run(query_text)
The function concludes with return qa.run(query_text)
. When called, the following steps occur:
- The retriever retrieves the most relevant document chunks from the vector store based on the
query_text
. - The
StuffQAChain
forms a prompt for the language model using the retrieved context and thequery_text
. - The OpenAI language model generates an answer based on the prompt.
- The generated answer is returned.
Setting Up the Streamlit Interface
After completing the programming part, I created a simple page title, text input, and logic to process the query and return the answer on the interface. I also included links to original sources of the answers for reference.
# Setup the page title to be displayed on Streamlit
st.set_page_config(page_title='🇮🇩 Working Holiday Visa 101 🇦🇺')
st.title('🇮🇩 Working Holiday Visa 101 🇦🇺')
# Setup the input textfield to take questions from user
query_text = st.text_input('Question / Pertanyaan:', placeholder='Please provide your question here / Masukkan pertanyaan Anda di sini.')
# Process the query input by users
if query_text:
with st.spinner('Sedang mencari jawaban...'):
response = generate_response(openai_api_key, query_text)
st.info(response if response else "Mohon maaf saya tidak bisa menjawab pertanyaan tersebut, mohon menanyakan hal yang lain.")
# Display sources from where the answers will be generated
st.write('\\n\\n\\n') # Add more line spaces
st.markdown("**The answers are generated from the following sources as of April 25, 2024. / Jawaban dihasilkan dari sumber-sumber berikut, per tanggal 25 April 2024.**")
st.markdown("* <https://immi.homeaffairs.gov.au/what-we-do/whm-program/>")
st.markdown("* <https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-listing/work-holiday-462>")
st.markdown("* <https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-listing/work-holiday-462/first-work-holiday-462>")
st.markdown("* <https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-listing/work-holiday-462/second-work-holiday-462>")
st.markdown("* <https://immi.homeaffairs.gov.au/visas/getting-a-visa/visa-listing/work-holiday-462/third-work-and-holiday-462>")
st.markdown("* <https://indonesia.embassy.gov.au/jakt/visa462.html>")
Deploying the Application
To ensure the security of the OpenAI API key, I decided to use the .env
method for storage. However, Streamlit Sharing doesn’t support .env
and required the use of .toml
for storing secret keys.
# Access the OpenAI API key from secrets.toml
openai_api_key = st.secrets["openai"]["api_key"]
This .toml
syntax look like below (see details here).
[openai]
api_key = "sk-CvPEhChFaL7Mu**********************************"
Successfully merging and pushing the code to the Git repository, I then connected the repository to the Streamlit Share dashboard. In the settings menu, I set up the OpenAI API key in the secrets section using the previously created .toml
syntax. Now, the app is up and running, accessible to all as I set the visibility to public.
Conclusion
In conclusion, this project provided a comprehensive, hands-on experience of working with Python. It offered a unique chance to apply theoretical knowledge to practical, real-world problems. While the journey was challenging, the outcome was immensely rewarding, further cementing my passion for programming and artificial intelligence.
Reference
https://blog.streamlit.io/langchain-tutorial-4-build-an-ask-the-doc-app/