Services

Resources

Company

RAG, MLOps

Oct 8, 2024 | 9 min read

Exploring Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge

Exploring Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge

SRE-Intern

RAG, MLOps

Oct 8, 2024 | 9 min read

Exploring Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge

SRE-Intern

RAG, MLOps

Oct 8, 2024 | 9 min read

Exploring Retrieval-Augmented Generation (RAG): Enhancing LLMs with External Knowledge

SRE-Intern

Discover Retrieval-Augmented Generation (RAG), a technique that enhances Large Language Models (LLMs) by integrating external knowledge bases to provide more accurate, up-to-date, and contextually relevant responses. RAG combines information retrieval with generative AI to address issues like hallucinations and outdated training data, ensuring LLMs stay relevant and current. Learn how RAG works and its benefits

Co-author: Saurabh-Hirani

Introduction

RAG applications, Vector databases, embeddings, context windows.

If you haven't been living under a rock, you would've heard these terms in context of AI applications.

  1. What does it take to build such an application and run it against ChatGPT APIs or self hosted models?

  2. What fundamental concepts do I need to understand to make sense of conversations which use all these buzzwords?

  3. Why is melody chocolaty?

We will answer 2 of the above 3 questions in this post.

Our goal is to provide a first-person perspective on how we explored the tools and technologies, along with insights into how Generative AI is likely to impact our work.

In this introductory post, we will cover the following topics:

  • The basics of Large Language Models (LLMs)

  • An introduction to Retrieval-Augmented Generation (RAG)

  • The foundational knowledge needed to build a RAG-based application

This information will serve as the base for future posts, where we will walk through building an application and enhancing it with RAG.

At first, there was Google

One day, you were searching for answers on Google, and the next, a switch flipped—you were using ChatGPT. The transition felt fast and natural. Before we dive into how ChatGPT provides better answers, let’s first explore why it does so.

When using Google, the typical process looks like this:

  1. Search for the best expression of your query.

  2. Review the top results to see if they provide the answer.

  3. If not, turn to forums like Stack Overflow or community mailing lists for discussions.

The key word here is "search". While Google uses sophisticated algorithms to find relevant pages, the process is still indirect, requiring humans to sift through, parse, and filter information.

ChatGPT simplifies this by removing the need to forage through multiple pages, presenting a direct answer to the best of its abilities. While it may not always offer the depth of a lively forum discussion, it often provides a solid starting point that you can refine and build upon.

The convenience of having a conversation without being redirected to multiple pages has driven the widespread adoption of ChatGPT.

How does ChatGPT work?

ChatGPT accomplishes its tasks by utilising LLMs from OpenAI.

This is where the buzzwords start appearing, so let’s break them down:

  • ChatGPT - Chat Generative Pre-Trained Transformer.

  • LLMs - Large Language Models.

  • OpenAI - The organization that developed key AI models like - GPT (for text generation), DALL-E (for generating images from text descriptions).

Now, let's go one level deeper and understand what Chat Generative Pre-Trained Transformer means.

  • Generative - Refers to the ability of this model to generate text.

  • Pre-trained - The model is pre-trained on a vast amount of data.

  • Transformer - The neural network architecture behind model, pioneered by the 2017 research paper - "Attention is all you need."

Yes, it's buzzwords leading to more buzzwords, but we're getting there. One more level of understanding before tying it all together:

  • Model - In machine learning, a model is a mathematical representation of a computational system designed to accomplish specific tasks e.g. classifying images, making weather predictions, answering queries, etc.

  • Neural network - A type of machine learning model inspired by the structure of the human brain.

If you've made it this far, let's take a moment to pause and revisit the opening statement:

ChatGPT accomplishes this by utilising LLMs (GPT Series of models) from OpenAI.

This can be expanded as

ChatGPT leverages large language models (LLMs) to accomplish this, using a pre-trained model built on the Transformer neural network architecture.

However, we still haven't fully unpacked LLMs.

  • Large - Refers to the vast number of parameters in a model. For example, in the task of classifying whether an image contains a cat or a dog, parameters are the weights that the model learns during training, which help it distinguish features like the shape of the tail, the length of the whiskers, and other patterns. The larger and more fine-tuned the parameters, the better the model can capture subtle details, increasing its accuracy in making predictions.

  • Language Models - These models can understand, process and generate natural language.

Fine-tuning occurs through a process called backpropagation, which works as follows:

  • Take the input (e.g. an image of an animal).

  • Use the model’s learned parameters (weights) to make predictions based on input features (e.g., the shape of the tail might have a weight of 0.9, and the size of the ears might have a weight of 0.6).

  • Perform mathematical operations using these weights and the input features (this process is more complex and involves multiple layers of neurons in a neural network).

  • Compare the predicted output with the actual answer to calculate the error (how far off the model's prediction was).

  • Adjust the parameter values (backpropagation) to minimize the error, making the model more accurate over time.

The fully expanded statement is

GPT, built on the Transformer neural network architecture, leverages large language models fine-tuned with vast numbers of parameters to generate accurate responses.

What does ChatGPT do with the input?

Image classification provides a simple yes/no answer, but a philosophical question like, "Why is melody chocolaty?" requires a more nuanced approach. It requires the model to generate a thoughtful response rather than just classify something.

A thoughtful response implies the act of thinking.

"Think" is a loaded term here. We've already established that these models are trained on large datasets and generate responses based on the knowledge they’ve gained through them.

What does ChatGPT do between you asking it a question and it giving the answer?

It doesn’t think. It “predicts” the next sequence of words to further the conversation. This brings us to the next question:

How does it predict?

A detailed post on how prediction works, without getting into complex math check out this great post by Miguel.

An overview of the approach describe is another buzzword loaded statement which we will deconstruct:

ChatGPT takes a user prompt which is preprocessed and passed to GPT, which converts it into embeddings, which capture the semantic relationships between words and phrases. Based on the context window of a certain token size, it then uses these embeddings to predict the next word or sequence of words to generate a coherent response.

  • User prompt - The input or query you provide to ChatGPT e.g. “Why is the sky blue?”

  • Semantic relationship - The meaning and connections between words and phrases e.g. “dog” and “puppy” have a closer semantic relationship as they are related in meaning.

  • Context window - The portion of the input (user prompt or conversation history) that ChatGPT considers when generating its response. It is measured in tokens. A larger context window allows the model to consider more context when generating a response. Smaller context windows may limit the model's understanding, leading to issues like "hallucination," where the generated answers may be syntactically correct but lack proper context, resulting in seemingly random responses.

  • Token - A piece of text processed by the model. It can be a word, part of a word, punctuation mark, etc. Tokens are converted into embeddings for further processing.

  • Embeddings - Numerical vectors that represent data (words or phrases) in a continuous vector space, capturing the meaning and relationships between them. Different models convert different forms of input into embeddings e.g an audio model converts an audio file, a text model works with text files and so on.

  • Embeddings: Numerical vectors that represent data (such as words or phrases) in a continuous vector space, capturing the meaning and relationships between them. Different models convert various types of input into embeddings. For example, a text model converts text into embeddings, an audio model processes audio files into embeddings, and similarly, other data types (like images) can be converted into embeddings in their respective models.

  • Predict: Prediction involves taking a series of embeddings and performing mathematical operations to identify the most similar embeddings. These embeddings are then converted back into words or phrases, which form the "magic" behind generating a coherent answer to a prompt.

Let’s dive one level deeper to understand embeddings:

  • Vectors - Mathematical definition: Vectors are the opposite of scalars. While scalars are single-dimensional values that represent a single quantity (like temperature), vectors have both magnitude and direction e.g. velocity, force, etc.

  • Vectors in context of embeddings - Embeddings are numerical vectors that represent objects (sentences, words, images, etc.) in a multi dimensional space. Each dimension of this space captures different features or relationships that the model has learned from training data.

  • Dimensions - They represent learned feature of an embedding from training data. For example, the vector for the word “king” can be represented by [0.25, 0.12, ..., 0.07] (a 300-dimensional vector). The dimensions can mean human understandable concepts like gender, royalty, etc., while other dimensions capture more abstract relationships. For instance, the embedding for "king" might be close to "queen" in this space, indicating a strong semantic connection.

Basically, we are breaking down the prompt into vectors that can be correlated with other vectors to generate appropriate context for a response.

This is why statements like "XYZ model increases context window" or "ABC model released with N billion tokens" now make more sense. Training these models requires vast computational power, and research teams are continually pushing the boundaries of how much context and how many tokens a model can handle. This expansion allows the model to gain a deeper understanding of its task.

The fully expanded statement is:

ChatGPT takes a user prompt and passes on to the LLM Model (GPT) and converts it into numerical vectors in a continuous space, capturing the meanings and relationships between words and phrases. Based on the size of the context it can handle, the model uses these numerical vectors in a high-dimensional space to predict the next word or sequence of words, generating a coherent response.

LLM Limits

Every system has its limits, and LLM-based systems like ChatGPT are no exception. Once we understand these limitations, we can augment their knowledge to build better context for responses. Let’s explore some of these limitations:

  • LLMs are pre-trained models and cannot provide answers on queries which involve data they have not been trained upon and data they have not been explicitly provided as input during inferencing: For example: GPT-4 can only provide responses based on the data it was trained on, i.e. all data that could be scraped off the web till September 2023. For instance, a model trained on data up to 2023 cannot answer questions like, "Who raised the ICC T20 World Cup in 2024?"

  • Training models is computationally expensive. Organizations that train these models use generic datasets scraped from the internet. However, many organizations that want to use these models need responses based on specific, private information—such as internal company sales records or proprietary knowledge bases—which is not publicly available. Also, the sheer complexity of building an LLM Model from scratch requires research and costly hardware. Only, the likes of OpenAI with billions of dollars in funding can afford the massive hardware infrastructure required.

  • The worst problem with LLMs is that they can hallucinate as mentioned earlier. Let’s look at an example showcasing this situation:

Fig: ChatGPT Hallucinating. (Source:https://x.com/goodside/status/1609972546954317824)

These limitations might lead one to believe the following:

  1. LLM Models cannot provide answers about current events.

  2. Internal data repositories cannot be used for training existing LLMs.

This is where new buzzwords make their way in:

End users can overcome these limitations by using prompt engineering to inject relevant context or programmatically leveraging RAG (Retrieval-Augmented Generation) applications to populate and search vector databases with relevant embeddings, ensuring more accurate and contextually appropriate responses.

  • Prompt engineering: The process of carefully designing and structuring the input (prompt) provided to an AI model. It helps “inject” the right context so that the model can generate more relevant responses. The act of generating these responses based on new, unseen input data is called inference.

    While prompt engineering is useful, it relies on a human to think and set the right context, making it more of an art than a science. However, if the process of injecting context requires looking up external information (such as searching the internet), it may not be scalable. This is where a programmatic approach is needed—retrieving the right data and automatically building the appropriate context for generating accurate responses.

  • RAG: Retrieval-Augmented Generation - This approach combines retrieval based methods with generative AI models. Instead of solely relying on pre-trained knowledge, RAG retrieves up-to-date information from external data sources - such as knowledge databases or APIs. The aim is to retrieve the right context and augment the model to generate better responses.

  • Vector databases: These are specialized databases that are designed to store and search data in the form of embeddings. Vector databases excel at similarity-based searches, which add context by identifying data based on meaning rather than exact matches. This makes them more effective than traditional databases, which are built for keyword-based or structured data searches (e.g., tables, rows, etc.). Traditional databases cannot perform similarity-based lookups on embeddings.

    Example: If you want your LLM to provide guidance on the "pros and cons of using Infrastructure as Code," a vector database search for "IaC analysis" could retrieve relevant information for questions like "How does Terraform compare to CloudFormation?" even if those exact words weren't in the query. This is because the vector database understands the meaning behind the query, rather than just matching specific keywords.

RAG architecture

Let’s bring everything together by addressing a key question from the previous section:

  • “Who won the ICC T20 World Cup in 2024?”

Assume the LLM used to answer this question has knowledge of cricket and its history up to 2023, but is not aware of events after December 2023.

To effectively answer this question, we need to do two things:

  1. Write more precise prompts to clearly convey our intent.

  2. Supplement the LLM with updated information, allowing it to enhance its knowledge with recent facts.

We briefly touched on Retrieval-Augmented Generation (RAG) earlier. Now, let’s break down how to build a RAG application to answer this query.

RAG applications follow these key stages:

  • Retrieval: Gather relevant information based on the user's query.

  • Augmentation: Use the retrieved data to craft an improved prompt.

  • Generation: Use the refined prompt to generate the final response from the LLM.

This process acts like a feedback loop, continuously retrieving and feeding the LLM the right information to help it better understand the problem domain.

To make this work, we need a specific capability in RAG applications that enables them to function as AI agents.

An AI agent is software that performs tasks on behalf of a user, automating processes, making decisions, and interacting intelligently with its environment. In this case, AI agents automate the process of querying the LLM with the necessary data and refining prompts through a RAG pipeline. LLM agents are AI agents that specifically utilize LLMs.

Let’s design an application to determine what existing knowledge the LLM can use and what additional data we need to provide. If a human were to solve this query, they would need to answer the following:

  • Is this really a sports-related query?

    • The LLM can handle this.

  • If it's about sports, which sport is it referring to?

    • The LLM can answer this too.

  • Do we need to fetch results from a specific data source for sports news?

    • The LLM agent will decide based on its interaction with the LLM.

  • What kind of data source should be used to retrieve the news?

    • The LLM agent will determine whether to scrape public news or sports websites, prepare and feed this data to the LLM, or query an already enriched vector database for relevant information.

  • Do we only need to fetch data based on specific parameters, or should we perform a semantic search across news headlines?

    • The LLM agent will make this decision based on the complexity of the query.

Let us look at the above stages in context of this query:

Fig: RAG working for Sports Chatbot

Retrieval stage

To handle these queries, we need to design several components:

  • Tools: Functions used to perform tasks beyond the LLM's internal capabilities, like fetching data from APIs, performing calculations, or accessing a database for updates.

  • Router: This function determines the correct course of action, such as choosing between system prompts or calling a tool to fetch additional information. It queries the LLM to identify if a tool call is needed.

The following steps occur during the Retrieval Stage:

  1. Determine if the user query is a sports-related query. If yes, identify which sport the query pertains to.

  2. For example, with the query "Who won the ICC Men's World Cup 2024?", the LLM can infer the event "International Cricket Council Men's World Cup" and the year "2024". By querying specialized prompts, the LLM identifies the sport as "Cricket."

  3. After identifying the sport and timeframe, the RAG application issues function calls to retrieve relevant cricket news from 2024 based on the keywords “ICC Men's World Cup 2024.”

  4. If the retrieved headline is "India lifts the ICC Men's World Cup Trophy 2024 after 14 years," multiple specialized LLM calls are made to extract the key information—e.g., the country "India."

Augmentation Stage

  • Once the data is fetched, the next step is to craft a more specific and enriched prompt for querying the LLM. This refined prompt provides the LLM with the necessary context to generate a more accurate response.

  • The final prompt may look like this (Note: the format can vary depending on the framework or LLM used to build the application):



Generation Phase

This is the final stage where the AI Agent has to generate the final output to the end user.



Are we playing the role of Google?

In the past, we relied on Google for search tasks, manually filtering through results to find relevant pages. Similarly, LLM applications need a level of human programmatic intervention when they hit their context limitations. LLM agents still need to search external databases and enrich themselves with the right information. In that sense, the "search" component hasn’t disappeared; it has just been integrated into the process and made more generic, allowing AI agents to handle it in the background.

With Google, we could specify what to search for directly. However, in the case of RAG applications, we have to decide where to search, how much information to gather, and what to extract from the search results.

This process is more of an art than a science. We are learning and refining our approach as we go. Here are a few guidelines that have helped us so far, which we will explore in greater detail in an upcoming post.

Technology Landscape

With the increased adoption of AI and the emergence of tools for building AI-based applications, there are many products and frameworks in AI's technology landscape. This section lists a few of those categories and some entries in each category. While not exhaustive, this overview provides insight into what comprises RAG frameworks, vector databases, and end-user products.

  1. End-user LLM Models

    1. OpenAI’s GPT

    2. Anthropic’s Claude

    3. Google’s Gemini/Gemma

  2. End-user LLM Models with Image Generation Support

    1. End user Text-to-image/video products

    2. OpenAI’s DALL-E

    3. Midjourney

    4. Stability.AI’s Stable Difussion

  3. End-user Code generation products

    1. Github Copilot

    2. Amazon Codewhisperer

  4. Model and Dataset Registries and Hosting Solutions

    1. HuggingFace (Model Registry, Data Registry and Model Inferencing via Transformers)

    2. Stanford's Alpaca

    3. Databricks' Dolly

  5. Inferencing Servers

    1. Ollama Inference Serving

    2. Nvidia Triton

    3. Tensorflow TFServe/TFX

    4. Ray Serve

  6. RAG frameworks

    1. LangChain

    2. Llamaindex

    3. Microsoft Autogen

  7. Vector databases

    1. Pinecone

    2. Qdrant

    3. ChromaDB

    4. Milvus

Oversimplified statements

Given our understanding so far, let’s make a few oversimplified statements to drive the point home:

  • ChatGPT responses work on automated prediction.

  • RAG is automated prompt engineering with the ability to fetch data from external sources.

While there are definitely nuances to both statements, and they are not entirely accurate, they offer a simplified way to grasp the core ideas. Just like one might visualize Google searches as “looking up an inverted index of terms and returning pages that match them,” these statements give a very high-level overview of these technologies.

Teaser

Next steps

In this post, we explored the building blocks of a Generative AI application. In the next part of this series, we will dive deeper into the internals, workings, and limitations of an actual RAG application. Subscribe and stay tuned for more insights on the One2N blog!

Jump to Section

Also Checkout

Also Checkout

Also Checkout

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.

Subscribe for more such content

Stay updated with the latest insights and best practices in software engineering and site reliability engineering by subscribing to our content.