Retrieval Augmented Generation (RAG) Architecture based on AWS

6 min readNov 30, 2023

In today’s fast-paced world, the evolution of GenAI is taking place at an unprecedented rate. We are witnessing a continuous influx of novel GenAI applications, with virtually every industry wholeheartedly embracing this transformative technology. In light of this dynamic landscape, let’s understand one important type of architectural design for GenAI applications: the RAG architecture.

What is RAG architecture ?

RAG, an acronym for Retrieval Augmented Generation, it represents a neural network architecture employed in natural language processing (NLP) and information retrieval tasks. This innovative model comprises two integral components: the Retrieval Part and the Generation Part. The synergy between these elements is central to its functionality. The Retrieval Part excels in extracting relevant information from extensive data repositories, while the Generation Part leverages this retrieved knowledge to generate responses tailored to user inputs

Having this above understanding let us build the architecture with the approach below:

1 — We will first understand the basic building blocks of the RAG architecture.

2 — We will then translate the architecture with AWS services.

Basic RAG architecture and its components

The different components of the architecture are numbered and is as explained below:

1 — Users login to the application.

2 —The application code resides on a web server and features a chat interface. A user initiates a request often in the form of a prompt, this request is transformed into a vector representation through an embedding model.

3 — This represents the data storage layer, it contains various forms of unstructured data including files, images, videos, documents, tables, CSV files, and more. These diverse files undergo indexing and are then transformed into vectors using an embedding model.

4 — This represents the retrieval model, it uses the user prompt to search for the most relevant files/data based on the user prompt and returns this information.

5 — This represents the Generation Model which uses this relevant files/data from the retrieval model along with the user prompt and then generates the response, this response is then sent back to the Chat Interface.

What are the advantages of a RAG architecture ?

1 — Reduces hallucination / misinformation by combining the power of retrieval and generation. RAG with its ability to cite the resources used for generating the response further improves the trust on the response generated.

2 — Cost effective alternative to fine tuning a model as RAG can leverage an existing model and augment it with retrieval mechanism saving company both time and resources.

3 — Ability to generate domain specific responses, since RAG gives users ability to tap into domain specific knowledge bases, one can extract domain specific responses.

4 — Building a RAG architecture on the cloud makes sure that the information used stays within your account and the responses generated are specific to the information thus provided.

Because of these above advantages one can observe that RAG architecture applications are quite common in the intelligent search engines/chatbots within a company to virtual assistants in an automobile.

Having understood the different components, advantages & applications of the RAG architecture, if one wishes to translate them to AWS then below is an approach to build this architecture.

Let us understand the different services one can use in order to build a RAG architecture with AWS.

AWS Kendra: This is an accurate, easy-to-use enterprise search service that is powered by machine learning. It allows developers to add search capabilities to their applications so the users can discover information stored in the vast amount of data spread across the company. One can use this service for the retrieval model.

AWS Bedrock, Antropic Claude:
Bedrock is a serverless service that offers a choice of high-performing foundation models(FMs) with a single API to build gen AI applications, simplifying development while maintaining privacy & security.

Antropic Claude is a state-of-the-art foundation model provided by the company Antropic. With antropic claude one can securely process extensive amounts of information that is as long as 100,000 tokens or approximately 75000 words. Claude can be used across a wide range of tasks such as text summarization, question-answering, creative content generation, coding and instruction writing. One can use this service for the generation model.

AWS S3: It is a cloud storage (Object storage) service provided by AWS. One can use this service as the storage layer.

AWS Amplify: It is a serverless service that lets users quickly deploy the frontend of the web-applications. One can use this service to host the web-app frontend.

Langchain: It is a framework for building applications powered by large language models. One can use this service to connect to the LLM.

DynamoDB: It is a fully managed serverless NoSQL database offered by AWS. One can use this service to store the conversation history.

The different steps in the architecture are numbered and are as explained below:

1 — Users send a prompt to a web-application that is deployed using Amplify, the source code of the application is located in a repository like github. The web-application has a chat interface allowing user to send a prompt/user request.

2 — The user prompt is sent to AWS kendra, which converts this prompt into an embedded vector using an embedding model.

3 — This represents the storage layer, it contains the different unstructured data such as files, images, videos, documents, tables, csv files, etc.

4 — The files stored on S3 are indexed and represented as a vector using an embedding model from the AWS Kendra service every time the user runs a sync job.
Kendra performs a semantic search with the user prompt on the storage layer, and returns the knowledge (Relevant Context). Knowledge here refers to the most relevant data/document.

5 — Lambda function acts as a langchain orchestrator, it uses the knowledge (Relevant Context) from kendra to connect to the LLM model (Eg. Antropic Claude). The LLM uses the relevant context along with the user prompt to generate a response. This response is sent back to the chat user interface hosted with AWS amplify.

6 — A copy of the response is also sent to the DynamoDB database in order to store the conversation history, this helps the LLM build the context for generating a responses to the followed up prompts.

Conclusion:

This article serves as a guide to understanding the RAG architecture, exploring its applications, benefits, and providing insights into its implementation on AWS. It’s worth noting that AWS continually evolves, introducing new services that can further enrich your generative AI experience. Consequently, the architecture might vary based on specific use cases.

References used to build the architecture:

1 — Kendra to S3:
https://docs.aws.amazon.com/kendra/latest/dg/data-source-s3.html

2 — lambda to bedrock:
https://repost.aws/articles/ARixmsXALpSWuxI02zHgv1YA/bedrock-unveiled-a-quick-lambda-example

3 — Deploying webapps using amplify:
https://aws.amazon.com/getting-started/guides/deploy-webapp-amplify/#:~:text=Amplify%20is%20a%20framework%20with,stack%20applications%2C%20powered%20by%20AWS

Retrieval Augmented Generation (RAG) Architecture based on AWS

Other Articles:

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Shabarish PILKUN RAVI

Responses (2)