AWS architecture for processing real-time and batch processing data and dashboards

6 min readJul 16, 2022

Context:

Let us assume that you have a website that sells articles, you want to study the customer behavior, analyze click-stream data and present the results on a real time and a batch processing dashboard, in this case the architecture explained can help you.

Information about the website:

Users access the website via a mobile application or a website (possible input sources to understand customer behavior).
Different micro-services will connect to the data lake, thus the endpoints must be compatible to the various data services.

Datalake Definition:

It is a central repository where one can store both structured and unstructured data at any scale, one can use this repository to perform different types of analytics such as creating a dashboard, big data processing, real time analytics and machine learning.
[source: What is a data lake? (amazon.com)]

The System Architecture Diagram:

The system architecture diagram is as presented below:

The architecture shown above can be explained in 3 parts:

Part 1: Ingestion
Part 2: Transform
Part 3: Load

Part 1: Ingestion

The data from the mobile application and website is collected in real-time using Kinesis data firehose, and this raw data is stored in an S3 bucket.

Reasons to use Kinesis Data Firehose:

Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Open Search Service, or any custom HTTP and HTTPS endpoints.
The data of interest that your data producer sends to a Kinesis Data Firehose delivery stream. A record can be as large as 1,000 KB.

Reasons to use S3 to store the Raw Data:

S3 is a object storage service that offers a high scalability, availability, performance and security. One can store huge amounts of data securely on S3. Thus S3 will be our datalake.

Part 2: Transform

In the transform part we are using Amazon kinesis Data Analytics and Firehose service. The kinesis data analytics service is used to run SQL queries on the stream data from the incoming kinesis firehose service.

We then activate transforms in the kinesis data firehose to further apply format changes, transformations, dynamic partitioning and compressions on this incoming data from kinesis data analytics. This transformed, compressed, partitioned data is stored in the clean bucket.

Why choose Amazon Kinesis Data Analytics:

Kinesis data analytics allows you to process and analyze streaming data using standard SQL.
The service enables you to quickly author and run powerful SQL code against streaming sources to perform time series analytics, feed real-time dashboards, and create real-time metrics.

This clean data is stored in an S3 bucket containing clean data.

Please note: One could also use AWS glue to perform the transform step if we are not transforming data in real-time.

Reasons to use AWS glue for transformation:

AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.
AWS Glue is a popular choice for ingesting data as a batch process. The Glue Crawler enables data in many different formats to be processed. Processing power is provided by a spark cluster and Python or Scala give programming flexibility.

Part 3: Load

The load part of the pipeline can be further split into 2 parts, real-time processing and batch processing.

Real-Time Processing

Everytime new data is available after transformation in the kinesis datahose section, the lambda function can be triggered to put the new data into the dynamodb database.

Using dynamodb streams everytime new data is available, we can trigger the streams to reflect this new data on the content delivery network Cloudfront, which will deliver the data to the dashboard built on a framework such as Angular, django, flask, react, etc. One can also use Quicksight as a dashboard service.

Why choose Dynamodb:

Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability.
With DynamoDB, you can create database tables that can store and retrieve any amount of data and serve any level of request traffic. You can scale up or scale down your tables’ throughput capacity without downtime or performance degradation.

Why choose Lambda:

Lambda is a serverless compute service that lets you run code without provisioning or managing servers.
One can build data-processing triggers for AWS services such as Amazon Simple Storage Service (Amazon S3) and Amazon DynamoDB.

Batch Processing

Once the data is transformed in the Transform step, it is stored in the Clean Bucket as shown in the figure. We will also create a glue crawler on the clean bucket to create data catalogue and tables, these glue catalogue tables can then be loaded onto various endpoints.

Once the data is transformed, one can load the data to various endpoints, in this architecture I have listed 3 end points to further run OLAP processes, create and deploy machine learning models, or just run some SQL queries on the dataset. One can also visualize the data on AWS Quicksight.

Amazon Redshift:

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. Amazon Redshift offers fast query performance using the same SQL-based tools and business intelligence applications that you use today. This enables you to use your data to acquire new insights for your business and customers.

Amazon Athena:

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon Simple Storage Service (Amazon S3) using standard SQL.
Athena is serverless, so there is no infrastructure to set up or manage, and you pay only for the queries you run.

Amazon Sagemaker:

Amazon SageMaker is a fully managed machine learning service.
Data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. It also enables users to practice MLOPs practices.
It provides an integrated Jupyter authoring notebook instance for easy access to your data sources for exploration and analysis, so you don’t have to manage servers.

Amazon Quicksight:

It is a serverless AWS data visualization service.
As a fully managed cloud-based service, Amazon QuickSight provides enterprise-grade security, global availability, and built-in redundancy. It also provides the user-management tools that you need to scale from 10 users to 10,000, all with no infrastructure to deploy or manage.

Conclusion:

In this article, I have presented an architecture that describes a data lake system that can support different users (Data engineers, product owners, stake holders, data scientists, BI engineers, and executive managers), for different use cases.

The services used in these architectures are highly scalable, and fully managed, thus we dont need to worry about setting up or managing the infrastructure.

A single architecture was explained to support a real time use case and a batch processing use case, however one could also seperate this architecture and build a single architecture that can support individual data processing type.