Designing Data Lake solution On AWS

Uviekugbere Theophilus
5 min readMar 3, 2023

--

In today’s digital age, companies generate a vast amount of data, and it is becoming increasingly important to store, manage, and analyze this data efficiently. A data lake is a solution that can store all types of data, structured, unstructured, or semi-structured, in its native format, enabling organizations to analyze the data and gain insights into their business. In this article, we will discuss how to build a data lake solution on AWS using various AWS services.

Data Sources

The first step in building a data lake solution is to identify the data sources. In our scenario, the data sources are:

  1. IoT sensors that send real-time data
  2. A database with historical transactional records
  3. Supplemental data from third-party or entities for enriching internally generated data

Ingesting Data from IoT Sensors

The first data source we will look at is IoT sensors, AWS offers a variety of services that can help organizations leverage IoT devices to their full potential, enabling them to securely connect, manage, and analyze data from IoT devices at scale. AWS IoT Core is one of the managed cloud services that enables devices to securely connect and communicate with cloud applications and other devices. It provides rules-based routing, message brokering, and device management features, making it easier to manage and analyze IoT data.

AWS also offers services like AWS IoT Analytics, which allows you to run advanced analytics on IoT data, and AWS Greengrass, which extends AWS to edge devices, allowing them to act locally on the data they generate. AWS IoT SiteWise enables you to collect, store, and organize industrial equipment data from the edge to the cloud, and AWS IoT Things Graph lets you easily connect devices and services, and build IoT applications. Furthermore, AWS provides IoT services for device management and security, such as AWS IoT Device Defender, which continuously monitors IoT device behavior to detect anomalies, and AWS IoT Device Management, which allows you to onboard, organize, monitor, and remotely manage IoT devices at scale.

We will use Amazon Kinesis Firehose data delivery stream to ingest the real-time data from the IoT sensors into an S3 bucket. The data is stored in its native format, which means that it can be accessed easily without any additional processing.

Migrating Data from the Database

The second data source is a database with historical records. We will use AWS Database Migration Service (DMS) to migrate the data from the existing database to the data lake. DMS is a managed service that helps migrate databases to AWS easily and securely. Once the data is migrated to S3, it is stored in its native format.

Ingesting Data from Third-Party Entities

The third data source is supplemental data from third-party entities for enriching internally generated data. We will use AWS SFTP to transmit the data to the S3 bucket. AWS SFTP is a fully managed service that enables secure and easy transfer of files over SFTP.

Data Cleansing and Cataloguing

Once the data is ingested into the data lake, the next step is to clean and catalogue the data. We will use AWS Glue for this purpose. AWS Glue is a fully managed ETL service that makes it easy to move data between data stores. We will use AWS Glue Crawler to create a table, which reads the data from S3 and creates a table definition. The table definition is then saved to an AWS Glue database.

Source: https://aws.amazon.com/blogs/big-data/orchestrate-an-etl-pipeline-using-aws-glue-workflows-triggers-and-crawlers-with-custom-classifiers/

Storing the Data

The data is then moved from the AWS Glue database to an S3 bucket, which is designated as “cleansed data.” This bucket stores the data after it has been cleansed and cataloged.

Processing the Data

Next, we will process the data using Amazon EMR. Amazon EMR is a managed Hadoop-based service that can process large amounts of data. We will use Amazon EMR to process the data and send it to another S3 bucket, designated as “curated data.” This bucket stores the data after it has been processed.

Querying the Data

Finally, we will use Amazon Athena to query the data in the S3 bucket designated as “curated data.” Amazon Athena is a serverless, interactive query service that enables analysis of data in S3 using SQL. With Athena, you can analyze data quickly and easily without having to manage any infrastructure.

Data Visualization

To visualize the analysis, we will use Kibana, a data visualization tool. Kibana is an open-source data visualization platform that enables you to create visualizations, dashboards, and reports. We will integrate Kibana with Amazon Athena to display the analysis visually.

In conclusion, the use of a data lake solution on AWS can help organizations effectively manage and analyze vast amounts of data from multiple sources. In this project, we designed a data lake solution on AWS that ingests data from three different sources — IoT sensors, a database with historical records, and third-party data — using Amazon Kinesis Firehose Data delivery Stream, AWS DMS, and AWS SFTP, respectively. We then used AWS Glue to clean and transform the data, and create a table using AWS Glue Crawler, which saved the table definition to a database.

The cleansed data was then stored in an S3 bucket designated as the “cleansed data” bucket, and processed by Amazon EMR, an Apache Hadoop-based software, before being moved to another S3 bucket designated as the “curated data” bucket. We then used Amazon Athena to query the data in the “curated data” bucket and visualize the results using Kibana.

Overall, this project demonstrates how AWS services can be used to build a robust and scalable data lake solution that effectively ingests, cleanses, catalogs, and analyzes large volumes of data from diverse sources. With the use of AWS services such as Amazon S3, AWS Glue, Amazon EMR, and Amazon Athena, organizations can easily and efficiently derive valuable insights from their data and make data-driven decisions

--

--

Uviekugbere Theophilus

I am a Cloud/Devops engineer with sound knowledge and technical experience in Automation, Building and Deployment processes for Operational excellence.