Building a solid cloud dataflow with AWS

When our clients asked us to move the entire data flow of their companies to Amazon services (AWS) and to replace their all infrastructure by cloud-based technologies, it was a wonderful challenge for OptimalBit LLC, create a solid and flexible structure capable of handling a large flow of data (more than 20Gb daily) and create reports that will provide useful information of this data in the shortest possible time.

Before start working, it’s time to ask yourself some questions:

  • – Why did our clients want to change their current architecture?
  • – What are the advantages of migrating to cloud services (specifically AWS)?
  • – Can we make a robust and flexible solution to this large amount/variety of data at the same time that ensures its integrity?
  • – How can this new solution impact in the efficiency of the process?

After deep research we found the following facts:

Why change?

  1. Decrease costs (deal breaker) like any company that wants to continue growing, you always have to look for increase profits and that can only be achieved:
  • – Increasing revenues
  • – Decreasing costs

Our task was “Decreasing costs” (we had already taken care of “Increasing revenues”) How? Here https://www.optimalbitsoftware.com/  😉)

  1. Upgrade architecture and processes.
  2. Minimize the time between the arrival of data from any source and the generation of their respective reports.

Why AWS?

That is explained by them even better 😄

https://aws.amazon.com/application-hosting/benefits/?nc1=h_ls

https://aws.amazon.com/application-hosting/

Our Solution

The proposed solution is based on the needs of our clients, but it is general enough to apply to almost any company. The first problem to face was the large number of sources through which the data could arrive, for example, the data from Google, Bing, Facebook came through Emails or APIs but for other companies it was easier to simply give us access to their FTP/SFTP server.

For this reason, we decided to do the integration with the sources using AWS Lambda (https://aws.amazon.com/lambda/), allowing us to use any of the supported programming languages for lambda ​​(Node.js, Python, Ruby, Java, Go, C #, PowerShell) to manage the data depending on the source. We worked on the idea that the lambda’s job was to receive the incoming data and save it into AWS S3 (Simple Storage System  https://aws.amazon.com/s3/), any further data processing would be done later with specialized tools. This way, the ‘data lake’ maintains the raw data obtained from the sources at all times and keeps the costs of lambdas execution as low as possible, since they are executed with the minimum possible resources.

Once the data was stored in AWS S3 we were ready to clean and process the data and generate reports. To accomplish this, Amazon designed a very easy and powerful service, AWS Glue (https://aws.amazon.com/glue/). The data processing was divided into the following steps:

First we classify the data using a crawler, this creates a data catalog table with the structure of the data that we are analyzing and creates the partitions that will be later used in the ETL (Extraction, Transformation, Loading). These data are then processed, applying the transformations defined in the ETL and persisting the relevant information to generate the reports in AWS Redshift (https://aws.amazon.com/redshift/) and AWS RDS (Relational Database System https://aws.amazon.com/rds/). AWS Glue allowed us to use the benefits of PySpark (https://spark.apache.org/docs/latest/api/python/index.html) for data management. The most frequent transformations that we faced were:

    • – Change the date format.
    • – Delete signs like “$” or “%”.
    • – Casting to another data type.
    • – Extract information from a field to create one or more fields.

 

This entire process was controlled using AWS Cloudwatch (https://aws.amazon.com/cloudwatch/) and Glue Workflows. In addition, to notify, we created a system based on AWS SNS (https://aws.amazon.com/sns/) and a AWS Lambda to connect with Slack, thus giving our customers the possibility of knowing in real-time when one of the processes had a problem. Once the data was stored in the database, it was time to create wonderful reports to know how much the company is growing, infinite new metrics were obtained, and plotted to help planning the company’s next steps.

If you belong to a company that wants to grow and improve its business, do not hesitate, consult our services at https://www.optimalbitsoftware.com/ or contact us directly.

📧 contact@optimalbitsoftware.com

📱 (302 ) 786-5532

Leave a Reply

Your email address will not be published. Required fields are marked (*)

This site uses Akismet to reduce spam. Learn how your comment data is processed.