Image

Portrait Mathieu Lienart
from Matthieu Lienart
Cloud Engineer, from Ostermundigen

#knowledgesharing #level 300

Near Real Time Data Ingestion into SageMaker Feature Store

The objective

This blog post is the first part of a 3 parts series about testing a fully automated MLOps pipeline for machine learning prediction on near real time timeseries data in AWS. This part focuses on the data ingestion pipeline into Amazon SageMaker Feature Store.

The entire demo code is publicly available in the project repository on GitHub.

Sagemaker Feature Store

For this demo, we have chosen to use the Amazon SageMaker Feature Store as the final repository of the data ingestion pipeline. As per the documentation:

“Amazon SageMaker Feature Store is a fully managed, purpose-built repository to store, share, and manage features for machine learning (ML) models. Features are inputs to ML models used during training and inference.”

The Demo

The demo ingests blockchain transactions from the blockchain.com API (see documentation here). Based on the data it ingests, the pipeline computes and stores 3 simple metrics in Amazon SageMaker Feature Store:

  • The total number of transactions
  • The total amount of transaction fees
  • The average amount of transaction fees

These metrics are computed per minute. Although it might not be the best window period to analyze blockchain transactions, it allows us to quickly gather a lot of data points in a short period of time, avoiding running the demo for too long which has an impact on the AWS costs.

This demo is developed using the AWS CDK and is available here.

The Architecture

The project is made of a self-mutating pipeline which deploys the different stacks of the project. Only the data ingestion pipeline components are shown here (the MLOps part of the architecture will be detailed in future posts).

The pipeline works as follow:

 1. An AWS Fargate container polls the data source API every second to ingest the last 100 transactions and publish all transactions on the data ingestion event bus of AWS EventBridge.

2. An AWS EventBridge Rule routes the ingested data to Amazon Kinesis Data Firehose streaming service.

3. An AWS Lambda Function is used in combination with Amazon DynamoDB to keep track of recently ingested transactions and filter out transactions already ingested.

4. The raw data are delivered by Amazon Kinesis Data Firehose to Amazon Kinesis Data Analytics and to an Amazon S3 Bucket for archival.

5. Amazon Kinesis Data Analytics aggregates the data in near real time, computing the following 3 metrics per minute:

a. total number of transactions
b. total amount of transaction fees
c. average amount of transaction fees

6. An AWS Lambda Function writes the aggregated data into Amazon SageMaker Feature Store which is used as the centralized data store for machine learning training and predictions.

7. An AWS Glue Job periodically aggregates the small files in the Amazon SageMaker Feature Store S3 Bucket to improve performance when reading data.

In addition to deploying the data ingestion pipeline, the infrastructure stack also deploys the data scientist environment using Amazon SageMaker Studio. It creates an Amazon SageMaker Studio Domain and creates a user in it with the appropriate permissions. With this, the data scientist has access to an IDE to run Jupyter Notebooks to perform analytics on the data, run experiments and test training a model.

How to look at the data being ingested?

Monitoring the Pipeline

The demo comes with a CloudWatch dashboard for you to see the data flowing through the different components. It displays in the first widget the number of bytes:

  • Ingested by the AWS Fargate Container
  • Ingested by AWS EventBridge (There is unfortunately no metric per AWS EventBridge bus. This metric shows the total amount of data ingested by EventBridge in the account)
  • Ingested by Amazon Kinesis Firehose
  • Delivered by Amazon Kinesis Firehose to Amazon S3
  • Ingested by Amazon Kinesis Analytics

The second widget displays the amount of aggregated data delivered by Amazon Kinesis Analytics to Amazon SageMaker Feature Store.

Querying the Data using Amazon Athena

Storage of Amazon SageMaker Feature Store. Here is a query example (if you deploy the demo, you will have to adapt the feature store table name).

Querying the Data using Amazon SageMaker Studio Notebook

In the repository /resources/sagemaker/tests/ folder we provide a Jupyter notebook read_feature_store.ipynb to read the latest entry in the online store. From the Amazon SageMaker Studio domain, you can use the provisioned user and launch a studio application. Once in the Jupyter or Code Editor environment, you can upload that notebook and run it. The notebook reads the latest data point from the Online Store of Amazon SageMaker Feature Store.

You will observe a roughly 6 minutes difference between the timestamp of the latest data available in the Online Store versus the Offline Store of Amazon SageMaker Feature Store.

The Challenges

The main challenge we faced when developing this architecture with the CDK was the cleanup of the SageMaker domain. When creating a SageMaker domain, AWS creates an Amazon EFS share with endpoints in the VPC and NSGs attached to them. When a user starts a SageMaker Studio App, compute resources are deployed to host the Code Editor/Jupyter IDE session and Jupyter kernel session. None of those resources are deleted automatically when deleting the domain. This means that a Custom Resource must be developed in the CDK Stack to clean up the domain before it gets deleted. The main issue is that deleting a SageMaker Studio App can take more than the 15 minutes maximum runtime of the Custom Resource Lambda Function. Implementing a Step Function to periodically check the SageMaker Studio App status and wait for the deletion does not help because Cloud Formation WaitCondition does not support deletes and thus does no wait to receive the signal back from the Custom Resource before continuing with deletion.

Two issues have been opened in the CloudFormation repository:

The Costs

As mentioned previously, we aggregate ingested data on a minute interval to be able to quickly train a model and see results. Yet, we recommend running the ingestion pipeline and demo for several days to have enough data. Should you want to run the full demo (with the MLOps automation part) for some time, be aware that the average monthly cost in the Ireland (eu-west-1) region is roughly $850/month.

Improvements

Amazon Kinesis Data Analytics is now legacy and should be replaced by Amazon Managed Service for Apache Flink. But that service does not integrate directly with Amazon Kinesis Firehose. The data ingestion needs to be rearchitected to ingest the data into Amazon Kinesis Data Stream, which would stream to both Amazon Kinesis Firehose (to dump the raw data in the corresponding S3 bucket) and Amazon Managed Service for Apache Flink.
Image

AWS AppConfig for Serverless Applications Demo

Wouldn’t it be nice to decouple application configuration from infrastructure configuration and code? This is where AWS AppConfig (a component of AWS Systems Manager) can help
learn more
Image

Build a cloud-native platform for your customers

What does the business idea of a hotel have in common with the platform approach in the cloud native world? And how can you meet your customers’ requirements? Learn more about it in this blog post.
learn more
Image

RDBMS Data Ingestion into an AWS Data Lake with AWS Glue and Apache Iceberg

How can you transfer constantly changing data from a database into a data lake based on immutable storage like Amazon S3? An article by Matthieu Lienart
learn more
Image

Comparing AWS Lambda on ARM vs x86 architectures

Do Lambda functions, supported by Graviton2, offer up to 19% more performance at 20% lower cost? A comparison by Matthieu Lienart
learn more