Can I integrate this with AWS Lambda?

Absolutely. The skill includes a standard Lambda handler pattern designed to invoke the async endpoint and manage the lifecycle of the request efficiently.

Does this skill support scale-to-zero?

Yes, it includes CDK configurations for Application Auto Scaling that allow the endpoint to scale down to zero instances when no requests are in the queue.

How does the client retrieve inference results?

The skill provides a TypeScript client that implements an S3-based polling mechanism to check for output files or failure notifications once the model finishes processing.

What is SageMaker Async Inference?

It is an Amazon SageMaker capability that queues incoming requests and processes them asynchronously, making it ideal for large payloads or long-running inference tasks.

When should I use this skill over real-time inference?

Use this skill if your model response takes longer than 60 seconds, your payload exceeds 6MB, or you want to scale infrastructure to zero to optimize costs.

SageMaker Async Inference

Name: SageMaker Async Inference
Author: aws-solutions-library-samples

byaws-solutions-library-samples

•

117

•

Cloud Infrastructure

Implements Amazon SageMaker asynchronous inference patterns for long-running workloads and large payloads using S3-based I/O.

This skill provides comprehensive guidance and production-ready patterns for implementing SageMaker Asynchronous Inference, a critical architecture for workloads that exceed standard real-time limits. It covers the entire implementation lifecycle, including CDK infrastructure setup with scale-to-zero capabilities, TypeScript client implementation for S3-based polling, and robust Lambda integration. This is an essential tool for developers building generative AI applications, high-resolution image processing, or data-intensive models where processing times exceed 60 seconds or payloads are larger than 6MB.

Key Features

01Asynchronous polling and SNS notification patterns for result retrieval

02Scale-to-zero auto-scaling configuration for maximum cost efficiency

03Robust error handling and exponential backoff retry logic

04S3-integrated I/O handling for payloads exceeding 6MB

05117 GitHub stars

06End-to-end CDK infrastructure templates for endpoint deployment

Use Cases

01Processing high-resolution media or large documents that exceed standard API timeout limits

02Executing long-running generative AI tasks with response times between 1 and 15 minutes

03Building cost-optimized ML pipelines that scale down to zero instances during idle periods

Key Features

01Asynchronous polling and SNS notification patterns for result retrieval

02Scale-to-zero auto-scaling configuration for maximum cost efficiency

03Robust error handling and exponential backoff retry logic

04S3-integrated I/O handling for payloads exceeding 6MB

05117 GitHub stars

06End-to-end CDK infrastructure templates for endpoint deployment

Use Cases

01Processing high-resolution media or large documents that exceed standard API timeout limits

02Executing long-running generative AI tasks with response times between 1 and 15 minutes

03Building cost-optimized ML pipelines that scale down to zero instances during idle periods