Information
# Guidance for AI-Driven Robotic Simulation and Training on AWS
This guidance showcases a robotic learning system (Imitation Learning) that combines the intelligence of foundation models with the precision of ML and mathematical algorithms, all accelerated by AWS Trainium/GPU and managed through modern cloud-native technologies. This guidance also enables developers to train (reinforcement learning) robotic agents using NVIDIA Isaac Sim on Amazon EKS with LLM-generated reward functions via Bedrock, then automatically deploy trained models to physical robots through AWS IoT services.
## Table of Contents
### Required
1. [Overview](#overview)
- [Cost](#cost)
2. [Prerequisites](#prerequisites)
- [Operating System](#operating-system)
3. [Deployment Steps](#deployment-steps)
4. [Deployment Validation](#deployment-validation)
5. [Running the Guidance](#running-the-guidance)
6. [Next Steps](#next-steps)
7. [Cleanup](#cleanup)
8. [Notices](#notices)
***Optional***
8. [FAQ, known issues, additional considerations, and limitations](#faq-known-issues-additional-considerations-and-limitations-optional)
9. [Revisions](#revisions-optional)
10. [Authors](#authors-optional)
## Overview
This guidance demonstrates how to build an AI-assisted robotic learning system that combines foundation models from Amazon Bedrock with reinforcement learning capabilities accelerated by AWS Trainium. The system enables robots to learn complex manipulation tasks through imitation learning and reinforcement learning, with automatic deployment to physical robots via AWS IoT services.
**Why did you build this Guidance?**
Traditional robotic training requires extensive manual programming and domain expertise. This guidance solves the challenge of creating adaptive robotic systems that can learn from demonstrations and improve through reinforcement learning, leveraging AWS's AI/ML services for scalable robot training.
**What problem does this Guidance solve?**
- Reduces the complexity of training robotic manipulation tasks
- Enables continuous learning and improvement of robot policies
- Provides scalable infrastructure for robot training using cloud-native technologies
- Integrates foundation models for intelligent reward function generation
- Automates the deployment pipeline from simulation to physical robots
**Architecture Flow:**
1. **Data Collection**: UR5 robot performs T-bar pushing tasks in NVIDIA Isaac Sim
2. **Policy Training**: ACT (Action Chunking Transformer) policy learns from demonstration data
3. **Reinforcement Learning**: Policy is fine-tuned using reward functions generated by Amazon Bedrock
4. **Infrastructure**: AWS Trainium/GPU instances accelerate training, managed through Amazon EKS
5. **Deployment**: Trained models are deployed to physical robots via AWS IoT services
### Cost
_You are responsible for the cost of the AWS services used while running this Guidance. As of December 2024, the cost for running this Guidance with the default settings in the US East (N. Virginia) Region is approximately $801.20 per month for processing 100 training episodes with continuous robot learning._
_We recommend creating a [Budget](https://docs.aws.amazon.com/cost-management/latest/userguide/budgets-managing-costs.html) through [AWS Cost Explorer](https://aws.amazon.com/aws-cost-management/aws-cost-explorer/) to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance._
### Sample Cost Table
The following table provides a sample cost breakdown for deploying this Guidance with the default parameters in the US East (N. Virginia) Region for one month.
| AWS service | Dimensions | Cost [USD] |
| ----------- | ------------ | ------------ |
| Amazon EC2 (g4dn.xlarge) | 1 instance running 24/7 for Isaac Sim | $ 367.20 |
| Amazon EKS Cluster | 1 cluster for container orchestration | $ 73.00 |
| Amazon EC2 (trn1.2xlarge) | 2 instances for Trainium training (8 hours/day) | $ 256.00 |
| Amazon Bedrock (Claude 3) | 10,000 requests for reward function generation | $ 45.00 |
| Amazon S3 | 500 GB storage for training data and models | $ 11.50 |
| AWS Secrets Manager | 1 secret for password management | $ 0.40 |
| Amazon VPC | NAT Gateway and data transfer | $ 45.60 |
| AWS IoT Core | 50,000 messages for robot deployment | $ 2.50 |
| **Total** | | **$ 801.20** |
## Prerequisites
### Operating System
These deployment instructions are optimized to best work on **Ubuntu 24.04 LTS with NVIDIA GPU support**. Deployment on other OS may require additional steps.
**Required packages:**
- Node.js 18+ and npm (for CDK deployment)
- AWS CLI v2
- Docker and Docker Compose
- Python 3.10+
- NVIDIA Docker runtime (for GPU support)
- ROS 2 Jazzy
**Installation commands:**
\`\`\`bash
# Install Node.js and npm
curl -fsSL https://deb.nodesource.com/setup_18.x | sudo -E bash -
sudo apt-get install -y nodejs
# Install AWS CLI v2
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# Install Docker
sudo apt-get update
sudo apt-get install -y docker.io docker-compose
sudo usermod -aG docker $USER
# Install NVIDIA Docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker
\`\`\`
### Third-party tools
**Required third-party tools:**
- **NVIDIA Isaac Sim 4.5.0**: Physics simulation environment for robot training
- **LeRobot**: Robotics learning framework for policy training
- **PyTorch 2.3.1**: Deep learning framework with CUDA 12.1 support
- **OpenCV**: Computer vision library for image processing
- **ROS 2 Jazzy**: Robot Operating System for robot control
- **MoveIt**: Motion planning framework for robotic arms
### AWS account requirements
**Required AWS account setup:**
- **Amazon Bedrock access**: Enable Claude 3 model access in your AWS account
- **EC2 instance limits**: Ensure sufficient quota for g4dn.xlarge and trn1.2xlarge instances
- **EKS service**: Enable Amazon EKS service in your target region
- **IAM permissions**: Administrator access or specific permissions for:
- EC2 instance management
- EKS cluster creation
- Bedrock model invocation
- S3 bucket operations
- Secrets Manager access
- IoT Core messaging
- **VPC**: Default VPC or custom VPC with internet gateway
- **Key Pair**: EC2 key pair for SSH access to instances
### aws cdk bootstrap (if sample code has aws-cdk)
This Guidance uses AWS CDK for infrastructure deployment. If you are using AWS CDK for the first time, please perform the following bootstrapping:
\`\`\`bash
# Install AWS CDK globally
npm install -g aws-cdk
# Bootstrap your AWS account for CDK
cdk bootstrap aws://ACCOUNT-NUMBER/REGION
# Example:
cdk bootstrap aws://123456789012/us-east-1
\`\`\`
**Note**: Replace \`ACCOUNT-NUMBER\` with your AWS account ID and \`REGION\` with your target AWS region.
### Service limits
**Critical service limits that may require increases:**
- **EC2 Instance Limits**:
- g4dn.xlarge instances: Default limit may be 0-5 per region
- trn1.2xlarge instances: Default limit may be 0-2 per region
- [Request limit increase](https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas)
- **EKS Cluster Limit**: Default 100 clusters per region (usually sufficient)
- **Bedrock Model Access**:
- Claude 3 models require explicit access request
- [Request model access](https://console.aws.amazon.com/bedrock/home#/modelaccess)
- **S3 Storage**: Default limits are typically sufficient for this guidance
### Supported Regions
**Recommended regions** (all required services available):
- **us-east-1** (N. Virginia) - Recommended for best service availability
- **us-west-2** (Oregon)
- **eu-west-1** (Ireland)
**Service availability considerations:**
- **AWS Trainium (trn1 instances)**: Limited to specific regions
- **Amazon Bedrock**: Claude 3 models available in select regions
- **NVIDIA Isaac Sim**: Requires GPU-enabled instances (g4dn, p3, p4 families)
**Note**: Verify Trainium instance availability in your target region before deployment.
## Deployment Steps
1. Clone the repository using command:
\`\`\`bash
# Make sure git-lfs is installed (https://git-lfs.com)
git lfs install
git clone https://github.com/aws-solutions-library-samples/guidance-for-ai-driven-robotic-simulation-and-training-on-aws.git
\`\`\`
2. Navigate to the repository folder:
\`\`\`bash
cd guidance-for-ai-driven-robotic-simulation-and-training-on-aws
\`\`\`
3. Navigate to the CDK deployment directory:
\`\`\`bash
cd deployment/cdk-nodejs
\`\`\`
4. Install CDK dependencies:
\`\`\`bash
npm install
\`\`\`
5. Configure AWS credentials:
\`\`\`bash
aws configure
\`\`\`
6. Bootstrap CDK (if first time using CDK in this account/region):
\`\`\`bash
cdk bootstrap
\`\`\`
7. Deploy the infrastructure stack:
[Follow steps mentioned in this doc](./deployment/cdk-nodejs/README.md)
8. Capture the deployed resources:
\`\`\`bash
# Get EC2 instance ID
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==\`InstanceId\`].OutputValue' --output text
# Get instance public IP
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==\`InstancePublicIp\`].OutputValue' --output text
# Get S3 bucket name
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==\`S3BucketName\`].OutputValue' --output text
\`\`\`
9. Connect to the EC2 instance via SSH:
\`\`\`bash
ssh -i your-key.pem ubuntu@
\`\`\`
10. Wait for Isaac Sim installation to complete (check status):
\`\`\`bash
tail -f /var/log/user-data.log
# Wait until you see "Phase 2 completed"
\`\`\`
## Deployment Validation
**Validate successful deployment:**
1. **CloudFormation Stack Status:**
\`\`\`bash
aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].StackStatus'
\`\`\`
Expected output: \`"CREATE_COMPLETE"\`
2. **EC2 Instance Status:**
\`\`\`bash
aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name,PublicIpAddress]' --output table
\`\`\`
Expected: Instance should be in \`running\` state
3. **Isaac Sim Installation:**
\`\`\`bash
# SSH to instance and check
ls -la /home/ubuntu/isaacsim/installation_complete
\`\`\`
Expected: File should exist
4. **DCV Server Status:**
\`\`\`bash
# On the EC2 instance
sudo systemctl status dcvserver
dcv list-sessions
\`\`\`
Expected: DCV server running with active session
5. **Access DCV Web Interface:**
- Open browser to \`https://:8443\`
- Login with username \`ubuntu\` and password from Secrets Manager
6. **S3 Bucket Creation:**
\`\`\`bash
aws s3 ls | grep robotics
\`\`\`
Expected: Bucket with robotics prefix should be listed
## Running the Guidance
**Step 1: Access the Isaac Sim Environment**
1. Connect to the EC2 instance via DCV at \`https://:8443\`
2. Login with username \`ubuntu\` and retrieve password from AWS Secrets Manager:
\`\`\`bash
aws secretsmanager get-secret-value --secret-id --query SecretString --output text
\`\`\`
**Step 2: Start Simulation and Data collection**
[Follow the commands](./source/ur5_nova/Scripts/commands.md)
**Iterative Training Approach:**
1. **Initial Training**: Run RL fine-tune for 30 minutes
2. **Evaluation**: Test with data collection to measure performance improvement
3. **Iteration**: If performance is insufficient, continue RL training for another 30 minutes
4. **Repeat**: Continue until desired accuracy threshold is achieved
**Monitoring Progress:**
- Training metrics are logged to console
- Model checkpoints saved automatically
- Success rate and accuracy displayed in real-time
- Nova Pro provides intelligent observations of robot behavior
## Next Steps
**Customization and Enhancement Options:**
1. **Modify Training Parameters:**
- Adjust learning rate, batch size, and training epochs in \`RL_Finetune.py\`
- Customize reward functions for different manipulation tasks
- Modify success thresholds and accuracy floors
2. **Extend to Different Robot Tasks:**
- Replace T-bar pushing with other manipulation tasks (pick-and-place, assembly)
- Modify the Isaac Sim scene files in \`source/ur5_nova/configuration/\`
- Update observation and action spaces for new tasks
3. **Scale Training Infrastructure:**
- Deploy multiple Trainium instances for distributed training
- Use Amazon SageMaker for managed training workflows
- Implement model versioning with Amazon SageMaker Model Registry
4. **Integrate Additional Foundation Models:**
- Use different Bedrock models for reward function generation
- Implement multi-modal learning with vision-language models
- Add natural language instruction following capabilities
5. **Deploy to Physical Robots:**
- Configure AWS IoT Core for robot fleet management
- Implement over-the-air model updates
- Add real-world sensor integration and calibration
6. **Production Optimization:**
- Implement model quantization for edge deployment
- Add monitoring and alerting with Amazon CloudWatch
- Set up automated retraining pipelines with Amazon EventBridge
## Cleanup
**To completely remove all resources created by this Guidance:**
1. **Stop running processes on EC2 instance:**
\`\`\`bash
# SSH to the instance and stop any running training
pkill -f python3
sudo systemctl stop dcvserver
\`\`\`
2. **Empty S3 bucket contents:**
\`\`\`bash
# Get bucket name from CloudFormation output
BUCKET_NAME=$(aws cloudformation describe-stacks --stack-name RoboticsStack --query 'Stacks[0].Outputs[?OutputKey==\`S3BucketName\`].OutputValue' --output text)
# Empty the bucket
aws s3 rm s3://$BUCKET_NAME --recursive
\`\`\`
3. **Delete CloudFormation stacks:**
\`\`\`bash
# Delete all CDK stacks
cdk destroy --all
# Confirm deletion when prompted
\`\`\`
4. **Verify resource deletion:**
\`\`\`bash
# Check that stacks are deleted
aws cloudformation list-stacks --stack-status-filter DELETE_COMPLETE
# Verify EC2 instances are terminated
aws ec2 describe-instances --filters "Name=tag:Name,Values=*robotics*" --query 'Reservations[*].Instances[*].[InstanceId,State.Name]' --output table
\`\`\`
5. **Manual cleanup:**
- **EKS clusters**: If EKS stack deletion fails, manually delete from console
- **VPC resources**: Delete any remaining ENIs or security groups
- **IAM roles**: Remove any custom IAM roles if they weren't deleted
- **Secrets Manager**: Delete the password secret if it remains
**Note**: The cleanup process may take 10-15 minutes to complete. Monitor the CloudFormation console to ensure all stacks are successfully deleted.
## FAQ, known issues, additional considerations, and limitations (optional)
**Known issues**
1. **Isaac Sim Installation Timeout:**
- **Issue**: Isaac Sim download may timeout on slower connections
- **Resolution**: SSH to instance and manually restart download:
\`\`\`bash
cd /home/ubuntu/isaacsim
wget --continue https://download.isaacsim.omniverse.nvidia.com/isaac-sim-standalone-5.1.0-linux-x86_64.zip
\`\`\`
2. **DCV Connection Issues:**
- **Issue**: Cannot connect to DCV web interface
- **Resolution**: Check security group allows port 8443 and restart DCV:
\`\`\`bash
sudo systemctl restart dcvserver
\`\`\`
3. **CUDA Out of Memory:**
- **Issue**: Training fails with CUDA OOM errors
- **Resolution**: Reduce batch size in training scripts or use smaller model
4. **Trainium Instance Unavailability:**
- **Issue**: trn1 instances not available in region
- **Resolution**: Use alternative regions or switch to GPU instances (p3, g4dn)
5. **Bedrock Model Access:**
- **Issue**: Access denied to Claude 3 models
- **Resolution**: Request model access in Bedrock console before deployment
**Additional considerations**
- **This Guidance creates EC2 instances that are billed per hour while running, including during idle time**
- **Trainium instances (trn1.2xlarge) are premium instances with higher hourly costs**
- **The guidance creates a VPC with NAT Gateway that incurs hourly charges**
- **Isaac Sim requires significant disk space (>50GB) and may increase EBS costs**
- **DCV server creates a remote desktop session accessible over the internet - ensure strong passwords**
- **Training data is stored in S3 and may accumulate over time, monitor storage costs**
- **Bedrock API calls are charged per request - monitor usage during reward function generation**
- **The system generates significant network traffic during training which may incur data transfer charges**
**Security Considerations:**
- DCV web interface is exposed to the internet on port 8443
- Ensure strong passwords are set via Secrets Manager
- Consider restricting access to specific IP ranges in security groups
- Training data may contain sensitive information - ensure proper S3 bucket policies
*For any feedback, questions, or suggestions, please use the issues tab under this repo.*
## Revisions
Document all notable changes to this project.
Consider formatting this section based on Keep a Changelog, and adhering to Semantic Versioning.
## Notices
*Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided "as is" without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.*