Registering GPU Instance w/ AWS Elastic Container Service (ECS)
Prerequisites
Existing cluster
Some familiarity with AWS ECS i.e. setting up cluster, services, task definitions, launch templates, auto scaling groups, and capacity providers
Steps
- Setup Task Definition as normal but specify GPU Resource
{
"containerDefinitions": [
{
...
"resourceRequirements" : [
{
"type" : "GPU",
"value" : "1"
}
],
},
...
}
Setup Launch Template with GPU specific AMI and instance type
Under Instance type Choose a compatible instance type from one of the following specified in this guide
Under Application and OS Images (Amazon Machine Image) Find an ECS compatibe AMI. In my case, for us-east-1, it was
ami-01ff5874b57a57613
aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended --region us-east-1 # Should output something similar to { "Parameters": [ { "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended", "Type": "String", "Value": "{\\"ecs_agent_version\\":\\"1.80.0\\",\\"ecs_runtime_version\\":\\"Docker version 20.10.25\\",\\"image_id\\":\\"ami-01ff5874b57a57613\\",\\"image_name\\":\\"amzn2-ami-ecs-gpu-hvm-2.0.20240109-x86_64-ebs\\",\\"image_version\\":\\"2.0.20240109\\",\\"os\\":\\"Amazon Linux 2\\",\\"schema_version\\":1,\\"source_image_name\\":\\"amzn2-ami-minimal-hvm-2.0.20240109.0-x86_64-ebs\\"}", "Version": 128, "LastModifiedDate": "2024-01-17T09:04:08.076000-07:00", "ARN": "arn:aws:ssm:us-east-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended", "DataType": "text" } ], "InvalidParameters": [] }
Type the AMI from the above step into the AMI search box. It should appear under the Community AMI (no idea why because it is made by AWS) and select it.
Under Storage (volumes), you need to add an extra storage volume because the specified AMI requires a snapshot. In my case, I did not have that snapshot, so you must you specify a secondary volume, which essentially doesnt get used, so you can set the size to something small like 30gb. Also, be sure to adjust Volume 1 because it defaults to 30gb and given you’re wanting a GPU I am assuming you will be downloading some rather large models and or datasets.
Under Advanced details find User data. Here you will enable gpu support, install cuda drivers, and update python version.
- To enable gpu support and register instance with ecs add the following:
echo ECS_CLUSTER=YOUR-CLUSTER-NAME >> /etc/ecs/ecs.config;
echo ECS_ENABLE_GPU_SUPPORT=true >> /etc/ecs/ecs.config
- Install cuda drivers
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
cp cuda-rhel8.repo /etc/yum.repos.d/
yum clean all
yum install -y cuda-12-2
# Adding CUDA to PATH for all future sessions
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> /etc/profile.d/cuda.sh
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> /etc/profile.d/cuda.sh
- Your final User data script should look like:
#!/bin/bash
echo ECS_CLUSTER=windmill-test >> /etc/ecs/ecs.config;
echo ECS_ENABLE_GPU_SUPPORT=true >> /etc/ecs/ecs.config
wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
cp cuda-rhel8.repo /etc/yum.repos.d/
yum clean all
yum install -y cuda-12-2
# Adding CUDA to PATH for all future sessions
echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> /etc/profile.d/cuda.sh
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> /etc/profile.d/cuda.sh
Ensure your IAM ecsInstanceRole has the following permissions: AmazonEC2ContainerServiceforEC2Role and EC2ModifyInstanceAttribute
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeTags", "ecs:CreateCluster", "ecs:DeregisterContainerInstance", "ecs:DiscoverPollEndpoint", "ecs:Poll", "ecs:RegisterContainerInstance", "ecs:StartTelemetrySession", "ecs:UpdateContainerInstancesState", "ecs:Submit*", "ecr:GetAuthorizationToken", "ecr:BatchCheckLayerAvailability", "ecr:GetDownloadUrlForLayer", "ecr:BatchGetImage", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "*" }, { "Effect": "Allow", "Action": "ecs:TagResource", "Resource": "*", "Condition": { "StringEquals": { "ecs:CreateAction": [ "CreateCluster", "RegisterContainerInstance" ] } } } ] } { "Version": "2012-10-17", "Statement": [ { "Sid": "VisualEditor0", "Effect": "Allow", "Action": "ec2:ModifyInstanceAttribute", "Resource": "*" } ] }
Setup Autoscaling Group that is tied to the above launch template
Deploy Service from task definition
Done!