Registering GPU Instance w/ AWS Elastic Container Service (ECS)

Prerequisites

  • Existing cluster

  • Some familiarity with AWS ECS i.e. setting up cluster, services, task definitions, launch templates, auto scaling groups, and capacity providers

Steps

  1. Setup Task Definition as normal but specify GPU Resource
{
  "containerDefinitions": [
     {
        ...
        "resourceRequirements" : [
            {
               "type" : "GPU", 
               "value" : "1"
            }
        ],
     },
...
}
  1. Setup Launch Template with GPU specific AMI and instance type

  2. Under Instance type Choose a compatible instance type from one of the following specified in this guide

  3. Under Application and OS Images (Amazon Machine Image) Find an ECS compatibe AMI. In my case, for us-east-1, it was ami-01ff5874b57a57613

     aws ssm get-parameters --names /aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended --region us-east-1
     # Should output something similar to
     {
         "Parameters": [
             {
                 "Name": "/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended",
                 "Type": "String",
                 "Value": "{\\"ecs_agent_version\\":\\"1.80.0\\",\\"ecs_runtime_version\\":\\"Docker version 20.10.25\\",\\"image_id\\":\\"ami-01ff5874b57a57613\\",\\"image_name\\":\\"amzn2-ami-ecs-gpu-hvm-2.0.20240109-x86_64-ebs\\",\\"image_version\\":\\"2.0.20240109\\",\\"os\\":\\"Amazon Linux 2\\",\\"schema_version\\":1,\\"source_image_name\\":\\"amzn2-ami-minimal-hvm-2.0.20240109.0-x86_64-ebs\\"}",
                 "Version": 128,
                 "LastModifiedDate": "2024-01-17T09:04:08.076000-07:00",
                 "ARN": "arn:aws:ssm:us-east-1::parameter/aws/service/ecs/optimized-ami/amazon-linux-2/gpu/recommended",
                 "DataType": "text"
             }
         ],
         "InvalidParameters": []
     }
    
  4. Type the AMI from the above step into the AMI search box. It should appear under the Community AMI (no idea why because it is made by AWS) and select it.

  5. Under Storage (volumes), you need to add an extra storage volume because the specified AMI requires a snapshot. In my case, I did not have that snapshot, so you must you specify a secondary volume, which essentially doesnt get used, so you can set the size to something small like 30gb. Also, be sure to adjust Volume 1 because it defaults to 30gb and given you’re wanting a GPU I am assuming you will be downloading some rather large models and or datasets.

    1. Under Advanced details find User data. Here you will enable gpu support, install cuda drivers, and update python version.

      1. To enable gpu support and register instance with ecs add the following:
        echo ECS_CLUSTER=YOUR-CLUSTER-NAME >> /etc/ecs/ecs.config;
        echo ECS_ENABLE_GPU_SUPPORT=true >> /etc/ecs/ecs.config
  1. Install cuda drivers
        wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
        cp cuda-rhel8.repo /etc/yum.repos.d/
        yum clean all
        yum install -y cuda-12-2

        # Adding CUDA to PATH for all future sessions
        echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> /etc/profile.d/cuda.sh
        echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> /etc/profile.d/cuda.sh
  1. Your final User data script should look like:
        #!/bin/bash 
        echo ECS_CLUSTER=windmill-test >> /etc/ecs/ecs.config;
        echo ECS_ENABLE_GPU_SUPPORT=true >> /etc/ecs/ecs.config

        wget https://developer.download.nvidia.com/compute/cuda/repos/rhel8/x86_64/cuda-rhel8.repo
        cp cuda-rhel8.repo /etc/yum.repos.d/
        yum clean all
        yum install -y cuda-12-2

        # Adding CUDA to PATH for all future sessions
        echo 'export PATH=/usr/local/cuda-12.2/bin:$PATH' >> /etc/profile.d/cuda.sh
        echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.2/lib64:$LD_LIBRARY_PATH' >> /etc/profile.d/cuda.sh
  1. Ensure your IAM ecsInstanceRole has the following permissions: AmazonEC2ContainerServiceforEC2Role and EC2ModifyInstanceAttribute

     {
         "Version": "2012-10-17",
         "Statement": [
             {
                 "Effect": "Allow",
                 "Action": [
                     "ec2:DescribeTags",
                     "ecs:CreateCluster",
                     "ecs:DeregisterContainerInstance",
                     "ecs:DiscoverPollEndpoint",
                     "ecs:Poll",
                     "ecs:RegisterContainerInstance",
                     "ecs:StartTelemetrySession",
                     "ecs:UpdateContainerInstancesState",
                     "ecs:Submit*",
                     "ecr:GetAuthorizationToken",
                     "ecr:BatchCheckLayerAvailability",
                     "ecr:GetDownloadUrlForLayer",
                     "ecr:BatchGetImage",
                     "logs:CreateLogStream",
                     "logs:PutLogEvents"
                 ],
                 "Resource": "*"
             },
             {
                 "Effect": "Allow",
                 "Action": "ecs:TagResource",
                 "Resource": "*",
                 "Condition": {
                     "StringEquals": {
                         "ecs:CreateAction": [
                             "CreateCluster",
                             "RegisterContainerInstance"
                         ]
                     }
                 }
             }
         ]
     }
     {
         "Version": "2012-10-17",
         "Statement": [
             {
                 "Sid": "VisualEditor0",
                 "Effect": "Allow",
                 "Action": "ec2:ModifyInstanceAttribute",
                 "Resource": "*"
             }
         ]
     }
    
  2. Setup Autoscaling Group that is tied to the above launch template

  3. Deploy Service from task definition

  4. Done!