使用Compose启用GPU访问

预计阅读时间:6分钟

如果Docker主机包含此类设备并且相应地设置了Docker守护程序,则Compose服务可以定义GPU设备预留。为此,请确保尚未安装必备组件

以下各节中的示例专门致力于通过Docker Compose为服务容器提供对GPU设备的访问。您可以使用docker-composedocker compose命令。

使用runtimeCompose v2.3格式的服务属性(旧版)

Docker Compose v1.27.0 +切换到使用Compose Specification模式,该模式是2.x和3.x版本中所有属性的组合。这重新启用了将服务属性用作运行时的功能,以提供对服务容器的GPU访问。但是,这不允许对GPU设备的特定属性进行控制。

services:
  test:
    image: nvidia/cuda:10.2-base
    command: nvidia-smi
    runtime: nvidia

启用GPU对服务容器的访问

Docker Compose v1.28.0 +允许使用Compose规范中定义的设备结构来定义GPU预留。由于可以为以下设备属性设置自定义值,因此可以对GPU预留进行更精细的控制:

  • 功能-值指定为字符串列表(例如capabilities: [gpu])。您必须在“撰写”文件中设置此字段。否则,它将在服务部署上返回错误。
  • count-指定为int的值或all代表应保留的GPU设备数量的值(前提是主机拥有该数量的GPU)。
  • device_ids-指定为表示来自主机的GPU设备ID的字符串列表的值。您可以nvidia-smi在主机上的输出中找到设备ID 。
  • driver-指定为字符串的值(例如driver: 'nvidia'
  • options-表示驱动程序特定选项的键/值对。

笔记

您必须设置该capabilities字段。否则,它将在服务部署上返回错误。

count并且device_ids是互斥的。您一次只能定义一个字段。

有关这些属性的更多信息,请参见deploy撰写规范》中的“部分” 。

用于运行可访问1个GPU设备的服务的Compose文件示例:

services:
  test:
    image: nvidia/cuda:10.2-base
    command: nvidia-smi
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu, utility]

与Docker Compose一起运行:

$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1    
test_1  | +-----------------------------------------------------------------------------+
test_1  | | NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.1     |
test_1  | |-------------------------------+----------------------+----------------------+
test_1  | | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
test_1  | | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
test_1  | |                               |                      |               MIG M. |
test_1  | |===============================+======================+======================|
test_1  | |   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
test_1  | | N/A   23C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
test_1  | |                               |                      |                  N/A |
test_1  | +-------------------------------+----------------------+----------------------+
test_1  |                                                                                
test_1  | +-----------------------------------------------------------------------------+
test_1  | | Processes:                                                                  |
test_1  | |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
test_1  | |        ID   ID                                                   Usage      |
test_1  | |=============================================================================|
test_1  | |  No running processes found                                                 |
test_1  | +-----------------------------------------------------------------------------+
gpu_test_1 exited with code 0

如果未设置count或,device_ids则默认情况下将使用主机上所有可用的GPU。

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
    deploy:
      resources:
        reservations:
          devices:
          - capabilities: [gpu]
$ docker-compose up
Creating network "gpu_default" with the default driver
Creating gpu_test_1 ... done
Attaching to gpu_test_1
test_1  | I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
.....
test_1  | I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402]
Created TensorFlow device (/device:GPU:0 with 13970 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
test_1  | /device:GPU:0
gpu_test_1 exited with code 0

在托管多个GPU的计算机上,device_ids可以将字段设置为以特定的GPU设备为目标,并且count可以用于限制分配给服务容器的GPU设备的数量。如果count超过主机上可用GPU的数量,则部署将出错。

$ nvidia-smi   
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1B.0 Off |                    0 |
| N/A   72C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla T4            On   | 00000000:00:1C.0 Off |                    0 |
| N/A   67C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla T4            On   | 00000000:00:1D.0 Off |                    0 |
| N/A   74C    P8    12W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   62C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

要仅启用对GPU-0和GPU-3设备的访问,请执行以下操作:

services:
  test:
    image: tensorflow/tensorflow:latest-gpu
    command: python -c "import tensorflow as tf;tf.test.gpu_device_name()"
    deploy:
      resources:
        reservations:
          devices:
          - driver: nvidia
            device_ids: ['0', '3']
            capabilities: [gpu]

$ docker-compose up
...
Created TensorFlow device (/device:GPU:0 with 13970 MB memory -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1b.0, compute capability: 7.5)
...
Created TensorFlow device (/device:GPU:1 with 13970 MB memory) -> physical GPU (device: 1, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
...
gpu_test_1 exited with code 0
文档文档docker撰写GPU访问NVIDIA示例