cuda-doctor logo

CLI overview

An environment doctor for CUDA stacks

cuda-doctor answers a higher-level question than low-level NVIDIA tooling: can this machine build and run real CUDA workloads correctly, and if not, can it fix the environment safely?

What cuda-doctor is

cuda-doctor is a diagnose + repair + build + validate CLI for CUDA environments.

It is not a replacement for `nvidia-smi`, `cuda-gdb`, or `compute-sanitizer`. Those tools expose low-level state or debugging workflows. cuda-doctor sits one layer above them and focuses on whether a machine can build and run real CUDA workloads correctly.

Can this machine build and run real CUDA workloads correctly, and if not, can it fix the environment safely?

Validation-first

cuda-doctor should never call an environment healthy just because packages exist or `nvidia-smi` returns data. Real GPU execution is the gate.

Why this project exists

Modern CUDA stacks fail in ways that look successful from the surface. Driver installs can appear healthy while runtime launches fail. Toolchains can exist but miss support for new architectures such as `sm_120`. PyTorch wheels can import successfully while targeting the wrong runtime for the local GPU.

Blackwell readiness

Catch missing `sm_120` support before a build or kernel launch wastes time.

Driver and runtime drift

Spot cases where reporting tools work but the intended runtime stack cannot execute correctly.

Fake-success installs

Refuse to call the environment fixed until validation proves memory transfer and kernel execution are real.

Primary user journey

  1. 1Install `cuda-doctor`.
  2. 2Run `cuda-doctor doctor` to diagnose the machine.
  3. 3Read what is broken, risky, or incomplete.
  4. 4Run `cuda-doctor doctor auto` to apply compatible repairs.
  5. 5Run `cuda-doctor validate` to prove a real GPU workload works.
  6. 6Run `cuda-doctor build` inside a project to compile with the correct architecture and toolchain settings.

Repository shape

Repository maptext
src/           native routing, diagnosis, repair, build, validation
include/       headers mirroring native modules
kernels/       CUDA smoke tests and benchmark kernels
cuda_doctor/   Python CLI wrapper, config handling, rich output
tests/         C++ unit tests and Python CLI tests
docker/        reproducible CUDA environments
scripts/       bootstrap and setup automation
CMakeLists.txt native build graph
pyproject.toml Python package and CLI entry point
  • Native core that can inspect the system and interact with CUDA directly.
  • Repair engine that reconciles broken or outdated environments.
  • Build helper that shields users from hand-authoring architecture flags.
  • Machine-readable reports for automation and CI.