2025-06-12_separat_slurm_kluster_felsokning#
This guide provides a practical, end-to-end procedure to diagnose and fix Slurm job dispatch problems in the containerized MLOps environment. It focuses on the typical failure modes we’ve observed when running the FastAPI app and the Slurm cluster as separate services with a shared bind-mounted volume (./persistent-data → /data).
Key takeaways:
Shared volume ownership matters. If
/datais owned by a different UID/GID than the process writing to it, you’ll seePermission deniedat runtime.Ensure
sbatchexists and the Slurm controller is UP before submitting jobs.Keep code sync predictable: set
PYTHONPATH,cdto the code root, and runpython -m app.ml_train.Docker socket access is required if the app dispatches jobs into the Slurm master container.
1) Symptoms → Root Causes → Fixes#
A. Permission denied when syncing code to /data/app_code_for_slurm#
Example:
cp: cannot create directory '/data/app_code_for_slurm/app': Permission denied
cp: cannot create regular file '/data/app_code_for_slurm/requirements.txt': Permission denied
Root cause:
The bind-mounted host folder
./persistent-datais owned by the wrong user (oftenroot) after rebuilds. The FastAPI container (userappuser) cannot write.
Fix:
On host:
sudo chown -R $(id -u):$(id -g) persistent-dataContainers now also fix ownership on startup:
FastAPI entrypoint: chowns
/datatoappuserUID/GID and setsu+rwX,g+rwX.Slurm entrypoint: chowns
/datatoslurmUID/GID (mapped to host UID/GID at build) and setsu+rwX,g+rwX.
Recreate services:
docker compose up --build -d(dev) ordocker compose -f prod-docker-compose.yaml up --build -d(prod).
Verification inside FastAPI container:
id
ls -ld /data /data/app_code_for_slurm
touch /data/app_code_for_slurm/.writetest
B. Slurm controller not UP#
Example UI status:
Slurm not UP (code=.. ) out=.. err=..
Root cause:
slurmctldnot fully started,mungedmissing/failed, or container order startup issues.
Fix:
In
slurm-mastercontainer:scontrol pingmust containUP.Check logs:
/var/log/slurm/slurmctld.logand Munge status.Ensure these mounts and settings are present (Compose):
/etc/slurmconfig is mounted/sys/fs/cgroup:/sys/fs/cgroup:rwprivileged: trueslurm-masterstarts before workers
C. sbatch not found#
Root cause:
Slurm packages not installed correctly in the image.
Fix:
Verify:
docker exec -it slurm-master which sbatchRebuild
slurm-imageif missing. Ensureslurm-wlmandslurm-clientare installed.
D. Python import error: ModuleNotFoundError: No module named 'app'#
Root cause:
Code sync did not create
/data/app_code_for_slurm/app/, orPYTHONPATH/CWD incorrect.
Fix:
The sbatch script (generated by the app) sets
PYTHONPATH=/data/app_code_for_slurm,cds to the code root, and runspython3 -m app.ml_train.Verify job logs in
/data/logs/ml_train_<jobid>.outshow directory listings and Python introspection (we printsys.pathbefore training starts).
E. Docker exec failures from app container#
Root cause:
Docker socket not mounted, or app user not in the socket group.
Fix:
Mount
/var/run/docker.sockinto the FastAPI container.Entry point maps the socket GID inside the container and adds
appuserto that group.
2) Ubuntu-specific gotchas (22.04 / 24.04)#
Bind-mounts are created by the first container that writes. If that container runs as
root, host files becomeroot-owned, breaking subsequent writes by non-root app containers.Fix by ensuring each service that writes to
/datanormalizes ownership on startup (we now do this in both FastAPI and Slurm entrypoints).If you rebuild images with different
HOST_UID/HOST_GID, previously written files may still be owned by old IDs. Run a host-sidechown -Rto reset.
3) Validation Checklist#
On host:
ls -ld persistent-data
sudo chown -R $(id -u):$(id -g) persistent-data
FastAPI container:
id
python -V
ls -ld /data /data/app_code_for_slurm /data/logs
python - <<'PY'
import os, sys
print('PYTHONPATH=', os.getenv('PYTHONPATH'))
print('sys.path=', sys.path)
print('can_write?', os.access('/data', os.W_OK))
PY
Slurm master:
scontrol ping # Should contain "UP"
which sbatch # Should be present
squeue --noheader # Should show jobs shortly after submit
Job logs (on host or in Slurm containers):
tail -n +1 persistent-data/logs/ml_train_*.out
tail -n +1 persistent-data/logs/ml_train_*.err
4) Compose and .env consistency#
Prod:
DATABASE_URLis read directly from.envand should point to thelnu-pinakes-postgresservice.Dev:
docker-compose.ymlconstructsDATABASE_URLfrom components. Ensure.envdefines:
POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...
DATABASE_HOST=...
DATABASE_PORT=...
Avoid inline comments on env lines (Docker Compose will treat them literally).
5) Known good behaviors in this repo#
FastAPI entrypoint (
fastapi_entrypoint.sh) now:Maps Docker socket GID to a local group and adds
appuserto it for dispatch.Ensures
/dataownership and permissions favorappuser.
Slurm entrypoint (
slurm-image/entrypoint.sh) now:Ensures
/dataownership and permissions favor theslurmuser (host-mapped UID/GID).Initializes Munge and Slurm spool directories with correct ownership.
Sbatch script (generated by app):
Exports
PYTHONPATH=/data/app_code_for_slurm,cds to that path, and runspython3 -m app.ml_train.Prints environment and path diagnostics to job logs to simplify debugging.
6) Quick decision tree#
UI shows:
Fel vid dispatch: Slurm not UP ...→ Checkscontrol ping, Munge, logs in/var/log/slurm.UI shows:
sbatch not found ...→ Rebuildslurm-imagewith Slurm packages.UI shows:
Permission denied ... app_code_for_slurm→ Fix/dataownership (hostchown -R; container entrypoints will maintain it).Jobs start but fail with
ModuleNotFoundError: app→ Verify code sync exists under/data/app_code_for_slurm/app, see job logs for path dumps.UI stuck on
Dispatch process failedwithout details → Check/data/logs/fastapi.logfor the expanded error (now logged and written to DB).
7) Commands reference#
Rebuild prod:
HOST_UID=$(id -u) HOST_GID=$(id -g) docker compose -f prod-docker-compose.yaml up --build -d
Rebuild dev:
HOST_UID=$(id -u) HOST_GID=$(id -g) docker compose up --build -d
Reset bind mount ownership (host):
sudo chown -R $(id -u):$(id -g) persistent-data
If issues persist after these steps, capture:
The exact UI error banner text (now includes the error cause)
Excerpts from
/persistent-data/logs/fastapi.logOutput of
scontrol ping,which sbatch, and a relevantml_train_*.err
and report them. The error banner plus log snippets will identify the next fix precisely.