2025-06-12_separat_slurm_kluster_felsokning#

This guide provides a practical, end-to-end procedure to diagnose and fix Slurm job dispatch problems in the containerized MLOps environment. It focuses on the typical failure modes we’ve observed when running the FastAPI app and the Slurm cluster as separate services with a shared bind-mounted volume (./persistent-data/data).

Key takeaways:

  • Shared volume ownership matters. If /data is owned by a different UID/GID than the process writing to it, you’ll see Permission denied at runtime.

  • Ensure sbatch exists and the Slurm controller is UP before submitting jobs.

  • Keep code sync predictable: set PYTHONPATH, cd to the code root, and run python -m app.ml_train.

  • Docker socket access is required if the app dispatches jobs into the Slurm master container.


1) Symptoms → Root Causes → Fixes#

A. Permission denied when syncing code to /data/app_code_for_slurm#

Example:

cp: cannot create directory '/data/app_code_for_slurm/app': Permission denied
cp: cannot create regular file '/data/app_code_for_slurm/requirements.txt': Permission denied

Root cause:

  • The bind-mounted host folder ./persistent-data is owned by the wrong user (often root) after rebuilds. The FastAPI container (user appuser) cannot write.

Fix:

  • On host: sudo chown -R $(id -u):$(id -g) persistent-data

  • Containers now also fix ownership on startup:

    • FastAPI entrypoint: chowns /data to appuser UID/GID and sets u+rwX,g+rwX.

    • Slurm entrypoint: chowns /data to slurm UID/GID (mapped to host UID/GID at build) and sets u+rwX,g+rwX.

  • Recreate services: docker compose up --build -d (dev) or docker compose -f prod-docker-compose.yaml up --build -d (prod).

Verification inside FastAPI container:

id
ls -ld /data /data/app_code_for_slurm
touch /data/app_code_for_slurm/.writetest

B. Slurm controller not UP#

Example UI status:

Slurm not UP (code=.. ) out=.. err=..

Root cause:

  • slurmctld not fully started, munged missing/failed, or container order startup issues.

Fix:

  • In slurm-master container: scontrol ping must contain UP.

  • Check logs: /var/log/slurm/slurmctld.log and Munge status.

  • Ensure these mounts and settings are present (Compose):

    • /etc/slurm config is mounted

    • /sys/fs/cgroup:/sys/fs/cgroup:rw

    • privileged: true

    • slurm-master starts before workers

C. sbatch not found#

Root cause:

  • Slurm packages not installed correctly in the image.

Fix:

  • Verify: docker exec -it slurm-master which sbatch

  • Rebuild slurm-image if missing. Ensure slurm-wlm and slurm-client are installed.

D. Python import error: ModuleNotFoundError: No module named 'app'#

Root cause:

  • Code sync did not create /data/app_code_for_slurm/app/, or PYTHONPATH/CWD incorrect.

Fix:

  • The sbatch script (generated by the app) sets PYTHONPATH=/data/app_code_for_slurm, cds to the code root, and runs python3 -m app.ml_train.

  • Verify job logs in /data/logs/ml_train_<jobid>.out show directory listings and Python introspection (we print sys.path before training starts).

E. Docker exec failures from app container#

Root cause:

  • Docker socket not mounted, or app user not in the socket group.

Fix:

  • Mount /var/run/docker.sock into the FastAPI container.

  • Entry point maps the socket GID inside the container and adds appuser to that group.


2) Ubuntu-specific gotchas (22.04 / 24.04)#

  • Bind-mounts are created by the first container that writes. If that container runs as root, host files become root-owned, breaking subsequent writes by non-root app containers.

  • Fix by ensuring each service that writes to /data normalizes ownership on startup (we now do this in both FastAPI and Slurm entrypoints).

  • If you rebuild images with different HOST_UID/HOST_GID, previously written files may still be owned by old IDs. Run a host-side chown -R to reset.


3) Validation Checklist#

On host:

ls -ld persistent-data
sudo chown -R $(id -u):$(id -g) persistent-data

FastAPI container:

id
python -V
ls -ld /data /data/app_code_for_slurm /data/logs
python - <<'PY'
import os, sys
print('PYTHONPATH=', os.getenv('PYTHONPATH'))
print('sys.path=', sys.path)
print('can_write?', os.access('/data', os.W_OK))
PY

Slurm master:

scontrol ping               # Should contain "UP"
which sbatch                # Should be present
squeue --noheader           # Should show jobs shortly after submit

Job logs (on host or in Slurm containers):

tail -n +1 persistent-data/logs/ml_train_*.out
tail -n +1 persistent-data/logs/ml_train_*.err

4) Compose and .env consistency#

  • Prod: DATABASE_URL is read directly from .env and should point to the lnu-pinakes-postgres service.

  • Dev: docker-compose.yml constructs DATABASE_URL from components. Ensure .env defines:

POSTGRES_DB=...
POSTGRES_USER=...
POSTGRES_PASSWORD=...
DATABASE_HOST=...
DATABASE_PORT=...

Avoid inline comments on env lines (Docker Compose will treat them literally).


5) Known good behaviors in this repo#

  • FastAPI entrypoint (fastapi_entrypoint.sh) now:

    • Maps Docker socket GID to a local group and adds appuser to it for dispatch.

    • Ensures /data ownership and permissions favor appuser.

  • Slurm entrypoint (slurm-image/entrypoint.sh) now:

    • Ensures /data ownership and permissions favor the slurm user (host-mapped UID/GID).

    • Initializes Munge and Slurm spool directories with correct ownership.

  • Sbatch script (generated by app):

    • Exports PYTHONPATH=/data/app_code_for_slurm, cds to that path, and runs python3 -m app.ml_train.

    • Prints environment and path diagnostics to job logs to simplify debugging.


6) Quick decision tree#

  1. UI shows: Fel vid dispatch: Slurm not UP ... → Check scontrol ping, Munge, logs in /var/log/slurm.

  2. UI shows: sbatch not found ... → Rebuild slurm-image with Slurm packages.

  3. UI shows: Permission denied ... app_code_for_slurm → Fix /data ownership (host chown -R; container entrypoints will maintain it).

  4. Jobs start but fail with ModuleNotFoundError: app → Verify code sync exists under /data/app_code_for_slurm/app, see job logs for path dumps.

  5. UI stuck on Dispatch process failed without details → Check /data/logs/fastapi.log for the expanded error (now logged and written to DB).


7) Commands reference#

Rebuild prod:

HOST_UID=$(id -u) HOST_GID=$(id -g) docker compose -f prod-docker-compose.yaml up --build -d

Rebuild dev:

HOST_UID=$(id -u) HOST_GID=$(id -g) docker compose up --build -d

Reset bind mount ownership (host):

sudo chown -R $(id -u):$(id -g) persistent-data

If issues persist after these steps, capture:

  • The exact UI error banner text (now includes the error cause)

  • Excerpts from /persistent-data/logs/fastapi.log

  • Output of scontrol ping, which sbatch, and a relevant ml_train_*.err

and report them. The error banner plus log snippets will identify the next fix precisely.