app.slurm_job_trigger module#

Dispatch ML training jobs to a Slurm cluster from the web app or CLI.

This module prepares and submits training jobs to a containerized Slurm master. It synchronizes the application code to a shared /data volume, generates a deterministic sbatch script, and executes sbatch inside the Slurm master container via the Docker Engine. It also updates the training status in the database to reflect submission progress or failure.

See Also#

app.ml_train

Standalone training job executed by the sbatch script.

app.main

FastAPI endpoints that initiate training dispatch.

app.database.SessionLocal

Session factory used to update status.

app.models.TrainingStatus

Singleton status row updated on dispatch.

Notes#

  • Primary role: synchronize code to shared volume, write the sbatch script, and trigger job submission within the Slurm master container.

  • Key dependencies: a running Docker daemon, a Slurm master container whose name matches one of SLURM_MASTER_CONTAINER_NAMES, a writable filesystem mounted at /data shared by Slurm nodes, and a valid DATABASE_URL environment variable for the job.

  • Invariants: the shared path SHARED_DATA_PATH (/data) must exist and be writable; the sbatch script name is fixed by SBATCH_SCRIPT_NAME.

Examples#

>>> from app.slurm_job_trigger import create_and_dispatch_training_job
>>> create_and_dispatch_training_job()
app.slurm_job_trigger.clear_training_flag_on_failure(reason: str = 'Failed to dispatch job') None[source]#

Reset the training flag and record a failure reason.

Sets TrainingStatus.is_training to False and stores the provided reason in TrainingStatus.current_horizon for traceability. Intended to be called whenever dispatching fails so that the UI reflects a stopped state with a short explanation.

Parameters:
reasonstr, optional

Human-readable failure reason persisted to the database. Must be non-empty, by default "Failed to dispatch job".

Raises:
AssertionError

If reason is an empty string.

Notes

app.slurm_job_trigger.create_and_dispatch_training_job() None[source]#

Prepare code, write sbatch script, and dispatch the training job to Slurm.

High-level orchestration that ensures the shared /data volume has a fresh copy of the application code, writes a minimal sbatch script with required environment variables, and triggers submission on the Slurm master container. Updates app.models.TrainingStatus accordingly, or resets the training flag with a human-readable reason on failure.

Returns:
None

Notes

  • The function relies on the DATABASE_URL environment variable. If it is missing, no submission is attempted and the status is cleared.

  • Operational errors from Docker, filesystem, or database are logged and cause a safe status reset without raising exceptions to the caller.

Examples

>>> from app.slurm_job_trigger import create_and_dispatch_training_job
>>> create_and_dispatch_training_job()
app.slurm_job_trigger.trigger_slurm_job(script_path: str) bool[source]#

Submit an sbatch job to the Slurm master container.

Finds the Slurm master container, executes sbatch with the provided script path inside that container, and parses the output to confirm submission. Logs the full outcome including STDOUT/STDERR for diagnosis.

Parameters:
script_pathstr

Absolute path to the .sbatch script inside the shared volume (typically under /data) that Slurm should execute.

Returns:
bool

True if the job was successfully submitted (detected by the presence of "Submitted batch job" in STDOUT); False otherwise or if the Slurm master container could not be found.

Raises:
AssertionError

If script_path is an empty string.

Examples

>>> # Requires a running Slurm master container and shared volume
>>> from app.slurm_job_trigger import trigger_slurm_job
>>> trigger_slurm_job("/data/run_ml_training_job.sbatch")
True