app.slurm_job_trigger module#

Dispatch ML training jobs to a Slurm cluster from the web app or CLI.

This module prepares and submits training jobs to a containerized Slurm master. It synchronizes the application code to a shared /data volume, generates a deterministic sbatch script, and executes sbatch inside the Slurm master container via the Docker Engine. It also updates the training status in the database to reflect submission progress or failure.

Notes#

Primary role: synchronize code to shared volume, write the sbatch script, and trigger job submission within the Slurm master container.
Key dependencies: a running Docker daemon, a Slurm master container whose name matches one of SLURM_MASTER_CONTAINER_NAMES, a writable filesystem mounted at /data shared by Slurm nodes, and a valid DATABASE_URL environment variable for the job.
Invariants: the shared path SHARED_DATA_PATH (/data) must exist and be writable; the sbatch script name is fixed by SBATCH_SCRIPT_NAME.

Examples#

>>> from app.slurm_job_trigger import create_and_dispatch_training_job
>>> create_and_dispatch_training_job()

app.slurm_job_trigger.clear_training_flag_on_failure(reason: str = 'Failed to dispatch job') → None[source]#

Reset the training flag and record a failure reason.

Sets TrainingStatus.is_training to False and stores the provided reason in TrainingStatus.current_horizon for traceability. Intended to be called whenever dispatching fails so that the UI reflects a stopped state with a short explanation.

Parameters:

reasonstr, optional: Human-readable failure reason persisted to the database. Must be non-empty, by default "Failed to dispatch job".

Raises:

AssertionError: If reason is an empty string.

Notes

Any sqlalchemy.exc.SQLAlchemyError is caught and logged; the function does not raise on database errors.

app.slurm_job_trigger.create_and_dispatch_training_job() → None[source]#

Prepare code, write sbatch script, and dispatch the training job to Slurm.

High-level orchestration that ensures the shared /data volume has a fresh copy of the application code, writes a minimal sbatch script with required environment variables, and triggers submission on the Slurm master container. Updates app.models.TrainingStatus accordingly, or resets the training flag with a human-readable reason on failure.

Returns:

None

Notes

The function relies on the DATABASE_URL environment variable. If it is missing, no submission is attempted and the status is cleared.
Operational errors from Docker, filesystem, or database are logged and cause a safe status reset without raising exceptions to the caller.

Examples

>>> from app.slurm_job_trigger import create_and_dispatch_training_job
>>> create_and_dispatch_training_job()

app.slurm_job_trigger.trigger_slurm_job(script_path: str) → bool[source]#

Submit an sbatch job to the Slurm master container.

Finds the Slurm master container, executes sbatch with the provided script path inside that container, and parses the output to confirm submission. Logs the full outcome including STDOUT/STDERR for diagnosis.