app.slurm_job_trigger module#
Dispatch ML training jobs to a Slurm cluster from the web app or CLI.
This module prepares and submits training jobs to a containerized Slurm master.
It synchronizes the application code to a shared /data volume, generates a
deterministic sbatch script, and executes sbatch inside the Slurm
master container via the Docker Engine. It also updates the training status in
the database to reflect submission progress or failure.
See Also#
app.ml_trainStandalone training job executed by the sbatch script.
app.mainFastAPI endpoints that initiate training dispatch.
app.database.SessionLocalSession factory used to update status.
app.models.TrainingStatusSingleton status row updated on dispatch.
Notes#
Primary role: synchronize code to shared volume, write the sbatch script, and trigger job submission within the Slurm master container.
Key dependencies: a running Docker daemon, a Slurm master container whose name matches one of
SLURM_MASTER_CONTAINER_NAMES, a writable filesystem mounted at/datashared by Slurm nodes, and a validDATABASE_URLenvironment variable for the job.Invariants: the shared path
SHARED_DATA_PATH(/data) must exist and be writable; the sbatch script name is fixed bySBATCH_SCRIPT_NAME.
Examples#
>>> from app.slurm_job_trigger import create_and_dispatch_training_job
>>> create_and_dispatch_training_job()
- app.slurm_job_trigger.clear_training_flag_on_failure(reason: str = 'Failed to dispatch job') None[source]#
Reset the training flag and record a failure reason.
Sets
TrainingStatus.is_trainingtoFalseand stores the providedreasoninTrainingStatus.current_horizonfor traceability. Intended to be called whenever dispatching fails so that the UI reflects a stopped state with a short explanation.- Parameters:
- reason
str,optional Human-readable failure reason persisted to the database. Must be non-empty, by default
"Failed to dispatch job".
- reason
- Raises:
AssertionErrorIf
reasonis an empty string.
Notes
Any
sqlalchemy.exc.SQLAlchemyErroris caught and logged; the function does not raise on database errors.
- app.slurm_job_trigger.create_and_dispatch_training_job() None[source]#
Prepare code, write sbatch script, and dispatch the training job to Slurm.
High-level orchestration that ensures the shared
/datavolume has a fresh copy of the application code, writes a minimalsbatchscript with required environment variables, and triggers submission on the Slurm master container. Updatesapp.models.TrainingStatusaccordingly, or resets the training flag with a human-readable reason on failure.- Returns:
Notes
The function relies on the
DATABASE_URLenvironment variable. If it is missing, no submission is attempted and the status is cleared.Operational errors from Docker, filesystem, or database are logged and cause a safe status reset without raising exceptions to the caller.
Examples
>>> from app.slurm_job_trigger import create_and_dispatch_training_job >>> create_and_dispatch_training_job()
- app.slurm_job_trigger.trigger_slurm_job(script_path: str) bool[source]#
Submit an
sbatchjob to the Slurm master container.Finds the Slurm master container, executes
sbatchwith the provided script path inside that container, and parses the output to confirm submission. Logs the full outcome including STDOUT/STDERR for diagnosis.- Parameters:
- script_path
str Absolute path to the
.sbatchscript inside the shared volume (typically under/data) that Slurm should execute.
- script_path
- Returns:
- bool
Trueif the job was successfully submitted (detected by the presence of"Submitted batch job"in STDOUT);Falseotherwise or if the Slurm master container could not be found.
- Raises:
AssertionErrorIf
script_pathis an empty string.
Examples
>>> # Requires a running Slurm master container and shared volume >>> from app.slurm_job_trigger import trigger_slurm_job >>> trigger_slurm_job("/data/run_ml_training_job.sbatch") True