Running ATS model#
This section provides the scripts used for executing ATS models on both local PC and high-performance computing (e.g., NERSC)
Using Docker#
Docker image can be used to quickly test run ATS models. The following example shows how to run ATS model using one of the example input file.
cd ats-workflow
docker run -it --rm -v $(pwd):/home/amanzi_user/work pshuai/ats:v1.5 /bin/bash -c "cd model/1-spinup_steadystate && ats --xml_file=../inputs/CoalCreek_spinup_steadystate.xml"
Note
Docker is useful for testing and debugging. However, it is not recommended for running large simulations. For production runs, see the following sections.
Single Job on Local PC#
Follow the ATS installation guide to install ATS on your local PC. Once ATS is installed, you can run ATS using the following command.
# using single core
ats --xml_file=input.xml
# using multiple cores
mpirun -n 4 ats --xml_file=input.xml
Single Job on High-performance Computing (HPC)#
A shell script is usually used for submitting jobs on HPC. Here is a sample job script for running ATS on Cori NERSC.
Note
Depending on the HPC system and node architectures, the job script may be slightly different. Refer to the machine’s documentation (e.g., NERSC)
#!/bin/bash -l
#SBATCH -A PROJECT_REPO
#SBATCH -N 2
#SBATCH -t 14:00:00
#SBATCH -L SCRATCH
#SBATCH -J JOB_NAME
#SBATCH --qos regular
#SBATCH -C haswell
cd $SLURM_SUBMIT_DIR
srun -n 64 ats --xml_file=./input.xml
Batch Job#
This section provides scripts for launching batch job on HPC. This is useful for ensemble runs.
Using for loop
#
Launch job using a sbatch script. Basically write a for loop and submit entire jobs using srun
. The example script below submits 40 small jobs through a single submission using a for
loop. It requests a total of 160 nodes (each job uses 4 nodes) and a time limit of 22 hours.
Important
Use &
between each srun
execution, and wait
after the entire for loop
.
Note
Add
sleep
to provide additional time between each job submission.#SBATCH --no-kill
is an option to prevent batch job failure if one of the nodes it has been allocated fails. The user will assume the responsibilities for fault-tolerance should a node fail (Don’t use this if you are not sure). See the Slurm documentation for more information.
#!/bin/bash -l
#SBATCH -A PROJECT_REPO
#SBATCH -N 160
#SBATCH -t 22:00:00
#SBATCH -L SCRATCH
#SBATCH -J JOB_NAME
#SBATCH --qos regular
#SBATCH --mail-type ALL
#SBATCH --mail-user pin.shuai@pnnl.gov
#SBATCH -C haswell
#SBATCH --no-kill
module use -a /global/project/projectdirs/m3421/ats-new/modulefiles
module load ats/ecoon-land_cover/cori-haswell/intel-6.0.5-mpich-7.7.10/opt
for i in {1..40}
do
cd /path/to/batch_job/ens$i
srun -N 4 -n 128 -e job%J.err -o job%J.out ats --xml_file=input.xml sleep 5 &
done
sleep 1
wait
Pros:
Simple workflow
Finish the ensemble jobs all at once.
Cons:
Long queue time at the beginning
May not work well for relatively large ensemble runs (e.g., >200) because of the potential node failure.
Using job Arrays
#
Job array provides a convenient way to schedule similar jobs quickly and easily. The example script below submits 30 jobs. Each job uses 1 node and has the same time limit and QOS. See Slurm documentation on job arrays.
#!/bin/bash -l
#SBATCH -A PROJECT_REPO
#SBATCH -N 1
#SBATCH -t 24:00:00
#SBATCH -L SCRATCH
#SBATCH --ntasks-per-node 32
#SBATCH --array=1-30
#SBATCH -J JOB_NAME
#SBATCH --qos regular
#SBATCH --mail-type ALL
#SBATCH --mail-user pin.shuai@pnnl.gov
#SBATCH --output=array_%A_%a.out
module use -a /global/project/projectdirs/m3421/ats-new/modulefiles
module load ats/ecoon-land_cover/cori-haswell/intel-6.0.5-mpich-7.7.10/opt
# Print the task id.
echo "My SLURM_ARRAY_TASK_ID: " $SLURM_ARRAY_TASK_ID
inode=${SLURM_ARRAY_TASK_ID}
cd /path/to/batch_job/ens$inode
srun -n 32 ats --xml_file=input.xml
Pros:
Submit jobs quickly and easily.
The queue time maybe short if running smaller jobs (e.g., short time limit).
Cons:
Long queue time overall if running large ensembles because Slurm scheduler prioritizes fewer jobs requesting many nodes ahead of many jobs requesting fewer nodes (array tasks are considered individual jobs).
Using MATK
#
Use python scripts to combine input file generation, job submission, sensitivity analysis, and model evaluation through Model Analysis ToolKit (MATK
). MATK facilitates model analysis within the Python computational environment. See MATK documentation for more information.
#!/bin/bash -l
#SBATCH --account=PROJECT_REPO
#SBATCH -N 66
#SBATCH --tasks-per-node=32
#SBATCH -t 06:00:00
#SBATCH --job-name=sensitivity_study
#SBATCH --qos regular
#SBATCH --mail-type ALL
#SBATCH --mail-user pin.shuai@pnnl.gov
#SBATCH -C haswell
#SBATCH -o ./sensitivity_study.out
#SBATCH -e ./sensitivity_study.err
module use -a /global/project/projectdirs/m3421/ats-new/modulefiles
module load python
module load matk
module load ats/ecoon-land_cover/cori-haswell/intel-6.0.5-mpich-7.7.10/opt
python run_sensitivity.py
Note
run_sensitivity.py
is a python script that user uses to generate model parameters for the ensemble, schedule the forward runs, and post-processing of model results.
Pros:
Manage jobs more efficiently (e.g., keep track of finished and unfinished jobs for resubmission)
Can leverage more functionality from
MATK
(e.g., model calibration using PEST)
Cons:
Need to istall
MATK
and learn how to use itThe total queue time may be longer than a big
for loop
submission.
Note
For more workflow tools, see NERSC documentation.