Recovering from a Failed Assimilation

Sometimes the assimilate.csh script will error out and must be rerun.

For example, when running the Kilo-CAM ensemble for the first time, the integration took much, much longer than it should’ve and it hit the 12:00:00 job wallclock limit. I resubmitted the job, but there were extraneous CESM logs present due to the failed integration, thus when assimilate.csh was run it exited with the following message:

ERROR: Too many cesm.log files (3) for the 1 restart sets.
Clean out the cesm.log files from failed cycles.

So let’s do just that:

$ cd /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.090.e1000/run
$ ls cesm.log*
cesm.log.17201900.201203-174150  cesm.log.17246904.201206-104359.gz  cesm.log.17325936.201207-142703
$ mv cesm.log.17201900.201203-174150 ~/

The script comments say that:

The (resulting) assimilate.csh script is called by CESM with two arguments:
1) the CASEROOT, and
2) the assimilation cycle number in this CESM job

To continue past a trival exit such as this one, the task is to rerun assimilate.csh. This script must be run from a batch job in order for the environmental variables to be set properly.

$ sbatch run_assimilate.csh

All run_assimilate.csh does is submit a batch job to SLURM in which assimilate.csh is run and passed two variables, CASEROOT and DATA_ASSIMILATION_CYCLES.

#!/bin/csh

#SBATCH --job-name=run_assimilate
#SBATCH --ntasks=320
#SBATCH --ntasks-per-node=32
#SBATCH --time=04:00:00
#SBATCH -A k1421
#SBATCH -p workq
#SBATCH -e run_assimilate.%j.err
#SBATCH -o run_assimilate.%j.out

./assimilate.csh /lustre/project/k1421/cases/FHIST_BGC.f09_d025.090.e1000 1

exit 0