List of Attempts
It took very, very many attempts to get a CAM ensemble performing well on Shaheen.
Trial 102
While we’re waiting, why don’t we try to build with just the serial NetCDF
library in cray-netcdf-hdf5parallel
to see if that’s faster.
Removed this line from config_machines.xml
:
<env name="PNETCDF_PATH">/opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0</env>
Trial 101
Tried various configurations of the linker flag in the Makefile:
312 SLIBS += -L$(LIB_PNETCDF) -lnetcdff_parallel
312 SLIBS += -L$(LIB_PNETCDF) -lnetcdff_parallel -lnetcdf_parallel
312 SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
These all error out.
Error
-- Configuring incomplete, errors occurred!
See also "/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.101.e0003/bld/intel/mpt/nodebug/nothreads/pio/pio1/CMakeFiles/CMakeOutput.log".
gmake: Leaving directory '/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.101.e0003/bld/intel/mpt/nodebug/nothreads/pio/pio1'
cat: Filepath: No such file or directory
cat: Srcfiles: No such file or directory
Building PIO with netcdf support
CMake Error at /sw/xc40cle7/cmake/3.13.4/sles15_gcc7.4.1/install/share/cmake-3.13/Modules/FindPackageHandleStandardArgs.cmake:137 (message):
Could NOT find PnetCDF_Fortran (missing: PnetCDF_Fortran_LIBRARY
PnetCDF_Fortran_INCLUDE_DIR)
Trial 100
Trial 099 showed us that we can’t simply remove NetCDF from the build. Let’s try
to get the build to compile with netcdf-hdf5parallel
. We tried this earlier
and couldn’t get the linking flag to work. Brian Dobbins suggested editing the
makefile.
This attempt requires us to make several changes to config_machines.xml
:
<modules mpilib="!mpi-serial">
<command name="load">cray-netcdf-hdf5parallel/4.6.3.2</command>
</modules>
<environment_variables>
<env name="PNETCDF_PATH">/opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0</env>
</environment_variables>
Without changing the makefile, we get the following error.
Error
– Configuring incomplete, errors occurred! See also “/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.100.e0003/bld/intel/mpt/nodebug/nothreads/pio/pio1/CMakeFiles/CMakeOutput.log”. gmake: Leaving directory ‘/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.100.e0003/bld/intel/mpt/nodebug/nothreads/pio/pio1’
Thus we attempt to edit /lustre/project/k1421/cesm2_1_3/cime/scripts/Tools/Makefile
.
312 SLIBS += -L$(LIB_PNETCDF) -lpnetcdf
to reflect the name of the static library:
312 SLIBS += -L$(LIB_PNETCDF) -lnetcdff_parallel
Even attempted to make a spoof directory:
<env name=”PNETCDF_PATH”>”/lustre/project/k1421/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0</env>
with symbolic links:
$ ln -s /opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0/lib/libnetcdff_parallel.a lpnetcdff.a
$ ln -s /opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0/lib/libnetcdf_parallel.a lpnetcdf.a
None of this works.
Trial 099
Commenting out NETCDF_PATH
from config_machines.xml
to see if that
affects the initialization time. We are trying to test the idea that only
allowing the build to use parallel netCDF might work better.
Also removing these lines from config_compilers.xml
:
<SLIBS>
<append> -L$(NETCDF_PATH) -lnetcdff -Wl,--as-needed,-L$(NETCDF_PATH)/lib -lnetcdff -lnetcdf </append>
</SLIBS>
<NETCDF_PATH>$ENV{NETCDF_PATH}</NETCDF_PATH>
This errors out.
Error
NETCDF not found: Define NETCDF_PATH or NETCDF_C_PATH and NETCDF_FORTRAN_PATH
Trial 098
Changing MAX_TASKS_PER_NODE
back to 32 since we haven’t figured out how
it’s used in setup_advanced_Rean
and why it’s doubling the requested nodes.
Building a 3 member ensemble for determining a baseline for Init Time.
An example timing file from ensemble member 0003 is here:
$ vim /lustre/project/k1421/cases/FHIST_BGC.f09_d025.098.e0003/timing/cesm_timing_0003.FHIST_BGC.f09_d025.098.e0003.17867122.201223-212516
Init Time : 1375.139 seconds
Run Time : 266.362 seconds 1065.447 seconds/day
The Init Time of 1375.139 seconds
is our benchmark for improvement.
Trial 095
Building with 500 ensemble members since we will not be able to get through the experiment, even with 4 million core hours given the current level of performance efficiency.
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.093.e500
Also made sure that DEBUG=FALSE
in setup_advanced_Rean
:
# DEBUG = TRUE implies turning on run and compile time debugging.
# INFO_DBUG level of debug output, 0=minimum, 1=normal, 2=more, 3=too much.
./xmlchange DEBUG=FALSE
./xmlchange INFO_DBUG=0
Trial 094
Rebuilding with 1000 members in case the efforts to fix the broken Trial 090 take much longer than expected.
./xmlchange DOUT_S=TRUE
Trial 093
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.093.e500
$ ./xmlquery JOB_WALLCLOCK_TIME
Results in group case.run
JOB_WALLCLOCK_TIME: 12:00:00
Results in group case.st_archive
JOB_WALLCLOCK_TIME: 1:00
$ ./xmlchange --subgroup case.run JOB_WALLCLOCK_TIME=12:00:00
Trial 092
To avoid the error in Trial 91, we changed:
&filter_nml … single_file_in = .false. perturb_from_single_instance = .false. … / making another attempt.
Trial 091
Since 090 takes so long to build, this case runs with 100 members to see if the copied restart files for ensemble members 81-100 are able to be used without issue before we run with 1000 members.
Result
The case runs to completion, however, the assimilation fails perhaps because
we changed single_file_in = .true.
to get the perturbation working.
From the log file /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.091.e100/run/da.log.17183899.201202-061559
:
Error
ERROR FROM:
PE 512: direct_netcdf_mod:
routine: direct_netcdf_mod:
message: If using single_file_in/single_file_out = .true.
message: ... you must have a member dimension in your input/output file.
Rank 512 [Wed Dec 2 08:08:45 2020] [c6-1c2s11n0] application called MPI_Abort(comm=0x84000004, 99) - process 512
Trial 090
Since we don’t know how long it’ll take to get through the run we change the
wallclock time to 12:00:00
.
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.090.e1000
$ ./xmlquery JOB_WALLCLOCK_TIME
Results in group case.run
JOB_WALLCLOCK_TIME: 12:00:00
Results in group case.st_archive
JOB_WALLCLOCK_TIME: 1:00
$ ./xmlchange --subgroup case.run JOB_WALLCLOCK_TIME=12:00:00
After the error in Trial 091, we change the filter namelist since aren’t using perturb_from_single_instance:
&filter_nml
...
single_file_in = .false.
perturb_from_single_instance = .false.
...
/
Trial 089
Testing how to use .i.
files in a hybrid run.
I’m not sure how to use .i. files in a hybrid run rather than .r. files.
Here is the relevant page in the CIME documentation.
To test whether we can do this, change rpointer atm contents to .i.
rather
than .r.
and see if it works in line 1257 of setup_advanced_Rean
.
$ ncdump -h cam_initial_0001.nc
shows that the initial file is an 'i'
file rather than an 'r'
file.
This, and the change from 'r'
to 'i'
, also seems to suggest that it’s
an initial file: /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.088.e03/run/cam_initial_0001.nc
.
In setup_advanced_Rean
Line 1283, we’re linking the 'i'
files here.
Be able to explain what the purpose is of the slwe of rpointer files:
@ inst=1
while (\$inst <= $num_instances)
set inst_string = \`printf _%04d \$inst\`
${LINK} -f ${case}.cam\${inst_string}.i.\${restart_time}.nc cam_initial\${inst_string}.nc
@ inst ++
end
Trial 088
Edit DART_config.template
to set:
./xmlchange DATA_ASSIMILATION_ATM=TRUE
...
if ($?CIMEROOT) ./xmlchange DATA_ASSIMILATION_SCRIPT=${CASEROOT}/assimilate.csh
Do adaptive inflation and run an assimilation cycle.
inf_flavor = 5, 0
Is it worth considering cray-mpich
versus mpt
?
In config_machines.xml
try adding:
NCAR_LIBS_PNETCDF=-Wl,-Bstatic -lpnetcdf -Wl,-Bdynamic
Add -Wl,-Bstatic -lpnetcdf -Wl,-Bdynamic
.
Trial 087
Change to 1 node per instance.
Compare this with Trial 086. It looks nearly the same for total cost.
setenv timewall 2:30
total pes active : 96
mpi tasks per node : 32
pe count for cost estimate : 96
Overall Metrics:
Model Cost: 51504.09 pe-hrs/simulated_year
Model Throughput: 0.04 simulated_years/day
Init Time : 1837.831 seconds
Run Time : 1322.879 seconds 5291.516 seconds/day
Final Time : 0.081 seconds
Trial 086
Changing ESP
to 32
.
total pes active : 384
mpi tasks per node : 32
pe count for cost estimate : 384
Overall Metrics:
Model Cost: 50881.78 pe-hrs/simulated_year
Model Throughput: 0.18 simulated_years/day
Init Time : 1364.705 seconds
Run Time : 326.724 seconds 1306.895 seconds/day
Final Time : 0.017 seconds
Actual Ocn Init Wait Time : 0.000 seconds
Estimated Ocn Init Run Time : 0.006 seconds
Estimated Run Time Correction : 0.006 seconds
(This correction has been applied to the ocean and total run times)
Trial 085
Same as Trial 084, except changing:
<arg name="binding" > --cpu_bind=cores</arg>
Trial 084
Same configuration as in Trial 083 except we restripe the forcing files in accordance with Bilel Hadri’s recommendations.
Trial 083
Same configuration as Trial 082 except all of the restart files are no longer striped.
The first submission takes 00:32:41.
The second submission with ./case.submit --skip-preview-namelist -M begin,end
takes 00:32:04.
Trial 082
Setting config_machines.xml
:
<modules mpilib="mpi-serial">
<command name="load">cray-hdf5/1.10.5.2</command>
<command name="load">cray-netcdf/4.6.3.2</command>
</modules>
<modules mpilib="!mpi-serial">
<command name="load">cray-hdf5-parallel/1.10.5.2</command>
<command name="load">cray-parallel-netcdf/1.11.1.1</command>
</modules>
<env name="NETCDF_PATH">/opt/cray/pe/netcdf/4.6.3.2/INTEL/19.0</env>
<env name="PNETCDF_PATH">/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0</env>
Setting config_compilers.xml
:
<SLIBS>
<append> -L$(NETCDF_PATH) -lnetcdff -Wl,--as-needed,-L$(NETCDF_PATH)/lib -lnetcdff -lnetcdf </append>
</SLIBS>
<MPICC> cc </MPICC>
<MPICXX> CC </MPICXX>
<MPIFC> ftn </MPIFC>
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<NETCDF_PATH>$ENV{NETCDF_PATH}</NETCDF_PATH>
Trial 081
Setting config_machines.xml
:
<command name="load">cray-netcdf/4.6.3.2</command>
<command name="load">cray-netcdf-hdf5parallel/4.6.3.2</command>
Setting config_compilers.xml
:
<SLIBS>
<append> -L$(NETCDF_PATH) -lnetcdff -Wl,--as-needed,-L$(NETCDF_PATH)/lib -lnetcdff -lnetcdf </append>
</SLIBS>
<MPICC> cc </MPICC>
<MPICXX> CC </MPICXX>
<MPIFC> ftn </MPIFC>
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<NETCDF_PATH>$ENV{NETCDF_PATH}</NETCDF_PATH>
Trial 079
Let’s try use the slibs
tag instead of the parallel netcdf tag.
Setting config_compilers.xml
:
<SLIBS>
<append>-L$(PNETCDF_PATH_KAUST)/lib -lnetcdff_parallel</append>
</SLIBS>
<MPICC> cc </MPICC>
<MPICXX> CC </MPICXX>
<MPIFC> ftn </MPIFC>
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<NETCDF_PATH>$ENV{NETCDF_PATH_KAUST}</NETCDF_PATH>
This errors out:
Error
65: PNETCDF not enabled in the build
--prefix -> /opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0
--includedir -> /opt/cray/pe/netcdf/4.6.3.2/include
--libdir -> /opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0/lib
Note
Typos in the CIME documentation:
https://esmci.github.io/cime/versions/master/html/users_guide/porting-cime.html
Should be cime/tools/load_balancing_tool
– no “s”.
Trial 078
Made these changes to DART_config
:
# ./xmlchange DATA_ASSIMILATION_ATM=TRUE
# if ($?CIMEROOT) ./xmlchange DATA_ASSIMILATION_SCRIPT=${CASEROOT}/assimilate.csh
When we want to run assimilatte, we need to undo this change.
Changed PNETCDF_PATH_KAUST
within config_machines.xml
to:
<env name="PNETCDF_PATH_KAUST">/opt/cray/pe/netcdf-hdf5parallel/4.6.3.2/INTEL/19.0</env>
Looking at the pio.bldlogs.
Working version
/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.075.e03/bld/pio.bldlog.201027-211925.gz
In this working build, these lines are printed:
-- Found NetCDF_Fortran: /opt/cray/pe/netcdf/4.6.3.2/INTEL/19.0/lib/libnetcdff.a
-- Found PnetCDF_Fortran: /opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib/libpnetcdf.a
Non-working version
/lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.078.e03/bld/pio.bldlog.201104-002417
In this non-working build, this is the final line before a build error:
Found NetCDF_Fortran: /opt/cray/pe/netcdf/4.6.3.2/INTEL/19.0/lib/libnetcdff.a
-- Configuring incomplete, errors occurred!
Could the issue be that the library is not called libpnetcdf
and instead is called
libnetcdf_parallel
but -lpnetcdf
is hard-coded into the makefile?
Trial 077
This is the same as Trial 075, we’re just rebuilding to sidestep this NetCDF Issue.
$ cd /lustre/project/k1421/cesm_store/inputdata/atm/cam/tracer_cnst/
$ mv tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc old_tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc
$ nccopy -k cdf5 old_tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc
Trial 076
Setting config_compilers.xml
:
<SLIBS>
<append>-L$(NETCDF_PATH_KAUST)/lib -lnetcdff</append>
</SLIBS>
<!--<append>-L$(NETCDF_PATH_KAUST) -lnetcdff, -L$(PNETCDF_PATH_KAUST) -lpnetcdf</append>-->
Trial 075
Comment out the append line in config_compilers.xml
:
<!--<append>-L$(NETCDF_PATH_KAUST) -lnetcdff, -L$(PNETCDF_PATH_KAUST) -lpnetcdf</append>-->
223: MOSART decomp info proc = 95 begr = 192376 endr = 194400 numr = 2025
srun: Job step aborted: Waiting up to 302 seconds for job step to finish.
0: slurmstepd: error: *** STEP 16076077.0 ON nid01376 CANCELLED AT 2020-10-28T19:25:55 DUE TO TIME LIMIT ***
srun: got SIGCONT
The second run completed the atmospheric portion.
This is the last printed statment in all three rof files is:
001
hist_htapes_build Successfully initialized MOSART history files
------------------------------------------------------------
(Rtmini) done
Snow capping will flow out in frozen river runoff
002
hist_htapes_build Successfully initialized MOSART history files
------------------------------------------------------------
(Rtmini) done
Snow capping will flow out in frozen river runoff
003
hist_htapes_build Successfully initialized MOSART history files
------------------------------------------------------------
(Rtmini) done
Snow capping will flow out in frozen river runoff
Rerunning Trial 075 to see if it hangs at the same spot. Started at 12:42.
Trial 074
Setting config_compilers.xml
:
<append>-L$(NETCDF_PATH_KAUST) -lnetcdff, -L$(PNETCDF_PATH_KAUST) -lpnetcdf</append>
Trial 073
Setting config_compilers.xml
:
-L$(NETCDF_DIR) -lnetcdff -Wl,--as-needed,-L$(NETCDF_DIR)/lib -lnetcdff -lnetcdf
<append>-L$(NETCDF_PATH_KAUST) -lnetcdff -l, -L$(PNETCDF_PATH_KAUST) -lpnetcdf -l</append>
Error
/usr/bin/ld: cannot find -l,
Trial 072
Setting config_compilers.xml
:
<append>-L$(NETCDF_PATH_KAUST)/lib -lnetcdff, -L$(PNETCDF_PATH_KAUST)/lib -lpnetcdf</append>
Error
/usr/bin/ld: cannot find -lnetcdff
Trial 071
Setting config_compilers.xml
:
<append>-L$(NETCDF_PATH_KAUST)/lib -lnetcdff -lnetcdf, -L$(PNETCDF_PATH_KAUST)/lib -lpnetcdf</append>
..error:
/usr/bin/ld: cannot find -lnetcdf,
Trial 070
Setting config_compilers.xml
:
<append>-L$(NETCDF_PATH_KAUST) -lnetcdff -Wl, -L$(PNETCDF_PATH_KAUST) -lpnetcdf -Wl</append>
This builds, but we still get an error.
Error
1: NetCDF: Attempt to use feature that was not turned on when netCDF was built.
Checking the cesm buildlog.
-lpnetcdf -Wl -mkl=cluster -L/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib -lpnetcdf -mkl
ifort: command line warning #10157: ignoring option '-W'; argument is of wrong type
ifort: command line warning #10121: overriding '-mkl=cluster' with '-mkl'
Trial 069
Since the compiler so the flag has to match the name of the shared object file
-lpnetcdf
should work but not -lpnetcdff
.
<SLIBS>
<append> -I$(NETCDF_PATH_KAUST)/include, -I$(PNETCDF_PATH_KAUST)/include, -L$(NETCDF_PATH_KAUST)/lib -lnetcdff -lnetcdf, -L$(PNETCDF_PATH_KAUST)/lib -lpnetcdf</append>
</SLIBS>
Trial 068
Attempting the same configuration as in Trial 067, except comment out
INC_PNETCDF
and LIB_PNETCDF
. The cesm buildlog reads:
ifort: command line warning #10121: overriding '-mkl=cluster' with '-mkl'
/usr/bin/ld: cannot find -lpnetcdff
/usr/bin/ld: cannot find -lnetcdf,
/usr/bin/ld: cannot find -lpnetcdff
/usr/bin/sha1sum: /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.068.e03/bld/cesm.exe: No such file or directory
What are the names of the shared files?
/opt/cray/pe/netcdf/4.6.3.2/INTEL/19.0/lib♡ ls *so
libbzip2_intel.so libmisc_intel.so libnetcdf_c++4_intel.so libnetcdff_intel.so libnetcdf_intel.so
libbzip2.so libmisc.so libnetcdf_c++4.so libnetcdff.so libnetcdf.so
/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib♡ ls
libpnetcdf.a libpnetcdf_intel.so libpnetcdf_intel.so.3.0.1 libpnetcdf.so.3 pkgconfig
libpnetcdf_intel.a libpnetcdf_intel.so.3 libpnetcdf.so libpnetcdf.so.3.0.1
Trial 067
Setting config_machines.xml
:
<env name="PNETCDF_PATH_KAUST">/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0</env>
<env name="INC_PNETCDF_KAUST">/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/include</env>
<env name="LIB_PNETCDF_KAUST">/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib</env>
<SLIBS>
<append> -L$(NETCDF_PATH_KAUST) -lnetcdff -Wl, -L$(PNETCDF_PATH_KAUST) -lpnetcdff -Wl, --as-needed, - L$(NETCDF_PATH_KAUST)/lib -lnetcdff -lnetcdf, -L$(PNETCDF_PATH_KAUST)/lib -plnetcdff -lpnetcdf</append>
</SLIBS>
<MPICC> cc </MPICC>
<MPICXX> CC </MPICXX>
<MPIFC> ftn </MPIFC>
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<NETCDF_PATH>$ENV{NETCDF_PATH_KAUST}</NETCDF_PATH>
<PNETCDF_PATH>$ENV{PNETCDF_PATH_KAUST}</PNETCDF_PATH>
<INC_PNETCDF>$ENV{INC_PNETCDF_KAUST}</INC_PNETCDF>
<LIB_PNETCDF>$ENV{LIB_PNETCDF_KAUST}</LIB_PNETCDF>
The PIO_CONFIG_ARGS
sets the PNETCDF_PATH
argument:
<append> -L$(NETCDF_PATH_KAUST) -lnetcdff -Wl, -L$(PNETCDF_PATH_KAUST) -lpnetcdff -Wl, -L$(NETCDF_PATH_KAUST)/lib -lnetcdff -lnetcdf, -L$(PNETCDF_PATH_KAUST)/lib -lpnetcdff -lpnetcdf</append>
This errors out when running create_newcase
…
Schemas validity error : Element 'INC_PNETCDF': This element is not expected.
Trial 066
This fails, too. What should we attempt next do?
<append> -L$(NETCDF_DIR) -lnetcdff -Wl,-L$(PNETCDF_DIR)/lib -lpnetcdff -lpnetcdf,--as-needed,-L$(NETCDF_DIR)/lib - lnetcdff -lnetcdf</append>
I guess the next thing to try is to toggle through differnt compiler flags. It might be faster to iterate by building DART rather than building CESM.
What does the the --as-needed
flag accomplish?
Trial 065
Adding back in the link to -lpnecdff
in config_compilers.xml
:
<SLIBS>
<append> -L$(NETCDF_PATH) -lnetcdff -Wl, --as-needed, -L$(PNETCDF_PATH)/lib -lpnetcdff</append>
</SLIBS>
This fails to build. Remember:
The NETCDF_PATH has a library:
/opt/cray/pe/netcdf/4.6.3.2/INTEL/19.0/lib
The PNETCDF_PATH also has a library:
/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0/lib
What to do next? I think the issue is it’s linking netcdf rather than pnetcdf.
Spitballing ideas
The directory we’re linking to is wrong.
-L
goes to the lib directory-I
goes to the include directory
Trial 064
Removed this from config_compilers.xml
:
<SLIBS>
<append> -L$(NETCDF_PATH) -lnetcdff -Wl, --as-needed, -L$(NETCDF_PATH)/lib -lnetcdff, -L$(PNETCDF_PATH)/ lib -lnetcdff</append>
</SLIBS>
Error
1: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
1: NetCDF: Attempt to use feature that was not turned on when netCDF was built.
Trial 063
How do we tell CESM to trigger which MPI library to use? MPI Libs http://www.cesm.ucar.edu/models/cesm1.2/cesm/doc_cesm1_2_1/modelnl/machines.html !mpi-serial /lustre/project/k1421/cases/FHIST_BGC.f09_d025.062.e03♡ ./xmlquery MPILIB MPILIB: mpt https://bb.cgd.ucar.edu/cesm/threads/viability-of-running-cesm-on-40-cores.4997/page-2#post-37159 You should build and install netcdf and pnetcdf separately and link both. Tried to building with this one: <appe:q nd> -L$(NETCDF_PATH) -lnetcdff -Wl, –as-needed, -L$(NETCDF_PATH)/lib -lnetcdff, -L$(PNETCDF_PATH)/lib - lnetcdff</append>
Trial 062
If that doesn’t work, try linking to PNETCDF
path in this line of
config_compilers.xml
:
<append> -L$(NETCDF_PATH) -lnetcdff -Wl, --as-needed, -L$(NETCDF_PATH)/lib -lnetcdff</append>
Okay that fixes the PNETCDF not enabled in the build
issue.
Although now we have a different error.
Error
1: pio_support::pio_die:: myrank= -1 : ERROR: ionf_mod.F90: 235 :
1: NetCDF: Attempt to use feature that was not turned on when netCDF was built.
$ nc-config --all
...
--has-pnetcdf -> no
...
You have to define an environmental variable that contains the path to the
PNETCDF
library in config_machines.xml
.
And then reference that environmental variable when assinging a value to the
PNETCDF_PATH
key in config_compilers.xml
but I actually don’t
understand how the linker is made aware of that value, because the linker is
only given the serial netcdf path.
Trial 061
I’m not sure why we’re getting this error.
When trying to set:
$ ./xmlchange PIO_TYPENAME=pnetcdf
Did not find pnetcdf in valid values for PIO_TYPENAME: ['netcdf']
Examining config_pio.xml
but I’m not sure how to interpret the results.
$ vim /lustre/project/k1421/cesm2_1_3/cime/config/cesm/machines/config_pio.xml
In config_compilers.xml
:
<PNETCDF_PATH>$ENV{PARALLEL_NETCDF_DIR}</PNETCDF_PATH>
The error is from Line 238 of ${CIMEROOT}/scripts/lib/CIME/XML/entry_id.py
in the function:
get_valid_value_string in "Did not find {} in valid values for {}: {}"
What does the entry for CNL look like?
<compiler OS="CNL">
<CMAKE_OPTS>
<base> -DCMAKE_SYSTEM_NAME=Catamount</base>
</CMAKE_OPTS>
<CPPDEFS>
<append> -DLINUX </append>
<append MODEL="gptl"> -DHAVE_NANOTIME -DBIT64 -DHAVE_VPRINTF -DHAVE_BACKTRACE -DHAVE_SLASHPROC -DHAVE_COMM_F2C - DHAVE_TIMES -DHAVE_GETTIMEOFDAY </append>
</CPPDEFS>
<MPICC> cc </MPICC>
<MPICXX> CC </MPICXX>
<MPIFC> ftn </MPIFC>
<NETCDF_PATH>$ENV{NETCDF_DIR}</NETCDF_PATH>
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<PNETCDF_PATH>$ENV{PARALLEL_NETCDF_DIR}</PNETCDF_PATH>
<SCC> cc </SCC>
<SCXX> CC </SCXX>
<SFC> ftn </SFC>
</compiler>
Edited the shaheen entry in config_compilers.xml
:
<PIO_FILESYSTEM_HINTS>lustre</PIO_FILESYSTEM_HINTS>
<PNETCDF_PATH>$ENV{PNETCDF_PATH}</PNETCDF_PATH>
<SCC> cc </SCC>
And added this to config_machines.xml
:
<env name="PNETCDF_PATH">/opt/cray/pe/parallel-netcdf/1.11.1.1/INTEL/19.0</env>
Trial 060
Using /Users/johnsonb/scratch/cheyenne/buildnml
which is the same as the
reanalysis, except it has had these lines inserted from cesm2_1_3:
docn_mode = case.get_value("DOCN_MODE")
if docn_mode and 'aqua' in docn_mode:
config['aqua_planet_sst_type'] = docn_mode
else:
config['aqua_planet_sst_type'] = 'none'
This crashes, with a warning.
Warning
PNETCDF not enabled in the build.
Is this warning present in Trial 059 as well?
$ vim /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.059.e80/run/cesm.log.15958125.201019-232044.gz
Pattern not found: PNETCDF not enabled in the build
According to this DiscussCESM post <https://bb.cgd.ucar.edu/cesm/threads/using-pnetcdf-for-cesm.3084/>, the message indicates that PNETCDF was not linked with the application.
Trial 059
Keeping OMP_STACKSIZE
and using the old buildnml script.
..code-block:
<env name="OMP_STACKSIZE">256M</env>
This runs properly.
Trial 058
Changing OMP_STACKSIZE
.
..code-block:
<env name="OMP_STACKSIZE">256M</env>
Also using Kevin Raeder’s buildnml
script modification.
Trial 057
Changing OMP_STACKSIZE
.
..code-block:
<env name="OMP_STACKSIZE">1024M</env>
Doesn’t seem to affect performance.
Trial 056
Changing OMP_STACKSIZE
.
..code-block:
<env name="OMP_STACKSIZE">128M</env>
Doesn’t seem to affect performance.
Trial 055
This run timed out with a super user message:
Error
run.FHIST_BGC.f09_d025.055.e80 Ended, Run time 01:00:26
It also sat in the queue for a really long time so maybe there is traffic on the interconnects.
Trial 054
Toggling settings in config-machines.xml
:
<env name="MPI_COMM_MAX">16383</env>
<env name="MPI_GROUP_MAX">1024</env>
Trial 049
Added echo statements to assimilate.csh
.
To see what a bash script to submit a SLURM job looks like:
vim /lustre/project/k1421/scripts_logs/run_check_input_data_SMS_Lm13.f10_f10_musgs.I1850Clm50SpG.shaheen_intel.20200612_215255_kuwhyp
#!/bin/bash
#
#SBATCH --job-name=run_check_input_data_SMS_Lm13.f10_f10_musgs.I1850Clm50SpG.shaheen_intel.20200612_215255_kuwhyp
#SBATCH --output=run_check_input_data_SMS_Lm13.f10_f10_musgs.I1850Clm50SpG.shaheen_intel.20200612_215255_kuwhyp.txt
#SBATCH --partition=workq
#SBATCH --ntasks=1
#SBATCH --time=23:59:00
#SBATCH --mem-per-cpu=100
python run_check_input_data.py SMS_Lm13.f10_f10_musgs.I1850Clm50SpG.shaheen_intel.20200612_215255_kuwhyp
Examining assimilate.csh
, it’s crashing when trying to execute this line:
setenv NODENAMES $SLURM_NODELIST
nid00[136-147]: No match.
Cross-referencing this with the da.log and the assimilate script:
vim /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.049.e3/run/da.log.15905662.201015-053628
vim /lustre/project/k1421/DART/models/cam-fv/shell_scripts/cesm2_1/assimilate.csh.template
inf_flavor(1) = 2, using namelist values.
[ Thu Oct 15 18:38:59 2020] [c0-0c0s9n0] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(537):
MPID_Init(246).......: channel initialization failed
cray-netcdf/4.6.3.2(9):ERROR:150: Module 'cray-netcdf/4.6.3.2' conflicts with the currently loaded module(s) 'cray-netcdf-hdf5parallel/4.6.3.2'
ncks: Command not found
Can't load parallel netcdf and nco at the same time.
MPI_COMM_MAX=16383
Trial 048
After differencing the SourceMod of dyn_comp.F90
we’re using and the one in
cesm2_1_3, it might be worth trying to see if swapping out the SourceMod gets
us to a different point in the compiling before crashing.
Removing lustre/project/k1421/SourceMods/cesm2_1_3/SourceMods/src.cam/dyn_comp.F90
.
This works! The run doesn’t complete, since assimilate.csh
crashes, but
this is clear progress. Now what the error log is printing:
Trial 047
Note
This is an important debugging trial because we fixed the PE issue in Trial 46 and can move onto determining why the build mysteriously hangs without producing any meaningful error messages.
We focus on determining which task happens right after GSMap indices not
increasing
. Change the debug settings, with ./xmlchange DEBUG=TRUE
and
./xmlchange INFO_DBUG=1
.
Last working startup case
The last working startup – not hybrid case is Trial 018:
/lustre/project/k1421/cases/FHIST_BGC.f09_d025.018.e3
Its buildlog continues past GSMap indices not increasing
:
MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
Opened file FHIST_BGC.f09_d025.018.e3.cam_0001.r.1979-01-01-21600.nc to write
Attempting to clone Kevin’s CAM Reanalysis on Cheyenne
Error
CAM Could not be built cat /glade/scratch/johnsonb/FHIST_BGC.f09_d025.001.e3/bld/atm.bldlog.201012-214748
Attempting to clone Kevin’s CAM Reanalysis on Shaheen
Working on: /glade/work/johnsonb/DART_Shaheen/models/cam-fv/shell_scripts/cesm2_1/setup_advanced_Rean
Stage directory: /glade/scratch/johnsonb/archive/f.e21.FHIST_BGC.f09_025.CAM6assim.011/rest/2019-08-05-00000
gunzip f.e21.FHIST_BGC.f09_025.CAM6assim.011.cam_00[45678]?.r.2019-08-05-00000.nc.gz
Rebuilding with cesm2.1.resl5.6 instead of cesm2_1_3
Two candidate source files:
./components/cam/src/chemistry/modal_aero/modal_aero_gasaerexch.F90
./components/cam/src/chemistry/utils/modal_aero_calcsize.F90
This is the code that prints the relevant statement in the buildlog:
1016 write( 6, '(a,3i3,2i4)' ) 'calcsize j,iq,jac, lsfrm,lstoo', &
1017 j,iq,jac, lsfrm,lstoo
This file never gets called: ./components/cam/src/chemistry/utils/modal_aero_calcsize.F90
Thus it’s actually hanging in: ./components/cam/src/dynamics/fv/cd_core.F90
We can attempt a case with a different resolution: FHIST_BGC.f19_f19_mg17.001.e3
.
However, using the build scripts don’t work because there isn’t a way to get the SST flux properly into the coupler.
source activate py27
cd /lustre/project/k1421/cesm2_1_3/cime/scripts
./create_newcase --case /lustre/project/k1421/cases/FHIST_BGC.f19_f19_mg17.002.e3 --machine shaheen --res f19_f19_mg17 --project k1421 --queue workq --walltime 1:00:00 --pecount 32x1 --ninst 3 --compset HIST_CAM60_CLM50%BGC-CROP_CICE%PRES_DOCN%DOM_MOSART_SGLC_SWAV --multi-driver --run-unsupported
cd /lustre/project/k1421/cases/FHIST_BGC.f19_f19_mg17.002.e3
./case.setup
./case.build
cd /lustre/project/k1421/cases/FHIST_BGC.f19_f19_mg17.002.e3
./case.submit -M begin,end
This gives us a “working” start up run with a f19_f19_mg17 grid. It’s useful because it provides two clues: where the cesm.log.* fails and where the atm_00??.log.* fails.
Examining the CESM Log
In a job submission that runs to completion, the CESM log continues past the
GSMap indices not increasing...Will correct
line:
/lustre/scratch/x_johnsobk/archive/FHIST_BGC.f19_f19_mg17.002.e3/logs/cesm.log.15883098.201014-065036.gz
transitions from the MCT::m_Router to the calcsize printouts immediately after it.
3226 0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3227 0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3228 0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3229 0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3230 64: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3231 64: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3232 64: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3233 64: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3234 32: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3235 32: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3236 32: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
3237 32: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
3238 0: calcsize j,iq,jac, lsfrm,lstoo 1 1 1 26 21
3239 0: calcsize j,iq,jac, lsfrm,lstoo 1 1 2 26 21
3240 0: calcsize j,iq,jac, lsfrm,lstoo 1 2 1 22 15
3241 0: calcsize j,iq,jac, lsfrm,lstoo 1 2 2 22 15
3242 0: calcsize j,iq,jac, lsfrm,lstoo 1 3 1 24 17
The MCT::m_router
lines are printed from the subroutine initp__(inGSMap,inRGSMap,mycomm,Rout,name )
in m_Router.F90
:
336 if(myPid == 0) call warn(myname_,'GSMap indices not increasing...Will correct')
337 call GlobalSegMap_OPoints(inGSMap,myPid,gpoints)
Note well, in mct_mod.F90
, m_router
undergoes association renaming:
use m_Router ,only: mct_router => Router
Additionally, the calcsize
lines are from modal_aero_calcize.F90
:
1016 write( 6, '(a,3i3,2i4)' ) 'calcsize j,iq,jac, lsfrm,lstoo', &
1017 j,iq,jac, lsfrm,lstoo
Thus the log from a non-working run, /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.047.e3/run/atm_0001.log.15874648.201013-170022
,
ends here:
4619 FV subcycling - nv, n2, nsplit, dt = 2 1 4
4620 225.000000000000
Line 4620 is printed from ./components/cam/src/dynamics/fv/dyn_comp.F90
:
1443 write(iulog,*) 'FV subcycling - nv, n2, nsplit, dt = ', nv, n2, nsplit, dt
The working log, /lustre/scratch/x_johnsobk/archive/FHIST_BGC.f19_f19_mg17.002.e3/logs/atm_0001.log.15883098.201014-065036.gz
,
continues:
4993 FV subcycling - nv, n2, nsplit, dt = 1 1 4
4994 450.000000000000
4995 Divergence damping: use 4th order damping
Line 4995 is printed from ./components/cam/src/dynamics/fv/cd_core.F90
:
545 if (masterproc) write(iulog,*) 'Divergence damping: use 4th order damping'
So the key is to determine what happens in between Line 1443 of dyn_comp.F90
and the invocation of cd_core
which is called only once on lone 1862:
1862 call cd_core(grid, nx, u, v, pt,
Trial 046
Edited config_machines.xml
to:
<!-- <MAX_TASKS_PER_NODE>64</MAX_TASKS_PER_NODE> -->
<MAX_TASKS_PER_NODE>32</MAX_TASKS_PER_NODE>
Important
This fixes the PE crash and we now hang at a different point.
Trial 045
Forgot to build Trial 044 with OMP_STACKSIZE=256MB
.
Edited config_machines.xml
to set OMP_STACKSIZE=256MB
. Rebuilt the case.
Trial 044
./xmlchange ROOTPE_ESP=0,NTHRDS_ESP=$nthreads,NTASKS_ESP=1
Trial 043
./xmlchange NTASKS_PER_INST_ESP=1
For some reason that still results in:
NTASKS_PER_INST: ['ATM:128', 'LND:128', 'ICE:128', 'OCN:128', 'ROF:128', 'GLC:128', 'WAV:128', 'ESP:32']
Trial 042
./xmlchange OMP_STACKSIZE=128MB
Trial 041
Attempting to change the stack limit to see if it noticeably affects performance.
./xmlchange OMP_STACKSIZE=1024MB
Trial 040
128: Reading zbgc_nml
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
0: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
128: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
256: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
128: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
256: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
256: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
128: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
256: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
128: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
361: forrtl: severe (174): SIGSEGV, segmentation fault occurred
128: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
256: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
9583 srun: error: nid04049: tasks 37-38,40-42,44,48,50-51,54-55,57,61-63: Exited with exit code 174
9584 srun: Terminating job step 15810768.0
9585 0: slurmstepd: error: *** STEP 15810768.0 ON nid04048 CANCELLED AT 2020-10-08T21:09:34 ***
Trial 039
Seems like we should just try this attempt again in case the error was caused by running 3 the script times simultaneously.
Trial 038
Changing set do_clm_interp = "true"
.
This works! However we should go through the CESM logs to see if it’s hanging anywhere.
Trial 037
Omitted modules.csh
.
Questions:
PIO_TYPENAME = 'pnetcdf'
; Do we need to change it to netcdf?set do_clm_interp = "false"
; Do we need to change it to true?
Another issue:
/usr/bin/cp: option '--v' is ambiguous; possibilities: '--verbose' '--version'
Try '/usr/bin/cp --help' for more information.
Edited DART_config
to omit update_dart_namelists.
Also copied a file from GLADE:
Fetching /glade/p/cesmdata/cseg/inputdata/atm/cam/tracer_cnst/tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc to tracer_cnst_halons_WACCM6_3Dmonthly_L70_1975-2014_c180216.nc
lustre/project/k1421/cesm_store/inputdata/atm/cam/tracer_cnst
This gives us an error.
Error
Did you mean to set use_init_interp = .true. in user_nl_clm?
94: (Setting use_init_interp = .true. is needed when doing a
94: transient run using an initial conditions file from a non-transient run,
94: or a non-transient run using an initial conditions file from a transient run,
94: or when running a resolution or configuration that differs from the initial conditions.
Trial 036
Changed NTASKS
for ESP=1
and set PIO_TYPENAME=netcdf
But we still have the same SIGTERM failure. GRRR.
At least it’s good to report that the Invalid PIO rearranger issue even occurs in a working run:
$ vim /lustre/scratch/x_johnsobk/archive/FHIST_BGC.f09_g17.002.e3/logs/cesm.log.15594937.200910-181624.gz
0: Invalid PIO rearranger comm max pend req (comp2io), 0
0: Resetting PIO rearranger comm max pend req (comp2io) to 64
0: PIO rearranger options:
0: comm type =
Working on contrasting these two runs, since one works and the other doesn’t
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_g17.002.e3
$ ./xmlquery --partial PE
RUN_TYPE: startup
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.036.e3
$ ./xmlquery --partial PE
RUN_TYPE: hybrid
The output of these are identical except for RUN_TYPE
.
There are two plausible paths:
1. Go through Kevin’s script here: /lustre/project/k1421/DART_CASES/setup_advanced_Rean.original
and see if we’re missing anything significant. It seems like the stagedir is
just the full path wehre the restart files are. I think that might be it.
In line 1295 of setup_advanced_Rean.original
1295 ${LINK} -f ${stagedir}/${refcase}.clm2\${inst_string}.r.${init_time}.nc .
./xmlchange RUN_REFDIR=$stagedir
2. Or get Kevin’s setup script working, which might actually only entail changing the user_nl text. This might actually be pretty fast.
360 set user_grid = "${user_grid} --gridfile /glade/work/raeder/Models/CAM_init/SST"
361 set user_grid = "${user_grid}/config_grids+fv1+2deg_oi0.25_gland20.xml"
362 setenv sst_dataset \
363 "/glade/work/raeder/Models/CAM_init/SST/avhrr-only-v2.20110101_cat_20111231_gregorian_c190703.nc"
We should only need to change these, right?
340 set use_tasks_per_node = 36
...
959 set cesm_data_dir = "/glade/p/cesmdata/cseg/inputdata/atm"
960 set cesm_chem_dir = "/gpfs/fs1/p/acom/acom-climate/cmip6inputs/emissions_ssp119"
961 set chem_root = "${cesm_chem_dir}/emissions-cmip6-ScenarioMIP_IAMC-IMAGE-ssp119-1-1"
962 set chem_dates = "175001-210012_0.9x1.25_c20181024"
Trial 035
Set tasks_per_node=16
but we still get the segmentation fault. So this might
be a PE layout issue. We should really be trying to track down what the correct
layout was for the last working ensemble.
Warning
Warning: missing non-idmap ROF2OCN_LIQ_RMAPNAME for ocn_grid, d.25x.25 and rof_grid r05
Warning: missing non-idmap ROF2OCN_ICE_RMAPNAME for ocn_grid, d.25x.25 and rof_grid r05
Looking back to /lustre/project/k1421/cases/FHIST_BGC.f09_g17.002.e3
we see:
NTASKS_PER_INST: ['ATM:128', 'LND:128', 'ICE:128', 'OCN:128', 'ROF:128', 'GLC:128', 'WAV:128', 'ESP:1']
ROOTPE: ['CPL:0', 'ATM:0', 'LND:0', 'ICE:0', 'OCN:0', 'ROF:0', 'GLC:0', 'WAV:0', 'ESP:0']
Results in group mach_pes_last
COSTPES_PER_NODE: 32
COST_PES: 384
MAX_MPITASKS_PER_NODE: 32
MAX_TASKS_PER_NODE: 64
TOTALPES: 384
I checked to see run_domain is the same in the working case and the non-working case.
Results in group run_domain
ATM2LND_FMAPTYPE: X
ATM2LND_SMAPTYPE: X
ATM2OCN_FMAPTYPE: X
ATM2OCN_SMAPTYPE: X
ATM2OCN_VMAPTYPE: X
ATM2WAV_SMAPTYPE: Y
GLC2ICE_RMAPTYPE: Y
GLC2LND_FMAPTYPE: Y
GLC2LND_SMAPTYPE: Y
GLC2OCN_ICE_RMAPTYPE: Y
GLC2OCN_LIQ_RMAPTYPE: Y
ICE2WAV_SMAPTYPE: Y
LND2ATM_FMAPTYPE: Y
LND2ATM_SMAPTYPE: Y
LND2GLC_FMAPTYPE: X
LND2GLC_SMAPTYPE: X
LND2ROF_FMAPTYPE: X
OCN2ATM_FMAPTYPE: Y
OCN2ATM_SMAPTYPE: Y
OCN2WAV_SMAPTYPE: Y
ROF2LND_FMAPTYPE: Y
ROF2OCN_FMAPTYPE: Y
ROF2OCN_ICE_RMAPTYPE: Y
ROF2OCN_LIQ_RMAPTYPE: Y
WAV2OCN_SMAPTYPE: X
Results in group mach_pes
NTASKS_PER_INST: ['ATM:128', 'LND:128', 'ICE:128', 'OCN:128', 'ROF:128', 'GLC:128', 'WAV:128', 'ESP:1']
ROOTPE: ['CPL:0', 'ATM:0', 'LND:0', 'ICE:0', 'OCN:0', 'ROF:0', 'GLC:0', 'WAV:0', 'ESP:0']
Results in group mach_pes_last
COSTPES_PER_NODE: 32
COST_PES: 384
MAX_MPITASKS_PER_NODE: 32
MAX_TASKS_PER_NODE: 64
TOTALPES: 384
PIO_NETCDF_FORMAT: ['CPL:64bit_offset', 'ATM:64bit_offset', 'LND:64bit_offset', 'ICE:64bit_offset', 'OCN:64bit_offset', 'ROF:64bit_offset', 'GLC:64bit_offset', 'WAV:64bit_offset', 'ESP:64bit_offset']
Trial 034
In Trial 034 we don’t get the problem encountered in Trial 031 after changing
PIO_TYPENAME
to netcdf. However, we still get an error.
Error
96: MCT::m_Router::initp_: RGSMap indices not increasing...Will correct
96: MCT::m_Router::initp_: GSMap indices not increasing...Will correct
52: forrtl: severe (174): SIGSEGV, segmentation fault occurred
Trial 033
We should build Trial 033 with as a hybrid
run and comment out
PIO_TYPENAME
.
$ cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.033.e3/
$ cp /lustre/project/k1421/DART/models/cam-fv/shell_scripts/cesm2_1/assimilate.csh.template assimilate.csh
$ ./xmlchange DATA_ASSIMILATION_SCRIPT=./assimilate.csh
$ ./case.submit -M begin,end
The two candidate settings for toggling are RUN_TYPE
AND PIO_TYPENAME
.
Check the PIO buildlog for Trials 33 and 32.
Building pio with output to file /lustre/scratch/x_johnsobk/FHIST_BGC.f09_d025.033.e3/bld/pio.bldlog.201007-210614
Here’s what the config_machines.xml modules look like:
<modules mpilib="mpi-serial">
<command name="load">cray-hdf5/1.10.5.2</command>
<command name="load">cray-netcdf/4.6.3.2</command>
</modules>
<modules mpilib="!mpi-serial">
<command name="load">cray-netcdf-hdf5parallel/4.6.3.2</command>
<command name="load">cray-hdf5-parallel/1.10.5.2</command>
<command name="load">cray-parallel-netcdf/1.11.1.1</command>
</modules>
Maybe we shouldn’t use the restart files for ice_ic?
Trial 032
Attempting to use a continue run instead of a hybrid run.
Error
Did not find continue in valid values for RUN_TYPE: ['startup', 'hybrid', 'branch']
RUN_TYPE = 'continue'
Aha, continue
was permitted in CESM1 but isn’t permitted anymore, so this
doesn’t work.
We can try a startup run next. Alternatively it could be a PIO_TYPENAME
issue?
Trial 031
Attempted to only change NTASKS_PER_INST_ATM
.
Okay this is progress as it results in an error.
Error
130: aborting in ice-pio_ropen with invalid file
130: ERROR: aborting in ice-pio_ropen with invalid file
Is the abort trap signal related to this post?
In user_nl_cice, and the variable is “ice_ic”. In Kevin’s directory,
/glade/work/raeder/Exp/f.e21.FHIST_BGC.f09_025.CAM6assim.011
, it is set as
ice_ic = 'Rean_spinup_2010.cice_0001.r.2011-01-01-00000.nc'
Hmm…are these files different resolution? Do they need to be both on the atmospheric grid, and not the oceanic grid?
This could be a number of different things:
The initial condition specified incorrectly? Or is the grid specified incorrectly?
What if we just omit ice_ic and see what happens.
This wasn’t a problem before because we were doing a startup run instead of a hybrid run.
Trial 030
Setting: use_tasks_per_node = 16
Which results in this result for our case on Shaheen.
$ cd ${CASEROOT}
$ ./xmlquery NTASKS
NTASKS: ['CPL:108', 'ATM:108', 'LND:108', 'ICE:108', 'OCN:108', 'ROF:108', 'GLC:108', 'WAV:108', 'ESP:36']
Comparing it to Kevin’s configuration for the reanlysis on Cheyenne.
$ cd /glade/work/raeder/Exp/f.e21.FHIST_BGC.f09_025.CAM6assim.011
$ ./xmlquery NTASKS_PER_INST
NTASKS_PER_INST: ['ATM:108', 'LND:108', 'ICE:108', 'OCN:108', 'ROF:108', 'GLC:108', 'WAV:108', 'ESP:36']
So this seems to be the key thing we’re missing.
Trial 029
Trying different nodes per instance: tasks_per_node = 32
and nodes_per_instance = 4
cd /lustre/project/k1421/cases/FHIST_BGC.f09_d025.031.e3/
cp /lustre/project/k1421/DART/models/cam-fv/shell_scripts/cesm2_1/assimilate.csh.template assimilate.csh
./xmlchange DATA_ASSIMILATION_SCRIPT=/lustre/project/k1421/cases/FHIST_BGC.f09_d025.031.e3/assimilate.csh
./case.submit -M begin,end