diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst
index 9695430621..d668276780 100644
--- a/doc/esmvalcore/config.rst
+++ b/doc/esmvalcore/config.rst
@@ -4,29 +4,163 @@
Configuration files
*******************
+Overview
+========
+
There are several configuration files in ESMValTool:
- - config-user.yml
- - config-developer.yml
- - config-references.yml
- - config-logging.yml
+* ``config-user.yml``: sets a number of user-specific options like desired
+ graphical output format, root paths to data, etc.;
+* ``config-developer.yml``: sets a number of standardized file-naming and paths
+ to data formatting;
+* ``config-references.yml``: stores information on diagnostic authors and
+ scientific journals references;
+* ``config-logging.yml``: stores information on logging.
User configuration file
=======================
-See Section
+The ``config-user.yml`` is one of the two files the user needs to provide as
+input arguments to the ``esmvaltool`` executable at run time, the second being
+the :ref:`recipe`.
+
+The ``config-user.yml`` configuration file contains all the global level
+information needed by ESMValTool. It can be reused as many times the user needs
+to before changing any of the options stored in it. This file is essentially
+the gateway between the user and the machine-specific instructions to
+``esmvaltool``. The following shows the default settings from the
+``config-user.yml`` file with explanations in a commented line above each
+option:
+
+.. code-block:: yaml
+
+ # Diagnostics create plots? [true]/false
+ # turning it off will turn off graphical output from diagnostic
+ write_plots: true
+
+ # Diagnositcs write NetCDF files? [true]/false
+ # turning it off will turn off netCDF output from diagnostic
+ write_netcdf: true
+
+ # Set the console log level debug, [info], warning, error
+ # for much more information printed to screen set log_level: debug
+ log_level: info
+ # verbosity is deprecated and will be removed in the future
+ # verbosity: 1
+
+ # Exit on warning? true/[false]
+ exit_on_warning: false
+
+ # Plot file format? [ps]/pdf/png/eps/epsi
+ output_file_type: pdf
+
+ # Destination directory where all output will be written
+ # including log files and performance stats
+ output_dir: ./esmvaltool_output
+
+ # Auxiliary data directory (used for some additional datasets)
+ # this is where e.g. files can be downloaded to by a download
+ # script embedded in the diagnostic
+ auxiliary_data_dir: ./auxiliary_data
+
+ # Use netCDF compression true/[false]
+ compress_netcdf: false
+
+ # Save intermediary cubes in the preprocessor true/[false]
+ # set to true will save the output cube from each preprocessing step
+ # these files are numbered according to the preprocessing order
+ save_intermediary_cubes: false
+
+ # Remove the preproc dir if all fine
+ # if this option is set to "true", ALL preprocessor files will be removed
+ # CAUTION when using: if you need those files, set it to false
+ remove_preproc_dir: true
+
+ # Run at most this many tasks in parallel null/[1]/2/3/4/..
+ # Set to null to use the number of available CPUs.
+ # Make sure your system has enough memory for the specified number of tasks.
+ max_parallel_tasks: 1
+
+ # Path to custom config-developer file, to customise project configurations.
+ # See config-developer.yml for an example. Set to None to use the default
+ config_developer_file: null
+
+ # Get profiling information for diagnostics
+ # Only available for Python diagnostics
+ profile_diagnostic: false
+
+ # Rootpaths to the data from different projects (lists are also possible)
+ rootpath:
+ CMIP5: [~/cmip5_inputpath1, ~/cmip5_inputpath2]
+ OBS: ~/obs_inputpath
+ default: ~/default_inputpath
+
+ # Directory structure for input data: [default]/BADC/DKRZ/ETHZ/etc
+ # See config-developer.yml for definitions.
+ drs:
+ CMIP5: default
+Most of these settings are fairly self-explanatory, e.g.:
+
+.. code-block:: yaml
+
+ # Diagnostics create plots? [true]/false
+ write_plots: true
+ # Diagnositcs write NetCDF files? [true]/false
+ write_netcdf: true
+
+The ``write_plots`` setting is used to inform ESMValTool diagnostics about your
+preference for creating figures. Similarly, the ``write_netcdf`` setting is a
+boolean which turns on or off the writing of netCDF files by the diagnostic
+scripts.
+
+.. code-block:: yaml
+
+ # Auxiliary data directory (used for some additional datasets)
+ auxiliary_data_dir: ~/auxiliary_data
+
+The ``auxiliary_data_dir`` setting is the path to place any required
+additional auxiliary data files. This is necessary because certain
+Python toolkits, such as cartopy, will attempt to download data files at run
+time, typically geographic data files such as coastlines or land surface maps.
+This can fail if the machine does not have access to the wider internet. This
+location allows the user to specify where to find such files if they can not be
+downloaded at runtime.
+
+.. warning::
+
+ This setting is not for model or observational datasets, rather it is for
+ data files used in plotting such as coastline descriptions and so on.
+
+A detailed explanation of the data finding-related sections of the
+``config-user.yml`` (``rootpath`` and ``drs``) is presented in the
+:ref:`data-retrieval` section. This section relates directly to the data
+finding capabilities of ESMValTool and are very important to be understood by
+the user.
+
+.. note::
+
+ You choose your ``config-user.yml`` file at run time, so you could have several of
+ them available with different purposes. One for a formalised run, another for
+ debugging, etc.
+
+
+.. _config-developer:
Developer configuration file
============================
This configuration file describes the file system structure for several
-key projects (CMIP5, CMIP6) on several key machines (BADC, CP4CDS, DKRZ, ETHZ,
-SMHI, BSC).
+key projects (CMIP5, CMIP6, OBS) on several key machines (BADC, CP4CDS, DKRZ,
+ETHZ, SMHI, BSC). CMIP data is stored as part of the Earth System Grid
+Federation (ESGF) and the standards for file naming and paths to files are set
+out by CMOR and DRS. For a detailed description of these standards and their
+adoption in ESMValTool, we refer the user to :ref:`CMOR-DRS` section where we
+relate these standards to the data retrieval mechanism of the ESMValTool.
-The data directory structure of the CMIP5 project is set up differently
-at each site. The following code snipper is an example of several paths
-descriptions for the CMIP5 at various sites:
+The data directory structure of the CMIP projects is set up differently
+at each site. The following code snippet is an example of several paths
+descriptions for the CMIP5 adopted at various sites:
.. code-block:: yaml
@@ -46,23 +180,29 @@ As an example, the CMIP5 file path on BADC would be:
[institute]/[dataset ]/[exp]/[frequency]/[modeling_realm]/[mip]/[ensemble]/latest/[short_name]
-When loading these files, ESMValTool replaces the placeholders with the true
-values. The resulting real path would look something like this:
+When loading these files, ESMValTool replaces the placeholders ``[item]`` with
+actual values supplied for by the user in ``config-user.yml`` and
+``recipe.yml``. The resulting real path would look something like this:
-.. code-block:: yaml
+.. code-block::
MOHC/HadGEM2-CC/rcp85/mon/ocean/Omon/r1i1p1/latest/tos
+Again, for a more in-depth description of this process, as part of the data
+retrieval mechanism, please see :ref:`CMOR-DRS`.
+
+.. _config-ref:
References configuration file
=============================
-The ``config-references.yml`` file is the full list of ESMValTool authors,
-references and projects. Each author, project and reference in the documentation
-section of a recipe needs to be in this file in the relevant section.
+The ``config-references.yml`` file contains the list of ESMValTool authors,
+references and projects. Each author, project and reference referred to in the
+documentation section of a recipe needs to be in this file in the relevant
+section.
-For instance, the recipe ``recipe_ocean_example.yml`` file contains the following
-documentation section:
+For instance, the recipe ``recipe_ocean_example.yml`` file contains the
+following documentation section:
.. code-block:: yaml
@@ -80,9 +220,10 @@ documentation section:
- ukesm
-All four items here are named people, references and projects listed in the
+These four items here are named people, references and projects listed in the
``config-references.yml`` file.
+
Logging configuration file
==========================
diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst
index 6759018c27..206b3d1766 100644
--- a/doc/esmvalcore/datafinder.rst
+++ b/doc/esmvalcore/datafinder.rst
@@ -1,7 +1,259 @@
-.. _datafinder:
+.. _findingdata:
-***********
-Data finder
-***********
+************
+Finding data
+************
-Documentation of the _data_finder.py module (incl. _download.py?)
+Overview
+========
+Data discovery and retrieval is the first step in any evaluation process;
+ESMValTool uses a `semi-automated` data finding mechanism with inputs from both
+the user configuration file and the recipe file: this means that the user will
+have to provide the tool with a set of parameters related to the data needed
+and once these parameters have been provided, the tool will automatically find
+the right data. We will detail below the data finding and retrieval process and
+the input the user needs to specify, giving examples on how to use the data
+finding routine under different scenarios.
+
+.. _CMOR-DRS:
+
+CMIP data - CMOR Data Reference Syntax (DRS) and the ESGF
+=========================================================
+CMIP data is widely available via the Earth System Grid Federation
+(`ESGF `_) and is accessible to users either
+via dowload from the ESGF portal or through the ESGF data nodes hosted
+by large computing facilities (like CEDA-Jasmin, DKRZ, etc). This data
+adheres to, among other standards, the DRS and Controlled Vocabulary
+standard for naming files and structured paths; the `DRS
+`_
+ensures that files and paths to them are named according to a
+standardized convention. Examples of this convention, also used by
+ESMValTool for file discovery and data retrieval, include:
+
+* CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc``
+* CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc``
+* OBS file: ``[project]_[dataset_name]_[type]_[version]_[mip]_[short_name]_[start-date]-[end-date].nc``
+
+Similar standards exist for the standard paths (input directories); for the
+ESGF data nodes, these paths differ slightly, for example:
+
+* CMIP6 path for BADC: ``ROOT-BADC/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/
+ [variable_short_name]/[grid]``;
+* CMIP6 path for ETHZ: ``ROOT-ETHZ/[experiment]/[mip]/[variable_short_name]/[dataset_name]/[ensemble]/[grid]``
+
+From the ESMValTool user perspective the number of data input parameters is
+optimized to allow for ease of use. We detail this procedure in the next
+section.
+
+.. _data-retrieval:
+
+Data retrieval
+==============
+Data retrieval in ESMValTool has two main aspects from the user's point of
+view:
+
+* data can be found by the tool, subject to availability on disk;
+* it is the user's responsibility to set the correct data retrieval parameters;
+
+The first point is self-explanatory: if the user runs the tool on a machine
+that has access to a data repository or multiple data repositories, then
+ESMValTool will look for and find the avaialble data requested by the user.
+
+The second point underlines the fact that the user has full control over what
+type and the amount of data is needed for the analyses. Setting the data
+retrieval parameters is explained below.
+
+Setting the correct root paths
+------------------------------
+The first step towards providing ESMValTool the correct set of parameters for
+data retrieval is setting the root paths to the data. This is done in the user
+configuration file ``config-user.yml``. The two sections where the user will
+set the paths are ``rootpath`` and ``drs``. ``rootpath`` contains pointers to
+``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the type
+of directory structure the root paths are structured by. It is important to
+first discuss the ``drs`` parameter: as we've seen in the previous section, the
+DRS as a standard is used for both file naming conventions and for directory
+structures.
+
+.. _config-user-drs:
+
+Explaining ``config-user/drs: CMIP5:`` or ``config-user/drs: CMIP6:``
+---------------------------------------------------------------------
+Whreas ESMValTool will **always** use the CMOR standard for file naming (please
+refer above), by setting the ``drs`` parameter the user tells the tool what
+type of root paths they need the data from, e.g.:
+
+ .. code-block:: yaml
+
+ drs:
+ CMIP6: BADC
+
+will tell the tool that the user needs data from a repository structured
+according to the BADC DRS structure, i.e.:
+
+``ROOT/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/[variable_short_name]/[grid]``;
+
+setting the ``ROOT`` parameter is explained below. This is a
+strictly-structured repository tree and if there are any sort of irregularities
+(e.g. there is no ``[mip]`` directory) the data will not be found! ``BADC`` can
+be replaced with ``DKRZ`` or ``ETHZ`` depending on the existing ``ROOT``
+directory structure.
+The snippet
+
+ .. code-block:: yaml
+
+ drs:
+ CMIP6: default
+
+is another way to retrieve data from a ``ROOT`` directory that has no DRS-like
+structure; ``default`` indicates that the data lies in a directory that
+contains all the files without any structure.
+
+.. note::
+ When using ``CMIP6: default`` or ``CMIP5: default`` it is important to
+ remember that all the needed files must be in the same top-level directory
+ set by ``default`` (see below how to set ``default``).
+
+.. _config-user-rootpath:
+
+Explaining ``config-user/rootpath:``
+------------------------------------
+
+``rootpath`` identifies the root directory for different data types (``ROOT`` as we used it above):
+
+* ``CMIP`` e.g. ``CMIP5`` or ``CMIP6``: this is the `root` path(s) to where the
+ CMIP files are stored; it can be a single path or a list of paths; it can
+ point to an ESGF node or it can point to a user private repository. Example
+ for a CMIP5 root path pointing to the ESGF node on CEDA-Jasmin (formerly
+ known as BADC):
+
+ .. code-block:: yaml
+
+ CMIP5: /badc/cmip5/data/cmip5/output1
+
+ Example for a CMIP6 root path pointing to the ESGF node on CEDA-Jasmin:
+
+ .. code-block:: yaml
+
+ CMIP6: /badc/cmip6/data/CMIP6/CMIP
+
+ Example for a mix of CMIP6 root path pointing to the ESGF node on CEDA-Jasmin
+ and a user-specific data repository for extra data:
+
+ .. code-block:: yaml
+
+ CMIP6: [/badc/cmip6/data/CMIP6/CMIP, /home/users/johndoe/cmip_data]
+
+* ``OBS``: this is the `root` path(s) to where the observational datasets are
+ stored; again, this could be a single path or a list of paths, just like for
+ CMIP data. Example for the OBS path for a large cache of observation datasets
+ on CEDA-Jasmin:
+
+ .. code-block:: yaml
+
+ OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2
+
+* ``default``: this is the `root` path(s) to where files are stored without any
+ DRS-like directory structure; in a nutshell, this is a single directory that
+ should contain all the files needed by the run, without any sub-directory
+ structure.
+
+* ``RAWOBS``: this is the `root` path(s) to where the raw observational data
+ files are stored; this is used by ``cmorize_obs``.
+
+Dataset definitions in ``recipe``
+---------------------------------
+Once the correct paths have been established, ESMValTool collects the
+information on the specific datasets that are needed for the analysis. This
+information, together with the CMOR convention for naming files (see CMOR-DRS_)
+will allow the tool to search and find the right files. The specific
+datasets are listed in any recipe, under either the ``datasets`` and/or
+``additional_datasets`` sections, e.g.
+
+.. code-block:: yaml
+
+ datasets:
+ - {dataset: HadGEM2-CC, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 2001, end_year: 2004}
+ - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical, ensemble: r1i1p1f2, grid: gn, start_year: 2004, end_year: 2014}
+
+``_data_finder`` will use this information to find data for **all** the variables specified in ``diagnostics/variables``.
+
+Recap and example
+=================
+Let us look at a practical example for a recap of the information above:
+suppose you are using a ``config-user.yml`` that has the following entries for
+data finding:
+
+.. code-block:: yaml
+
+ rootpath: # running on CEDA-Jasmin
+ CMIP6: /badc/cmip6/data/CMIP6/CMIP
+ drs:
+ CMIP6: BADC # since you are on CEDA-Jasmin
+
+and the dataset you need is specified in your ``recipe.yml`` as:
+
+.. code-block:: yaml
+
+ - {dataset: UKESM1-0-LL, project: CMIP6, mip: Amon, exp: historical, grid: gn, ensemble: r1i1p1f2, start_year: 2004, end_year: 2014}
+
+for a variable, e.g.:
+
+.. code-block:: yaml
+
+ diagnostics:
+ some_diagnostic:
+ description: some_description
+ variables:
+ ta:
+ preprocessor: some_preprocessor
+
+The tool will then use the root path ``/badc/cmip6/data/CMIP6/CMIP`` and the
+dataset information and will assemble the full DRS path using information from
+CMOR-DRS_ and establish the path to the files as:
+
+.. code-block::
+
+ /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon
+
+then look for variable ``ta`` and specifically the latest version of the data
+file:
+
+.. code-block::
+
+ /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/
+
+and finally, using the file naming definition from CMOR-DRS_ find the file:
+
+.. code-block::
+
+ /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc
+
+.. _observations:
+
+Observational data
+==================
+Observational data is retrieved in the same manner as CMIP data, for example
+using the ``OBS`` root path set to:
+
+ .. code-block:: yaml
+
+ OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2
+
+and the dataset:
+
+ .. code-block:: yaml
+
+ - {dataset: ERA-Interim, project: OBS, type: reanaly, version: 1, start_year: 2014, end_year: 2015, tier: 3}
+
+in ``recipe.yml`` in ``datasets`` or ``additional_datasets``, the rules set in
+CMOR-DRS_ are used again and the file will be automatically found:
+
+.. code-block::
+
+ /group_workspaces/jasmin4/esmeval/obsdata-v2/Tier3/ERA-Interim/OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc
+
+Since observational data are organized in Tiers depending on their level of
+public availability, the ``default`` directory must be structured accordingly
+with sub-directories ``TierX`` (``Tier1``, ``Tier2`` or ``Tier3``), even when
+``drs: default``.
diff --git a/doc/esmvalcore/index.rst b/doc/esmvalcore/index.rst
index eedbac7983..826bac1c94 100644
--- a/doc/esmvalcore/index.rst
+++ b/doc/esmvalcore/index.rst
@@ -11,3 +11,4 @@ ESMValTool Core
Recipe
Preprocessor
Fixing data
+ Utilities
diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst
index 62ff95b86c..82bb54833a 100644
--- a/doc/esmvalcore/preprocessor.rst
+++ b/doc/esmvalcore/preprocessor.rst
@@ -3,41 +3,75 @@
************
Preprocessor
************
-The ESMValTool preprocessor can be used to perform a broad range of operations
-on the input data before diagnostics or metrics are applied. The
-preprocessor performs these operations in a centralized, documented and
-efficient way, thus reducing the data processing load on the diagnostics side.
-
-Each of the preprocessor operations is written in a dedicated python module and
-all of them receive and return an Iris cube, working sequentially on the data
-with no interactions between them. The order
-in which the preprocessor operations is applied is set by default in order to
-minimize the loss of information due to, for example, temporal and spatial
-subsetting or multi-model averaging. Nevertheless, the user is free to change
-such order to address specific scientific requirements, but keeping in mind
-that some operations must be necessarily performed in a specific order. This is
-the case, for instance, for multi-model statistics, which required the model to
-be on a common grid and therefore has to be called after the regridding module.
In this section, each of the preprocessor modules is described in detail
following the default order in which they are applied:
-* `Variable derivation`_.
-* `CMOR check and dataset-specific fixes`_.
-* `Vertical interpolation`_.
-* `Land/Sea/Ice masking`_.
-* `Horizontal regridding`_.
-* `Masking of missing values`_.
-* `Multi-model statistics`_.
-* `Time operations`_.
-* `Area operations`_.
-* `Volume operations`_.
-* `Unit conversion`_.
+* :ref:`Variable derivation`
+* :ref:`CMOR check and dataset-specific fixes`
+* :ref:`Vertical interpolation`
+* :ref:`Land/Sea/Ice masking`
+* :ref:`Horizontal regridding`
+* :ref:`Masking of missing values`
+* :ref:`Multi-model statistics`
+* :ref:`Time operations`
+* :ref:`Area operations`
+* :ref:`Volume operations`
+* :ref:`Unit conversion`
+
+Overview
+========
+
+..
+ ESMValTool is a modular ``Python 3.6+`` software package possesing capabilities
+ of executing a large number of diagnostic routines that can be written in a
+ number of programming languages (Python, NCL, R, Julia). The modular nature
+ benefits the users and developers in different key areas: a new feature
+ developed specifically for version 2.0 is the preprocessing core or the
+ preprocessor (esmvalcore) that executes the bulk of standardized data
+ operations and is highly optimized for maximum performance in data-intensive
+ tasks. The main objective of the preprocessor is to integrate as many
+ standardizable data analysis functions as possible so that the diagnostics can
+ focus on the specific scientific tasks they carry. The preprocessor is linked
+ to the diagnostics library and the diagnostic execution is seamlessly performed
+ after the preprocessor has completed its steps. The benefit of having a
+ preprocessing unit separate from the diagnostics library include:
+
+ * ease of integration of new preprocessing routines;
+ * ease of maintenance (including unit and integration testing) of existing
+ routines;
+ * a straightforward manner of importing and using the preprocessing routines as
+ part of the overall usage of the software and, as a special case, the use
+ during diagnostic execution;
+ * shifting the effort for the scientific diagnostic developer from implementing
+ both standard and diagnostic-specific functionalities to allowing them to
+ dedicate most of the effort to developing scientifically-relevant diagnostics
+ and metrics;
+ * a more strict code review process, given the smaller code base than for
+ diagnostics.
+
+The ESMValTool preprocessor can be used to perform a broad range of operations
+on the input data before diagnostics or metrics are applied. The preprocessor
+performs these operations in a centralized, documented and efficient way, thus
+reducing the data processing load on the diagnostics side.
+
+Each of the preprocessor operations is written in a dedicated python module and
+all of them receive and return an Iris `cube
+`_ , working
+sequentially on the data with no interactions between them. The order in which
+the preprocessor operations is applied is set by default to minimize
+the loss of information due to, for example, temporal and spatial subsetting or
+multi-model averaging. Nevertheless, the user is free to change such order to
+address specific scientific requirements, but keeping in mind that some
+operations must be necessarily performed in a specific order. This is the case,
+for instance, for multi-model statistics, which required the model to be on a
+common grid and therefore has to be called after the regridding module.
+
+.. _Variable derivation:
Variable derivation
===================
-
The variable derivation module allows to derive variables which are not in the
CMIP standard data request using standard variables as input. The typical use
case of this operation is the evaluation of a variable which is only available
@@ -52,10 +86,10 @@ comparison with the observations.
To contribute a new derived variable, it is also necessary to define a name for
it and to provide the corresponding CMOR table. This is to guarantee the proper
metadata definition is attached to the derived data. Such custom CMOR tables
-are collected as part of the `ESMValTool core package
+are collected as part of the `ESMValCore package
`_. By default, the variable
-derivation will be applied only if not already available in the input data, but
-the derivation can be forced by setting the appropriate flag.
+derivation will be applied only if the variable is not already available in the
+input data, but the derivation can be forced by setting the appropriate flag.
.. code-block:: yaml
@@ -65,27 +99,106 @@ the derivation can be forced by setting the appropriate flag.
force_derivation: false
The required arguments for this module are two boolean switches:
-* derive: activate variable derivation
-* force_derivation: force variable derivation even if the variable is
-directly available in the input data.
+
+* ``derive``: activate variable derivation
+* ``force_derivation``: force variable derivation even if the variable is
+ directly available in the input data.
See also :func:`esmvalcore.preprocessor.derive`.
-CMOR check and dataset-specific fixes
+.. _CMOR check and dataset-specific fixes:
+
+CMORization and dataset-specific fixes
======================================
.. warning::
- Documentation of _reformat.py, check.py and fix.py to be added
+ Section to be added.
+
+
+.. _Vertical interpolation:
Vertical interpolation
======================
-.. warning::
- Documentation of _regrid.py (part 1) to be added
+Vertical level selection is an important aspect of data preprocessing since it
+allows the scientist to perform a number of metrics specific to certain levels
+(whether it be air pressure or depth, e.g. the Quasi-Biennial-Oscillation (QBO)
+u30 is computed at 30 hPa). Dataset native vertical grids may not come with the
+desired set of levels, so an interpolation operation will be needed to regrid
+the data vertically. ESMValTool can perform this vertical interpolation via the
+``extract_levels`` preprocessor. Level extraction may be done in a number of
+ways.
+
+Level extraction can be done at specific values passed to ``extract_levels`` as
+``levels:`` with its value a list of levels (note that the units are
+CMOR-standard, Pascals (Pa)):
+
+.. code-block:: yaml
+
+ preprocessors:
+ preproc_select_levels_from_list:
+ extract_levels:
+ levels: [100000., 50000., 3000., 1000.]
+ scheme: linear
+
+It is also possible to extract the CMIP-specific, CMOR levels as they appear in
+the CMOR table, e.g. ``plev10`` or ``plev17`` or ``plev19`` etc:
+
+.. code-block:: yaml
+
+ preprocessors:
+ preproc_select_levels_from_cmip_table:
+ extract_levels:
+ levels: {cmor_table: CMIP6, coordinate: plev10}
+ scheme: nearest
+
+Of good use is also the level extraction with values specific to a certain
+dataset, without the user actually polling the dataset of interest to find out
+the specific levels: e.g. in the example below we offer two alternatives to
+extract the levels and vertically regrid onto the vertical levels of
+``ERA-Interim``:
+
+.. code-block:: yaml
+
+ preprocessors:
+ preproc_select_levels_from_dataset:
+ extract_levels:
+ levels: ERA-Interim
+ # This also works, but allows specifying the pressure coordinate name
+ # levels: {dataset: ERA-Interim, coordinate: air_pressure}
+ scheme: linear_horizontal_extrapolate_vertical
+* See also :func:`esmvalcore.preprocessor.extract_levels`.
+* See also :func:`esmvalcore.preprocessor.get_cmor_levels`.
+
+.. note::
-Land/Sea/Ice masking
-====================
+ For both vertical and horizontal regridding one can control the
+ extrapolation mode when defining the interpolation scheme. Controlling the
+ extrapolation mode allows us to avoid situations where extrapolating values
+ makes little physical sense (e.g. extrapolating beyond the last data point).
+ The extrapolation mode is controlled by the `extrapolation_mode`
+ keyword. For the available interpolation schemes available in Iris, the
+ extrapolation_mode keyword must be one of:
+
+ * ``extrapolate``: the extrapolation points will be calculated by
+ extending the gradient of the closest two points;
+ * ``error``: a ``ValueError`` exception will be raised, notifying an
+ attempt to extrapolate;
+ * ``nan``: the extrapolation points will be be set to NaN;
+ * ``mask``: the extrapolation points will always be masked, even if the
+ source data is not a ``MaskedArray``; or
+ * ``nanmask``: if the source data is a MaskedArray the extrapolation
+ points will be masked, otherwise they will be set to NaN.
+
+
+.. _masking:
+
+Masking
+=======
+
+Introduction to masking
+-----------------------
Certain metrics and diagnostics need to be computed and performed on specific
domains on the globe. The ESMValTool preprocessor supports filtering
@@ -99,13 +212,18 @@ are used: although these are not model-specific, they represent a good
approximation since they have a much higher resolution than most of the models
and they are regularly updated with changing geographical features.
+.. _land/sea/ice masking:
+
+Land-sea masking
+----------------
+
In ESMValTool, land-sea-ice masking can be done in two places: in the
preprocessor, to apply a mask on the data before any subsequent preprocessing
step and before running the diagnostic, or in the diagnostic scripts
themselves. We present both these implementations below.
To mask out a certain domain (e.g., sea) in the preprocessor,
-`mask_landsea` can be used:
+``mask_landsea`` can be used:
.. code-block:: yaml
@@ -114,12 +232,11 @@ To mask out a certain domain (e.g., sea) in the preprocessor,
mask_landsea:
mask_out: sea
-and requires only one argument:
-* mask_out: either land or sea.
+and requires only one argument: ``mask_out``: either ``land`` or ``sea``.
-The preprocessor automatically retrieves the corresponding mask (`fx: stfof` in
-this case) and applies it so that sea-covered grid cells are set to
-missing. Conversely, it retrieves the `fx: sftlf` mask when land need to be
+The preprocessor automatically retrieves the corresponding mask (``fx: stfof``
+in this case) and applies it so that sea-covered grid cells are set to
+missing. Conversely, it retrieves the ``fx: sftlf`` mask when land needs to be
masked out, respectively. If the corresponding fx file is not found (which is
the case for some models and almost all observational datasets), the
preprocessor attempts to mask the data using Natural Earth mask files (that are
@@ -127,9 +244,14 @@ vectorized rasters). As mentioned above, the spatial resolution of the the
Natural Earth masks are much higher than any typical global model (10m for
land and 50m for ocean masks).
+See also :func:`esmvalcore.preprocessor.mask_landsea`.
+
+Ice masking
+-----------
+
Note that for masking out ice sheets, the preprocessor uses a different
function, to ensure that both land and sea or ice can be masked out without
-losing generality. To mask ice out, `mask_landseaice` can be used:
+losing generality. To mask ice out, ``mask_landseaice`` can be used:
.. code-block:: yaml
@@ -138,149 +260,363 @@ losing generality. To mask ice out, `mask_landseaice` can be used:
mask_landseaice:
mask_out: ice
-and requires only one argument:
-* mask_out: either landsea or ice.
+and requires only one argument: ``mask_out``: either ``landsea`` or ``ice``.
-As in the case of `mask_landsea`, the preprocessor automatically retrieves the
-`fx: sftgif` mask.
+As in the case of ``mask_landsea``, the preprocessor automatically retrieves
+the ``fx_files: [sftgif]`` mask.
-Another option is to just read the fx masks as any other CMOR variable and use
-it within a diagnostic script. This can be done in the variable dictionary by
-specifiying the desired fx variables (masks):
+See also :func:`esmvalcore.preprocessor.mask_landseaice`.
-.. warning::
- Code snippet, text and link to function to be added (after #1037 and #1075
- are closed).
+Mask files
+----------
+
+At the core of the land/sea/ice masking in the preprocessor are the mask files
+(whether it be fx type or Natural Earth type of files); these files (bar
+Natural Earth) can be retrived and used in the diagnostic phase as well. By
+specifying the ``fx_files:`` key in the variable in diagnostic in the recipe,
+and populating it with a list of desired files e.g.:
+
+.. code-block:: yaml
+
+ variables:
+ ta:
+ preprocessor: my_masking_preprocessor
+ fx_files: [sftlf, sftof, sftgif, areacello, areacella]
+Such a recipe will automatically retrieve all the ``fx_files: [sftlf, sftof,
+sftgif, areacello, areacella]``-type fx files for each of the variables they
+are needed for and then, in the diagnostic phase, these mask files will be
+available for the developer to use them as they need to. The `fx_files`
+attribute of the big `variable` nested dictionary that gets passed to the
+diagnostic is, in turn, a dictionary on its own, and members of it can be
+accessed in the diagnostic through a simple loop over the ``config`` diagnostic
+variable items e.g.:
+
+.. code-block::
+
+ for filename, attributes in config['input_data'].items():
+ sftlf_file = attributes['fx_files']['sftlf']
+ areacello_file = attributes['fx_files']['areacello']
+
+.. _masking of missing values:
+
+Missing values masks
+--------------------
+
+Missing (masked) values can be a nuisance especially when dealing with
+multimodel ensembles and having to compute multimodel statistics; different
+numbers of missing data from dataset to dataset may introduce biases and
+artifically assign more weight to the datasets that have less missing
+data. This is handled in ESMValTool via the missing values masks: two types of
+such masks are available, one for the multimodel case and another for the
+single model case.
+
+The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step
+that usually comes after all the single-model steps (regridding, area selection
+etc) have been performed; in a nutshell, it combines missing values masks from
+individual models into a multimodel missing values mask; the individual model
+masks are built according to common criteria: the user chooses a time window in
+which missing data points are counted, and if the number of missing data points
+relative to the number of total data points in a window is less than a chosen
+fractional theshold, the window is discarded i.e. all the points in the window
+are masked (set to missing).
+
+.. code-block:: yaml
+
+ preprocessors:
+ missing_values_preprocessor:
+ mask_fillvalues:
+ threshold_fraction: 0.95
+ min_value: 19.0
+ time_window: 10.0
+
+In the example above, the fractional threshold for missing data vs. total data
+is set to 95% and the time window is set to 10.0 (units of the time coordinate
+units). Optionally, a minimum value threshold can be applied, in this case it
+is set to 19.0 (in units of the variable units).
+
+See also :func:`esmvalcore.preprocessor.mask_fillvalues`.
+
+.. note::
+
+ It is possible to use ``mask_fillvalues`` to create a combined multimodel
+ mask (all the masks from all the analyzed models combined into a single
+ mask); for that purpose setting the ``threshold_fraction`` to 0 will not
+ discard any time windows, essentially keeping the original model masks and
+ combining them into a single mask; here is an example:
+
+ .. code-block:: yaml
+
+ preprocessors:
+ missing_values_preprocessor:
+ mask_fillvalues:
+ threshold_fraction: 0.0 # keep all missing values
+ min_value: -1e20 # small enough not to alter the data
+ # time_window: 10.0 # this will not matter anymore
+
+Minimum, maximum and interval masking
+-------------------------------------
+
+Thresholding on minimum and maximum accepted data values can also be performed:
+masks are constructed based on the results of thresholding; inside and outside
+interval thresholding and masking can also be performed. These functions are
+``mask_above_threshold``, ``mask_below_threshold``, ``mask_inside_range``, and
+``mask_outside_range``.
+
+Thes functions always take a cube as first argument and either ``threshold``
+for threshold masking or the pair ``minimum`, ``maximum`` for interval masking.
+
+See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related
+functions.
+
+
+.. _Horizontal regridding:
Horizontal regridding
=====================
-.. warning::
- Documentation of _regrid.py (part 2) to be added
+Regridding is necessary when various datasets are available on a variety of
+`lat-lon` grids and they need to be brought together on a common grid (for
+various statistical operations e.g. multimodel statistics or for e.g. direct
+inter-comparison or comparison with observational datasets). Regridding is
+conceptually a very similar process to interpolation (in fact, the regridder
+engine uses interpolation and extrapolation, with various schemes). The primary
+difference is that interpolation is based on sample data points, while
+regridding is based on the horizontal grid of another cube (the reference
+grid).
-Masking of missing values
-=========================
-.. warning::
- Documentation of _mask.py (part 2) to be added
+The underlying regridding mechanism in ESMValTool uses the `cube.regrid()
+`_
+from Iris.
+The use of the horizontal regridding functionality is flexible depending on
+what type of reference grid and what interpolation scheme is preferred. Below
+we show a few examples.
-Multi-model statistics
-======================
+Regridding on a reference dataset grid
+--------------------------------------
-.. warning::
- Documentation of _multimodel.py to be added.
+The example below shows how to regrid on the reference dataset ``ERA-Interim``
+(observational data, but just as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `scheme` is `linear`.
+
+.. code-block:: yaml
-Information on maximum memory required: In the most general case, we can set
-upper limits on the maximum memory the analysis will require:
+ preprocessors:
+ regrid_preprocessor:
+ regrid:
+ target_grid: ERA-Interim
+ scheme: linear
-Ms = (R + N) x F_eff - F_eff - when no multimodel analysis is performed;
-Mm = (2R + N) x F_eff - 2F_eff - when multimodel analysis is performed;
+Regridding on an ``MxN`` grid specification
+-------------------------------------------
-where
+The example below shows how to regrid on a reference grid with a cell
+specification of ``2.5x2.5`` degrees. This is similar to regridding on
+reference datasets, but in the previous case the reference dataset grid cell
+specifications are not necessarily known a priori. Reegridding on an ``MxN``
+cell specification is oftentimes used when operating on localized data.
-Ms: maximum memory for non-multimodel module
-Mm: maximum memory for multimodel module
-R: computational efficiency of module (typically 2-3)
-N: number of datasets
-F_eff: average size of data per dataset where F_eff = e x f x F
-where e is the factor that describes how lazy the data is (e = 1 for fully
-realized data) and f describes how much the data was shrunk by the immediately
-previous module e.g. time extraction, area selection or level extraction; note
-that for fix_data f relates only to the time extraction, if data is exact in
-time (no time selection) f = 1 for fix_data.
+.. code-block:: yaml
-So for cases when we deal with a lot of datasets (R + N = N), data is fully
-realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N
-datasets will require:
+ preprocessors:
+ regrid_preprocessor:
+ regrid:
+ target_grid: 2.5x2.5
+ scheme: nearest
-Ms = 1.5 x (N - 1) GB
-Mm = 1.5 x (N - 2) GB
+In this case the ``NearestNeighbour`` interpolation scheme is used (see below
+for scheme definitions).
+When using a ``MxN`` type of grid it is possible to offset the grid cell
+centrepoints using the `lat_offset` and ``lon_offset`` arguments:
-Time operations
-===============
+* ``lat_offset``: offsets the grid centers of the latitude coordinate w.r.t. the
+ pole by half a grid step;
+* ``lon_offset``: offsets the grid centers of the longitude coordinate
+ w.r.t. Greenwich meridian by half a grid step.
-The time operations module contains a broad set of functions to subset data and apply
-statistical operators along the temporal coordinate of the input data:
-
-| `1. extract_time`_: extract a specified time range from a cube.
-| `2. extract_season`_: extract only the times that occur within a specific
- season.
-| `3. extract_month`_: extract only the times that occur within a specific
- month.
-| `4. time_average`_: take the weighted average over the entire time dimension.
-| `5. seasonal_mean`_: produce a mean for each season (DJF, MAM, JJA, SON)
-| `6. annual_mean`_: produce an annual or decadal mean.
-| `7. regrid_time`_: align the time axis of each dataset to have common time
- points and calendars.
-
-1. extract_time
----------------
+.. code-block:: yaml
+
+ preprocessors:
+ regrid_preprocessor:
+ regrid:
+ target_grid: 2.5x2.5
+ lon_offset: True
+ lat_offset: True
+ scheme: nearest
+
+Regridding (interpolation, extrapolation) schemes
+-------------------------------------------------
+
+The schemes used for the interpolation and extrapolation operations needed by
+the horizontal regridding functionality directly map to their corresponding
+implementaions in Iris:
+
+* ``linear``: `Linear(extrapolation_mode='mask') `_.
+* ``linear_extrapolate``: `Linear(extrapolation_mode='extrapolate') `_.
+* ``nearest``: `Nearest(extrapolation_mode='mask') `_.
+* ``area_weighted``: `AreaWeighted() `_.
+* ``unstructured_nearest``: `UnstructuredNearest() `_.
+
+See also :func:`esmvalcore.preprocessor.regrid`
+
+.. note::
+
+ For both vertical and horizontal regridding one can control the
+ extrapolation mode when defining the interpolation scheme. Controlling the
+ extrapolation mode allows us to avoid situations where extrapolating values
+ makes little physical sense (e.g. extrapolating beyond the last data
+ point). The extrapolation mode is controlled by the `extrapolation_mode`
+ keyword. For the available interpolation schemes available in Iris, the
+ extrapolation_mode keyword must be one of:
+
+ * ``extrapolate`` – the extrapolation points will be calculated by
+ extending the gradient of the closest two points;
+ * ``error`` – a ``ValueError`` exception will be raised, notifying an
+ attempt to extrapolate;
+ * ``nan`` – the extrapolation points will be be set to NaN;
+ * ``mask`` – the extrapolation points will always be masked, even if
+ the source data is not a ``MaskedArray``; or
+ * ``nanmask`` – if the source data is a MaskedArray the extrapolation
+ points will be masked, otherwise they will be set to NaN.
+
+.. note::
+
+ The regridding mechanism is (at the moment) done with fully realized data in
+ memory, so depending on how fine the target grid is, it may use a rather
+ large amount of memory. Empirically target grids of up to ``0.5x0.5``
+ degrees should not produce any memory-related issues, but be advised that
+ for resolutions of ``< 0.5`` degrees the regridding becomes very slow and
+ will use a lot of memory.
+
+
+.. _multi-model statistics:
+
+Multi-model statistics
+======================
+Computing multi-model statistics is an integral part of model analysis and
+evaluation: individual models display a variety of biases depending on model
+set-up, initial conditions, forcings and implementation; comparing model data
+to observational data, these biases have a significanly lower statistical
+impact when using a multi-model ensemble. ESMValTool has the capability of
+computing a number of multi-model statistical measures: using the preprocessor
+module ``multi_model_statistics`` will enable the user to ask for either a
+multi-model ``mean`` and/or ``median`` with a set of argument parameters passed
+to ``multi_model_statistics``.
+
+Multimodel statistics in ESMValTool are computed along the time axis, and as
+such, can be computed across a common overlap in time (by specifying ``span:
+overlap`` argument) or across the full length in time of each model (by
+specifying ``span: full`` argument).
+
+Restrictive computation is also available by excluding any set of models that
+the user will not want to include in the statistics (by setting ``exclude:
+[excluded models list]`` argument). The implementation has a few restrictions
+that apply to the input data: model datasets must have consistent shapes, and
+from a statistical point of view, this is needed since weights are not yet
+implemented; also higher dimensional data is not supported (i.e. anything with
+dimensionality higher than four: time, vertical axis, two horizontal axes).
+
+.. code-block:: yaml
+
+ preprocessors:
+ multimodel_preprocessor:
+ multi_model_statistics:
+ span: overlap
+ statistics: [mean, median]
+ exclude: [NCEP]
+
+see also :func:`esmvalcore.preprocessor.multi_model_statistics`.
+
+.. note::
+
+ Note that the multimodel array operations, albeit performed in
+ per-time/per-horizontal level loops to save memory, could, however, be
+ rather memory-intensive (since they are not performed lazily as
+ yet). The Section on :ref:`Memory use` details the memory intake
+ for different run scenarios, but as a thumb rule, for the multimodel
+ preprocessor, the expected maximum memory intake could be approximated as
+ the number of datasets multiplied by the average size in memory for one
+ dataset.
+
+.. _time operations:
+
+Time manipulation
+=================
+The ``_time.py`` module contains the following preprocessor functions:
+
+* ``extract_time``: Extract a time range from an Iris ``cube``.
+* ``extract_season``: Extract only the times that occur within a specific
+ season.
+* ``extract_month``: Extract only the times that occur within a specific month.
+* ``time_average``: Take the weighted average over the time dimension.
+* ``seasonal_mean``: Produces a mean for each season (DJF, MAM, JJA, SON)
+* ``annual_mean``: Produces an annual or decadal mean.
+* ``regrid_time``: Aligns the time axis of each dataset to have common time
+ points and calendars.
+
+``extract_time``
+----------------
This function subsets a dataset between two points in times. It removes all
times in the dataset before the first time and after the last time point.
The required arguments are relatively self explanatory:
-* start_year
-* start_month
-* start_day
-* end_year
-* end_month
-* end_day
+* ``start_year``
+* ``start_month``
+* ``start_day``
+* ``end_year``
+* ``end_month``
+* ``end_day``
-These start and end points are set using the datasets native calendar. All six
-arguments should be given as integers, named month strings (e.g., March) will
-not be accepted. Note that start_year and end_year can be omitted, as they are
-filled in automatically from the dataset definition if not specified
-here (end_year will be the value in the dataset definition + 1).
+These start and end points are set using the datasets native calendar.
+All six arguments should be given as integers - the named month string
+will not be accepted.
See also :func:`esmvalcore.preprocessor.extract_time`.
-2. extract_season
------------------
+``extract_season``
+------------------
Extract only the times that occur within a specific season.
-This function only has one argument:
-
-* season: DJF, MAM, JJA, or SON
+This function only has one argument: ``season``. This is the named season to
+extract. ie: DJF, MAM, JJA, SON.
Note that this function does not change the time resolution. If your original
data is in monthly time resolution, then this function will return three
monthly datapoints per year.
-To calculate a seasonal average, this function needs to be combined with the
-seasonal_mean function, below.
+If you want the seasonal average, then this function needs to be combined with
+the seasonal_mean function, below.
See also :func:`esmvalcore.preprocessor.extract_season`.
-3. extract_month
-----------------
+``extract_month``
+-----------------
The function extracts the times that occur within a specific month.
-This function only has one argument:
-
-* month: [1-12]
-
-Note that named month strings will not be accepted.
+This function only has one argument: ``month``. This value should be an integer
+between 1 and 12 as the named month string will not be accepted.
See also :func:`esmvalcore.preprocessor.extract_month`.
-4. time_average
----------------
+.. _time_average:
+
+``time_average``
+----------------
This function takes the weighted average over the time dimension. This
function requires no arguments and removes the time dimension of the cube.
See also :func:`esmvalcore.preprocessor.time_average`.
-5. seasonal_mean
-----------------
+``seasonal_mean``
+-----------------
This function produces a seasonal mean for each season (DJF, MAM, JJA, SON).
Note that this function will not check for missing time points. For instance,
-if the DJF field is selected, but the input datasets starts on January 1st,
+if you are looking at the DJF field, but your datasets starts on January 1st,
the first DJF field will only contain data from January and February.
We recommend using the extract_time to start the dataset from the following
@@ -288,86 +624,83 @@ December and remove such biased initial datapoints.
See also :func:`esmvalcore.preprocessor.seasonal_mean`.
-6. annual_mean
---------------
+``annual_mean``
+---------------
-This function produces an annual or a decadal mean. It takes a single boolean
-switch as argument:
-* decadal: set this to true to calculate decadal averages instead of annual
-averages.
+This function produces an annual or a decadal mean. The only argument is the
+decadal boolean switch. When this switch is set to true, this function
+will output the decadal averages.
See also :func:`esmvalcore.preprocessor.annual_mean`.
-7. regrid_time
---------------
+``regrid_time``
+---------------
-This function aligns the time points of each component dataset to allow the
-subtraction of two Iris cubes from different datasets. The operation makes the
+This function aligns the time points of each component dataset so that the Iris
+cubes from different datasets can be subtracted. The operation makes the
datasets time points common and sets common calendars; it also resets the time
-bounds and auxiliary coordinates to reflect the artificially shifted time
-points. The current implementation works for monthly, daily, 6 hourly, 3
-hourly and hourly data. It takes a string representing the data frequency as
-an input argument:
-* frequency: mon, day, 1hr, 3hr, or 6hr
-
+bounds and auxiliary coordinates to reflect the artifically shifted time
+points. Current implementation for monthly and daily data; the ``frequency`` is
+set automatically from the variable CMOR table unless a custom ``frequency`` is
+set manually by the user in recipe.
+
See also :func:`esmvalcore.preprocessor.regrid_time`.
-Area operations
-===============
+.. _area operations:
-.. warning::
- Need to be adapted after renaming action in #1123
+Area manipulation
+=================
+The ``_area.py`` module contains the following preprocessor functions:
-The area manipulation module contains the following preprocessor functions:
+* ``extract_region``: Extract a region from a cube based on ``lat/lon``
+ corners.
+* ``zonal_means``: Calculates the zonal or meridional means.
+* ``area_statistics``: Calculates the average value over a region.
+* ``extract_named_regions``: Extract a specific region from in the region
+ cooordinate.
-| `1. extract_region`_: extract a region from a cube based on lat/lon corners.
-| `2. zonal_means`_: calculate the zonal or meridional means.
-| `3. area_statistics`_: calculate the average value over a region.
-| `4. extract_named_regions`_: extract a region from a cube given its name.
-1. extract_region
------------------
+``extract_region``
+------------------
This function masks data outside a rectagular region requested. The boundairies
of the region are provided as latitude and longitude coordinates in the
arguments:
-* start_longitude
-* end_longitude
-* start_latitude
-* end_latitude
+* ``start_longitude``
+* ``end_longitude``
+* ``start_latitude``
+* ``end_latitude``
Note that this function can only be used to extract a rectangular region.
See also :func:`esmvalcore.preprocessor.extract_region`.
-2. zonal_means
---------------
+
+``zonal_means``
+---------------
The function calculates the zonal or meridional means. While this function is
-named `zonal_mean`, it can be used to apply several different operations in
-an zonal or meridional direction.
-This function takes two arguments:
+named ``zonal_mean``, it can be used to apply several different operations in
+an zonal or meridional direction. This function takes two arguments:
-* coordinate: Which direction to apply the operation: latitude or longitude.
-* mean_type: Which operation to apply: mean, std_dev, variance, median, min or
-* max.
+* ``coordinate``: Which direction to apply the operation: latitude or longitude
+* ``mean_type``: Which operation to apply: mean, std_dev, variance, median, min
+ or max
See also :func:`esmvalcore.preprocessor.zonal_means`.
-3. area_statistics
-------------------
-This function calculates the average value over a region - weighted by the
-cell areas of the region.
+``area_statistics``
+-------------------
-This function takes one argument:
-* operator: the name of the operation to apply.
+This function calculates the average value over a region - weighted by the cell
+areas of the region. This function takes the argument, ``operator``: the name
+of the operation to apply.
This function can be used to apply several different operations in the
-horizonal plane: mean, standard deviation, median variance, minimum and
-maximum.
+horizonal plane: mean, standard deviation, median variance, minimum and maximum.
Note that this function is applied over the entire dataset. If only a specific
region, depth layer or time period is required, then those regions need to be
@@ -375,107 +708,113 @@ removed using other preprocessor operations in advance.
See also :func:`esmvalcore.preprocessor.area_statistics`.
-4. extract_named_regions
-------------------------
-This function extract a specific named region from the data.
-This function takes onw argument:
+``extract_named_regions``
+-------------------------
-* regions: either a string or a list of strings of named regions.
-
-Note that the dataset must have a `region` cooordinate which includes a list of
-strings as values. This function then matches the named regions against the
-requested string.
+This function extracts a specific named region from the data. This function
+takes the following argument: ``regions`` which is either a string or a list
+of strings of named regions. Note that the dataset must have a ``region``
+cooordinate which includes a list of strings as values. This function then
+matches the named regions against the requested string.
See also :func:`esmvalcore.preprocessor.extract_named_regions`.
-Volume operations
-=================
+.. _volume operations:
-The volume operations module contains the following preprocessor functions:
+Volume manipulation
+===================
+The ``_volume.py`` module contains the following preprocessor functions:
-| `1. extract_volume`_: extract a specific depth range from a cube.
-| `2. volume_statistics`_: calculate the volume-weighted average.
-| `3. depth_integration`_: integrate over the depth dimension.
-| `4. extract_transect`_: extract data along a line of constant latitude or
- longitude.
-| `5. extract_trajectory`_: extract data along a specified trajectory.
+* ``extract_volume``: Extract a specific depth range from a cube.
+* ``volume_statistics``: Calculate the volume-weighted average.
+* ``depth_integration``: Integrate over the depth dimension.
+* ``extract_transect``: Extract data along a line of constant latitude or
+ longitude.
+* ``extract_trajectory``: Extract data along a specified trajectory.
-1. extract_volume
------------------
-This function extracts a specific range in the z-direction from a cube.
-This function takes two arguments:
+``extract_volume``
+------------------
-* z_min: minimum in the z direction
-* z_max: maximum in the z direction
+Extract a specific range in the `z`-direction from a cube. This function
+takes two arguments, a minimum and a maximum (``z_min`` and ``z_max``,
+respectively) in the `z`-direction.
-Note that this requires the requested z-coordinate range to be the same sign as
-the Iris cube, i.e. if the cube has z-coordinate as negative, then z_min and
-z_max need to be negative numbers.
+Note that this requires the requested `z`-coordinate range to be the same sign
+as the Iris cube. ie, if the cube has `z`-coordinate as negative, then
+``z_min`` and ``z_max`` need to be negative numbers.
See also :func:`esmvalcore.preprocessor.extract_volume`.
-2. volume_statistics
---------------------
+
+``volume_statistics``
+---------------------
This function calculates the volume-weighted average across three dimensions,
but maintains the time dimension.
-This function takes one argument:
-* operator: operation to apply over the volume (at the moment only mean is implemented)
+This function takes the argument: ``operator``, which defines the operation to
+apply over the volume.
-No depth coordinate is required as this is determined by Iris. This
-function works best when the fx files provide the cell volume.
+No depth coordinate is required as this is determined by Iris. This function
+works best when the ``fx_files`` provide the cell volume.
See also :func:`esmvalcore.preprocessor.volume_statistics`.
-3. depth_integration
---------------------
+``depth_integration``
+---------------------
-This function integrates over the depth dimension. It performs a weighted sum
-along the z-coordinate, and removes the z direction of the output cube. It takes no arguments.
+This function integrates over the depth dimension. This function does a
+weighted sum along the `z`-coordinate, and removes the `z` direction of the
+output cube. This preprocessor takes no arguments.
See also :func:`esmvalcore.preprocessor.depth_integration`.
-4. extract_transect
--------------------
+
+``extract_transect``
+--------------------
This function extracts data along a line of constant latitude or longitude.
-This function takes two arguments, although only one is strictly required:
-* latitude
-* longitude
+This function takes two arguments, although only one is strictly required.
+The two arguments are ``latitude`` and ``longitude``. One of these arguments
+needs to be set to a float, and the other can then be either ignored or set to
+a minimum or maximum value.
-One of these arguments needs to be set to a float, and the other can then be
-either ignored or set to a minimum or maximum value. For example, if latitude
-is set to 0 and longitude is left blank, the function would produce a cube
-along the equator. If latitude is set to to 0 and longitude to `[40., 100.]` it
-will produce a transect of the equator in the Indian Ocean.
+For example, if we set latitude to 0 N and leave longitude blank, it would
+produce a cube along the Equator. On the other hand, if we set latitude to 0
+and then set longitude to ``[40., 100.]`` this will produce a transect of the
+Equator in the Indian Ocean.
See also :func:`esmvalcore.preprocessor.extract_transect`.
-5. extract_trajectory
----------------------
-This function extracts data along a specified trajectory. It requires three
-arguments:
-* latitude_points: list of latitude coordinates
-* longitude_points: list of longiute coordinates
-* number_points: if two points are provided, the `number_points` argument is
-used to set the number of places to extract between the two end points.
+``extract_trajectory``
+----------------------
-If more than two points are provided, then extract_trajectory will produce a
-cube which has extrapolated the data of the cube to those points, and
-`number_points` is not needed. Note that this function uses the expensive
-interpolate method, but it may be necessary for irregular grids.
+This function extract data along a specified trajectory.
+The three areguments are: ``latitudes``, ``longitudes`` and number of point
+needed for extrapolation ``number_points``.
+
+If two points are provided, the ``number_points`` argument is used to set a
+the number of places to extract between the two end points.
+
+If more than two points are provided, then ``extract_trajectory`` will produce
+a cube which has extrapolated the data of the cube to those points, and
+``number_points`` is not needed.
+
+Note that this function uses the expensive ``interpolate`` method from
+``Iris.analysis.trajectory``, but it may be neccesary for irregular grids.
See also :func:`esmvalcore.preprocessor.extract_trajectory`.
+.. _unit conversion:
Unit conversion
===============
+
Converting units is also supported. This is particularly useful in
cases where different datasets might have different units, for example
when comparing CMIP5 and CMIP6 variables where the units have changed
@@ -487,8 +826,49 @@ will guarantee homogeneous input for the diagnostics.
.. note::
Conversion is only supported between compatible units! In other
- words, converting temperature units from `degC` to `Kelvin` works
+ words, converting temperature units from ``degC`` to ``Kelvin`` works
fine, changing precipitation units from a rate based unit to an
amount based unit is not supported at the moment.
See also :func:`esmvalcore.preprocessor.convert_units`.
+
+
+.. _Memory use:
+
+Information on maximum memory required
+======================================
+In the most general case, we can set upper limits on the maximum memory the
+anlysis will require:
+
+
+``Ms = (R + N) x F_eff - F_eff`` - when no multimodel analysis is performed;
+
+``Mm = (2R + N) x F_eff - 2F_eff`` - when multimodel analysis is performed;
+
+where
+
+* ``Ms``: maximum memory for non-multimodel module
+* ``Mm``: maximum memory for multimodel module
+* ``R``: computational efficiency of module; `R` is typically 2-3
+* ``N``: number of datasets
+* ``F_eff``: average size of data per dataset where ``F_eff = e x f x F``
+ where ``e`` is the factor that describes how lazy the data is (``e = 1`` for
+ fully realized data) and ``f`` describes how much the data was shrunk by the
+ immediately previous module, e.g. time extraction, area selection or level
+ extraction; note that for fix_data ``f`` relates only to the time extraction,
+ if data is exact in time (no time selection) ``f = 1`` for fix_data so for
+ cases when we deal with a lot of datasets ``R + N \approx N``, data is fully
+ realized, assuming an average size of 1.5GB for 10 years of `3D` netCDF data,
+ ``N`` datasets will require:
+
+
+``Ms = 1.5 x (N - 1)`` GB
+
+``Mm = 1.5 x (N - 2)`` GB
+
+As a rule of thumb, the maximum required memory at a certain time for
+multimodel analysis could be estimated by multiplying the number of datasets by
+the average file size of all the datasets; this memory intake is high but also
+assumes that all data is fully realized in memory; this aspect will gradually
+change and the amount of realized data will decrease with the increase of
+``dask`` use.
diff --git a/doc/esmvalcore/recipe.rst b/doc/esmvalcore/recipe.rst
index dbecb0776c..2bfa58e161 100644
--- a/doc/esmvalcore/recipe.rst
+++ b/doc/esmvalcore/recipe.rst
@@ -1,30 +1,41 @@
.. _recipe:
-*****************
-ESMValTool recipe
-*****************
+******
+Recipe
+******
-Recipes are the instructions telling ESMValTool about the user who wrote the
-recipe, the datasets which need to be run, the preprocessors that need to be
-applied, and the diagnostics which need to be run over the preprocessed data.
-This information is provided to ESMValTool in the recipe sections:
-`Documentation`_, `Datasets`_, `Preprocessors`_ and `Diagnostics`_,
-respectively.
+Overview
+========
+
+After ``config-user.yml``, the ``recipe.yml`` is the second file the user needs
+to pass to ``esmvaltool`` as command line option, at each run time point.
+Recipes contain the data and data analysis information and instructions needed
+to run the diagnostic(s), as well as specific diagnostic-related instructions.
+
+Broadly, recipes contain a general section summarizing the provenance and
+functionality of the diagnostics, the datasets which need to be run, the
+preprocessors that need to be applied, and the diagnostics which need to be run
+over the preprocessed data. This information is provided to ESMValTool in four
+main recipe sections: Documentation_, Datasets_, Preprocessors_ and
+Diagnostics_, respectively.
+.. _Documentation:
-Documentation
-=============
+Recipe section: ``documentation``
+=================================
The documentation section includes:
-- The recipe's author's user name
-- A description of the recipe
-- The user name of the maintainer
-- A list of scientific references
-- the project or projects associated with the recipe.
+- The recipe's author's user name (``authors``, matching the definitions in the
+ :ref:`config-ref`)
+- A description of the recipe (``description``, written in MarkDown format)
+- A list of scientific references (``references``, matching the definitions in
+ the :ref:`config-ref`)
+- the project or projects associated with the recipe (``projects``, matching
+ the definitions in the :ref:`config-ref`)
-For example, please see the documentation section from the recipe:
-recipe_ocean_amoc.yml.
+For example, the documentation section of ``recipes/recipe_ocean_amoc.yml`` is
+the following:
.. code-block:: yaml
@@ -47,27 +58,33 @@ recipe_ocean_amoc.yml.
projects:
- ukesm
-Note that the authors, projects, and references will need to be included in the
-``config-references.yml`` file. The author name uses the format:
-`surname_name`. For instance, Mickey Mouse would be: `mouse_mickey`.
-Also note that this username is unlikely to be the same as the github
-user name.
+.. note::
+ Note that all authors, projects, and references mentioned in the description
+ section of the recipe need to be included in the ``config-references.yml``
+ file. The author name uses the format: ``surname_name``. For instance, John
+ Doe would be: ``doe_john``. This information can be omitted by new users
+ whose name is not yet included in ``config-references.yml``.
+.. _Datasets:
-Datasets
-========
+Recipe section: ``datasets``
+============================
-The datasets section includes:
+The ``datasets`` section includes dictionaries that, via key-value pairs, define standardized
+data specifications:
-- Dataset name
-- Project (CMIP5 or 6, observations...)
-- Activity (CMIP6 only, sometimes it can be deduced automatically)
-- Experiment (historical/ RCP8.5 etc...)
-- Ensemble member
-- The time range
-- The model grid, gn or gr, (CMIP6 only).
-- Dataset alias. If not specified, a unique alias will be created
+- dataset name (key ``dataset``, value e.g. ``MPI-ESM-LR`` or ``UKESM1-0-LL``)
+- project (key ``project``, value ``CMIP5`` or ``CMIP6`` for CMIP data,
+ ``OBS`` for observational data, ``ana4mips`` for ana4mips data,
+ ``obs4mips`` for obs4mips data, ``EMAC`` for EMAC data)
+- experiment (key ``exp``, value e.g. ``historical``, ``amip``, ``piControl``,
+ ``RCP8.5``)
+- mip (for CMIP data, key ``mip``, value e.g. ``Amon``, ``Omon``, ``LImon``)
+- ensemble member (key ``ensemble``, value e.g. ``r1i1p1``, ``r1i1p1f1``)
+- time range (e.g. key-value ``start_year: 1982``, ``end_year: 1990``)
+- model grid (native grid ``grid: gn`` or regridded grid ``grid: gr``, for
+ CMIP6 data only).
For example, a datasets section could be:
@@ -80,24 +97,29 @@ For example, a datasets section could be:
Note that this section is not required, as datasets can also be provided in the
-`Diagnostics`_ section.
+Diagnostics_ section.
+.. _Preprocessors:
-Preprocessors
-=============
+Recipe section: ``preprocessors``
+=================================
The preprocessor section of the recipe includes one or more preprocesors, each
-of which may call one or several preprocessor functions.
+of which may call the execution of one or several preprocessor functions.
Each preprocessor section includes:
-- A preprocessor name.
-- A list of preprocesor functions to apply
-- Any Arguments given to the preprocessor functions.
-- The order that the preprocesor functions are applied can also be specified using the ``custom_order`` preprocesor function.
+- A preprocessor name (any name, under ``preprocessors``);
+- A list of preprocesor steps to be executed (choose from the API);
+- Any or none arguments given to the preprocessor steps;
+- The order that the preprocesor steps are applied can also be specified using
+ the ``custom_order`` preprocesor function.
-The following preprocessor is an example of a preprocessor that contains
-multiple preprocessor functions:
+The following snippet is an example of a preprocessor named ``prep_map`` that
+contains multiple preprocessing steps (:ref:`Horizontal regridding` with two
+arguments, :ref:`Time operations` with no arguments (i.e., calcualting the
+average over the time dimension) and :ref:`Multi-model statistics` with two
+arguments):
.. code-block:: yaml
@@ -111,36 +133,44 @@ multiple preprocessor functions:
span: overlap
statistics: [mean ]
-If only the default preprocessor is needed, then this section can be omitted.
+.. note::
+
+ In this case no ``preprocessors`` section is needed the workflow will apply
+ a ``default`` preprocessor consisting of only basic operations like: loading
+ data, applying CMOR checks and fixes (:ref:`CMOR check and dataset-specific
+ fixes`) and saving the data to disk.
+.. _Diagnostics:
-Diagnostics
-===========
+Recipe section: ``diagnostics``
+===============================
The diagnostics section includes one or more diagnostics. Each diagnostics will
-have:
+include:
-- A list of which variables to load
-- A description of the variables (optional)
-- Which preprocessor to apply to each variable
-- The script to run
-- The diagnostics can also include an optional ``additional_datasets`` section.
+- a list of which variables to load;
+- a description of the variables (optional);
+- the preprocessor to be applied to each variable;
+- the script to be run;
+- an optional ``additional_datasets`` section.
The ``additional_datasets`` can add datasets beyond those listed in the the
-`Datasets`_ section. This is useful if specific datasets need to be linked with
-a specific diagnostics. The addition datasets can be used to add variable
-specific datasets. This is also a good way to add observational datasets can be
-added to the diagnostic.
-
-The following example, taken from recipe_ocean_example.yml, shows a diagnostic
-named `diag_map`, which loads the temperature at the ocean surface between
-the years 2001 and 2003 and then passes it to the prep_map preprocessor.
-The result of this process is then passed to the ocean diagnostic map scipt,
-``ocean/diagnostic_maps.py``.
+Datasets_ section. This is useful if specific datasets need to be used only by
+a specific diagnostic. The ``additional_datasets`` can also be used to add
+variable specific datasets. This is also a good way to add observational
+datasets, which are usually variable-specific.
+
+Running a simple diagnostic
+---------------------------
+The following example, taken from ``recipe_ocean_example.yml``, shows a
+diagnostic named `diag_map`, which loads the temperature at the ocean surface
+between the years 2001 and 2003 and then passes it to the ``prep_map``
+preprocessor. The result of this process is then passed to the ocean diagnostic
+map scipt, ``ocean/diagnostic_maps.py``.
.. code-block:: yaml
- diagnostics:
+ diagnostics:
diag_map:
description: Global Ocean Surface regridded temperature map
@@ -157,32 +187,108 @@ To define a variable/dataset combination, the keys in the diagnostic section
are combined with the keys from datasets section. If two versions of the same
key are provided, then the key in the datasets section will take precedence
over the keys in variables section. For many recipes it makes more sense to
-define the ``start_year`` and ``end_year`` items in the variable section, because the
-diagnostic script assumes that all the data has the same time range.
+define the ``start_year`` and ``end_year`` items in the variable section,
+because the diagnostic script assumes that all the data has the same time
+range.
Note that the path to the script provided in the `script` option should be
-either:
+either the absolute path to the script, or the path relative to the
+``esmvaltool/diag_scripts`` directory.
-1. the absolute path to the script.
-2. the path relative to the ``esmvaltool/diag_scripts`` directory.
+Passing arguments to a diagnostic
+---------------------------------
+The ``diagnostics`` section may include a lot of arguments that can be used by
+the diagnostic script; these arguments are stored at runtime in a dictionary
+that is then made available to the diagnostic script via the interface link,
+independent of the language the diagnostic script is written in. Here is an
+example of such groups of arguments:
-As mentioned above, the datasets are provided in the `Diagnostics`_ section
-in this section. However, they could also be included in the `Datasets`_
-section.
+.. code-block:: yaml
+ scripts:
+ autoassess_strato_test_1: &autoassess_strato_test_1_settings
+ script: autoassess/autoassess_area_base.py
+ title: "Autoassess Stratosphere Diagnostic Metric MPI-MPI"
+ area: stratosphere
+ control_model: MPI-ESM-LR
+ exp_model: MPI-ESM-MR
+ obs_models: [ERA-Interim] # list to hold models that are NOT for metrics but for obs operations
+ additional_metrics: [ERA-Interim, inmcm4] # list to hold additional datasets for metrics
+
+In this example, apart from specifying the diagnostic script ``script:
+autoassess/autoassess_area_base.py``, we pass a suite of parameters to be used
+by the script (``area``, ``control_model`` etc). These parameters are stored in
+key-value pairs in the diagnostic configuration file, an interface file that
+can be used by importing the ``run_diagnostic`` utility:
+
+.. code-block:: python
+
+ from esmvaltool.diag_scripts.shared import run_diagnostic
+
+ # write the diagnostic code here e.g.
+ def run_some_diagnostic(my_area, my_control_model, my_exp_model):
+ """Diagnostic to be run."""
+ if my_area == 'stratosphere':
+ diag = my_control_model / my_exp_model
+ return diag
+
+ def main(cfg):
+ """Main diagnostic run function."""
+ my_area = cfg['area']
+ my_control_model = cfg['control_model']
+ my_exp_model = cfg['exp_model']
+ run_some_diagnostic(my_area, my_control_model, my_exp_model)
+
+ if __name__ == '__main__':
+
+ with run_diagnostic() as config:
+ main(config)
+
+This way a lot of the optional arguments necessary to a diagnostic are at the
+user's control via the recipe.
+
+Running your own diagnostic
+---------------------------
+If the user wants to test a newly-developed ``my_first_diagnostic.py`` which
+is not yet part of the ESMValTool diagnostics library, he/she do it by passing
+the absolute path to the diagnostic:
-Brief introduction to YAML
-==========================
+.. code-block:: yaml
-While .yaml is a relatively common format, maybe users may not have
-encountered this language before. The key information about this format is:
+ diagnostics:
+
+ myFirstDiag:
+ description: John Doe wrote a funny diagnostic
+ variables:
+ tos: # Temperature at the ocean surface
+ preprocessor: prep_map
+ start_year: 2001
+ end_year: 2003
+ scripts:
+ JoeDiagFunny:
+ script: /home/users/john_doe/esmvaltool_testing/my_first_diagnostic.py
+
+This way the user may test a new diagnostic thoroughly before committing to the
+GitHub repository and including it in the ESMValTool diagnostics library.
+
+Re-using parameters from one ``script`` to another
+--------------------------------------------------
+Due to ``yaml`` features it is possible to recycle entire diagnostics sections
+for use with other diagnostics. Here is an example:
+
+.. code-block:: yaml
-- Yaml is a human friendly markup language.
-- Yaml is commonly used for configuration files.
-- the syntax is relatively straightforward
-- Indentation matters a lot (like python)!
-- yaml is case sensitive
-- A yml tutorial is available here: https://learnxinyminutes.com/docs/yaml/
-- A yml quick reference card is available here: https://yaml.org/refcard.html
-- ESMValTool uses the yamllint linter tool: http://www.yamllint.com
+ scripts:
+ cycle: &cycle_settings
+ script: perfmetrics/main.ncl
+ plot_type: cycle
+ time_avg: monthlyclim
+ grading: &grading_settings
+ <<: *cycle_settings
+ plot_type: cycle_latlon
+ calc_grading: true
+ normalization: [centered_median, none]
+
+In this example the hook ``&cycle_settings`` can be used to pass the ``cycle:``
+parameters to ``grading:`` via the shortcut ``<<: *cycle_settings``.
diff --git a/doc/esmvalcore/utils.rst b/doc/esmvalcore/utils.rst
new file mode 100644
index 0000000000..15d4bdec02
--- /dev/null
+++ b/doc/esmvalcore/utils.rst
@@ -0,0 +1,26 @@
+.. _utils:
+
+*********
+Utilities
+*********
+
+This section provides extra information on topics that are not part of
+ESMValTool code base but are used by ESMValTool directly or indirectly.
+
+Brief introduction to YAML
+==========================
+
+While ``.yaml`` or ``.yml`` is a relatively common format, users may not have
+encountered this language before. The key information about this format is:
+
+- yaml is a human friendly markup language;
+- yaml is commonly used for configuration files (gradually replacing the
+ venerable ``.ini``);
+- the syntax is relatively straightforward;
+- indentation matters a lot (like ``Python``)!
+- yaml is case sensitive;
+
+More information can be found in the `yaml tutorial
+`_ and `yaml quick reference card
+`_. ESMValTool uses the `yamllint
+`_ linter tool to check recipe syntax.