From 09b356c24f7cb892454dfaf3879bbc994115923d Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Tue, 25 Jun 2019 12:55:08 +0100 Subject: [PATCH 01/49] shipshaped preprocessor section --- doc/sphinx/source/esmvalcore/preprocessor.inc | 185 ++++++++++++++---- 1 file changed, 146 insertions(+), 39 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 40323cb53b..da1b361238 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -3,7 +3,48 @@ ************ Preprocessor ************ -The ESMValTool preprocessor can be used to perform all types of climate data pre-processing needed before indices or diagnostics can be calculated. It is a base component for many other diagnostics and metrics shown on this portal. It can be applied to tailor the climate model data to the need of the user for its own calculations. + +Overview +======== + +ESMValTool is a modular Python 3.6+ software package possesing capabilities +of executing a large number of diagnostic routines +that can be written in a number of programming languages (Python, NCL, R, Julia). +The modular nature benefits the users and developers in different key areas: +a new feature developed specifically for version 2.0 is the preprocessing core or +the preprocessor (esmvalcore) that executes the bulk of standardized data operations +and is highly optimized for maximum performance in data-intensive tasks. The main +objective of the preprocessor is to integrate as many standardizable data analysis +functions as possible so that the diagnostics can focus on the specific scientific +tasks they carry. The preprocessor is linked to the diagnostics library and the +diagnostic execution is seamlessly performed after the preprocessor has completed the +its steps. The benefit of having a preprocessing unit separate from the diagnostics +library include: + +* ease of integration of new preprocessing routines; +* ease of maintenance (including unit and integration testing) of existing routines; +* a straightforward manner of importing and using the preprocessing routines as part + of the overall usage of the software and, as a special case, the use during diagnostic execution; +* shifting the effort for the scientific diagnostic developer from implementing both standard + and diagnostic-specific functionalities to allowing them to dedicate most of the effort to + developing scientifically-relevant diagnostics and metrics; +* a more strict code review process, given the smaller code base than for diagnostics. + +The ESMValTool preprocessor can be used to perform a broad range of operations +on the input data before diagnostics or metrics are applied. The +preprocessor performs these operations in a centralized, documented and +efficient way, thus reducing the data processing load on the diagnostics side. + +Each of the preprocessor operations is written in a dedicated python module and +all of them receive and return an Iris cube, working sequentially on the data +with no interactions between them. The order +in which the preprocessor operations is applied is set by default in order to +minimize the loss of information due to, for example, temporal and spatial +subsetting or multi-model averaging. Nevertheless, the user is free to change +such order to address specific scientific requirements, but keeping in mind +that some operations must be necessarily performed in a specific order. This is +the case, for instance, for multi-model statistics, which required the model to +be on a common grid and therefore has to be called after the regridding module. Features of the ESMValTool Climate data pre-processor are: @@ -17,7 +58,38 @@ Features of the ESMValTool Climate data pre-processor are: Variable derivation =================== -Documentation of _derive.py +The variable derivation module allows to derive variables which are not in the +CMIP standard data request using standard variables as input. The typical use +case of this operation is the evaluation of a variable which is only available +in an observational dataset but not in the models. In this case a derivation +function is provided by the ESMValTool in order to calculate the variable and +perform the comparison. For example, several observational datasets deliver +total column ozone as observed variable (`toz`), but CMIP models only provide +the ozone 3D field. In this case, a derivation function is provided to +vertically integrate the ozone and obtain total column ozone for direct +comparison with the observations. + +To contribute a new derived variable, it is also necessary to define a name for +it and to provide the corresponding CMOR table. This is to guarantee the proper +metadata definition is attached to the derived data. Such custom CMOR tables +are collected as part of the `ESMValTool core package +`_. By default, the variable +derivation will be applied only if not already available in the input data, but +the derivation can be forced by setting the appropriate flag. + +.. code-block:: yaml + + variables: + toz: + derive: true + force_derivation: false + +The required arguments for this module are two boolean switches: +* derive: activate variable derivation +* force_derivation: force variable derivation even if the variable is +directly available in the input data. + +See also :func:`esmvalcore.preprocessor.derive`. Time manipulation @@ -296,53 +368,74 @@ Masking ======= Documentation of _mask.py (part 1) -1. Introduction to masking ---------------------------- - -Certain metrics and diagnostics need to be computed and performed on restricted regions of the Globe; ESMValTool supports subsetting the input data on land mass, oceans and seas, ice. This is achived by masking the model data and keeping only the values associated with grid points that correspond to e.g. land mass -or oceans and seas; masking is done either by using standard mask files that have the same grid resolution as the model data (these files are usually produced -at the same time with the model data and are called fx files) or, in the absence of these files, by using Natural Earth masks. Natural Earth masks, even if they are not model-specific, represent a good approximation since their grid resolution is almost always much higher than the model data, and they are constantly updated with changing -geographical features. +Introduction to masking +----------------------- -2. Land-sea masking -------------------- +Certain metrics and diagnostics need to be computed and performed on specific +domains on the globe. The ESMValTool preprocessor supports filtering +the input data on continents, oceans/seas and ice. This is achived by masking +the model data and keeping only the values associated with grid points that +correspond to, e.g., land, ocean or ice surfaces, as specified by the +user. Where possible, the masking is realized using the standard mask files +provided together with the model data as part of the CMIP data request (the +so-called fx variable). In the absence of these files, the Natural Earth masks +are used: although these are not model-specific, they represent a good +approximation since they have a much higher resolution than most of the models +and they are regularly updated with changing geographical features. + +Land-sea masking +---------------- -In ESMValTool v2 land-seas-ice masking can be done in two places: in the preprocessor, to apply a mask on the data before any subsequent preprocessing step, and before -running the diagnostic, or in the disgnostic phase. We present both these implementations below. +In ESMValTool, land-sea-ice masking can be done in two places: in the +preprocessor, to apply a mask on the data before any subsequent preprocessing +step and before running the diagnostic, or in the diagnostic scripts +themselves. We present both these implementations below. -To mask out seas in the preprocessor step, simply add `mask_landsea:` as a preprocessor step in the `preprocessor` of your choice section of the recipe, example: +To mask out a certain domain (e.g., sea) in the preprocessor, +`mask_landsea` can be used: -.. code-block:: bash +.. code-block:: yaml preprocessors: - my_masking_preprocessor: + preproc_mask: mask_landsea: mask_out: sea -The tool will retrieve the corresponding `fx: stfof` type of mask for each of the used variables and apply the mask so that only the land mass points are -kept in the data after applying the mask; conversely, it will retrieve the `fx: sftlf` files when land needs to be masked out. -`mask_out` accepts: land or sea as values. If the corresponding fx file is not found (some models are missing these -type of files; observational data is missing them altogether), then the tool attempts to mask using Natural Earth mask files (that are vectorized rasters). -Note that the resolutions for the Natural Earth masks are much higher than any usual CMIP model: 10m for land and 50m for ocean masks. +and requires only one argument: +* mask_out: either land or sea. -3. Ice masking ---------------- +The preprocessor automatically retrieves the corresponding mask (`fx: stfof` in +this case) and applies it so that sea-covered grid cells are set to +missing. Conversely, it retrieves the `fx: sftlf` mask when land need to be +masked out, respectively. If the corresponding fx file is not found (which is +the case for some models and almost all observational datasets), the +preprocessor attempts to mask the data using Natural Earth mask files (that are +vectorized rasters). As mentioned above, the spatial resolution of the the +Natural Earth masks are much higher than any typical global model (10m for +land and 50m for ocean masks). -Note that for masking out ice the preprocessor is using a different function, this so that both land and sea or ice can be masked out without -losing generality. To mask ice out one needs to add the preprocessing step much as above: +Ice masking +----------- -.. code-block:: bash +Note that for masking out ice sheets, the preprocessor uses a different +function, to ensure that both land and sea or ice can be masked out without +losing generality. To mask ice out, `mask_landseaice` can be used: - preprocessors: - my_masking_preprocessor: - mask_landseaice: - mask_out: ice +.. code-block:: yaml -To keep only the ice, one needs to mask out landsea, so use that as value for mask_out. As in the case of mask_landsea, the tool will automatically -retrieve the `fx: sftgif` file corresponding the the used variable and extract the ice mask from it. + preprocessors: + preproc_mask: + mask_landseaice: + mask_out: ice -4. mask files --------------- +and requires only one argument: +* mask_out: either landsea or ice. + +As in the case of `mask_landsea`, the preprocessor automatically retrieves the +`fx: sftgif` mask. + +Mask files +---------- At the core of the land/sea/ice masking in the preprocessor are the mask files (whether it be fx type or Natural Earth type of files); these files (bar Natural Earth) can be retrived and used in the diagnostic phase as well or solely. By specifying the `fx_files:` key in the variable in diagnostic in the recipe, and populating it @@ -366,8 +459,8 @@ the 'config' diagnostic variable items e.g.: sftlf_file = attributes['fx_files']['sftlf'] areacello_file = attributes['fx_files']['areacello'] -5. Missing values masks ------------------------ +Missing values masks +-------------------- Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles and having to compute multimodel statistics; different numbers of missing data from dataset to datest may introduce biases and artifically @@ -396,8 +489,8 @@ to 19.0 (in units of the variable units). A similar preprocessor step exists for the single-dataset: mask_window_threshold (with the same arguments as mask_fillvalues). -6. Min, max and interval masking --------------------------------- +Min, max and interval masking +----------------------------- Thresholding on minimum and maximum accepted data values can also be performed: masks are constructed based on the results of thresholding; inside and outside interval thresholding and masking can also be performed. These functions @@ -413,7 +506,21 @@ Documentation of _mask.py (part 2) Multi-model statistics ====================== -Documentation of_multimodel.py +Computing multi-model statistics is an integral part of model analysis and evaluation: individual +models display a variety of biases depedning on model set-up, initial conditions, forcings and +implementation; comparing model data to observational data, these biases have a significanly lower +statistical impact when using a multi-model ensemble. ESMValTool has the capability of computing a +number of multi-model statistical measures: using the preprocessor module `multi_model_statistics` +will enable the user to ask for either a multi-model `mean` and/or `median` with a set of argument +parameters passed to `multi_model_statistics`. +Multimodel statistics in ESMValTool are computed along the time axis, and as such, +can be computed across a common overlap in time (by specifying `span: overlap` argument) or across the full length +in time of each model (by specifying `span: full` argument). +Restrictive compuation is also available by excluding any set of models that the user +will not want to include in the statistics (by setting `exclude: [excluded models list]` argument). +The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, +and from a statistical point of view, this is needed since weights are not yet implemented; also higher dimesnional +data is not supported (ie anything with dimensionality higher than four: time, vertical axis, two horizontal axes). Time-area statistics ==================== From 2a1a83b7a39826271912479e90f577d7c2cc5710 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Tue, 25 Jun 2019 13:52:57 +0100 Subject: [PATCH 02/49] more shipshaping preprocessor inc --- doc/sphinx/source/esmvalcore/preprocessor.inc | 185 ++++++++++-------- 1 file changed, 105 insertions(+), 80 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index da1b361238..01a87a90cb 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -53,7 +53,7 @@ Features of the ESMValTool Climate data pre-processor are: * Aggregation of data * Provenance tracking of the calculations * Model statistics -* Multi-model mean +* Multimodel statistics * and many more Variable derivation @@ -94,29 +94,29 @@ See also :func:`esmvalcore.preprocessor.derive`. Time manipulation ================= -The _time.py module contains the following preprocessor functions: +The `_time.py` module contains the following preprocessor functions: -* extract_time: Extract a time range from a cube. -* extract_season: Extract only the times that occur within a specific season. -* extract_month: Extract only the times that occur within a specific month. -* time_average: Take the weighted average over the time dimension. -* seasonal_mean: Produces a mean for each season (DJF, MAM, JJA, SON) -* annual_mean: Produces an annual or decadal mean. -* regrid_time: Aligns the time axis of each dataset to have common time points and calendars. +* `extract_time`: Extract a time range from a cube. +* `extract_season`: Extract only the times that occur within a specific season. +* `extract_month`: Extract only the times that occur within a specific month. +* `time_average`: Take the weighted average over the time dimension. +* `seasonal_mean`: Produces a mean for each season (DJF, MAM, JJA, SON) +* `annual_mean`: Produces an annual or decadal mean. +* `regrid_time`: Aligns the time axis of each dataset to have common time points and calendars. -1. extract_time ---------------- +`extract_time` +-------------- This function subsets a dataset between two points in times. It removes all times in the dataset before the first time and after the last time point. The required arguments are relatively self explanatory: -* start_year -* start_month -* start_day -* end_year -* end_month -* end_day +* `start_year` +* `start_month` +* `start_day` +* `end_year` +* `end_month` +* `end_day` These start and end points are set using the datasets native calendar. All six arguments should be given as integers - the named month string @@ -125,8 +125,8 @@ will not be accepted. See also :func:`esmvalcore.preprocessor.extract_time`. -2. extract_season ------------------ +`extract_season` +---------------- Extract only the times that occur within a specific season. @@ -143,8 +143,8 @@ the seasonal_mean function, below. See also :func:`esmvalcore.preprocessor.extract_season`. -3. extract_month ----------------- +`extract_month` +--------------- The function extracts the times that occur within a specific month. This function only has one argument: `month`. This value should be an integer @@ -153,8 +153,8 @@ between 1 and 12 as the named month string will not be accepted. See also :func:`esmvalcore.preprocessor.extract_month`. -4. time_average ---------------- +`time_average` +-------------- This functions takes the weighted average over the time dimension. This function requires no arguments and removes the time dimension of the cube. @@ -162,8 +162,8 @@ function requires no arguments and removes the time dimension of the cube. See also :func:`esmvalcore.preprocessor.time_average`. -5. seasonal_mean ----------------- +`seasonal_mean` +--------------- This function produces a seasonal mean for each season (DJF, MAM, JJA, SON). Note that this function will not check for missing time points. For instance, @@ -176,8 +176,8 @@ December and remove such biased initial datapoints. See also :func:`esmvalcore.preprocessor.seasonal_mean`. -6. annual_mean --------------- +`annual_mean` +------------- This function produces an annual or a decadal mean. The only argument is the decadal boolean switch. When this switch is set to true, this function @@ -186,8 +186,8 @@ will output the decadal averages. See also :func:`esmvalcore.preprocessor.annual_mean`. -7. regrid_time --------------- +`regrid_time` +------------- This function aligns the time points of each component dataset so that the dataset iris cubes can be subtracted. The operation makes the datasets time points common and @@ -199,53 +199,53 @@ unless a custom frequency is set manually by the user in recipe. Area manipulation ================= -The _area.py module contains the following preprocessor functions: +The `_area.py` module contains the following preprocessor functions: -* extract_region: Extract a region from a cube based on lat/lon corners. -* zonal_means: Calculates the zonal or meridional means. -* area_statistics: Calculates the average value over a region. -* extract_named_regions: Extract a specific region from in the region cooordinate. +* `extract_region`: Extract a region from a cube based on lat/lon corners. +* `zonal_means`: Calculates the zonal or meridional means. +* `area_statistics`: Calculates the average value over a region. +* `extract_named_regions`: Extract a specific region from in the region cooordinate. -1. extract_region ------------------ +`extract_region` +---------------- This function masks data outside a rectagular region requested. The boundairies of the region are provided as latitude and longitude coordinates in the arguments: -* start_longitude -* end_longitude -* start_latitude -* end_latitude +* `start_longitude` +* `end_longitude` +* `start_latitude` +* `end_latitude` Note that this function can only be used to extract a rectangular region. See also :func:`esmvalcore.preprocessor.extract_region`. -2. zonal_means --------------- +`zonal_means` +------------- The function calculates the zonal or meridional means. While this function is named `zonal_mean`, it can be used to apply several different operations in an zonal or meridional direction. This function takes two arguments: -* coordinate: Which direction to apply the operation: latitude or longitude -* mean_type: Which operation to apply: mean, std_dev, variance, median, min or max +* `coordinate`: Which direction to apply the operation: latitude or longitude +* `mean_type`: Which operation to apply: mean, std_dev, variance, median, min or max See also :func:`esmvalcore.preprocessor.zonal_means`. -3. area_statistics +`area_statistics` ----------------- This function calculates the average value over a region - weighted by the cell areas of the region. This function takes the argument, -operator: the name of the operation to apply. +`operator`: the name of the operation to apply. This function can be used to apply several different operations in the horizonal plane: mean, standard deviation, median @@ -258,8 +258,8 @@ removed using other preprocessor operations in advance. See also :func:`esmvalcore.preprocessor.area_statistics`. -4. extract_named_regions ------------------------- +`extract_named_regions` +----------------------- This function extract a specific named region from the data. This function takes the following argument: `regions` which is either a string or a list @@ -272,31 +272,31 @@ See also :func:`esmvalcore.preprocessor.extract_named_regions`. Volume manipulation =================== -The _volume.py module contains the following preprocessor functions: +The `_volume.py` module contains the following preprocessor functions: -* extract_volume: Extract a specific depth range from a cube. -* volume_statistics: Calculate the volume-weighted average. -* depth_integration: Integrate over the depth dimension. -* extract_transect: Extract data along a line of constant latitude or longitude. -* extract_trajectory: Extract data along a specified trajectory. +* `extract_volume`: Extract a specific depth range from a cube. +* `volume_statistics`: Calculate the volume-weighted average. +* `depth_integration`: Integrate over the depth dimension. +* `extract_transect`: Extract data along a line of constant latitude or longitude. +* `extract_trajectory`: Extract data along a specified trajectory. -1. extract_volume ------------------ +`extract_volume` +---------------- -Extract a specific range in the z-direction from a cube. This function +Extract a specific range in the `z`-direction from a cube. This function takes two arguments, a minimum and a maximum (`z_min` and `z_max`, -respectively) in the z direction. +respectively) in the `z`-direction. -Note that this requires the requested z-coordinate range to be the +Note that this requires the requested `z`-coordinate range to be the same sign as the iris cube. ie, if the cube has z-coordinate as negative, then z_min and z_max need to be negative numbers. See also :func:`esmvalcore.preprocessor.extract_volume`. -2. volume_statistics ------------------ +`volume_statistics` +------------------- This function calculates the volume-weighted average across three dimensions, but maintains the time dimension. The following arguments are required: @@ -305,23 +305,23 @@ This function takes the argument: operator, which defines the operation to apply over the volume. No depth coordinate is required as this is determined by iris. This -function works best when the fx_files provide the cell volume. +function works best when the `fx_files` provide the cell volume. See also :func:`esmvalcore.preprocessor.volume_statistics`. -3. depth_integration --------------------- +`depth_integration` +------------------- This function integrate over the depth dimension. This function does a -weighted sum along the z-coordinate, and removes the z direction of the output +weighted sum along the `z`-coordinate, and removes the `z` direction of the output cube. This preprocessor takes no arguments. See also :func:`esmvalcore.preprocessor.depth_integration`. -4. extract_transect -------------------- +`extract_transect` +------------------ This function extract data along a line of constant latitude or longitude. This function takes two arguments, although only one is strictly required. @@ -336,8 +336,8 @@ in the indian ocean. See also :func:`esmvalcore.preprocessor.extract_transect`. -5. extract_trajectory ---------------------- +`extract_trajectory` +-------------------- This function extract data along a specified trajectory. The three areguments are: latitudes and longitudes are the coordinates of the @@ -402,7 +402,7 @@ To mask out a certain domain (e.g., sea) in the preprocessor, mask_out: sea and requires only one argument: -* mask_out: either land or sea. +* `mask_out`: either `land` or `sea`. The preprocessor automatically retrieves the corresponding mask (`fx: stfof` in this case) and applies it so that sea-covered grid cells are set to @@ -414,6 +414,8 @@ vectorized rasters). As mentioned above, the spatial resolution of the the Natural Earth masks are much higher than any typical global model (10m for land and 50m for ocean masks). +See also :func:`esmvalcore.preprocessor.mask_landsea`. + Ice masking ----------- @@ -429,16 +431,20 @@ losing generality. To mask ice out, `mask_landseaice` can be used: mask_out: ice and requires only one argument: -* mask_out: either landsea or ice. +* `mask_out`: either `landsea` or `ice`. As in the case of `mask_landsea`, the preprocessor automatically retrieves the `fx: sftgif` mask. +See also :func:`esmvalcore.preprocessor.mask_landseaice`. + Mask files ---------- -At the core of the land/sea/ice masking in the preprocessor are the mask files (whether it be fx type or Natural Earth type of files); these files (bar Natural Earth) -can be retrived and used in the diagnostic phase as well or solely. By specifying the `fx_files:` key in the variable in diagnostic in the recipe, and populating it +At the core of the land/sea/ice masking in the preprocessor are the mask files +(whether it be fx type or Natural Earth type of files); these files (bar Natural Earth) +can be retrived and used in the diagnostic phase as well or solely. By specifying the +`fx_files:` key in the variable in diagnostic in the recipe, and populating it with a list of desired files e.g.: .. code-block:: bash @@ -448,9 +454,11 @@ with a list of desired files e.g.: preprocessor: my_masking_preprocessor fx_files: [sftlf, sftof, sftgif, areacello, areacella] -Such a recipe will automatically retrieve all the `[sftlf, sftof, sftgif, areacello, areacella]`-type fx files for each of the variables that are needed for -and then, in the diagnostic phase, these mask files will be available for the developer to use them as they need to. They `fx_files` attribute of the big `variable` -nested dictionary that gets passed to the diagnostic is, in turn, a dictionary on its own, and members of it can be accessed in the diagnostic through a simple loop over +Such a recipe will automatically retrieve all the `[sftlf, sftof, sftgif, areacello, areacella]`-type +fx files for each of the variables that are needed for and then, in the diagnostic phase, +these mask files will be available for the developer to use them as they need to. The `fx_files` +attribute of the big `variable` nested dictionary that gets passed to the diagnostic is, in turn, +a dictionary on its own, and members of it can be accessed in the diagnostic through a simple loop over the 'config' diagnostic variable items e.g.: .. code-block:: bash @@ -487,19 +495,36 @@ In the example above, the fractional threshold for missing data vs. total data i 10.0 (units of the time coordinate units). Optionally, a minimum value threshold can be applied, in this case it is set to 19.0 (in units of the variable units). -A similar preprocessor step exists for the single-dataset: mask_window_threshold (with the same arguments as mask_fillvalues). +See also :func:`esmvalcore.preprocessor.mask_fillvalues`. -Min, max and interval masking ------------------------------ +Minimum, maximum and interval masking +------------------------------------- Thresholding on minimum and maximum accepted data values can also be performed: masks are constructed based on the results of thresholding; inside and outside interval thresholding and masking can also be performed. These functions -are mask_above_threshold, mask_below_threshold, mask_inside_range, and mask_outside_range. +are `mask_above_threshold`, `mask_below_threshold`, `mask_inside_range`, and `mask_outside_range`. + +Thes functions always take a `cube` as first argument and either `threshold` for threshold masking or the pair +`minimum`, `maximum` for interval masking. + +See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. Horizontal regridding ===================== Documentation of _regrid.py (part 2) +Regridding is necessary when various datasets are available on a variety of `lat-lon` grids and they need +to be brought together on a common grid (for various statistical operations e.g. multimodel statistics or +for e.g. direct inter-comparison or comparison with observational datasets). Regridding is conceptually a +very similar process to interpolation (in fact, the regridder engine uses interpolation and extrapolation, +with various schemes). The primary difference is that interpolation is based on sample data points, while +regridding is based on the horizontal grid of another cube (the reference grid). + +The underlying regridding mechanism in ESMValTool uses `cube.regrid()` method from iris, so we point the reader +to its documentation: `https://scitools.org.uk/iris/docs/latest/iris/iris/cube.html#iris.cube.Cube.regrid`_ + +See also :func:`esmvalcore.preprocessor.regrid` + Masking of missing values ========================= Documentation of _mask.py (part 2) From 38983b23c59efbe6a067fc4054eace785e656318 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Tue, 25 Jun 2019 14:10:47 +0100 Subject: [PATCH 03/49] breaking some lines --- doc/sphinx/source/esmvalcore/preprocessor.inc | 61 +++++++++++-------- 1 file changed, 35 insertions(+), 26 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 01a87a90cb..38e6257798 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -24,7 +24,8 @@ library include: * ease of integration of new preprocessing routines; * ease of maintenance (including unit and integration testing) of existing routines; * a straightforward manner of importing and using the preprocessing routines as part - of the overall usage of the software and, as a special case, the use during diagnostic execution; + of the overall usage of the software and, as a special case, the use during diagnostic + execution; * shifting the effort for the scientific diagnostic developer from implementing both standard and diagnostic-specific functionalities to allowing them to dedicate most of the effort to developing scientifically-relevant diagnostics and metrics; @@ -102,7 +103,8 @@ The `_time.py` module contains the following preprocessor functions: * `time_average`: Take the weighted average over the time dimension. * `seasonal_mean`: Produces a mean for each season (DJF, MAM, JJA, SON) * `annual_mean`: Produces an annual or decadal mean. -* `regrid_time`: Aligns the time axis of each dataset to have common time points and calendars. +* `regrid_time`: Aligns the time axis of each dataset to have common time points + and calendars. `extract_time` -------------- @@ -470,17 +472,19 @@ the 'config' diagnostic variable items e.g.: Missing values masks -------------------- -Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles and having to compute -multimodel statistics; different numbers of missing data from dataset to datest may introduce biases and artifically -assign more weight to the datasets that have less missing data. This is handled in ESMValTool via the missing values -masks: two types of such masks are available: one for the multimodel case and another for the single model case. +Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles +and having to compute multimodel statistics; different numbers of missing data from dataset +to datest may introduce biases and artifically assign more weight to the datasets that have +less missing data. This is handled in ESMValTool via the missing values masks: two types of +such masks are available: one for the multimodel case and another for the single model case. -The multimodel missing values mask (mask_fillvalues) is a preprocessor step that usually comes after all the single-model -steps (regridding, area selection etc) have been performed; in a nutshell, it combines missing values masks from individual -models into a multimodel missing values mask; the individual model masks are built according to common criteria: -the user chooses a time window in which missing data points are counted, and if the number of missing data points relative -to the number of total data points in a window is less than a chosen fractional theshold, the window is discarded i.e. -all the points in the window are masked (set to missing). +The multimodel missing values mask (mask_fillvalues) is a preprocessor step that usually comes +after all the single-model steps (regridding, area selection etc) have been performed; in a +nutshell, it combines missing values masks from individual models into a multimodel missing +values mask; the individual model masks are built according to common criteria: the user chooses +a time window in which missing data points are counted, and if the number of missing data points +relative to the number of total data points in a window is less than a chosen fractional theshold, +the window is discarded i.e. all the points in the window are masked (set to missing). .. code-block:: bash @@ -491,8 +495,9 @@ all the points in the window are masked (set to missing). min_value: 19.0 time_window: 10.0 -In the example above, the fractional threshold for missing data vs. total data is set to 95% and the time window is set to -10.0 (units of the time coordinate units). Optionally, a minimum value threshold can be applied, in this case it is set +In the example above, the fractional threshold for missing data vs. total data is set to 95% and +the time window is set to 10.0 (units of the time coordinate units). Optionally, a minimum value +threshold can be applied, in this case it is set to 19.0 (in units of the variable units). See also :func:`esmvalcore.preprocessor.mask_fillvalues`. @@ -500,12 +505,13 @@ See also :func:`esmvalcore.preprocessor.mask_fillvalues`. Minimum, maximum and interval masking ------------------------------------- -Thresholding on minimum and maximum accepted data values can also be performed: masks are constructed based on the -results of thresholding; inside and outside interval thresholding and masking can also be performed. These functions -are `mask_above_threshold`, `mask_below_threshold`, `mask_inside_range`, and `mask_outside_range`. +Thresholding on minimum and maximum accepted data values can also be performed: masks are +constructed based on the results of thresholding; inside and outside interval thresholding +and masking can also be performed. These functions are `mask_above_threshold`, +`mask_below_threshold`, `mask_inside_range`, and `mask_outside_range`. -Thes functions always take a `cube` as first argument and either `threshold` for threshold masking or the pair -`minimum`, `maximum` for interval masking. +Thes functions always take a `cube` as first argument and either `threshold` for threshold +masking or the pair `minimum`, `maximum` for interval masking. See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. @@ -539,13 +545,14 @@ number of multi-model statistical measures: using the preprocessor module `multi will enable the user to ask for either a multi-model `mean` and/or `median` with a set of argument parameters passed to `multi_model_statistics`. Multimodel statistics in ESMValTool are computed along the time axis, and as such, -can be computed across a common overlap in time (by specifying `span: overlap` argument) or across the full length -in time of each model (by specifying `span: full` argument). +can be computed across a common overlap in time (by specifying `span: overlap` argument) or across +the full length in time of each model (by specifying `span: full` argument). Restrictive compuation is also available by excluding any set of models that the user will not want to include in the statistics (by setting `exclude: [excluded models list]` argument). -The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, -and from a statistical point of view, this is needed since weights are not yet implemented; also higher dimesnional -data is not supported (ie anything with dimensionality higher than four: time, vertical axis, two horizontal axes). +The implementation has a few restrictions that apply to the input data: model datasets must have +consistent shapes, and from a statistical point of view, this is needed since weights are not yet +implemented; also higher dimesnional data is not supported (ie anything with dimensionality higher +than four: time, vertical axis, two horizontal axes). Time-area statistics ==================== @@ -568,9 +575,11 @@ N: number of datasets F_eff: average size of data per dataset where F_eff = e x f x F where e is the factor that describes how lazy the data is (e = 1 for fully realized data) and f describes how much the data was shrunk by the immediately previous module eg -time extraction, area selection or level extraction; note that for fix_data f relates only to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data +time extraction, area selection or level extraction; note that for fix_data f relates only +to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data -so for cases when we deal with a lot of datasets (R + N = N), data is fully realized, assuming an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require +so for cases when we deal with a lot of datasets (R + N = N), data is fully realized, assuming +an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require Ms = 1.5 x (N - 1) GB From b2a8ca24bbe75e6ecf9ada69d5b4e67193dd0f2b Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Tue, 25 Jun 2019 14:50:34 +0100 Subject: [PATCH 04/49] polishing the turd --- doc/sphinx/source/esmvalcore/preprocessor.inc | 113 ++++++++++++++---- 1 file changed, 88 insertions(+), 25 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 38e6257798..8ca2ff74cf 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -368,7 +368,6 @@ Documentation of _regrid.py (part 1) Masking ======= -Documentation of _mask.py (part 1) Introduction to masking ----------------------- @@ -517,7 +516,6 @@ See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functi Horizontal regridding ===================== -Documentation of _regrid.py (part 2) Regridding is necessary when various datasets are available on a variety of `lat-lon` grids and they need to be brought together on a common grid (for various statistical operations e.g. multimodel statistics or @@ -529,11 +527,75 @@ regridding is based on the horizontal grid of another cube (the reference grid). The underlying regridding mechanism in ESMValTool uses `cube.regrid()` method from iris, so we point the reader to its documentation: `https://scitools.org.uk/iris/docs/latest/iris/iris/cube.html#iris.cube.Cube.regrid`_ -See also :func:`esmvalcore.preprocessor.regrid` +The use of the horizontal regridding functionality is flexible depending on what type of reference grid +and what interpolation scheme is preferred. Below we show a few examples. + +Regridding on a reference dataset grid +-------------------------------------- + +The example below shows how to regrid on the reference dataset `ERA-Interim` (observational data, but just +as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `scheme` is `linear`. + +.. code-block:: bash + + preprocessors: + regrid_preprocessor: + regrid: + target_grid: ERA-Interim + scheme: linear + +Regridding on an `MxN` grid specification +----------------------------------------- + +The example below shows how to regrid on a reference grid with a cell specification of `2.5x2.5` degrees. +This is similar to regridding on reference datasets, but in the previous case the reference dataset grid +cell specifications are not necessarily known a priori. Reegridding on an `MxN` cell specification is +oftentimes used when operating on localized data. + +.. code-block:: bash + + preprocessors: + regrid_preprocessor: + regrid: + target_grid: 2.5x2.5 + scheme: nearest + +In this case the NearestNeighbour interpolation scheme is used. -Masking of missing values -========================= -Documentation of _mask.py (part 2) +When using a `MxN` type of grid it is possible to offset the grid cell centrepoints +using the `lat_offset` and `lon_offset` arguments: + +* `lat_offset`: offsets the grid centers of the latitude coordinate w.r.t. the + pole by half a grid step; +* `lon_offset`: offsets the grid centers of the longitude coordinate w.r.t. Greenwich + meridian by half a grid step. + +.. code-block:: bash + + preprocessors: + regrid_preprocessor: + regrid: + target_grid: 2.5x2.5 + lon_offset: True + lat_offset: True + scheme: nearest + +Regridding (interpolation, extrapolation) schemes +------------------------------------------------- + +The schemes used for the interpolation and extrapolation operations needed by the +horizontal regridding functionality directly map to their corresponding implementaions +in iris: + +* `linear`: `Linear(extrapolation_mode='mask')`, +* `linear_extrapolate`: `Linear(extrapolation_mode='extrapolate')`, +* `nearest`: `Nearest(extrapolation_mode='mask')`, +* `area_weighted`: `AreaWeighted()`, +* `unstructured_nearest`: `UnstructuredNearest()`, + +TODO: can we get some explanations which one's best for what?? + +See also :func:`esmvalcore.preprocessor.regrid` Multi-model statistics ====================== @@ -554,36 +616,37 @@ consistent shapes, and from a statistical point of view, this is needed since we implemented; also higher dimesnional data is not supported (ie anything with dimensionality higher than four: time, vertical axis, two horizontal axes). -Time-area statistics -==================== -Documentation of _area_pp.py and _volume_pp.py - Information on maximum memory required ====================================== In the most general case, we can set upper limits on the maximum memory the anlysis will require: -Ms = (R + N) x F_eff - F_eff - when no multimodel analysis is performed; -Mm = (2R + N) x F_eff - 2F_eff - when multimodel analysis is performed; +`Ms = (R + N) x F_eff - F_eff` - when no multimodel analysis is performed; +`Mm = (2R + N) x F_eff - 2F_eff` - when multimodel analysis is performed; where -Ms: maximum memory for non-multimodel module -Mm: maximum memory for multimodel module -R: computational efficiency of module; R is typically 2-3 -N: number of datasets -F_eff: average size of data per dataset where F_eff = e x f x F -where e is the factor that describes how lazy the data is (e = 1 for fully realized data) -and f describes how much the data was shrunk by the immediately previous module eg -time extraction, area selection or level extraction; note that for fix_data f relates only -to the time extraction, if data is exact in time (no time selection) f = 1 for fix_data +* `Ms`: maximum memory for non-multimodel module +* `Mm`: maximum memory for multimodel module +* `R`: computational efficiency of module; `R` is typically 2-3 +* `N`: number of datasets +* `F_eff`: average size of data per dataset where `F_eff = e x f x F` + where `e` is the factor that describes how lazy the data is (`e = 1` for fully realized data) + and `f` describes how much the data was shrunk by the immediately previous module eg + time extraction, area selection or level extraction; note that for fix_data f relates only + to the time extraction, if data is exact in time (no time selection) `f = 1` for fix_data + +so for cases when we deal with a lot of datasets `(R + N = N)`, data is fully realized, assuming +an average size of 1.5GB for 10 years of `3D` netCDF data, `N` datasets will require -so for cases when we deal with a lot of datasets (R + N = N), data is fully realized, assuming -an average size of 1.5GB for 10 years of 3D netCDF data, N datasets will require +`Ms = 1.5 x (N - 1)` GB +`Mm = 1.5 x (N - 2)` GB -Ms = 1.5 x (N - 1) GB -Mm = 1.5 x (N - 2) GB +As a thumb rule, the maximum required memory at a certain time, when meeding multimodel analysis +could be estimated by multiplying the number of datasets by the average file size of all the datasets; +this memory intake is high but also assumes that all data is fully realized in memory; this aspect +will gradually change and the amount of realized data will decrease with the increase of `dask` use. Unit conversion From edb454e91c89f76432965a6e2a8462709a22b20f Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Tue, 25 Jun 2019 16:54:35 +0100 Subject: [PATCH 05/49] super polishing the turd, almost done with preprocessor inc --- doc/sphinx/source/esmvalcore/preprocessor.inc | 371 ++++++++++-------- 1 file changed, 200 insertions(+), 171 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 8ca2ff74cf..1e87de3cc1 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -7,7 +7,7 @@ Preprocessor Overview ======== -ESMValTool is a modular Python 3.6+ software package possesing capabilities +ESMValTool is a modular ``Python 3.6+`` software package possesing capabilities of executing a large number of diagnostic routines that can be written in a number of programming languages (Python, NCL, R, Julia). The modular nature benefits the users and developers in different key areas: @@ -86,39 +86,40 @@ the derivation can be forced by setting the appropriate flag. force_derivation: false The required arguments for this module are two boolean switches: -* derive: activate variable derivation -* force_derivation: force variable derivation even if the variable is -directly available in the input data. + +* ``derive``: activate variable derivation +* ``force_derivation``: force variable derivation even if the variable is + directly available in the input data. See also :func:`esmvalcore.preprocessor.derive`. Time manipulation ================= -The `_time.py` module contains the following preprocessor functions: - -* `extract_time`: Extract a time range from a cube. -* `extract_season`: Extract only the times that occur within a specific season. -* `extract_month`: Extract only the times that occur within a specific month. -* `time_average`: Take the weighted average over the time dimension. -* `seasonal_mean`: Produces a mean for each season (DJF, MAM, JJA, SON) -* `annual_mean`: Produces an annual or decadal mean. -* `regrid_time`: Aligns the time axis of each dataset to have common time points +The ``_time.py`` module contains the following preprocessor functions: + +* ``extract_time``: Extract a time range from an Iris ``cube``. +* ``extract_season``: Extract only the times that occur within a specific season. +* ``extract_month``: Extract only the times that occur within a specific month. +* ``time_average``: Take the weighted average over the time dimension. +* ``seasonal_mean``: Produces a mean for each season (DJF, MAM, JJA, SON) +* ``annual_mean``: Produces an annual or decadal mean. +* ``regrid_time``: Aligns the time axis of each dataset to have common time points and calendars. -`extract_time` --------------- +``extract_time`` +---------------- This function subsets a dataset between two points in times. It removes all times in the dataset before the first time and after the last time point. The required arguments are relatively self explanatory: -* `start_year` -* `start_month` -* `start_day` -* `end_year` -* `end_month` -* `end_day` +* ``start_year`` +* ``start_month`` +* ``start_day`` +* ``end_year`` +* ``end_month`` +* ``end_day`` These start and end points are set using the datasets native calendar. All six arguments should be given as integers - the named month string @@ -127,12 +128,12 @@ will not be accepted. See also :func:`esmvalcore.preprocessor.extract_time`. -`extract_season` ----------------- +``extract_season`` +------------------ Extract only the times that occur within a specific season. -This function only has one argument: `season`. This is the named season to +This function only has one argument: ``season``. This is the named season to extract. ie: DJF, MAM, JJA, SON. Note that this function does not change the time resolution. If your original @@ -145,18 +146,18 @@ the seasonal_mean function, below. See also :func:`esmvalcore.preprocessor.extract_season`. -`extract_month` ---------------- +``extract_month`` +----------------- The function extracts the times that occur within a specific month. -This function only has one argument: `month`. This value should be an integer +This function only has one argument: ``month``. This value should be an integer between 1 and 12 as the named month string will not be accepted. See also :func:`esmvalcore.preprocessor.extract_month`. -`time_average` --------------- +``time_average`` +---------------- This functions takes the weighted average over the time dimension. This function requires no arguments and removes the time dimension of the cube. @@ -164,8 +165,8 @@ function requires no arguments and removes the time dimension of the cube. See also :func:`esmvalcore.preprocessor.time_average`. -`seasonal_mean` ---------------- +``seasonal_mean`` +----------------- This function produces a seasonal mean for each season (DJF, MAM, JJA, SON). Note that this function will not check for missing time points. For instance, @@ -178,8 +179,8 @@ December and remove such biased initial datapoints. See also :func:`esmvalcore.preprocessor.seasonal_mean`. -`annual_mean` -------------- +``annual_mean`` +--------------- This function produces an annual or a decadal mean. The only argument is the decadal boolean switch. When this switch is set to true, this function @@ -188,70 +189,67 @@ will output the decadal averages. See also :func:`esmvalcore.preprocessor.annual_mean`. -`regrid_time` -------------- +``regrid_time`` +--------------- This function aligns the time points of each component dataset so that the dataset -iris cubes can be subtracted. The operation makes the datasets time points common and +Iris cubes can be subtracted. The operation makes the datasets time points common and sets common calendars; it also resets the time bounds and auxiliary coordinates to reflect the artifically shifted time points. Current implementation for monthly -and daily data; the frequency is set automatically from the variable CMOR table -unless a custom frequency is set manually by the user in recipe. +and daily data; the ``frequency`` is set automatically from the variable CMOR table +unless a custom ``frequency`` is set manually by the user in recipe. +See also :func:`esmvalcore.preprocessor.regrid_time`. Area manipulation ================= -The `_area.py` module contains the following preprocessor functions: +The ``_area.py`` module contains the following preprocessor functions: -* `extract_region`: Extract a region from a cube based on lat/lon corners. -* `zonal_means`: Calculates the zonal or meridional means. -* `area_statistics`: Calculates the average value over a region. -* `extract_named_regions`: Extract a specific region from in the region cooordinate. +* ``extract_region``: Extract a region from a cube based on ``lat/lon`` corners. +* ``zonal_means``: Calculates the zonal or meridional means. +* ``area_statistics``: Calculates the average value over a region. +* ``extract_named_regions``: Extract a specific region from in the region cooordinate. -`extract_region` ----------------- +``extract_region`` +------------------ This function masks data outside a rectagular region requested. The boundairies of the region are provided as latitude and longitude coordinates in the arguments: -* `start_longitude` -* `end_longitude` -* `start_latitude` -* `end_latitude` +* ``start_longitude`` +* ``end_longitude`` +* ``start_latitude`` +* ``end_latitude`` Note that this function can only be used to extract a rectangular region. See also :func:`esmvalcore.preprocessor.extract_region`. -`zonal_means` -------------- +``zonal_means`` +--------------- The function calculates the zonal or meridional means. While this function is -named `zonal_mean`, it can be used to apply several different operations in -an zonal or meridional direction. -This function takes two arguments: +named ``zonal_mean``, it can be used to apply several different operations in +an zonal or meridional direction. This function takes two arguments: -* `coordinate`: Which direction to apply the operation: latitude or longitude -* `mean_type`: Which operation to apply: mean, std_dev, variance, median, min or max +* ``coordinate``: Which direction to apply the operation: latitude or longitude +* ``mean_type``: Which operation to apply: mean, std_dev, variance, median, min or max See also :func:`esmvalcore.preprocessor.zonal_means`. -`area_statistics` ------------------ +``area_statistics`` +------------------- This function calculates the average value over a region - weighted by the -cell areas of the region. - -This function takes the argument, -`operator`: the name of the operation to apply. +cell areas of the region. This function takes the argument, +``operator``: the name of the operation to apply. -This function can be used to apply several -different operations in the horizonal plane: mean, standard deviation, median -variance, minimum and maximum. +This function can be used to apply several different operations in the horizonal +plane: mean, standard deviation, median variance, minimum and maximum. Note that this function is applied over the entire dataset. If only a specific region, depth layer or time period is required, then those regions need to be @@ -260,12 +258,12 @@ removed using other preprocessor operations in advance. See also :func:`esmvalcore.preprocessor.area_statistics`. -`extract_named_regions` ------------------------ +``extract_named_regions`` +------------------------- This function extract a specific named region from the data. This function -takes the following argument: `regions` which is either a string or a list -of strings of named regions. Note that the dataset must have a `region` +takes the following argument: ``regions`` which is either a string or a list +of strings of named regions. Note that the dataset must have a ``region`` cooordinate which includes a list of strings as values. This function then matches the named regions against the requested string. @@ -274,46 +272,46 @@ See also :func:`esmvalcore.preprocessor.extract_named_regions`. Volume manipulation =================== -The `_volume.py` module contains the following preprocessor functions: +The ``_volume.py`` module contains the following preprocessor functions: -* `extract_volume`: Extract a specific depth range from a cube. -* `volume_statistics`: Calculate the volume-weighted average. -* `depth_integration`: Integrate over the depth dimension. -* `extract_transect`: Extract data along a line of constant latitude or longitude. -* `extract_trajectory`: Extract data along a specified trajectory. +* ``extract_volume``: Extract a specific depth range from a cube. +* ``volume_statistics``: Calculate the volume-weighted average. +* ``depth_integration``: Integrate over the depth dimension. +* ``extract_transect``: Extract data along a line of constant latitude or longitude. +* ``extract_trajectory``: Extract data along a specified trajectory. -`extract_volume` ----------------- +``extract_volume`` +------------------ Extract a specific range in the `z`-direction from a cube. This function -takes two arguments, a minimum and a maximum (`z_min` and `z_max`, +takes two arguments, a minimum and a maximum (``z_min`` and ``z_max``, respectively) in the `z`-direction. Note that this requires the requested `z`-coordinate range to be the -same sign as the iris cube. ie, if the cube has z-coordinate as -negative, then z_min and z_max need to be negative numbers. +same sign as the Iris cube. ie, if the cube has `z`-coordinate as +negative, then ``z_min`` and ``z_max`` need to be negative numbers. See also :func:`esmvalcore.preprocessor.extract_volume`. -`volume_statistics` -------------------- +``volume_statistics`` +--------------------- This function calculates the volume-weighted average across three dimensions, -but maintains the time dimension. The following arguments are required: +but maintains the time dimension. -This function takes the argument: operator, which defines the +This function takes the argument: ``operator``, which defines the operation to apply over the volume. -No depth coordinate is required as this is determined by iris. This -function works best when the `fx_files` provide the cell volume. +No depth coordinate is required as this is determined by Iris. This +function works best when the ``fx_files`` provide the cell volume. See also :func:`esmvalcore.preprocessor.volume_statistics`. -`depth_integration` -------------------- +``depth_integration`` +--------------------- This function integrate over the depth dimension. This function does a weighted sum along the `z`-coordinate, and removes the `z` direction of the output @@ -322,45 +320,46 @@ cube. This preprocessor takes no arguments. See also :func:`esmvalcore.preprocessor.depth_integration`. -`extract_transect` ------------------- +``extract_transect`` +-------------------- This function extract data along a line of constant latitude or longitude. This function takes two arguments, although only one is strictly required. -The two arguments are `latitude` and `longitude`. One of these arguments +The two arguments are ``latitude`` and ``longitude``. One of these arguments needs to be set to a float, and the other can then be either ignored or set to a minimum or maximum value. -Ie: If we set latitude to 0 N and leave longitude blank, it would produce a -cube along the equator. On the other hand, if we set latitude to 0 and then -set longitude to `[40., 100.]` this will produce a transect of the equator -in the indian ocean. + +**Example**: If we set latitude to 0 N and leave longitude blank, it would produce a +cube along the Equator. On the other hand, if we set latitude to 0 and then +set longitude to ``[40., 100.]`` this will produce a transect of the Equator +in the Indian Ocean. See also :func:`esmvalcore.preprocessor.extract_transect`. -`extract_trajectory` --------------------- +``extract_trajectory`` +---------------------- This function extract data along a specified trajectory. -The three areguments are: latitudes and longitudes are the coordinates of the -trajectory. +The three areguments are: ``latitudes``, ``longitudes`` and number of point needed for +extrapolation ``number_points``. -If two points are provided, the `number_points` argument is used to set a +If two points are provided, the ``number_points`` argument is used to set a the number of places to extract between the two end points. If more than two points are provided, then -extract_trajectory will produce a cube which has extrapolated the data -of the cube to those points, and `number_points` is not needed. +``extract_trajectory`` will produce a cube which has extrapolated the data +of the cube to those points, and ``number_points`` is not needed. -Note that this function uses the expensive interpolate method, but it may be -necceasiry for irregular grids. +Note that this function uses the expensive ``interpolate`` method from ``Iris.analysis.trajectory``, +but it may be necceasiry for irregular grids. See also :func:`esmvalcore.preprocessor.extract_trajectory`. CMORization and dataset-specific fixes ====================================== -Documentation of _reformat.py, check.py and fix.py +Javier Vertical interpolation ====================== @@ -393,7 +392,7 @@ step and before running the diagnostic, or in the diagnostic scripts themselves. We present both these implementations below. To mask out a certain domain (e.g., sea) in the preprocessor, -`mask_landsea` can be used: +``mask_landsea`` can be used: .. code-block:: yaml @@ -402,12 +401,11 @@ To mask out a certain domain (e.g., sea) in the preprocessor, mask_landsea: mask_out: sea -and requires only one argument: -* `mask_out`: either `land` or `sea`. +and requires only one argument: ``mask_out``: either ``land`` or ``sea``. -The preprocessor automatically retrieves the corresponding mask (`fx: stfof` in +The preprocessor automatically retrieves the corresponding mask (``fx: stfof`` in this case) and applies it so that sea-covered grid cells are set to -missing. Conversely, it retrieves the `fx: sftlf` mask when land need to be +missing. Conversely, it retrieves the ``fx: sftlf`` mask when land need to be masked out, respectively. If the corresponding fx file is not found (which is the case for some models and almost all observational datasets), the preprocessor attempts to mask the data using Natural Earth mask files (that are @@ -422,7 +420,7 @@ Ice masking Note that for masking out ice sheets, the preprocessor uses a different function, to ensure that both land and sea or ice can be masked out without -losing generality. To mask ice out, `mask_landseaice` can be used: +losing generality. To mask ice out, ``mask_landseaice`` can be used: .. code-block:: yaml @@ -431,11 +429,10 @@ losing generality. To mask ice out, `mask_landseaice` can be used: mask_landseaice: mask_out: ice -and requires only one argument: -* `mask_out`: either `landsea` or `ice`. +and requires only one argument: ``mask_out``: either ``landsea`` or ``ice``. -As in the case of `mask_landsea`, the preprocessor automatically retrieves the -`fx: sftgif` mask. +As in the case of ``mask_landsea``, the preprocessor automatically retrieves the +``fx_files: [sftgif]`` mask. See also :func:`esmvalcore.preprocessor.mask_landseaice`. @@ -445,22 +442,22 @@ Mask files At the core of the land/sea/ice masking in the preprocessor are the mask files (whether it be fx type or Natural Earth type of files); these files (bar Natural Earth) can be retrived and used in the diagnostic phase as well or solely. By specifying the -`fx_files:` key in the variable in diagnostic in the recipe, and populating it +``fx_files:`` key in the variable in diagnostic in the recipe, and populating it with a list of desired files e.g.: -.. code-block:: bash +.. code-block:: yaml variables: ta: preprocessor: my_masking_preprocessor fx_files: [sftlf, sftof, sftgif, areacello, areacella] -Such a recipe will automatically retrieve all the `[sftlf, sftof, sftgif, areacello, areacella]`-type +Such a recipe will automatically retrieve all the ``fx_files: [sftlf, sftof, sftgif, areacello, areacella]``-type fx files for each of the variables that are needed for and then, in the diagnostic phase, these mask files will be available for the developer to use them as they need to. The `fx_files` attribute of the big `variable` nested dictionary that gets passed to the diagnostic is, in turn, a dictionary on its own, and members of it can be accessed in the diagnostic through a simple loop over -the 'config' diagnostic variable items e.g.: +the ``config`` diagnostic variable items e.g.: .. code-block:: bash @@ -477,7 +474,7 @@ to datest may introduce biases and artifically assign more weight to the dataset less missing data. This is handled in ESMValTool via the missing values masks: two types of such masks are available: one for the multimodel case and another for the single model case. -The multimodel missing values mask (mask_fillvalues) is a preprocessor step that usually comes +The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step that usually comes after all the single-model steps (regridding, area selection etc) have been performed; in a nutshell, it combines missing values masks from individual models into a multimodel missing values mask; the individual model masks are built according to common criteria: the user chooses @@ -485,7 +482,7 @@ a time window in which missing data points are counted, and if the number of mis relative to the number of total data points in a window is less than a chosen fractional theshold, the window is discarded i.e. all the points in the window are masked (set to missing). -.. code-block:: bash +.. code-block:: yaml preprocessors: missing_values_preprocessor: @@ -506,11 +503,11 @@ Minimum, maximum and interval masking Thresholding on minimum and maximum accepted data values can also be performed: masks are constructed based on the results of thresholding; inside and outside interval thresholding -and masking can also be performed. These functions are `mask_above_threshold`, -`mask_below_threshold`, `mask_inside_range`, and `mask_outside_range`. +and masking can also be performed. These functions are ``mask_above_threshold``, +``mask_below_threshold``, ``mask_inside_range``, and ``mask_outside_range``. -Thes functions always take a `cube` as first argument and either `threshold` for threshold -masking or the pair `minimum`, `maximum` for interval masking. +Thes functions always take a cube as first argument and either ``threshold`` for threshold +masking or the pair ``minimum`, ``maximum`` for interval masking. See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. @@ -524,8 +521,9 @@ very similar process to interpolation (in fact, the regridder engine uses interp with various schemes). The primary difference is that interpolation is based on sample data points, while regridding is based on the horizontal grid of another cube (the reference grid). -The underlying regridding mechanism in ESMValTool uses `cube.regrid()` method from iris, so we point the reader -to its documentation: `https://scitools.org.uk/iris/docs/latest/iris/iris/cube.html#iris.cube.Cube.regrid`_ +The underlying regridding mechanism in ESMValTool uses ``cube.regrid()`` method from Iris, +so we point the reader to its documentation: +`cube.regrid() `_. The use of the horizontal regridding functionality is flexible depending on what type of reference grid and what interpolation scheme is preferred. Below we show a few examples. @@ -533,10 +531,10 @@ and what interpolation scheme is preferred. Below we show a few examples. Regridding on a reference dataset grid -------------------------------------- -The example below shows how to regrid on the reference dataset `ERA-Interim` (observational data, but just +The example below shows how to regrid on the reference dataset ``ERA-Interim`` (observational data, but just as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `scheme` is `linear`. -.. code-block:: bash +.. code-block:: yaml preprocessors: regrid_preprocessor: @@ -544,15 +542,15 @@ as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `sch target_grid: ERA-Interim scheme: linear -Regridding on an `MxN` grid specification ------------------------------------------ +Regridding on an ``MxN`` grid specification +------------------------------------------- -The example below shows how to regrid on a reference grid with a cell specification of `2.5x2.5` degrees. +The example below shows how to regrid on a reference grid with a cell specification of ``2.5x2.5`` degrees. This is similar to regridding on reference datasets, but in the previous case the reference dataset grid -cell specifications are not necessarily known a priori. Reegridding on an `MxN` cell specification is +cell specifications are not necessarily known a priori. Reegridding on an ``MxN`` cell specification is oftentimes used when operating on localized data. -.. code-block:: bash +.. code-block:: yaml preprocessors: regrid_preprocessor: @@ -560,17 +558,17 @@ oftentimes used when operating on localized data. target_grid: 2.5x2.5 scheme: nearest -In this case the NearestNeighbour interpolation scheme is used. +In this case the ``NearestNeighbour`` interpolation scheme is used (see below for scheme definitions). -When using a `MxN` type of grid it is possible to offset the grid cell centrepoints -using the `lat_offset` and `lon_offset` arguments: +When using a ``MxN`` type of grid it is possible to offset the grid cell centrepoints +using the `lat_offset` and ``lon_offset`` arguments: -* `lat_offset`: offsets the grid centers of the latitude coordinate w.r.t. the +* ``lat_offset``: offsets the grid centers of the latitude coordinate w.r.t. the pole by half a grid step; -* `lon_offset`: offsets the grid centers of the longitude coordinate w.r.t. Greenwich +* ``lon_offset``: offsets the grid centers of the longitude coordinate w.r.t. Greenwich meridian by half a grid step. -.. code-block:: bash +.. code-block:: yaml preprocessors: regrid_preprocessor: @@ -585,68 +583,99 @@ Regridding (interpolation, extrapolation) schemes The schemes used for the interpolation and extrapolation operations needed by the horizontal regridding functionality directly map to their corresponding implementaions -in iris: +in Iris: -* `linear`: `Linear(extrapolation_mode='mask')`, -* `linear_extrapolate`: `Linear(extrapolation_mode='extrapolate')`, -* `nearest`: `Nearest(extrapolation_mode='mask')`, -* `area_weighted`: `AreaWeighted()`, -* `unstructured_nearest`: `UnstructuredNearest()`, - -TODO: can we get some explanations which one's best for what?? +* ``linear``: `Linear(extrapolation_mode='mask') `_. +* ``linear_extrapolate``: `Linear(extrapolation_mode='extrapolate') `_. +* ``nearest``: `Nearest(extrapolation_mode='mask') `_. +* ``area_weighted``: `AreaWeighted() `_. +* ``unstructured_nearest``: `UnstructuredNearest() `_. See also :func:`esmvalcore.preprocessor.regrid` +.. note:: + **Advanced User and Developer** + + For both vertical and horizontal regridding one can control the extrapolation mode when defining + the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations + where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). + The extrapolation mode is controlled by the `extrapolation_mode` keyword. For the available interpolation + schemes available in Iris, the extrapolation_mode keyword must be one of: + + * ``extrapolate`` – the extrapolation points will be calculated by extending the gradient + of the closest two points, + * ``error`` – a ``ValueError`` exception will be raised, notifying an attempt to extrapolate, + * ``nan`` – the extrapolation points will be be set to NaN, + * ``mask`` – the extrapolation points will always be masked, even if the source data is not + a ``MaskedArray``, or + * ``nanmask`` – if the source data is a MaskedArray the extrapolation points will be masked. + Otherwise they will be set to NaN. + Multi-model statistics ====================== Computing multi-model statistics is an integral part of model analysis and evaluation: individual models display a variety of biases depedning on model set-up, initial conditions, forcings and implementation; comparing model data to observational data, these biases have a significanly lower statistical impact when using a multi-model ensemble. ESMValTool has the capability of computing a -number of multi-model statistical measures: using the preprocessor module `multi_model_statistics` -will enable the user to ask for either a multi-model `mean` and/or `median` with a set of argument -parameters passed to `multi_model_statistics`. +number of multi-model statistical measures: using the preprocessor module ``multi_model_statistics`` +will enable the user to ask for either a multi-model ``mean`` and/or ``median`` with a set of argument +parameters passed to ``multi_model_statistics``. + Multimodel statistics in ESMValTool are computed along the time axis, and as such, -can be computed across a common overlap in time (by specifying `span: overlap` argument) or across -the full length in time of each model (by specifying `span: full` argument). +can be computed across a common overlap in time (by specifying ``span: overlap`` argument) or across +the full length in time of each model (by specifying ``span: full`` argument). + Restrictive compuation is also available by excluding any set of models that the user -will not want to include in the statistics (by setting `exclude: [excluded models list]` argument). +will not want to include in the statistics (by setting ``exclude: [excluded models list]`` argument). The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, and from a statistical point of view, this is needed since weights are not yet implemented; also higher dimesnional data is not supported (ie anything with dimensionality higher than four: time, vertical axis, two horizontal axes). +.. code-block:: yaml + + preprocessors: + multimodel_preprocessor: + multi_model_statistics: + span: overlap + statistics: [mean, median] + exclude: [NCEP] + +see also :func:`esmvalcore.preprocessor.multi_model_statistics`. + Information on maximum memory required ====================================== In the most general case, we can set upper limits on the maximum memory the anlysis will require: -`Ms = (R + N) x F_eff - F_eff` - when no multimodel analysis is performed; -`Mm = (2R + N) x F_eff - 2F_eff` - when multimodel analysis is performed; +``Ms = (R + N) x F_eff - F_eff`` - when no multimodel analysis is performed; + +``Mm = (2R + N) x F_eff - 2F_eff`` - when multimodel analysis is performed; where -* `Ms`: maximum memory for non-multimodel module -* `Mm`: maximum memory for multimodel module -* `R`: computational efficiency of module; `R` is typically 2-3 -* `N`: number of datasets -* `F_eff`: average size of data per dataset where `F_eff = e x f x F` - where `e` is the factor that describes how lazy the data is (`e = 1` for fully realized data) - and `f` describes how much the data was shrunk by the immediately previous module eg - time extraction, area selection or level extraction; note that for fix_data f relates only - to the time extraction, if data is exact in time (no time selection) `f = 1` for fix_data +* ``Ms``: maximum memory for non-multimodel module +* ``Mm``: maximum memory for multimodel module +* ``R``: computational efficiency of module; `R` is typically 2-3 +* ``N``: number of datasets +* ``F_eff``: average size of data per dataset where ``F_eff = e x f x F`` + where ``e`` is the factor that describes how lazy the data is (``e = 1`` for fully realized data) + and ``f`` describes how much the data was shrunk by the immediately previous module e.g. + time extraction, area selection or level extraction; note that for fix_data ``f`` relates only + to the time extraction, if data is exact in time (no time selection) ``f = 1`` for fix_data + +so for cases when we deal with a lot of datasets ``R + N \approx N``, data is fully realized, assuming +an average size of 1.5GB for 10 years of `3D` netCDF data, ``N`` datasets will require -so for cases when we deal with a lot of datasets `(R + N = N)`, data is fully realized, assuming -an average size of 1.5GB for 10 years of `3D` netCDF data, `N` datasets will require +``Ms = 1.5 x (N - 1)`` GB -`Ms = 1.5 x (N - 1)` GB -`Mm = 1.5 x (N - 2)` GB +``Mm = 1.5 x (N - 2)`` GB As a thumb rule, the maximum required memory at a certain time, when meeding multimodel analysis could be estimated by multiplying the number of datasets by the average file size of all the datasets; this memory intake is high but also assumes that all data is fully realized in memory; this aspect -will gradually change and the amount of realized data will decrease with the increase of `dask` use. +will gradually change and the amount of realized data will decrease with the increase of ``dask`` use. Unit conversion @@ -663,7 +692,7 @@ will guarantee homogeneous input for the diagnostics. .. note:: Conversion is only supported between compatible units! In other - words, converting temperature units from `degC` to `Kelvin` works + words, converting temperature units from ``degC`` to ``Kelvin`` works fine, changing precipitation units from a rate based unit to an amount based unit is not supported at the moment. From 1364ccdf5c883ea0391d334fcd11c3005e76c528 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Wed, 26 Jun 2019 12:18:12 +0100 Subject: [PATCH 06/49] added vertical regridding dox and finishing my bit for preprocessor.inc --- doc/sphinx/source/esmvalcore/preprocessor.inc | 70 ++++++++++++++++++- 1 file changed, 67 insertions(+), 3 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 1e87de3cc1..633724b9de 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -37,8 +37,9 @@ preprocessor performs these operations in a centralized, documented and efficient way, thus reducing the data processing load on the diagnostics side. Each of the preprocessor operations is written in a dedicated python module and -all of them receive and return an Iris cube, working sequentially on the data -with no interactions between them. The order +all of them receive and return an Iris +`cube `_ , +working sequentially on the data with no interactions between them. The order in which the preprocessor operations is applied is set by default in order to minimize the loss of information due to, for example, temporal and spatial subsetting or multi-model averaging. Nevertheless, the user is free to change @@ -363,7 +364,70 @@ Javier Vertical interpolation ====================== -Documentation of _regrid.py (part 1) +Vertical level selection is an important aspect of data preprocessing since it allows the +scientist to perform a number of metrics specific to certain levels (whether it be air pressure +or depth, e.g. the Quasi-Biennial-Oscillation (QBO) u30 is computed at 30 hPa). Dataset native +vertical grids may not come with the desired set of levels, so an interpolation operation will be +needed to regrid the data vertically. ESMValTool can perform this vertical interpolation via the +``extract_levels`` preprocessor. Level extraction may be done in a number of ways: + +Level extraction can be done at specific values passed to ``extract_levels`` as ``levels:`` with +its value a list of levels (note that the units are CMOR-standard, Pascals (Pa)): + +.. code-block:: yaml + + preprocessors: + preproc_select_levels_from_list: + extract_levels: + levels: [100000., 50000., 3000., 1000.] + scheme: linear + +It is also possible to extract the CMIP-specific, CMOR levels as they appear in the CMOR table, +e.g. ``plev10`` or ``plev17`` or ``plev19`` etc: + +.. code-block:: yaml + + preprocessors: + preproc_select_levels_from_cmip_table: + extract_levels: + levels: {cmor_table: CMIP6, coordinate: plev10} + scheme: nearest + +Of good use is also the level extraction with values specific to a certain dataset, without +the user actually polling the dataset of interest to find out the specific levels: e.g. in the +example below we offer two alternatives to extract the levels and vertically regrid onto the +vertical levels of ``ERA-Interim``: + +.. code-block:: yaml + + preprocessors: + preproc_select_levels_from_dataset: + extract_levels: + levels: ERA-Interim + # This also works, but allows specifying the pressure coordinate name + # levels: {dataset: ERA-Interim, coordinate: air_pressure} + scheme: linear_horizontal_extrapolate_vertical + +* See also :func:`esmvalcore.preprocessor.extract_levels`. +* See also :func:`esmvalcore.preprocessor.get_cmor_levels`. + +.. note:: + **Advanced User and Developer** + + For both vertical and horizontal regridding one can control the extrapolation mode when defining + the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations + where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). + The extrapolation mode is controlled by the `extrapolation_mode` keyword. For the available interpolation + schemes available in Iris, the extrapolation_mode keyword must be one of: + + * ``extrapolate`` – the extrapolation points will be calculated by extending the gradient + of the closest two points, + * ``error`` – a ``ValueError`` exception will be raised, notifying an attempt to extrapolate, + * ``nan`` – the extrapolation points will be be set to NaN, + * ``mask`` – the extrapolation points will always be masked, even if the source data is not + a ``MaskedArray``, or + * ``nanmask`` – if the source data is a MaskedArray the extrapolation points will be masked. + Otherwise they will be set to NaN. Masking ======= From e2e4b3868e758387e0e624851f1a10c50b8bd8a4 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Wed, 26 Jun 2019 14:18:09 +0100 Subject: [PATCH 07/49] started working on the data finder --- doc/sphinx/source/esmvalcore/datafinder.inc | 105 +++++++++++++++++++- 1 file changed, 104 insertions(+), 1 deletion(-) diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.inc index 333a828cd0..0b4e0565cd 100644 --- a/doc/sphinx/source/esmvalcore/datafinder.inc +++ b/doc/sphinx/source/esmvalcore/datafinder.inc @@ -4,4 +4,107 @@ Data finder *********** -Documentation of the _data_finder.py module (incl. _download.py?) \ No newline at end of file +Overview +======== +Data discovery and retrieval is the first step in any evaluation process; ESMValTool +uses a `semi-automated` data finding mechanism performed by the ``_data_finder.py`` module +with inputs from both the user configuration file and the recipe file. The reason why the data +finder module is `semi`-automated is that the user will have to provide the tool with a set +of parameters related to the data needed; the reason why it is semi-`automated` is that once +these parameters have been provided, the tool will automatically find the right data. We will +detail below the data finding and retrieval process and the inputs the user needs to specify, +giving examples on how to use the data finding routine under different scenarios. + +CMIP data: Data Reference Syntax (DRS) and the ESGF +=================================================== +CMIP data is widely available via the Earth System Grid Federation (`ESGF `_) +and is accessible to users either via dowload from the ESGF portal or through the ESGF data nodes hosted +by large computing facilities (like CEDA-Jasmin, DKRZ etc). This data adheres to, among other standards, +the DRS and Controlled Vocabulary standard for naming files and structured paths; the `DRS `_ +ensures that files and paths to them are named according to a standardized convention. An example of this +convention, and also used by ESMValTool for file discovery and data retrieval can be seen here: + +* CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc``; +* CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc``; + +and similar standards exist for the standard paths (input directories); for the ESGF data nodes, +these paths differ slightly, an example is given below: + +* CMIP6 path for BADC: ``ROOT-BADC/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/ + [variable_short_name]/[grid]``; +* CMIP6 pth for ETHZ: ``ROOT-ETHZ/[experiment]/[mip]/[variable_short_name]/[dataset_name]/[ensemble]/[grid]``; + +From the ESMValTool user perspective the number of data input parameters is optimized to allow for ease of use. +We detail this procedure in the next section. + +ESMValTool data retrieval +========================= +Data retrieval in ESMValTool has two main aspects from the user's point of view: + +* data can be found by the tool, subject to availability on disk; +* it is the user's responsibility to set the corect data retrieval parameters; + +The first point is self-explanatory: if the user runs the tool on a machine that has access to a data +repository or multiple data repositories, then ESMValTool will look for and find the avaialble data requested +by the user. + +The second point underlines the fact that the user has full control over what type and the amount of data they +need for their analyses. Setting the data retrieval parameters is explained below: + +Setting the correct paths +------------------------- +The first step towards providing ESMValTool the correct set of parameters for data retrieval is setting +the root paths to the data. This is done in the user configuration file ``config-user.yml``. +The two sections where the user will set the paths are ``rootpath`` and ``drs``. ``rootpath`` contains pointers +to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths: + +* ``CMIP`` e.g. ``CMIP5`` or ``CMIP6``: this is the `root` path(s) to where the CMIP files are stored; + it can be a single path or a list of paths; it can point to an ESGF node or it can point to a user + private repository; + + Example for a CMIP5 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC): + + .. code-block:: yaml + + CMIP5: /badc/cmip5/data/cmip5/output1 + + Example for a CMIP6 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC): + + .. code-block:: yaml + + CMIP6: /badc/cmip6/data/CMIP6/CMIP + + Example for a mix of CMIP6 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC) + and a user-specific data repository for extra data: + + .. code-block:: yaml + + CMIP6: [/badc/cmip6/data/CMIP6/CMIP, /home/users/joepesci/cmip_data] + +* ``OBS``: this is the `root` path(s) to where the observational datasets are stored; again, this could + be a single path or a list of paths, just like for CMIP data. + + Example for the OBS path for a large cache of observation datasets on CEDA-Jasmin: + + .. code-block:: yaml + + OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2 + + + + + + + + + + + + + + + + + + + From 949be8badc81eb07ea10122ed9a7119110ecc472 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Wed, 26 Jun 2019 14:49:42 +0100 Subject: [PATCH 08/49] plodding along with data finder --- doc/sphinx/source/esmvalcore/datafinder.inc | 54 +++++++++++++++++++-- 1 file changed, 49 insertions(+), 5 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.inc index 0b4e0565cd..bdbf0db7ac 100644 --- a/doc/sphinx/source/esmvalcore/datafinder.inc +++ b/doc/sphinx/source/esmvalcore/datafinder.inc @@ -15,8 +15,8 @@ these parameters have been provided, the tool will automatically find the right detail below the data finding and retrieval process and the inputs the user needs to specify, giving examples on how to use the data finding routine under different scenarios. -CMIP data: Data Reference Syntax (DRS) and the ESGF -=================================================== +CMIP data: CMOR Data Reference Syntax (DRS) and the ESGF +======================================================== CMIP data is widely available via the Earth System Grid Federation (`ESGF `_) and is accessible to users either via dowload from the ESGF portal or through the ESGF data nodes hosted by large computing facilities (like CEDA-Jasmin, DKRZ etc). This data adheres to, among other standards, @@ -51,12 +51,51 @@ by the user. The second point underlines the fact that the user has full control over what type and the amount of data they need for their analyses. Setting the data retrieval parameters is explained below: -Setting the correct paths -------------------------- +Setting the correct root paths +------------------------------ The first step towards providing ESMValTool the correct set of parameters for data retrieval is setting the root paths to the data. This is done in the user configuration file ``config-user.yml``. The two sections where the user will set the paths are ``rootpath`` and ``drs``. ``rootpath`` contains pointers -to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths: +to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the type of directory structure +the root paths are structured by. It is important to first discuss the ``drs`` parameter: as we've seen in +the previous section, the DRS as a standard is used for both file naming conventions and for directory structures. + +Explaining ``drs: CMIP5:`` or ``drs: CMIP6:`` +--------------------------------------------- +Whreas ESMValTool will **always** use the CMOR standard for file naming (please refer above), by setting the ``drs`` +parameter the user tells the tool what type of root paths they need the data from, e.g.: + + .. code-block:: yaml + + drs: + CMIP6: BADC + +will tell the tool that the user needs data from a repository structured according to the BADC DRS structure `ie` + +``ROOT/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/[variable_short_name]/[grid]``; + +setting the ``ROOT`` parameter is explained below. This is a strictly-structured repository tree and if +there are any sort of irregularities (e.g. there is no ``[mip]`` directory) the data will not be found! +``BADC`` can be replaced with ``DKRZ`` or ``ETHZ`` depending on the existing ``ROOT`` directory structure. + +The snippet + + .. code-block:: yaml + + drs: + CMIP6: default + +is another way to retrieve data from a ``ROOT`` directory that has no DRS-like structure; ``default`` is +a directory that contains all the needed data files (a bucket full of everything). + +.. note:: + When using ``CMIP6: default`` or ``CMIP5: default`` it is important to remember that all the needed files + must be in the same top-level directory set by ``default`` (see below how to set ``default``). + +Explaining ``rootpath:`` +------------------------ + +``rootpath`` identifies the root directory for different data types (``ROOT`` as we used it above): * ``CMIP`` e.g. ``CMIP5`` or ``CMIP6``: this is the `root` path(s) to where the CMIP files are stored; it can be a single path or a list of paths; it can point to an ESGF node or it can point to a user @@ -90,7 +129,12 @@ to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths: OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2 +* ``default``: this is the `root` path(s) to where files are stored without any DRS-like directory + structure; in a nutshell, this is a single directory that should contain all the files needed by the + run, without any sub-directory structure. +* ``RAWOBS``: this is the `root` path(s) to where the raw observational data files are stored; this is + used by ``cmorize_obs``. From 49ab99008b0be0259b3981323205d366c52bfc70 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Wed, 26 Jun 2019 16:05:44 +0100 Subject: [PATCH 09/49] almost done with datafinder --- doc/sphinx/source/esmvalcore/datafinder.inc | 69 +++++++++++++++++++-- 1 file changed, 63 insertions(+), 6 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.inc index bdbf0db7ac..1b6362d46b 100644 --- a/doc/sphinx/source/esmvalcore/datafinder.inc +++ b/doc/sphinx/source/esmvalcore/datafinder.inc @@ -15,6 +15,8 @@ these parameters have been provided, the tool will automatically find the right detail below the data finding and retrieval process and the inputs the user needs to specify, giving examples on how to use the data finding routine under different scenarios. +.. _CMOR-DRS: + CMIP data: CMOR Data Reference Syntax (DRS) and the ESGF ======================================================== CMIP data is widely available via the Earth System Grid Federation (`ESGF `_) @@ -37,8 +39,8 @@ these paths differ slightly, an example is given below: From the ESMValTool user perspective the number of data input parameters is optimized to allow for ease of use. We detail this procedure in the next section. -ESMValTool data retrieval -========================= +Data retrieval +============== Data retrieval in ESMValTool has two main aspects from the user's point of view: * data can be found by the tool, subject to availability on disk; @@ -60,8 +62,8 @@ to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the ty the root paths are structured by. It is important to first discuss the ``drs`` parameter: as we've seen in the previous section, the DRS as a standard is used for both file naming conventions and for directory structures. -Explaining ``drs: CMIP5:`` or ``drs: CMIP6:`` ---------------------------------------------- +Explaining ``config-user/drs: CMIP5:`` or ``config-user/drs: CMIP6:`` +--------------------------------------------------------------------- Whreas ESMValTool will **always** use the CMOR standard for file naming (please refer above), by setting the ``drs`` parameter the user tells the tool what type of root paths they need the data from, e.g.: @@ -92,8 +94,8 @@ a directory that contains all the needed data files (a bucket full of everything When using ``CMIP6: default`` or ``CMIP5: default`` it is important to remember that all the needed files must be in the same top-level directory set by ``default`` (see below how to set ``default``). -Explaining ``rootpath:`` ------------------------- +Explaining ``config-user/rootpath:`` +------------------------------------ ``rootpath`` identifies the root directory for different data types (``ROOT`` as we used it above): @@ -136,9 +138,64 @@ Explaining ``rootpath:`` * ``RAWOBS``: this is the `root` path(s) to where the raw observational data files are stored; this is used by ``cmorize_obs``. +Dataset definitions in ``recipe`` +--------------------------------- +Once the correct paths have been established, it is now time to collect the information on the specific +datasets that are needed for the analysis. This information, together with the CMOR convention for +naming files (see CMOR-DRS_) will allow ``_data_finder`` to search and find the right files. The specific +datasets are listed in any recipe, under either the ``datasets`` and/or ``additional_datasets`` sections, e.g. + +.. code-block:: yaml + + datasets: + - {dataset: HadGEM2-CC, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 2001, end_year: 2004} + - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical, ensemble: r1i1p1f2, grid: gn, start_year: 2004, end_year: 2014} + +``_data_finder`` will use this information to find data for **all** the variables specified in ``diagnostics/variables``. + +Recap and example +----------------- +Let's look at a practical example for a recap of the information above: suppose you are using a ``config-user.yml`` +that has the following entries for data finding: + +.. code-block:: yaml + + rootpath: # running on CEDA-Jasmin + CMIP6: /badc/cmip6/data/CMIP6/CMIP + drs: + CMIP6: BADC # since you are on CEDA-Jasmin + +and the dataset you need is specified in your ``recipe.yml`` as: + +.. code-block:: yaml + + - {dataset: UKESM1-0-LL, project: CMIP6, mip: Amon, exp: historical, grid: gn, ensemble: r1i1p1f2, start_year: 2004, end_year: 2014} + +for a variable e.g. + +.. code-block:: yaml + + diagnostics: + some_diagnostic: + description: some_description + variables: + ta: + preprocessor: some_preprocessor + +``_data_finder`` will use the root path ``/badc/cmip6/data/CMIP6/CMIP`` and the dataset information and will +assemble the full DRS path using information from CMOR-DRS_ and establish the path to the files as + +``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon`` + +then look for variable ``ta`` and specifically the latest version of the data file: +``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/`` +and finally, using the file naming definition from CMOR-DRS_ find the file: +``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/`` +``ta/gn/latest/`` +``ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc`` From 133eab3138fbced7343da31f6ad25a69202d57b7 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Wed, 26 Jun 2019 16:33:43 +0100 Subject: [PATCH 10/49] finished data finder --- doc/sphinx/source/esmvalcore/datafinder.inc | 28 +++++++++++++++------ 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.inc index 1b6362d46b..8a25908571 100644 --- a/doc/sphinx/source/esmvalcore/datafinder.inc +++ b/doc/sphinx/source/esmvalcore/datafinder.inc @@ -26,15 +26,16 @@ the DRS and Controlled Vocabulary standard for naming files and structured paths ensures that files and paths to them are named according to a standardized convention. An example of this convention, and also used by ESMValTool for file discovery and data retrieval can be seen here: -* CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc``; -* CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc``; +* CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc`` +* CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc`` +* OBS file: ``[project]_[dataset_name]_[type]_[version]_[mip]_[short_name]_[start-date]-[end-date].nc`` and similar standards exist for the standard paths (input directories); for the ESGF data nodes, these paths differ slightly, an example is given below: * CMIP6 path for BADC: ``ROOT-BADC/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/ [variable_short_name]/[grid]``; -* CMIP6 pth for ETHZ: ``ROOT-ETHZ/[experiment]/[mip]/[variable_short_name]/[dataset_name]/[ensemble]/[grid]``; +* CMIP6 path for ETHZ: ``ROOT-ETHZ/[experiment]/[mip]/[variable_short_name]/[dataset_name]/[ensemble]/[grid]`` From the ESMValTool user perspective the number of data input parameters is optimized to allow for ease of use. We detail this procedure in the next section. @@ -154,7 +155,7 @@ datasets are listed in any recipe, under either the ``datasets`` and/or ``additi ``_data_finder`` will use this information to find data for **all** the variables specified in ``diagnostics/variables``. Recap and example ------------------ +================= Let's look at a practical example for a recap of the information above: suppose you are using a ``config-user.yml`` that has the following entries for data finding: @@ -197,15 +198,26 @@ and finally, using the file naming definition from CMOR-DRS_ find the file: ``ta/gn/latest/`` ``ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc`` +Observational data +================== +Observational data is retrieved in the same manner as CMIP data, for example using the ``OBS`` root path set to + .. code-block:: yaml + OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2 +and the dataset + .. code-block:: yaml + - {dataset: ERA-Interim, project: OBS, type: reanaly, version: 1, start_year: 2014, end_year: 2015, tier: 3} +in ``recipe.yml`` in ``datasets`` or ``additional_datasets``, the rules set in CMOR-DRS_ are used again +and the file will be automatically found: +``/group_workspaces/jasmin4/esmeval/obsdata-v2/`` +``Tier3/ERA-Interim/`` +``OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc`` - - - - +Note that for observational data for ``drs: default`` the ``default`` directory must contain a sub-directory +``TierX`` (``Tier1``, ``Tier2`` or ``Tier3``). From ce0d3e158db6cfb7f37c0d11a23a051433846b30 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 11:29:05 +0100 Subject: [PATCH 11/49] added note on creating a simple multimodel mask cheers to Ben for pointing it out --- doc/sphinx/source/esmvalcore/preprocessor.inc | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 633724b9de..594c168e05 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -562,6 +562,24 @@ to 19.0 (in units of the variable units). See also :func:`esmvalcore.preprocessor.mask_fillvalues`. +.. note:: + **Pro Tip: creating a multimodel mask using ``mask_fillvalues``** + + It is possible to use ``mask_fillvalues`` to create a combined multimodel + mask (all the masks from all the analyzed models combined into a single mask); + for that purpose setting the ``threshold_fraction`` to 0 will not discard any + time windows, essentially keeping the original model masks and combining them + into a single mask; here is an example: + + .. code-block:: yaml + + preprocessors: + missing_values_preprocessor: + mask_fillvalues: + threshold_fraction: 0.0 # keep all missing values + min_value: -1e20 # small enough not to alter the data + time_window: 10.0 # this will not matter anymore + Minimum, maximum and interval masking ------------------------------------- From e9c0472d04a8081b613bb8f5bb081b54375026d0 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 12:03:14 +0100 Subject: [PATCH 12/49] added note on regridding memory --- doc/sphinx/source/esmvalcore/preprocessor.inc | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 594c168e05..539b13b68a 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -693,6 +693,14 @@ See also :func:`esmvalcore.preprocessor.regrid` * ``nanmask`` – if the source data is a MaskedArray the extrapolation points will be masked. Otherwise they will be set to NaN. +.. note:: + **Memory limits for horizontal regridding** + + The rigridding mechanism is (at the moment) done with fully realized data in memory, so depending + on how fine the target grid is, it may use a rather large amount of memory. Empirically target grids + of up to ``0.5x0.5`` degrees should not produce any memory-related issues, but be advised that + for resolutions of ``< 0.5`` degrees the regridding becomes very slow and will use a lot of memory. + Multi-model statistics ====================== Computing multi-model statistics is an integral part of model analysis and evaluation: individual From 2d993de82ec046d2b4d1cdd1cac123c27a61d288 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 12:15:54 +0100 Subject: [PATCH 13/49] added note on multimodel memory --- doc/sphinx/source/esmvalcore/preprocessor.inc | 14 +++++++++++++- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index 539b13b68a..aed6d576d7 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -715,7 +715,7 @@ Multimodel statistics in ESMValTool are computed along the time axis, and as suc can be computed across a common overlap in time (by specifying ``span: overlap`` argument) or across the full length in time of each model (by specifying ``span: full`` argument). -Restrictive compuation is also available by excluding any set of models that the user +Restrictive computation is also available by excluding any set of models that the user will not want to include in the statistics (by setting ``exclude: [excluded models list]`` argument). The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, and from a statistical point of view, this is needed since weights are not yet @@ -733,6 +733,18 @@ than four: time, vertical axis, two horizontal axes). see also :func:`esmvalcore.preprocessor.multi_model_statistics`. +.. note:: + + **Memory limits for multimodel statistics** + + Note that the multimodel array operations, albeit performed in per-time/per-horizontal level + loops to save memory, could, however, be rather memory-intensive (since they are not performed + lazily as yet). Section _MemoryUse details the memory intake for different run scenarios, but + as a thumb rule, for the multimodel preprocessor, the expected maximum memory intake could be + approximated as the number of datasets multiplied by the average size in memory for one dataset. + +.. _MemoryUse: + Information on maximum memory required ====================================== In the most general case, we can set upper limits on the maximum memory the anlysis will require: From 410a28294035ea40250051b6fcb5be721c568b92 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 12:20:43 +0100 Subject: [PATCH 14/49] added another memory note and comment a thing or two --- doc/sphinx/source/esmvalcore/preprocessor.inc | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index aed6d576d7..f85dd3d616 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -576,9 +576,9 @@ See also :func:`esmvalcore.preprocessor.mask_fillvalues`. preprocessors: missing_values_preprocessor: mask_fillvalues: - threshold_fraction: 0.0 # keep all missing values - min_value: -1e20 # small enough not to alter the data - time_window: 10.0 # this will not matter anymore + threshold_fraction: 0.0 # keep all missing values + min_value: -1e20 # small enough not to alter the data + # time_window: 10.0 # this will not matter anymore Minimum, maximum and interval masking ------------------------------------- @@ -739,7 +739,7 @@ see also :func:`esmvalcore.preprocessor.multi_model_statistics`. Note that the multimodel array operations, albeit performed in per-time/per-horizontal level loops to save memory, could, however, be rather memory-intensive (since they are not performed - lazily as yet). Section _MemoryUse details the memory intake for different run scenarios, but + lazily as yet). Section MemoryUse_ details the memory intake for different run scenarios, but as a thumb rule, for the multimodel preprocessor, the expected maximum memory intake could be approximated as the number of datasets multiplied by the average size in memory for one dataset. From 4501f9a54c99e459d11673b15780b6b71366aba3 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 12:59:35 +0100 Subject: [PATCH 15/49] added hooks for section referencing --- doc/sphinx/source/esmvalcore/datafinder.inc | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.inc index 8a25908571..18761713d0 100644 --- a/doc/sphinx/source/esmvalcore/datafinder.inc +++ b/doc/sphinx/source/esmvalcore/datafinder.inc @@ -63,6 +63,8 @@ to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the ty the root paths are structured by. It is important to first discuss the ``drs`` parameter: as we've seen in the previous section, the DRS as a standard is used for both file naming conventions and for directory structures. +.. _config-user-drs: + Explaining ``config-user/drs: CMIP5:`` or ``config-user/drs: CMIP6:`` --------------------------------------------------------------------- Whreas ESMValTool will **always** use the CMOR standard for file naming (please refer above), by setting the ``drs`` @@ -95,6 +97,8 @@ a directory that contains all the needed data files (a bucket full of everything When using ``CMIP6: default`` or ``CMIP5: default`` it is important to remember that all the needed files must be in the same top-level directory set by ``default`` (see below how to set ``default``). +.. _config-user-rootpath: + Explaining ``config-user/rootpath:`` ------------------------------------ From 7f0ce6fdb12c84accafcf486ad539e7c26af765c Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 12:59:59 +0100 Subject: [PATCH 16/49] work on config section --- doc/sphinx/source/esmvalcore/config.inc | 136 +++++++++++++++++++++++- 1 file changed, 131 insertions(+), 5 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/config.inc b/doc/sphinx/source/esmvalcore/config.inc index 9695430621..527f38a499 100644 --- a/doc/sphinx/source/esmvalcore/config.inc +++ b/doc/sphinx/source/esmvalcore/config.inc @@ -4,18 +4,144 @@ Configuration files ******************* +Overview +======== + There are several configuration files in ESMValTool: - - config-user.yml - - config-developer.yml - - config-references.yml - - config-logging.yml +* ``config-user.yml``: sets a number of user-specific options like desired + graphical output format, root paths to data etc. +* ``config-developer.yml``: sets a number of standardized file-naming and paths to data + formatting; +* ``config-references.yml``: stores information on diagnostic authors and scientific + journals references; +* ``config-logging.yml``: stores information on logging (duh!). User configuration file ======================= -See Section +The ``config-user.yml`` is one of the two files the user needs to provide to the +``esmvaltool`` executable at run time, the second being ``recipe.yml`` (named accordingly +as per the specific diagnostic/recipe used). + +The ``config-user.yml`` configuration file contains all the global level +information needed by ESMValTool. ``config-user.yml`` can be reused as many times the +user needs to before changing any of the options stored in it. This file is essentially +the gateway between the user and the machine-specific instructions to ``esmvaltool``. +The following shows the default settings from the ``config-user.yml`` file with explanations +in a commented line above each option: + +.. code-block:: yaml + + # Diagnostics create plots? [true]/false + # turning it off will turn off graphical output from diagnostic + write_plots: true + + # Diagnositcs write NetCDF files? [true]/false + # turning it off will turn off netCDF output from diagnostic + write_netcdf: true + + # Set the console log level debug, [info], warning, error + # for much more information printed to screen set log_level: debug + log_level: info + # verbosity is deprecated and will be removed in the future + # verbosity: 1 + + # Exit on warning? true/[false] + exit_on_warning: false + + # Plot file format? [ps]/pdf/png/eps/epsi + output_file_type: pdf + + # Destination directory where all output will be written + # including log files and performance stats + output_dir: ./esmvaltool_output + + # Auxiliary data directory (used for some additional datasets) + # this is where e.g. files can be downloaded to by a download + # script embedded in the diagnostic + auxiliary_data_dir: ./auxiliary_data + + # Use netCDF compression true/[false] + compress_netcdf: false + + # Save intermediary cubes in the preprocessor true/[false] + # set to true will save the output cube from each preprocessing step + # these files are numbered according to the preprocessing order + save_intermediary_cubes: false + + # Remove the preproc dir if all fine + # this option true will remove ALL preprocessor files + # CAUTION when using: if you need those files, set it to false + remove_preproc_dir: true + + # Run at most this many tasks in parallel null/[1]/2/3/4/.. + # Set to null to use the number of available CPUs. + # Make sure your system has enough memory for the specified number of tasks. + max_parallel_tasks: 1 + + # Path to custom config-developer file, to customise project configurations. + # See config-developer.yml for an example. Set to None to use the default + config_developer_file: null + + # Get profiling information for diagnostics + # Only available for Python diagnostics + profile_diagnostic: false + + # Rootpaths to the data from different projects (lists are also possible) + rootpath: + CMIP5: [~/cmip5_inputpath1, ~/cmip5_inputpath2] + OBS: ~/obs_inputpath + default: ~/default_inputpath + + # Directory structure for input data: [default]/BADC/DKRZ/ETHZ/etc + # See config-developer.yml for definitions. + drs: + CMIP5: default + +Most of these settings are fairly self-explanatory, ie: + +.. code-block:: yaml + + # Diagnostics create plots? [true]/false + write_plots: true + # Diagnositcs write NetCDF files? [true]/false + write_netcdf: true + +The ``write_plots`` setting is used to inform ESMValTool about your preference +for saving figures. Similarly, the ``write_netcdf`` setting is a boolean which +turns on or off the writing of netCDF files. + +.. code-block:: yaml + + # Auxiliary data directory (used for some additional datasets) + auxiliary_data_dir: ~/auxiliary_data + +The ``auxiliary_data_dir`` setting is the path to place any required +additional auxiliary data files. This method was necessary because certain +Python toolkits such as cartopy will attempt to download data files at run +time, typically geographic data files such as coastlines or land surface maps. +This can fail if the machine does not have access to the wider internet. This +location allows us to tell cartopy (and other similar tools) where to find the +files if they can not be downloaded at runtime. To reiterate, this setting is +not for model or observational datasets, rather it is for data files used in +plotting such as coastline descriptions and so on. + +.. note:: + + **Pro Tip: working with multiple config-user files.** + + You choose your config.yml file at run time, so you could have several + available with different purposes. One for formalised run, one for debugging, etc. + +.. note:: + + **Note on data finding sections of the config-user file.** + A detailed explanation of the data finding-related sections of the ``config-user.yml`` + (``rootpath`` and ``drs``) is presented in config-user-rootpath_ and config-user-drs_ + in the Data Finder section; these sections relate directly to the data finding capabilities + of ESMValTool and are very important to be understood by the user. Developer configuration file ============================ From 1ca7568046bfe3b1aac50d711e7cf5381c93d4f8 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 13:40:50 +0100 Subject: [PATCH 17/49] tidying up config --- doc/sphinx/source/esmvalcore/config.inc | 21 ++++++++++++++------- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/config.inc b/doc/sphinx/source/esmvalcore/config.inc index 527f38a499..ffd52d7f8e 100644 --- a/doc/sphinx/source/esmvalcore/config.inc +++ b/doc/sphinx/source/esmvalcore/config.inc @@ -148,10 +148,14 @@ Developer configuration file This configuration file describes the file system structure for several key projects (CMIP5, CMIP6) on several key machines (BADC, CP4CDS, DKRZ, ETHZ, -SMHI, BSC). - -The data directory structure of the CMIP5 project is set up differently -at each site. The following code snipper is an example of several paths +SMHI, BSC) - CMIP data is stored as part of the Earth System Grid Federation (ESGF) +and the standards for file naming and paths to files are set out by CMOR and DRS. +For a detailed description of these standards and how the impact on the use in ESMValTool +we refer the user to CMOR-DRS_ section where we relate these standards to the data retrieval +mechanism built-in ESMValTool. + +The data directory structure of the CMIP5/6 projects is set up differently +at each site. The following code snippet is an example of several paths descriptions for the CMIP5 at various sites: .. code-block:: yaml @@ -172,13 +176,16 @@ As an example, the CMIP5 file path on BADC would be: [institute]/[dataset ]/[exp]/[frequency]/[modeling_realm]/[mip]/[ensemble]/latest/[short_name] -When loading these files, ESMValTool replaces the placeholders with the true -values. The resulting real path would look something like this: +When loading these files, ESMValTool replaces the placeholders ``[item]`` with actual +values supplied for by the information form ``config-user.yml`` and ``recipe.yml``. +The resulting real path would look something like this: -.. code-block:: yaml +.. code-block:: bash MOHC/HadGEM2-CC/rcp85/mon/ocean/Omon/r1i1p1/latest/tos +Again, for a more in-depth description this process, as part of the data retrieval mechanism, +please see CMOR-DRS_. References configuration file ============================= From 88c54d7df3023f49109cf7f5edd58acd041da921 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 14:39:42 +0100 Subject: [PATCH 18/49] shipshaping config --- doc/sphinx/source/esmvalcore/config.inc | 4 ++++ 1 file changed, 4 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/config.inc b/doc/sphinx/source/esmvalcore/config.inc index ffd52d7f8e..563095a2ef 100644 --- a/doc/sphinx/source/esmvalcore/config.inc +++ b/doc/sphinx/source/esmvalcore/config.inc @@ -143,6 +143,8 @@ plotting such as coastline descriptions and so on. in the Data Finder section; these sections relate directly to the data finding capabilities of ESMValTool and are very important to be understood by the user. +.. _config-developer: + Developer configuration file ============================ @@ -187,6 +189,8 @@ The resulting real path would look something like this: Again, for a more in-depth description this process, as part of the data retrieval mechanism, please see CMOR-DRS_. +.. _config-ref: + References configuration file ============================= From d058fd98e5bda1558b2c40de947667a09ce64d3f Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 14:40:02 +0100 Subject: [PATCH 19/49] shipshaping preprocess --- doc/sphinx/source/esmvalcore/preprocessor.inc | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index f85dd3d616..dd14f34bba 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -357,10 +357,13 @@ but it may be necceasiry for irregular grids. See also :func:`esmvalcore.preprocessor.extract_trajectory`. +.. _cmor-checks-fixes: CMORization and dataset-specific fixes ====================================== -Javier +.. warning:: + + Section to be added by Javier ``CMORMAN`` Vegas-Regidor Vertical interpolation ====================== From ef37a0a7d414c2bc8d82c9b839da0d859853b2a0 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 14:40:47 +0100 Subject: [PATCH 20/49] adding recipe dox mostly taken from esmvaltool by Lee but changing it and improving it quite a bity --- doc/sphinx/source/esmvalcore/recipe.inc | 201 ++++++++++++++++++++++++ 1 file changed, 201 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.inc index 72402f9865..8f8afc817d 100644 --- a/doc/sphinx/source/esmvalcore/recipe.inc +++ b/doc/sphinx/source/esmvalcore/recipe.inc @@ -3,3 +3,204 @@ ****** Recipe ****** + +.. _recipe: + +Overview +======== + +After ``config-user.yml``, the ``recipe.yml`` file the second the user inputted file +to ESMValTool at each run time point. Rceipes contain the data and data analysis +information and instructions needed to run the diagnostic(s), as well as specific +diagnostic-related instructions. + +Broadly, recipes contain information on the user who wrote the +recipe file, the datasets which need to be run, the preprocessors that need to be +applied, and the diagnostics which need to be run over the preprocessed data. +This information is provided to ESMValTool in for main recipe sections: +`Documentation`_, `Datasets`_, `Preprocessors`_ and `Diagnostics`_, +respectively. + + +Recipe section: ``documentation`` +================================= + +The documentation section includes: + +- The recipe's author's user name (``authors``, as they appaer in ``config-references.yml`` config-ref_) +- A description of the recipe (``description``, written in MarkDown format) +- The user name of the maintainer (``maintainer``, as they appaer in ``config-references.yml`` config-ref_) +- A list of scientific references (``references`` , as they appaer in ``config-references.yml`` config-ref_) +- the project or projects associated with the recipe (``projects``, as they appaer in ``config-references.yml`` config-ref_) + +For example, please see the documentation section from the recipe: +``recipes/recipe_ocean_amoc.yml``: + +.. code-block:: yaml + + documentation: + description: | + Recipe to produce time series figures of the derived variable, the + Atlantic meriodinal overturning circulation (AMOC). + This recipe also produces transect figures of the stream functions for + the years 2001-2004. + + authors: + - demo_le + + maintainer: + - demo_le + + references: + - demora2018gmd + + projects: + - ukesm + +.. note:: + + **Information from config-references.yml** + + Note that the authors, projects, and references will need to be included in the + ``config-references.yml`` file. The author name uses the format: + ``surname_name``. For instance, Joe Pesci would be: ``authors: - pesci_joe``. + +Recipe section: ``datasets`` +============================ + +The ``datasets`` section includes dictionaries that, via key-value pairs, define standardized +data specifications: + +- dataset name (key ``dataset``, value e.g. ``MPI-ESM-LR`` or ``UKESM1-0-LL``) +- project (key ``project``, value ``CMIP5`` or ``CMIP6`` for CMIP data, + ``OBS`` for observational data, ``ana4mips`` for ana4mips data, + ``obs4mips`` for obs4mips data, ``EMAC`` for EMAC data) +- experiment (key ``exp``, value e.g. ``historical``, ``amip``, ``piControl``, ``RCP8.5``) +- mip (for CMIP data, key ``mip``, value e.g. ``Amon``, ``Omon``, ``LImon``) +- ensemble member (key ``ensemble``, value e.g. ``r1i1p1``, ``r1i1p1f1``) +- time range (e.g. key-value ``start_year: 1982``, ``end_year: 1990``) +- model grid (native grid ``grid: gn`` or regridded grid ``grid: gr``, for CMIP6 data only). + +For example, a datasets section could be: + +.. code-block:: yaml + + datasets: + - {dataset: CanESM2, project: CMIP5, exp: historical, ensemble: r1i1p1, start_year: 2001, end_year: 2004} + - {dataset: UKESM1-0-LL, project: CMIP6, exp: historical, ensemble: r1i1p1f2, start_year: 2001, end_year: 2004, grid: gn} + + +Note that this section is not required, as datasets can also be provided in the +`Diagnostics`_ section. + + +Recipe section: ``preprocessors`` +================================= + +The preprocessor section of the recipe includes one or more preprocesors, each +of which may execute one or several preprocessor steps. + +Each preprocessor section includes: + +- A preprocessor name (any name, one yaml tab (2 spaces) under ``preprocessors``); +- A list of preprocesor steps to be executed (choose from the API, each one yaml tab (2 spaces) under preprocessor name) +- Any or none arguments given to the preprocessor steps (one yaml tab (2 spaces) under preprocessor step name) +- The order that the preprocesor steps are applied can also be specified using the ``custom_order`` preprocesor function. + +The following preprocessor is an example of a preprocessor that contains +multiple preprocessing steps: + +.. code-block:: yaml + + preprocessors: + prep_map: + regrid: + target_grid: 1x1 + scheme: linear + time_average: + multi_model_statistics: + span: overlap + statistics: [mean ] + +.. note:: + + What if no preprocessor is needed? + + In this case no ``preprocessors`` section is needed; + the workflow will apply a ``default`` preprocessor consisting of only + basic operations like: loading data, applying CMOR checks and fixes (cmor-checks-fixes_) + and saving the data to disk (if needed). + + +Diagnostics +=========== + +The diagnostics section includes one or more diagnostics. Each diagnostics will +have: + +- A list of which variables to load +- A description of the variables (optional) +- Which preprocessor to apply to each variable +- The script to run +- The diagnostics can also include an optional ``additional_datasets`` section. + +The ``additional_datasets`` can add datasets beyond those listed in the the +`Datasets`_ section. This is useful if specific datasets need to be linked with +a specific diagnostics. The addition datasets can be used to add variable +specific datasets. This is also a good way to add observational datasets can be +added to the diagnostic. + +The following example, taken from recipe_ocean_example.yml, shows a diagnostic +named `diag_map`, which loads the temperature at the ocean surface between +the years 2001 and 2003 and then passes it to the prep_map preprocessor. +The result of this process is then passed to the ocean diagnostic map scipt, +``ocean/diagnostic_maps.py``. + +.. code-block:: yaml + + diagnostics: + + diag_map: + description: Global Ocean Surface regridded temperature map + variables: + tos: # Temperature at the ocean surface + preprocessor: prep_map + start_year: 2001 + end_year: 2003 + scripts: + Global_Ocean_Surface_regrid_map: + script: ocean/diagnostic_maps.py + +To define a variable/dataset combination, the keys in the diagnostic section +are combined with the keys from datasets section. If two versions of the same +key are provided, then the key in the datasets section will take precedence +over the keys in variables section. For many recipes it makes more sense to +define the ``start_year`` and ``end_year`` items in the variable section, because the +diagnostic script assumes that all the data has the same time range. + +Note that the path to the script provided in the `script` option should be +either: + +1. the absolute path to the script. +2. the path relative to the ``esmvaltool/diag_scripts`` directory. + + +As mentioned above, the datasets are provided in the `Diagnostics`_ section +in this section. However, they could also be included in the `Datasets`_ +section. + + +Brief introduction to YAML +========================== + +While .yaml is a relatively common format, maybe users may not have +encountered this language before. The key information about this format is: + +- Yaml is a human friendly markup language. +- Yaml is commonly used for configuration files. +- the syntax is relatively straightforward +- Indentation matters a lot (like python)! +- yaml is case sensitive +- A yml tutorial is available here: https://learnxinyminutes.com/docs/yaml/ +- A yml quick reference card is available here: https://yaml.org/refcard.html +- ESMValTool uses the yamllint linter tool: http://www.yamllint.com From 6568eb925f5d3e415c0efa700112b6c57b105a62 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:05:14 +0100 Subject: [PATCH 21/49] shipshaping recipe --- doc/sphinx/source/esmvalcore/recipe.inc | 23 +++++++++++++++-------- 1 file changed, 15 insertions(+), 8 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.inc index 8f8afc817d..152e58ff94 100644 --- a/doc/sphinx/source/esmvalcore/recipe.inc +++ b/doc/sphinx/source/esmvalcore/recipe.inc @@ -4,8 +4,6 @@ Recipe ****** -.. _recipe: - Overview ======== @@ -18,9 +16,10 @@ Broadly, recipes contain information on the user who wrote the recipe file, the datasets which need to be run, the preprocessors that need to be applied, and the diagnostics which need to be run over the preprocessed data. This information is provided to ESMValTool in for main recipe sections: -`Documentation`_, `Datasets`_, `Preprocessors`_ and `Diagnostics`_, +Documentation_, Datasets_, Preprocessors_ and Diagnostics_, respectively. +.. _Documentation: Recipe section: ``documentation`` ================================= @@ -64,6 +63,11 @@ For example, please see the documentation section from the recipe: Note that the authors, projects, and references will need to be included in the ``config-references.yml`` file. The author name uses the format: ``surname_name``. For instance, Joe Pesci would be: ``authors: - pesci_joe``. + Also note that Joe Pesci does not appreciate you calling him ``funny``. + For a first-time user that does not yet have their name added to ``config-references.yml`` + a run of an already-made recipe or running with no author name is possible. + +.. _Datasets: Recipe section: ``datasets`` ============================ @@ -91,8 +95,9 @@ For example, a datasets section could be: Note that this section is not required, as datasets can also be provided in the -`Diagnostics`_ section. +Diagnostics_ section. +.. _Preprocessors: Recipe section: ``preprocessors`` ================================= @@ -107,8 +112,9 @@ Each preprocessor section includes: - Any or none arguments given to the preprocessor steps (one yaml tab (2 spaces) under preprocessor step name) - The order that the preprocesor steps are applied can also be specified using the ``custom_order`` preprocesor function. -The following preprocessor is an example of a preprocessor that contains -multiple preprocessing steps: +The following snippet is an example of a preprocessor named ``prep_map`` that contains +multiple preprocessing steps (regrid_ with two arguments, time_average_ with no arguments +and multi_model_statistics_ with two arguments): .. code-block:: yaml @@ -131,9 +137,10 @@ multiple preprocessing steps: basic operations like: loading data, applying CMOR checks and fixes (cmor-checks-fixes_) and saving the data to disk (if needed). +.. _Diagnostics: -Diagnostics -=========== +Recipe section: ``diagnostics`` +=============================== The diagnostics section includes one or more diagnostics. Each diagnostics will have: From 32756c2206309f1cf686d654b674296dc896a942 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:05:31 +0100 Subject: [PATCH 22/49] added reference hooks --- doc/sphinx/source/esmvalcore/preprocessor.inc | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index dd14f34bba..c438c39156 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -156,6 +156,7 @@ between 1 and 12 as the named month string will not be accepted. See also :func:`esmvalcore.preprocessor.extract_month`. +.. _time_average: ``time_average`` ---------------- @@ -596,6 +597,8 @@ masking or the pair ``minimum`, ``maximum`` for interval masking. See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. +.. _regrid: + Horizontal regridding ===================== @@ -704,6 +707,8 @@ See also :func:`esmvalcore.preprocessor.regrid` of up to ``0.5x0.5`` degrees should not produce any memory-related issues, but be advised that for resolutions of ``< 0.5`` degrees the regridding becomes very slow and will use a lot of memory. +.. _multi_model_statistics: + Multi-model statistics ====================== Computing multi-model statistics is an integral part of model analysis and evaluation: individual From 1102d7d81b25263f8b5b17ac9cd47a17b0de26ff Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:51:24 +0100 Subject: [PATCH 23/49] adding stuffs --- doc/sphinx/source/esmvalcore/recipe.inc | 89 ++++++++++++++++++++++--- 1 file changed, 79 insertions(+), 10 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.inc index 152e58ff94..2137c15e12 100644 --- a/doc/sphinx/source/esmvalcore/recipe.inc +++ b/doc/sphinx/source/esmvalcore/recipe.inc @@ -152,14 +152,15 @@ have: - The diagnostics can also include an optional ``additional_datasets`` section. The ``additional_datasets`` can add datasets beyond those listed in the the -`Datasets`_ section. This is useful if specific datasets need to be linked with -a specific diagnostics. The addition datasets can be used to add variable -specific datasets. This is also a good way to add observational datasets can be -added to the diagnostic. +Datasets_ section. This is useful if specific datasets need to be linked with +a specific diagnostic. The ``additional_datasets`` can be used to add variable +specific datasets. This is also a good way to add observational datasets. -The following example, taken from recipe_ocean_example.yml, shows a diagnostic +Running a simple diagnostic +--------------------------- +The following example, taken from ``recipe_ocean_example.yml``, shows a diagnostic named `diag_map`, which loads the temperature at the ocean surface between -the years 2001 and 2003 and then passes it to the prep_map preprocessor. +the years 2001 and 2003 and then passes it to the ``prep_map`` preprocessor. The result of this process is then passed to the ocean diagnostic map scipt, ``ocean/diagnostic_maps.py``. @@ -188,14 +189,82 @@ diagnostic script assumes that all the data has the same time range. Note that the path to the script provided in the `script` option should be either: -1. the absolute path to the script. -2. the path relative to the ``esmvaltool/diag_scripts`` directory. + - the absolute path to the script. + - the path relative to the ``esmvaltool/diag_scripts`` directory. -As mentioned above, the datasets are provided in the `Diagnostics`_ section -in this section. However, they could also be included in the `Datasets`_ +As mentioned above, the datasets are provided in the Diagnostics_ section +in this section. However, they could also be included in the Datasets_ section. +Passing arguments to diagnostic +------------------------------- +The ``diagnostics`` section may include a lot of arguments that can be used by the +diagnostic script; these arguments are stored at runtime in a dictionary that is then +made available to the diagnostic script via the interface link (no matter if the diagnostic +is run in Python, NCL etc). Here is an example of such groups of arguments: + +.. code-block:: yaml + + scripts: + autoassess_strato_test_1: &autoassess_strato_test_1_settings + script: autoassess/autoassess_area_base.py + title: "Autoassess Stratosphere Diagnostic Metric MPI-MPI" + area: stratosphere + control_model: MPI-ESM-LR + exp_model: MPI-ESM-MR + obs_models: [ERA-Interim] # list to hold models that are NOT for metrics but for obs operations + additional_metrics: [ERA-Interim, inmcm4] # list to hold additional datasets for metrics + +In this example, apart from the pointer to the diagnostic script ``script: autoassess/autoassess_area_base.py``, +we pass a suite of parameters to be used by the script (``area``, ``control_model`` etc). These parameters are +stored in key-value pairs in the diagnostic configuration file, an interface file that can be used by importing +the ``run_diagnostic`` utility: + +.. code-block:: python + + from esmvaltool.diag_scripts.shared import run_diagnostic + + # write the diagnostic code here e.g. + def run_some_diagnostic(my_area, my_control_model, my_exp_model): + """Diagnostic to be run.""" + if my_area == 'stratosphere': + diag = my_control_model / my_exp_model + return diag + + def main(cfg): + """Main diagnostic run function.""" + my_area = cfg['area'] + my_control_model = cfg['control_model'] + my_exp_model = cfg['exp_model'] + run_some_diagnostic(my_area, my_control_model, my_exp_model) + + if __name__ == '__main__': + + with run_diagnostic() as config: + main(config) + +Running your own diagnostic +--------------------------- +If the user decides to test a e.g. ``my_first_diagnostic.py`` diagnostic they have just written +and, of course, this diagnostic is not in the ESMValTool diagnostics library, they can do it by +passing the absolute path to the diagnostic: + +.. code-block:: yaml + + diagnostics: + + myFirstDiag: + description: Joe Pesci wrote a funny diagnostic + variables: + tos: # Temperature at the ocean surface + preprocessor: prep_map + start_year: 2001 + end_year: 2003 + scripts: + JoeDiagFunny: + script: /home/users/joepesci/esmvaltool_testing/my_first_diagnostic.py + Brief introduction to YAML ========================== From 77e584c7dcbad122c761a47be7cbf920fe41a71f Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:51:47 +0100 Subject: [PATCH 24/49] modified hook --- doc/sphinx/source/esmvalcore/preprocessor.inc | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.inc index c438c39156..e0c0fc7427 100644 --- a/doc/sphinx/source/esmvalcore/preprocessor.inc +++ b/doc/sphinx/source/esmvalcore/preprocessor.inc @@ -1,4 +1,4 @@ -:: _preprocessor: +.. _preprocessor: ************ Preprocessor From 53fad7e8262ff631d3126f4b76e36260afda4a63 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:56:44 +0100 Subject: [PATCH 25/49] added utils section --- doc/sphinx/source/esmvalcore/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/sphinx/source/esmvalcore/index.rst b/doc/sphinx/source/esmvalcore/index.rst index f7c63389d0..4ee45b4ff8 100644 --- a/doc/sphinx/source/esmvalcore/index.rst +++ b/doc/sphinx/source/esmvalcore/index.rst @@ -6,3 +6,4 @@ ESMValTool Core .. include:: datafinder.inc .. include:: recipe.inc .. include:: preprocessor.inc +.. include:: utils.inc From d856c8357f44e78dcd0095632f22c51fce9e328e Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:56:57 +0100 Subject: [PATCH 26/49] added utils section --- doc/sphinx/source/esmvalcore/utils.inc | 21 +++++++++++++++++++++ 1 file changed, 21 insertions(+) create mode 100644 doc/sphinx/source/esmvalcore/utils.inc diff --git a/doc/sphinx/source/esmvalcore/utils.inc b/doc/sphinx/source/esmvalcore/utils.inc new file mode 100644 index 0000000000..9589271571 --- /dev/null +++ b/doc/sphinx/source/esmvalcore/utils.inc @@ -0,0 +1,21 @@ +.. _utils: + +********* +Utilities +********* + + +Brief introduction to YAML +========================== + +While .yaml is a relatively common format, maybe users may not have +encountered this language before. The key information about this format is: + +- Yaml is a human friendly markup language. +- Yaml is commonly used for configuration files. +- the syntax is relatively straightforward +- Indentation matters a lot (like python)! +- yaml is case sensitive +- A yml tutorial is available here: https://learnxinyminutes.com/docs/yaml/ +- A yml quick reference card is available here: https://yaml.org/refcard.html +- ESMValTool uses the yamllint linter tool: http://www.yamllint.com From aa30598b350d3efb8374f9b53d04d02bf62452db Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 15:57:13 +0100 Subject: [PATCH 27/49] adding more stuffs --- doc/sphinx/source/esmvalcore/recipe.inc | 18 ++++-------------- 1 file changed, 4 insertions(+), 14 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.inc index 2137c15e12..161354b564 100644 --- a/doc/sphinx/source/esmvalcore/recipe.inc +++ b/doc/sphinx/source/esmvalcore/recipe.inc @@ -265,18 +265,8 @@ passing the absolute path to the diagnostic: JoeDiagFunny: script: /home/users/joepesci/esmvaltool_testing/my_first_diagnostic.py +This way a lot of the optional arguments necessary to a diagnostic are at the user's +control via the recipe. -Brief introduction to YAML -========================== - -While .yaml is a relatively common format, maybe users may not have -encountered this language before. The key information about this format is: - -- Yaml is a human friendly markup language. -- Yaml is commonly used for configuration files. -- the syntax is relatively straightforward -- Indentation matters a lot (like python)! -- yaml is case sensitive -- A yml tutorial is available here: https://learnxinyminutes.com/docs/yaml/ -- A yml quick reference card is available here: https://yaml.org/refcard.html -- ESMValTool uses the yamllint linter tool: http://www.yamllint.com +Re-using parameters from one ``script`` to another +-------------------------------------------------- From 796f5bacb3fed700cc29237fb17b874a5d54c97b Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 16:13:03 +0100 Subject: [PATCH 28/49] finishing up recipe chapter --- doc/sphinx/source/esmvalcore/recipe.inc | 25 +++++++++++++++++++++++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.inc index 161354b564..77342d83af 100644 --- a/doc/sphinx/source/esmvalcore/recipe.inc +++ b/doc/sphinx/source/esmvalcore/recipe.inc @@ -244,6 +244,9 @@ the ``run_diagnostic`` utility: with run_diagnostic() as config: main(config) +This way a lot of the optional arguments necessary to a diagnostic are at the user's +control via the recipe. + Running your own diagnostic --------------------------- If the user decides to test a e.g. ``my_first_diagnostic.py`` diagnostic they have just written @@ -265,8 +268,26 @@ passing the absolute path to the diagnostic: JoeDiagFunny: script: /home/users/joepesci/esmvaltool_testing/my_first_diagnostic.py -This way a lot of the optional arguments necessary to a diagnostic are at the user's -control via the recipe. +This way the user may test their diagnostic thoroughly before committing to git and including +their new diagnostic in the ESMValTool diagnostics library. Re-using parameters from one ``script`` to another -------------------------------------------------- +Due to ``yaml`` features it is possible to recycle entire diagnostics sections for use with other +diagnostics. Here is an example: + +.. code-block:: yaml + + scripts: + cycle: &cycle_settings + script: perfmetrics/main.ncl + plot_type: cycle + time_avg: monthlyclim + grading: &grading_settings + <<: *cycle_settings + plot_type: cycle_latlon + calc_grading: true + normalization: [centered_median, none] + +In this example the hook ``&cycle_settings`` can be used to pass the ``cycle:`` parameters to +``grading:`` via the shortcut ``<<: *cycle_settings``. From 9b5f68466105f2e1be39857e0526522ccbfb426d Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Thu, 27 Jun 2019 16:21:54 +0100 Subject: [PATCH 29/49] reformatted it a bit --- doc/sphinx/source/esmvalcore/utils.inc | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/doc/sphinx/source/esmvalcore/utils.inc b/doc/sphinx/source/esmvalcore/utils.inc index 9589271571..d8e075ceb2 100644 --- a/doc/sphinx/source/esmvalcore/utils.inc +++ b/doc/sphinx/source/esmvalcore/utils.inc @@ -4,18 +4,20 @@ Utilities ********* +This section provides extra information on topics that are not part of ESMValTool +code base but are used by ESMValTool directly or indirectly. Brief introduction to YAML ========================== -While .yaml is a relatively common format, maybe users may not have +While ``.yaml`` or ``.yml`` is a relatively common format, maybe users may not have encountered this language before. The key information about this format is: -- Yaml is a human friendly markup language. -- Yaml is commonly used for configuration files. +- yaml is a human friendly markup language. +- yaml is commonly used for configuration files (gradually replacing the venerable ``.ini``) - the syntax is relatively straightforward -- Indentation matters a lot (like python)! +- indentation matters a lot (like ``Python``)! - yaml is case sensitive -- A yml tutorial is available here: https://learnxinyminutes.com/docs/yaml/ -- A yml quick reference card is available here: https://yaml.org/refcard.html -- ESMValTool uses the yamllint linter tool: http://www.yamllint.com +- a yaml tutorial is available `here `_ +- a yaml quick reference card is available `here `_ +- ESMValTool uses the ``yamllint`` linter `tool `_ From a3d0622658c3b042b367e57c5d665148f075499e Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Tue, 23 Jul 2019 12:59:48 +0200 Subject: [PATCH 30/49] Rename to rst to avoid merge conflicts --- doc/sphinx/source/esmvalcore/{config.inc => config.rst} | 0 doc/sphinx/source/esmvalcore/{datafinder.inc => datafinder.rst} | 0 .../source/esmvalcore/{preprocessor.inc => preprocessor.rst} | 0 doc/sphinx/source/esmvalcore/{recipe.inc => recipe.rst} | 0 doc/sphinx/source/esmvalcore/{utils.inc => utils.rst} | 0 5 files changed, 0 insertions(+), 0 deletions(-) rename doc/sphinx/source/esmvalcore/{config.inc => config.rst} (100%) rename doc/sphinx/source/esmvalcore/{datafinder.inc => datafinder.rst} (100%) rename doc/sphinx/source/esmvalcore/{preprocessor.inc => preprocessor.rst} (100%) rename doc/sphinx/source/esmvalcore/{recipe.inc => recipe.rst} (100%) rename doc/sphinx/source/esmvalcore/{utils.inc => utils.rst} (100%) diff --git a/doc/sphinx/source/esmvalcore/config.inc b/doc/sphinx/source/esmvalcore/config.rst similarity index 100% rename from doc/sphinx/source/esmvalcore/config.inc rename to doc/sphinx/source/esmvalcore/config.rst diff --git a/doc/sphinx/source/esmvalcore/datafinder.inc b/doc/sphinx/source/esmvalcore/datafinder.rst similarity index 100% rename from doc/sphinx/source/esmvalcore/datafinder.inc rename to doc/sphinx/source/esmvalcore/datafinder.rst diff --git a/doc/sphinx/source/esmvalcore/preprocessor.inc b/doc/sphinx/source/esmvalcore/preprocessor.rst similarity index 100% rename from doc/sphinx/source/esmvalcore/preprocessor.inc rename to doc/sphinx/source/esmvalcore/preprocessor.rst diff --git a/doc/sphinx/source/esmvalcore/recipe.inc b/doc/sphinx/source/esmvalcore/recipe.rst similarity index 100% rename from doc/sphinx/source/esmvalcore/recipe.inc rename to doc/sphinx/source/esmvalcore/recipe.rst diff --git a/doc/sphinx/source/esmvalcore/utils.inc b/doc/sphinx/source/esmvalcore/utils.rst similarity index 100% rename from doc/sphinx/source/esmvalcore/utils.inc rename to doc/sphinx/source/esmvalcore/utils.rst From aedbc491d297ed880da13159729fd3acf60d02fb Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Tue, 23 Jul 2019 13:08:07 +0200 Subject: [PATCH 31/49] Update index --- doc/esmvalcore/index.rst | 1 + 1 file changed, 1 insertion(+) diff --git a/doc/esmvalcore/index.rst b/doc/esmvalcore/index.rst index eedbac7983..826bac1c94 100644 --- a/doc/esmvalcore/index.rst +++ b/doc/esmvalcore/index.rst @@ -11,3 +11,4 @@ ESMValTool Core Recipe Preprocessor Fixing data + Utilities From 63abb33240c04669a0d20d3aa28adaeac411e3f4 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 12:41:03 +0100 Subject: [PATCH 32/49] readded table of contents --- doc/esmvalcore/preprocessor.rst | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst index e0c0fc7427..83aee130d1 100644 --- a/doc/esmvalcore/preprocessor.rst +++ b/doc/esmvalcore/preprocessor.rst @@ -4,6 +4,21 @@ Preprocessor ************ +In this section, each of the preprocessor modules is described in detail +following the default order in which they are applied: + +* `Variable derivation`_. +* `CMOR check and dataset-specific fixes`_. +* `Vertical interpolation`_. +* `Land/Sea/Ice masking`_. +* `Horizontal regridding`_. +* `Masking of missing values`_. +* `Multi-model statistics`_. +* `Time operations`_. +* `Area operations`_. +* `Volume operations`_. +* `Unit conversion`_. + Overview ======== From 46361d5403b309512e37cea2ad51b2efa2d18f25 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 13:05:09 +0100 Subject: [PATCH 33/49] plugged in Mattia comments --- doc/esmvalcore/config.rst | 31 +++++++++++++++++-------------- 1 file changed, 17 insertions(+), 14 deletions(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index 563095a2ef..4fed748d2c 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -15,14 +15,13 @@ There are several configuration files in ESMValTool: formatting; * ``config-references.yml``: stores information on diagnostic authors and scientific journals references; -* ``config-logging.yml``: stores information on logging (duh!). +* ``config-logging.yml``: stores information on logging. User configuration file ======================= The ``config-user.yml`` is one of the two files the user needs to provide to the -``esmvaltool`` executable at run time, the second being ``recipe.yml`` (named accordingly -as per the specific diagnostic/recipe used). +``esmvaltool`` executable at run time, the second being the recipe_. The ``config-user.yml`` configuration file contains all the global level information needed by ESMValTool. ``config-user.yml`` can be reused as many times the @@ -99,7 +98,7 @@ in a commented line above each option: drs: CMIP5: default -Most of these settings are fairly self-explanatory, ie: +Most of these settings are fairly self-explanatory, e.g.: .. code-block:: yaml @@ -108,9 +107,9 @@ Most of these settings are fairly self-explanatory, ie: # Diagnositcs write NetCDF files? [true]/false write_netcdf: true -The ``write_plots`` setting is used to inform ESMValTool about your preference -for saving figures. Similarly, the ``write_netcdf`` setting is a boolean which -turns on or off the writing of netCDF files. +The ``write_plots`` setting is used to inform ESMValTool diagnostics about your preference +for creating figures. Similarly, the ``write_netcdf`` setting is a boolean which +turns on or off the writing of netCDF files by the diagnostic scripts. .. code-block:: yaml @@ -118,14 +117,18 @@ turns on or off the writing of netCDF files. auxiliary_data_dir: ~/auxiliary_data The ``auxiliary_data_dir`` setting is the path to place any required -additional auxiliary data files. This method was necessary because certain +additional auxiliary data files. This is necessary because certain Python toolkits such as cartopy will attempt to download data files at run time, typically geographic data files such as coastlines or land surface maps. This can fail if the machine does not have access to the wider internet. This location allows us to tell cartopy (and other similar tools) where to find the -files if they can not be downloaded at runtime. To reiterate, this setting is -not for model or observational datasets, rather it is for data files used in -plotting such as coastline descriptions and so on. +files if they can not be downloaded at runtime. + +.. warning:: + + This setting is not for model or observational datasets, + rather it is for data files used in + plotting such as coastline descriptions and so on. .. note:: @@ -152,11 +155,11 @@ This configuration file describes the file system structure for several key projects (CMIP5, CMIP6) on several key machines (BADC, CP4CDS, DKRZ, ETHZ, SMHI, BSC) - CMIP data is stored as part of the Earth System Grid Federation (ESGF) and the standards for file naming and paths to files are set out by CMOR and DRS. -For a detailed description of these standards and how the impact on the use in ESMValTool +For a detailed description of these standards and their adoption in ESMValTool, we refer the user to CMOR-DRS_ section where we relate these standards to the data retrieval mechanism built-in ESMValTool. -The data directory structure of the CMIP5/6 projects is set up differently +The data directory structure of the CMIP projects is set up differently at each site. The following code snippet is an example of several paths descriptions for the CMIP5 at various sites: @@ -179,7 +182,7 @@ As an example, the CMIP5 file path on BADC would be: [institute]/[dataset ]/[exp]/[frequency]/[modeling_realm]/[mip]/[ensemble]/latest/[short_name] When loading these files, ESMValTool replaces the placeholders ``[item]`` with actual -values supplied for by the information form ``config-user.yml`` and ``recipe.yml``. +values supplied for by the user in ``config-user.yml`` and ``recipe.yml``. The resulting real path would look something like this: .. code-block:: bash From 327fc29ad47c0f84f32cbec1bfa19156575ae4d7 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 13:17:15 +0100 Subject: [PATCH 34/49] plugging in Mattia comments --- doc/esmvalcore/datafinder.rst | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index 18761713d0..46d5b30ead 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -23,15 +23,15 @@ CMIP data is widely available via the Earth System Grid Federation (`ESGF `_ -ensures that files and paths to them are named according to a standardized convention. An example of this -convention, and also used by ESMValTool for file discovery and data retrieval can be seen here: +ensures that files and paths to them are named according to a standardized convention. Examples of this +convention, also used by ESMValTool for file discovery and data retrieval, include: * CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc`` * CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc`` * OBS file: ``[project]_[dataset_name]_[type]_[version]_[mip]_[short_name]_[start-date]-[end-date].nc`` and similar standards exist for the standard paths (input directories); for the ESGF data nodes, -these paths differ slightly, an example is given below: +these paths differ slightly, for example: * CMIP6 path for BADC: ``ROOT-BADC/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/ [variable_short_name]/[grid]``; @@ -45,14 +45,14 @@ Data retrieval Data retrieval in ESMValTool has two main aspects from the user's point of view: * data can be found by the tool, subject to availability on disk; -* it is the user's responsibility to set the corect data retrieval parameters; +* it is the user's responsibility to set the correct data retrieval parameters; The first point is self-explanatory: if the user runs the tool on a machine that has access to a data repository or multiple data repositories, then ESMValTool will look for and find the avaialble data requested by the user. -The second point underlines the fact that the user has full control over what type and the amount of data they -need for their analyses. Setting the data retrieval parameters is explained below: +The second point underlines the fact that the user has full control over what type and the amount of data is +needed for the analyses. Setting the data retrieval parameters is explained below: Setting the correct root paths ------------------------------ @@ -75,7 +75,7 @@ parameter the user tells the tool what type of root paths they need the data fro drs: CMIP6: BADC -will tell the tool that the user needs data from a repository structured according to the BADC DRS structure `ie` +will tell the tool that the user needs data from a repository structured according to the BADC DRS structure ie ``ROOT/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/[variable_short_name]/[grid]``; @@ -114,7 +114,7 @@ Explaining ``config-user/rootpath:`` CMIP5: /badc/cmip5/data/cmip5/output1 - Example for a CMIP6 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC): + Example for a CMIP6 root path pointing to the ESGF node on CEDA-Jasmin: .. code-block:: yaml From bb8116b83386c50a77692bdbd8a1fb8d378332e2 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 13:19:40 +0100 Subject: [PATCH 35/49] removed repetition --- doc/esmvalcore/datafinder.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index 46d5b30ead..4fe310a6e7 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -120,7 +120,7 @@ Explaining ``config-user/rootpath:`` CMIP6: /badc/cmip6/data/CMIP6/CMIP - Example for a mix of CMIP6 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC) + Example for a mix of CMIP6 root path pointing to the ESGF node on CEDA-Jasmin and a user-specific data repository for extra data: .. code-block:: yaml From e38db31c8f1ad02e7ead9fb48e0a200ae919b427 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 13:50:44 +0100 Subject: [PATCH 36/49] plugging in Mattia comments --- doc/esmvalcore/recipe.rst | 37 ++++++++++++++++++------------------- 1 file changed, 18 insertions(+), 19 deletions(-) diff --git a/doc/esmvalcore/recipe.rst b/doc/esmvalcore/recipe.rst index 77342d83af..6d9bd7c893 100644 --- a/doc/esmvalcore/recipe.rst +++ b/doc/esmvalcore/recipe.rst @@ -7,15 +7,15 @@ Recipe Overview ======== -After ``config-user.yml``, the ``recipe.yml`` file the second the user inputted file -to ESMValTool at each run time point. Rceipes contain the data and data analysis -information and instructions needed to run the diagnostic(s), as well as specific -diagnostic-related instructions. +After ``config-user.yml``, the ``recipe.yml`` is the second file the user needs +to pass to ``esmvaltool`` as command line option, at each run time point. +Recipes contain the data and data analysis information and instructions needed +to run the diagnostic(s), as well as specific diagnostic-related instructions. -Broadly, recipes contain information on the user who wrote the -recipe file, the datasets which need to be run, the preprocessors that need to be +Broadly, recipes contain a general section summarizing the provenance and functionality of the +diagnostics, the datasets which need to be run, the preprocessors that need to be applied, and the diagnostics which need to be run over the preprocessed data. -This information is provided to ESMValTool in for main recipe sections: +This information is provided to ESMValTool in four main recipe sections: Documentation_, Datasets_, Preprocessors_ and Diagnostics_, respectively. @@ -28,7 +28,6 @@ The documentation section includes: - The recipe's author's user name (``authors``, as they appaer in ``config-references.yml`` config-ref_) - A description of the recipe (``description``, written in MarkDown format) -- The user name of the maintainer (``maintainer``, as they appaer in ``config-references.yml`` config-ref_) - A list of scientific references (``references`` , as they appaer in ``config-references.yml`` config-ref_) - the project or projects associated with the recipe (``projects``, as they appaer in ``config-references.yml`` config-ref_) @@ -62,8 +61,7 @@ For example, please see the documentation section from the recipe: Note that the authors, projects, and references will need to be included in the ``config-references.yml`` file. The author name uses the format: - ``surname_name``. For instance, Joe Pesci would be: ``authors: - pesci_joe``. - Also note that Joe Pesci does not appreciate you calling him ``funny``. + ``surname_name``. For instance, John Doe would be: ``authors: - doe_john``. For a first-time user that does not yet have their name added to ``config-references.yml`` a run of an already-made recipe or running with no author name is possible. @@ -103,13 +101,13 @@ Recipe section: ``preprocessors`` ================================= The preprocessor section of the recipe includes one or more preprocesors, each -of which may execute one or several preprocessor steps. +of which may call the execution of one or several preprocessor functions. Each preprocessor section includes: -- A preprocessor name (any name, one yaml tab (2 spaces) under ``preprocessors``); -- A list of preprocesor steps to be executed (choose from the API, each one yaml tab (2 spaces) under preprocessor name) -- Any or none arguments given to the preprocessor steps (one yaml tab (2 spaces) under preprocessor step name) +- A preprocessor name (any name, under ``preprocessors``); +- A list of preprocesor steps to be executed (choose from the API); +- Any or none arguments given to the preprocessor steps; - The order that the preprocesor steps are applied can also be specified using the ``custom_order`` preprocesor function. The following snippet is an example of a preprocessor named ``prep_map`` that contains @@ -152,9 +150,10 @@ have: - The diagnostics can also include an optional ``additional_datasets`` section. The ``additional_datasets`` can add datasets beyond those listed in the the -Datasets_ section. This is useful if specific datasets need to be linked with -a specific diagnostic. The ``additional_datasets`` can be used to add variable -specific datasets. This is also a good way to add observational datasets. +Datasets_ section. This is useful if specific datasets need to be used only by +a specific diagnostic. The ``additional_datasets`` can also be used to add variable +specific datasets. This is also a good way to add observational +datasets, which are usually variable specific. Running a simple diagnostic --------------------------- @@ -216,7 +215,7 @@ is run in Python, NCL etc). Here is an example of such groups of arguments: obs_models: [ERA-Interim] # list to hold models that are NOT for metrics but for obs operations additional_metrics: [ERA-Interim, inmcm4] # list to hold additional datasets for metrics -In this example, apart from the pointer to the diagnostic script ``script: autoassess/autoassess_area_base.py``, +In this example, apart from specifying the diagnostic script ``script: autoassess/autoassess_area_base.py``, we pass a suite of parameters to be used by the script (``area``, ``control_model`` etc). These parameters are stored in key-value pairs in the diagnostic configuration file, an interface file that can be used by importing the ``run_diagnostic`` utility: @@ -258,7 +257,7 @@ passing the absolute path to the diagnostic: diagnostics: myFirstDiag: - description: Joe Pesci wrote a funny diagnostic + description: John Doe wrote a funny diagnostic variables: tos: # Temperature at the ocean surface preprocessor: prep_map From c074cfc5399ed201a06eb9b1cec4a4a425b83f10 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 14:09:09 +0100 Subject: [PATCH 37/49] fixed bad refernces --- doc/esmvalcore/config.rst | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index 4fed748d2c..b078bc229f 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -21,7 +21,7 @@ User configuration file ======================= The ``config-user.yml`` is one of the two files the user needs to provide to the -``esmvaltool`` executable at run time, the second being the recipe_. +``esmvaltool`` executable at run time, the second being the :ref:`recipe`. The ``config-user.yml`` configuration file contains all the global level information needed by ESMValTool. ``config-user.yml`` can be reused as many times the @@ -142,7 +142,7 @@ files if they can not be downloaded at runtime. **Note on data finding sections of the config-user file.** A detailed explanation of the data finding-related sections of the ``config-user.yml`` - (``rootpath`` and ``drs``) is presented in config-user-rootpath_ and config-user-drs_ + (``rootpath`` and ``drs``) is presented in :ref:`config-user-rootpath` and :ref:`config-user-drs` in the Data Finder section; these sections relate directly to the data finding capabilities of ESMValTool and are very important to be understood by the user. @@ -156,7 +156,7 @@ key projects (CMIP5, CMIP6) on several key machines (BADC, CP4CDS, DKRZ, ETHZ, SMHI, BSC) - CMIP data is stored as part of the Earth System Grid Federation (ESGF) and the standards for file naming and paths to files are set out by CMOR and DRS. For a detailed description of these standards and their adoption in ESMValTool, -we refer the user to CMOR-DRS_ section where we relate these standards to the data retrieval +we refer the user to :ref:`CMOR-DRS` section where we relate these standards to the data retrieval mechanism built-in ESMValTool. The data directory structure of the CMIP projects is set up differently @@ -190,7 +190,7 @@ The resulting real path would look something like this: MOHC/HadGEM2-CC/rcp85/mon/ocean/Omon/r1i1p1/latest/tos Again, for a more in-depth description this process, as part of the data retrieval mechanism, -please see CMOR-DRS_. +please see :ref:`CMOR-DRS`. .. _config-ref: From 3f1650e4936210cf9dec563dcd9d544e9c1fd18f Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 14:32:18 +0100 Subject: [PATCH 38/49] fixed references --- doc/esmvalcore/preprocessor.rst | 59 ++++++++++++++++++++++----------- 1 file changed, 40 insertions(+), 19 deletions(-) diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst index 83aee130d1..a6bc8b7926 100644 --- a/doc/esmvalcore/preprocessor.rst +++ b/doc/esmvalcore/preprocessor.rst @@ -7,17 +7,17 @@ Preprocessor In this section, each of the preprocessor modules is described in detail following the default order in which they are applied: -* `Variable derivation`_. -* `CMOR check and dataset-specific fixes`_. -* `Vertical interpolation`_. -* `Land/Sea/Ice masking`_. -* `Horizontal regridding`_. -* `Masking of missing values`_. -* `Multi-model statistics`_. -* `Time operations`_. -* `Area operations`_. -* `Volume operations`_. -* `Unit conversion`_. +* :ref:`Variable derivation` +* :ref:`CMOR check and dataset-specific fixes` +* :ref:`Vertical interpolation` +* :ref:`Land/Sea/Ice masking` +* :ref:`Horizontal regridding` +* :ref:`Masking of missing values` +* :ref:`Multi-model statistics` +* :ref:`Time operations` +* :ref:`Area operations` +* :ref:`Volume operations` +* :ref:`Unit conversion` Overview ======== @@ -73,6 +73,8 @@ Features of the ESMValTool Climate data pre-processor are: * Multimodel statistics * and many more +.. _Variable derivation: + Variable derivation =================== The variable derivation module allows to derive variables which are not in the @@ -110,6 +112,17 @@ The required arguments for this module are two boolean switches: See also :func:`esmvalcore.preprocessor.derive`. +.. _CMOR check and dataset-specific fixes: + +CMORization and dataset-specific fixes +====================================== +.. warning:: + + Section to be added by Javier ``CMORMAN`` Vegas-Regidor + + +.. _time operations: + Time manipulation ================= The ``_time.py`` module contains the following preprocessor functions: @@ -218,6 +231,9 @@ unless a custom ``frequency`` is set manually by the user in recipe. See also :func:`esmvalcore.preprocessor.regrid_time`. + +.. _area operations: + Area manipulation ================= The ``_area.py`` module contains the following preprocessor functions: @@ -287,6 +303,8 @@ matches the named regions against the requested string. See also :func:`esmvalcore.preprocessor.extract_named_regions`. +.. _volume operations: + Volume manipulation =================== The ``_volume.py`` module contains the following preprocessor functions: @@ -373,13 +391,8 @@ but it may be necceasiry for irregular grids. See also :func:`esmvalcore.preprocessor.extract_trajectory`. -.. _cmor-checks-fixes: - -CMORization and dataset-specific fixes -====================================== -.. warning:: - Section to be added by Javier ``CMORMAN`` Vegas-Regidor +.. _Vertical interpolation: Vertical interpolation ====================== @@ -466,6 +479,9 @@ are used: although these are not model-specific, they represent a good approximation since they have a much higher resolution than most of the models and they are regularly updated with changing geographical features. + +.. _land/sea/ice masking: + Land-sea masking ---------------- @@ -548,6 +564,9 @@ the ``config`` diagnostic variable items e.g.: sftlf_file = attributes['fx_files']['sftlf'] areacello_file = attributes['fx_files']['areacello'] + +.. _masking of missing values: + Missing values masks -------------------- @@ -612,7 +631,7 @@ masking or the pair ``minimum`, ``maximum`` for interval masking. See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. -.. _regrid: +.. _Horizontal regridding: Horizontal regridding ===================== @@ -722,7 +741,7 @@ See also :func:`esmvalcore.preprocessor.regrid` of up to ``0.5x0.5`` degrees should not produce any memory-related issues, but be advised that for resolutions of ``< 0.5`` degrees the regridding becomes very slow and will use a lot of memory. -.. _multi_model_statistics: +.. _multi-model statistics: Multi-model statistics ====================== @@ -803,6 +822,8 @@ this memory intake is high but also assumes that all data is fully realized in m will gradually change and the amount of realized data will decrease with the increase of ``dask`` use. +.. _unit conversion: + Unit conversion =============== From 9b7333488364de22a7255e3ba3c82d2cde1406b6 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 14:32:28 +0100 Subject: [PATCH 39/49] fixed references --- doc/esmvalcore/recipe.rst | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/esmvalcore/recipe.rst b/doc/esmvalcore/recipe.rst index 6d9bd7c893..163faf45b5 100644 --- a/doc/esmvalcore/recipe.rst +++ b/doc/esmvalcore/recipe.rst @@ -26,10 +26,10 @@ Recipe section: ``documentation`` The documentation section includes: -- The recipe's author's user name (``authors``, as they appaer in ``config-references.yml`` config-ref_) +- The recipe's author's user name (``authors``, as they appaer in ``config-references.yml`` :ref:`config-ref`) - A description of the recipe (``description``, written in MarkDown format) -- A list of scientific references (``references`` , as they appaer in ``config-references.yml`` config-ref_) -- the project or projects associated with the recipe (``projects``, as they appaer in ``config-references.yml`` config-ref_) +- A list of scientific references (``references`` , as they appaer in ``config-references.yml`` :ref:`config-ref`) +- the project or projects associated with the recipe (``projects``, as they appaer in ``config-references.yml`` :ref:`config-ref`) For example, please see the documentation section from the recipe: ``recipes/recipe_ocean_amoc.yml``: @@ -111,8 +111,8 @@ Each preprocessor section includes: - The order that the preprocesor steps are applied can also be specified using the ``custom_order`` preprocesor function. The following snippet is an example of a preprocessor named ``prep_map`` that contains -multiple preprocessing steps (regrid_ with two arguments, time_average_ with no arguments -and multi_model_statistics_ with two arguments): +multiple preprocessing steps (:ref:`Horizontal regridding` with two arguments, :ref:`Time operations` with no arguments +and :ref:`Multi-model statistics` with two arguments): .. code-block:: yaml @@ -132,7 +132,7 @@ and multi_model_statistics_ with two arguments): In this case no ``preprocessors`` section is needed; the workflow will apply a ``default`` preprocessor consisting of only - basic operations like: loading data, applying CMOR checks and fixes (cmor-checks-fixes_) + basic operations like: loading data, applying CMOR checks and fixes (:ref:`CMOR check and dataset-specific fixes`) and saving the data to disk (if needed). .. _Diagnostics: From e3d908a29e1029674c729b97983a81ae5c8516d9 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 14:45:14 +0100 Subject: [PATCH 40/49] fixed references to external --- doc/esmvalcore/utils.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/doc/esmvalcore/utils.rst b/doc/esmvalcore/utils.rst index d8e075ceb2..ac7e33f2c5 100644 --- a/doc/esmvalcore/utils.rst +++ b/doc/esmvalcore/utils.rst @@ -18,6 +18,6 @@ encountered this language before. The key information about this format is: - the syntax is relatively straightforward - indentation matters a lot (like ``Python``)! - yaml is case sensitive -- a yaml tutorial is available `here `_ -- a yaml quick reference card is available `here `_ -- ESMValTool uses the ``yamllint`` linter `tool `_ +- have a look at this `yaml tutorial `_ +- have a look at this `yaml quick reference card `_ +- ESMValTool uses the `yamllint `_ linter tool. From c46872d736fd02610964b39f3832f8f3dfd72206 Mon Sep 17 00:00:00 2001 From: Valeriu Predoi Date: Mon, 29 Jul 2019 15:05:19 +0100 Subject: [PATCH 41/49] removed reference to private module --- doc/esmvalcore/datafinder.rst | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index 4fe310a6e7..525931cf02 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -7,8 +7,8 @@ Data finder Overview ======== Data discovery and retrieval is the first step in any evaluation process; ESMValTool -uses a `semi-automated` data finding mechanism performed by the ``_data_finder.py`` module -with inputs from both the user configuration file and the recipe file. The reason why the data +uses a `semi-automated` data finding mechanism with inputs from both the user configuration file +and the recipe file. The reason why the data finder module is `semi`-automated is that the user will have to provide the tool with a set of parameters related to the data needed; the reason why it is semi-`automated` is that once these parameters have been provided, the tool will automatically find the right data. We will From f30eb2d32e640d28ca9a4aaa76c43d5dbc2e6e3b Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Tue, 30 Jul 2019 15:25:20 +0200 Subject: [PATCH 42/49] Minor corrections and column limit to 79 --- doc/esmvalcore/config.rst | 103 ++++++++++--------- doc/esmvalcore/datafinder.rst | 188 +++++++++++++++++++--------------- doc/esmvalcore/recipe.rst | 156 ++++++++++++++-------------- 3 files changed, 236 insertions(+), 211 deletions(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index b078bc229f..859a3c5dc4 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -10,25 +10,27 @@ Overview There are several configuration files in ESMValTool: * ``config-user.yml``: sets a number of user-specific options like desired - graphical output format, root paths to data etc. -* ``config-developer.yml``: sets a number of standardized file-naming and paths to data - formatting; -* ``config-references.yml``: stores information on diagnostic authors and scientific - journals references; + graphical output format, root paths to data, etc.; +* ``config-developer.yml``: sets a number of standardized file-naming and paths + to data formatting; +* ``config-references.yml``: stores information on diagnostic authors and + scientific journals references; * ``config-logging.yml``: stores information on logging. User configuration file ======================= -The ``config-user.yml`` is one of the two files the user needs to provide to the -``esmvaltool`` executable at run time, the second being the :ref:`recipe`. +The ``config-user.yml`` is one of the two files the user needs to provide as +input arguments to the ``esmvaltool`` executable at run time, the second being +the :ref:`recipe`. The ``config-user.yml`` configuration file contains all the global level -information needed by ESMValTool. ``config-user.yml`` can be reused as many times the -user needs to before changing any of the options stored in it. This file is essentially -the gateway between the user and the machine-specific instructions to ``esmvaltool``. -The following shows the default settings from the ``config-user.yml`` file with explanations -in a commented line above each option: +information needed by ESMValTool. It can be reused as many times the user needs +to before changing any of the options stored in it. This file is essentially +the gateway between the user and the machine-specific instructions to +``esmvaltool``. The following shows the default settings from the +``config-user.yml`` file with explanations in a commented line above each +option: .. code-block:: yaml @@ -107,9 +109,10 @@ Most of these settings are fairly self-explanatory, e.g.: # Diagnositcs write NetCDF files? [true]/false write_netcdf: true -The ``write_plots`` setting is used to inform ESMValTool diagnostics about your preference -for creating figures. Similarly, the ``write_netcdf`` setting is a boolean which -turns on or off the writing of netCDF files by the diagnostic scripts. +The ``write_plots`` setting is used to inform ESMValTool diagnostics about your +preference for creating figures. Similarly, the ``write_netcdf`` setting is a +boolean which turns on or off the writing of netCDF files by the diagnostic +scripts. .. code-block:: yaml @@ -118,33 +121,29 @@ turns on or off the writing of netCDF files by the diagnostic scripts. The ``auxiliary_data_dir`` setting is the path to place any required additional auxiliary data files. This is necessary because certain -Python toolkits such as cartopy will attempt to download data files at run +Python toolkits, such as cartopy, will attempt to download data files at run time, typically geographic data files such as coastlines or land surface maps. This can fail if the machine does not have access to the wider internet. This -location allows us to tell cartopy (and other similar tools) where to find the -files if they can not be downloaded at runtime. +location allows the user to specify where to find such files if they can not be +downloaded at runtime. .. warning:: - This setting is not for model or observational datasets, - rather it is for data files used in - plotting such as coastline descriptions and so on. + This setting is not for model or observational datasets, rather it is for + data files used in plotting such as coastline descriptions and so on. -.. note:: - - **Pro Tip: working with multiple config-user files.** - - You choose your config.yml file at run time, so you could have several - available with different purposes. One for formalised run, one for debugging, etc. +A detailed explanation of the data finding-related sections of the +``config-user.yml`` (``rootpath`` and ``drs``) is presented in the +:ref:`data-retrieval` section. This section relates directly to the data +finding capabilities of ESMValTool and are very important to be understood by +the user. .. note:: - **Note on data finding sections of the config-user file.** + You choose your config.yml file at run time, so you could have several of + them available with different purposes. One for formalised run, one for + debugging, etc. - A detailed explanation of the data finding-related sections of the ``config-user.yml`` - (``rootpath`` and ``drs``) is presented in :ref:`config-user-rootpath` and :ref:`config-user-drs` - in the Data Finder section; these sections relate directly to the data finding capabilities - of ESMValTool and are very important to be understood by the user. .. _config-developer: @@ -152,16 +151,16 @@ Developer configuration file ============================ This configuration file describes the file system structure for several -key projects (CMIP5, CMIP6) on several key machines (BADC, CP4CDS, DKRZ, ETHZ, -SMHI, BSC) - CMIP data is stored as part of the Earth System Grid Federation (ESGF) -and the standards for file naming and paths to files are set out by CMOR and DRS. -For a detailed description of these standards and their adoption in ESMValTool, -we refer the user to :ref:`CMOR-DRS` section where we relate these standards to the data retrieval -mechanism built-in ESMValTool. +key projects (CMIP5, CMIP6, OBS) on several key machines (BADC, CP4CDS, DKRZ, +ETHZ, SMHI, BSC). CMIP data is stored as part of the Earth System Grid +Federation (ESGF) and the standards for file naming and paths to files are set +out by CMOR and DRS. For a detailed description of these standards and their +adoption in ESMValTool, we refer the user to :ref:`CMOR-DRS` section where we +relate these standards to the data retrieval mechanism of the ESMValTool. The data directory structure of the CMIP projects is set up differently at each site. The following code snippet is an example of several paths -descriptions for the CMIP5 at various sites: +descriptions for the CMIP5 adopted at various sites: .. code-block:: yaml @@ -181,28 +180,29 @@ As an example, the CMIP5 file path on BADC would be: [institute]/[dataset ]/[exp]/[frequency]/[modeling_realm]/[mip]/[ensemble]/latest/[short_name] -When loading these files, ESMValTool replaces the placeholders ``[item]`` with actual -values supplied for by the user in ``config-user.yml`` and ``recipe.yml``. -The resulting real path would look something like this: +When loading these files, ESMValTool replaces the placeholders ``[item]`` with +actual values supplied for by the user in ``config-user.yml`` and +``recipe.yml``. The resulting real path would look something like this: -.. code-block:: bash +.. code-block:: MOHC/HadGEM2-CC/rcp85/mon/ocean/Omon/r1i1p1/latest/tos -Again, for a more in-depth description this process, as part of the data retrieval mechanism, -please see :ref:`CMOR-DRS`. +Again, for a more in-depth description this process, as part of the data +retrieval mechanism, please see :ref:`CMOR-DRS`. .. _config-ref: References configuration file ============================= -The ``config-references.yml`` file is the full list of ESMValTool authors, -references and projects. Each author, project and reference in the documentation -section of a recipe needs to be in this file in the relevant section. +The ``config-references.yml`` file contains the list of ESMValTool authors, +references and projects. Each author, project and reference referred to in the +documentation section of a recipe needs to be in this file in the relevant +section. -For instance, the recipe ``recipe_ocean_example.yml`` file contains the following -documentation section: +For instance, the recipe ``recipe_ocean_example.yml`` file contains the +following documentation section: .. code-block:: yaml @@ -220,9 +220,10 @@ documentation section: - ukesm -All four items here are named people, references and projects listed in the +These four items here are named people, references and projects listed in the ``config-references.yml`` file. + Logging configuration file ========================== diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index 525931cf02..9a49a88c30 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -1,88 +1,104 @@ -:: _inputdata: +.. _findingdata: -*********** -Data finder -*********** +************ +Finding data +************ Overview ======== -Data discovery and retrieval is the first step in any evaluation process; ESMValTool -uses a `semi-automated` data finding mechanism with inputs from both the user configuration file -and the recipe file. The reason why the data -finder module is `semi`-automated is that the user will have to provide the tool with a set -of parameters related to the data needed; the reason why it is semi-`automated` is that once -these parameters have been provided, the tool will automatically find the right data. We will -detail below the data finding and retrieval process and the inputs the user needs to specify, -giving examples on how to use the data finding routine under different scenarios. +Data discovery and retrieval is the first step in any evaluation process; +ESMValTool uses a `semi-automated` data finding mechanism with inputs from both +the user configuration file and the recipe file. The reason why the data finder +module is `semi`-automated is that the user will have to provide the tool with +a set of parameters related to the data needed; the reason why it is +semi-`automated` is that once these parameters have been provided, the tool +will automatically find the right data. We will detail below the data finding +and retrieval process and the inputs the user needs to specify, giving examples +on how to use the data finding routine under different scenarios. .. _CMOR-DRS: -CMIP data: CMOR Data Reference Syntax (DRS) and the ESGF -======================================================== -CMIP data is widely available via the Earth System Grid Federation (`ESGF `_) -and is accessible to users either via dowload from the ESGF portal or through the ESGF data nodes hosted -by large computing facilities (like CEDA-Jasmin, DKRZ etc). This data adheres to, among other standards, -the DRS and Controlled Vocabulary standard for naming files and structured paths; the `DRS `_ -ensures that files and paths to them are named according to a standardized convention. Examples of this -convention, also used by ESMValTool for file discovery and data retrieval, include: +CMIP data - CMOR Data Reference Syntax (DRS) and the ESGF +========================================================= +CMIP data is widely available via the Earth System Grid Federation +(`ESGF `_) and is accessible to users either +via dowload from the ESGF portal or through the ESGF data nodes hosted +by large computing facilities (like CEDA-Jasmin, DKRZ, etc). This data +adheres to, among other standards, the DRS and Controlled Vocabulary +standard for naming files and structured paths; the `DRS +`_ +ensures that files and paths to them are named according to a +standardized convention. Examples of this convention, also used by +ESMValTool for file discovery and data retrieval, include: * CMIP6 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[grid]_[start-date]-[end-date].nc`` * CMIP5 file: ``[variable_short_name]_[mip]_[dataset_name]_[experiment]_[ensemble]_[start-date]-[end-date].nc`` * OBS file: ``[project]_[dataset_name]_[type]_[version]_[mip]_[short_name]_[start-date]-[end-date].nc`` -and similar standards exist for the standard paths (input directories); for the ESGF data nodes, -these paths differ slightly, for example: +Similar standards exist for the standard paths (input directories); for the +ESGF data nodes, these paths differ slightly, for example: * CMIP6 path for BADC: ``ROOT-BADC/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/ [variable_short_name]/[grid]``; * CMIP6 path for ETHZ: ``ROOT-ETHZ/[experiment]/[mip]/[variable_short_name]/[dataset_name]/[ensemble]/[grid]`` -From the ESMValTool user perspective the number of data input parameters is optimized to allow for ease of use. -We detail this procedure in the next section. +From the ESMValTool user perspective the number of data input parameters is +optimized to allow for ease of use. We detail this procedure in the next +section. + +.. _data-retrieval: Data retrieval ============== -Data retrieval in ESMValTool has two main aspects from the user's point of view: +Data retrieval in ESMValTool has two main aspects from the user's point of +view: * data can be found by the tool, subject to availability on disk; * it is the user's responsibility to set the correct data retrieval parameters; -The first point is self-explanatory: if the user runs the tool on a machine that has access to a data -repository or multiple data repositories, then ESMValTool will look for and find the avaialble data requested -by the user. +The first point is self-explanatory: if the user runs the tool on a machine +that has access to a data repository or multiple data repositories, then +ESMValTool will look for and find the avaialble data requested by the user. -The second point underlines the fact that the user has full control over what type and the amount of data is -needed for the analyses. Setting the data retrieval parameters is explained below: +The second point underlines the fact that the user has full control over what +type and the amount of data is needed for the analyses. Setting the data +retrieval parameters is explained below. Setting the correct root paths ------------------------------ -The first step towards providing ESMValTool the correct set of parameters for data retrieval is setting -the root paths to the data. This is done in the user configuration file ``config-user.yml``. -The two sections where the user will set the paths are ``rootpath`` and ``drs``. ``rootpath`` contains pointers -to ``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the type of directory structure -the root paths are structured by. It is important to first discuss the ``drs`` parameter: as we've seen in -the previous section, the DRS as a standard is used for both file naming conventions and for directory structures. +The first step towards providing ESMValTool the correct set of parameters for +data retrieval is setting the root paths to the data. This is done in the user +configuration file ``config-user.yml``. The two sections where the user will +set the paths are ``rootpath`` and ``drs``. ``rootpath`` contains pointers to +``CMIP``, ``OBS``, ``default`` and ``RAWOBS`` root paths; ``drs`` sets the type +of directory structure the root paths are structured by. It is important to +first discuss the ``drs`` parameter: as we've seen in the previous section, the +DRS as a standard is used for both file naming conventions and for directory +structures. .. _config-user-drs: Explaining ``config-user/drs: CMIP5:`` or ``config-user/drs: CMIP6:`` --------------------------------------------------------------------- -Whreas ESMValTool will **always** use the CMOR standard for file naming (please refer above), by setting the ``drs`` -parameter the user tells the tool what type of root paths they need the data from, e.g.: +Whreas ESMValTool will **always** use the CMOR standard for file naming (please +refer above), by setting the ``drs`` parameter the user tells the tool what +type of root paths they need the data from, e.g.: .. code-block:: yaml drs: CMIP6: BADC -will tell the tool that the user needs data from a repository structured according to the BADC DRS structure ie +will tell the tool that the user needs data from a repository structured +according to the BADC DRS structure, i.e.: ``ROOT/[institute]/[dataset_name]/[experiment]/[ensemble]/[mip]/[variable_short_name]/[grid]``; -setting the ``ROOT`` parameter is explained below. This is a strictly-structured repository tree and if -there are any sort of irregularities (e.g. there is no ``[mip]`` directory) the data will not be found! -``BADC`` can be replaced with ``DKRZ`` or ``ETHZ`` depending on the existing ``ROOT`` directory structure. - +setting the ``ROOT`` parameter is explained below. This is a +strictly-structured repository tree and if there are any sort of irregularities +(e.g. there is no ``[mip]`` directory) the data will not be found! ``BADC`` can +be replaced with ``DKRZ`` or ``ETHZ`` depending on the existing ``ROOT`` +directory structure. The snippet .. code-block:: yaml @@ -90,12 +106,14 @@ The snippet drs: CMIP6: default -is another way to retrieve data from a ``ROOT`` directory that has no DRS-like structure; ``default`` is -a directory that contains all the needed data files (a bucket full of everything). +is another way to retrieve data from a ``ROOT`` directory that has no DRS-like +structure; ``default`` indicates that the data lies in a directory that +contains all the files without any structire. .. note:: - When using ``CMIP6: default`` or ``CMIP5: default`` it is important to remember that all the needed files - must be in the same top-level directory set by ``default`` (see below how to set ``default``). + When using ``CMIP6: default`` or ``CMIP5: default`` it is important to + remember that all the needed files must be in the same top-level directory + set by ``default`` (see below how to set ``default``). .. _config-user-rootpath: @@ -104,11 +122,11 @@ Explaining ``config-user/rootpath:`` ``rootpath`` identifies the root directory for different data types (``ROOT`` as we used it above): -* ``CMIP`` e.g. ``CMIP5`` or ``CMIP6``: this is the `root` path(s) to where the CMIP files are stored; - it can be a single path or a list of paths; it can point to an ESGF node or it can point to a user - private repository; - - Example for a CMIP5 root path pointing to the ESGF node on CEDA-Jasmin (formerly known as BADC): +* ``CMIP`` e.g. ``CMIP5`` or ``CMIP6``: this is the `root` path(s) to where the + CMIP files are stored; it can be a single path or a list of paths; it can + point to an ESGF node or it can point to a user private repository. Example + for a CMIP5 root path pointing to the ESGF node on CEDA-Jasmin (formerly + known as BADC): .. code-block:: yaml @@ -127,28 +145,31 @@ Explaining ``config-user/rootpath:`` CMIP6: [/badc/cmip6/data/CMIP6/CMIP, /home/users/joepesci/cmip_data] -* ``OBS``: this is the `root` path(s) to where the observational datasets are stored; again, this could - be a single path or a list of paths, just like for CMIP data. - - Example for the OBS path for a large cache of observation datasets on CEDA-Jasmin: +* ``OBS``: this is the `root` path(s) to where the observational datasets are + stored; again, this could be a single path or a list of paths, just like for + CMIP data. Example for the OBS path for a large cache of observation datasets + on CEDA-Jasmin: .. code-block:: yaml OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2 -* ``default``: this is the `root` path(s) to where files are stored without any DRS-like directory - structure; in a nutshell, this is a single directory that should contain all the files needed by the - run, without any sub-directory structure. +* ``default``: this is the `root` path(s) to where files are stored without any + DRS-like directory structure; in a nutshell, this is a single directory that + should contain all the files needed by the run, without any sub-directory + structure. -* ``RAWOBS``: this is the `root` path(s) to where the raw observational data files are stored; this is - used by ``cmorize_obs``. +* ``RAWOBS``: this is the `root` path(s) to where the raw observational data + files are stored; this is used by ``cmorize_obs``. Dataset definitions in ``recipe`` --------------------------------- -Once the correct paths have been established, it is now time to collect the information on the specific -datasets that are needed for the analysis. This information, together with the CMOR convention for -naming files (see CMOR-DRS_) will allow ``_data_finder`` to search and find the right files. The specific -datasets are listed in any recipe, under either the ``datasets`` and/or ``additional_datasets`` sections, e.g. +Once the correct paths have been established, ESMValTool collects the +information on the specific datasets that are needed for the analysis. This +information, together with the CMOR convention for naming files (see CMOR-DRS_) +will allow the tool to search and find the right files. The specific +datasets are listed in any recipe, under either the ``datasets`` and/or +``additional_datasets`` sections, e.g. .. code-block:: yaml @@ -160,8 +181,9 @@ datasets are listed in any recipe, under either the ``datasets`` and/or ``additi Recap and example ================= -Let's look at a practical example for a recap of the information above: suppose you are using a ``config-user.yml`` -that has the following entries for data finding: +Let us look at a practical example for a recap of the information above: +suppose you are using a ``config-user.yml`` that has the following entries for +data finding: .. code-block:: yaml @@ -176,7 +198,7 @@ and the dataset you need is specified in your ``recipe.yml`` as: - {dataset: UKESM1-0-LL, project: CMIP6, mip: Amon, exp: historical, grid: gn, ensemble: r1i1p1f2, start_year: 2004, end_year: 2014} -for a variable e.g. +for a variable, e.g.: .. code-block:: yaml @@ -187,41 +209,43 @@ for a variable e.g. ta: preprocessor: some_preprocessor -``_data_finder`` will use the root path ``/badc/cmip6/data/CMIP6/CMIP`` and the dataset information and will -assemble the full DRS path using information from CMOR-DRS_ and establish the path to the files as +The tool will then use the root path ``/badc/cmip6/data/CMIP6/CMIP`` and the +dataset information and will assemble the full DRS path using information from +CMOR-DRS_ and establish the path to the files as: ``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon`` -then look for variable ``ta`` and specifically the latest version of the data file: +then look for variable ``ta`` and specifically the latest version of the data +file: ``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/`` and finally, using the file naming definition from CMOR-DRS_ find the file: -``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/`` -``ta/gn/latest/`` -``ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc`` +``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc`` + +.. _observations: Observational data ================== -Observational data is retrieved in the same manner as CMIP data, for example using the ``OBS`` root path set to +Observational data is retrieved in the same manner as CMIP data, for example +using the ``OBS`` root path set to: .. code-block:: yaml OBS: /group_workspaces/jasmin4/esmeval/obsdata-v2 -and the dataset +and the dataset: .. code-block:: yaml - {dataset: ERA-Interim, project: OBS, type: reanaly, version: 1, start_year: 2014, end_year: 2015, tier: 3} -in ``recipe.yml`` in ``datasets`` or ``additional_datasets``, the rules set in CMOR-DRS_ are used again -and the file will be automatically found: +in ``recipe.yml`` in ``datasets`` or ``additional_datasets``, the rules set in +CMOR-DRS_ are used again and the file will be automatically found: -``/group_workspaces/jasmin4/esmeval/obsdata-v2/`` -``Tier3/ERA-Interim/`` -``OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc`` +``/group_workspaces/jasmin4/esmeval/obsdata-v2/Tier3/ERA-Interim/OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc`` -Note that for observational data for ``drs: default`` the ``default`` directory must contain a sub-directory +Note that for observational data for ``drs: default`` the ``default`` directory +must contain a sub-directory: ``TierX`` (``Tier1``, ``Tier2`` or ``Tier3``). diff --git a/doc/esmvalcore/recipe.rst b/doc/esmvalcore/recipe.rst index 79d0d76bd1..ea856cd05f 100644 --- a/doc/esmvalcore/recipe.rst +++ b/doc/esmvalcore/recipe.rst @@ -12,12 +12,12 @@ to pass to ``esmvaltool`` as command line option, at each run time point. Recipes contain the data and data analysis information and instructions needed to run the diagnostic(s), as well as specific diagnostic-related instructions. -Broadly, recipes contain a general section summarizing the provenance and functionality of the -diagnostics, the datasets which need to be run, the preprocessors that need to be -applied, and the diagnostics which need to be run over the preprocessed data. -This information is provided to ESMValTool in four main recipe sections: -Documentation_, Datasets_, Preprocessors_ and Diagnostics_, -respectively. +Broadly, recipes contain a general section summarizing the provenance and +functionality of the diagnostics, the datasets which need to be run, the +preprocessors that need to be applied, and the diagnostics which need to be run +over the preprocessed data. This information is provided to ESMValTool in four +main recipe sections: Documentation_, Datasets_, Preprocessors_ and +Diagnostics_, respectively. .. _Documentation: @@ -26,13 +26,16 @@ Recipe section: ``documentation`` The documentation section includes: -- The recipe's author's user name (``authors``, as they appaer in ``config-references.yml`` :ref:`config-ref`) +- The recipe's author's user name (``authors``, matching the definitions in the + :ref:`config-ref`) - A description of the recipe (``description``, written in MarkDown format) -- A list of scientific references (``references`` , as they appaer in ``config-references.yml`` :ref:`config-ref`) -- the project or projects associated with the recipe (``projects``, as they appaer in ``config-references.yml`` :ref:`config-ref`) +- A list of scientific references (``references``, matching the definitions in + the :ref:`config-ref`) +- the project or projects associated with the recipe (``projects``, matching + the definitions in the :ref:`config-ref`) -For example, please see the documentation section from the recipe: -``recipes/recipe_ocean_amoc.yml``: +For example, the documentation section of ``recipes/recipe_ocean_amoc.yml`` is +the following: .. code-block:: yaml @@ -57,13 +60,11 @@ For example, please see the documentation section from the recipe: .. note:: - **Information from config-references.yml** - - Note that the authors, projects, and references will need to be included in the - ``config-references.yml`` file. The author name uses the format: - ``surname_name``. For instance, John Doe would be: ``authors: - doe_john``. - For a first-time user that does not yet have their name added to ``config-references.yml`` - a run of an already-made recipe or running with no author name is possible. + Note that all authors, projects, and references mentioned in the description + section of the recipe need to be included in the ``config-references.yml`` + file. The author name uses the format: ``surname_name``. For instance, John + Doe would be: ``doe_john``. This information can be omitted by new users + whose name is not yet included in ``config-references.yml``. .. _Datasets: @@ -77,11 +78,13 @@ data specifications: - project (key ``project``, value ``CMIP5`` or ``CMIP6`` for CMIP data, ``OBS`` for observational data, ``ana4mips`` for ana4mips data, ``obs4mips`` for obs4mips data, ``EMAC`` for EMAC data) -- experiment (key ``exp``, value e.g. ``historical``, ``amip``, ``piControl``, ``RCP8.5``) +- experiment (key ``exp``, value e.g. ``historical``, ``amip``, ``piControl``, + ``RCP8.5``) - mip (for CMIP data, key ``mip``, value e.g. ``Amon``, ``Omon``, ``LImon``) - ensemble member (key ``ensemble``, value e.g. ``r1i1p1``, ``r1i1p1f1``) - time range (e.g. key-value ``start_year: 1982``, ``end_year: 1990``) -- model grid (native grid ``grid: gn`` or regridded grid ``grid: gr``, for CMIP6 data only). +- model grid (native grid ``grid: gn`` or regridded grid ``grid: gr``, for + CMIP6 data only). For example, a datasets section could be: @@ -109,11 +112,13 @@ Each preprocessor section includes: - A preprocessor name (any name, under ``preprocessors``); - A list of preprocesor steps to be executed (choose from the API); - Any or none arguments given to the preprocessor steps; -- The order that the preprocesor steps are applied can also be specified using the ``custom_order`` preprocesor function. +- The order that the preprocesor steps are applied can also be specified using + the ``custom_order`` preprocesor function. -The following snippet is an example of a preprocessor named ``prep_map`` that contains -multiple preprocessing steps (:ref:`Horizontal regridding` with two arguments, :ref:`Time operations` with no arguments -and :ref:`Multi-model statistics` with two arguments): +The following snippet is an example of a preprocessor named ``prep_map`` that +contains multiple preprocessing steps (:ref:`Horizontal regridding` with two +arguments, :ref:`Time operations` with no arguments and :ref:`Multi-model +statistics` with two arguments): .. code-block:: yaml @@ -129,12 +134,10 @@ and :ref:`Multi-model statistics` with two arguments): .. note:: - What if no preprocessor is needed? - - In this case no ``preprocessors`` section is needed; - the workflow will apply a ``default`` preprocessor consisting of only - basic operations like: loading data, applying CMOR checks and fixes (:ref:`CMOR check and dataset-specific fixes`) - and saving the data to disk (if needed). + In this case no ``preprocessors`` section is needed the workflow will apply + a ``default`` preprocessor consisting of only basic operations like: loading + data, applying CMOR checks and fixes (:ref:`CMOR check and dataset-specific + fixes`) and saving the data to disk. .. _Diagnostics: @@ -142,31 +145,31 @@ Recipe section: ``diagnostics`` =============================== The diagnostics section includes one or more diagnostics. Each diagnostics will -have: +include: -- A list of which variables to load -- A description of the variables (optional) -- Which preprocessor to apply to each variable -- The script to run -- The diagnostics can also include an optional ``additional_datasets`` section. +- a list of which variables to load; +- a description of the variables (optional); +- the preprocessor to be applied to each variable; +- the script to be run; +- an optional ``additional_datasets`` section. The ``additional_datasets`` can add datasets beyond those listed in the the Datasets_ section. This is useful if specific datasets need to be used only by -a specific diagnostic. The ``additional_datasets`` can also be used to add variable -specific datasets. This is also a good way to add observational -datasets, which are usually variable specific. +a specific diagnostic. The ``additional_datasets`` can also be used to add +variable specific datasets. This is also a good way to add observational +datasets, which are usually variable-specific. Running a simple diagnostic --------------------------- -The following example, taken from ``recipe_ocean_example.yml``, shows a diagnostic -named `diag_map`, which loads the temperature at the ocean surface between -the years 2001 and 2003 and then passes it to the ``prep_map`` preprocessor. -The result of this process is then passed to the ocean diagnostic map scipt, -``ocean/diagnostic_maps.py``. +The following example, taken from ``recipe_ocean_example.yml``, shows a +diagnostic named `diag_map`, which loads the temperature at the ocean surface +between the years 2001 and 2003 and then passes it to the ``prep_map`` +preprocessor. The result of this process is then passed to the ocean diagnostic +map scipt, ``ocean/diagnostic_maps.py``. .. code-block:: yaml - diagnostics: + diagnostics: diag_map: description: Global Ocean Surface regridded temperature map @@ -183,26 +186,22 @@ To define a variable/dataset combination, the keys in the diagnostic section are combined with the keys from datasets section. If two versions of the same key are provided, then the key in the datasets section will take precedence over the keys in variables section. For many recipes it makes more sense to -define the ``start_year`` and ``end_year`` items in the variable section, because the -diagnostic script assumes that all the data has the same time range. +define the ``start_year`` and ``end_year`` items in the variable section, +because the diagnostic script assumes that all the data has the same time +range. Note that the path to the script provided in the `script` option should be -either: - - - the absolute path to the script. - - the path relative to the ``esmvaltool/diag_scripts`` directory. - +either the absolute path to the script, or the path relative to the +``esmvaltool/diag_scripts`` directory. -As mentioned above, the datasets are provided in the Diagnostics_ section -in this section. However, they could also be included in the Datasets_ -section. -Passing arguments to diagnostic -------------------------------- -The ``diagnostics`` section may include a lot of arguments that can be used by the -diagnostic script; these arguments are stored at runtime in a dictionary that is then -made available to the diagnostic script via the interface link (no matter if the diagnostic -is run in Python, NCL etc). Here is an example of such groups of arguments: +Passing arguments to a diagnostic +--------------------------------- +The ``diagnostics`` section may include a lot of arguments that can be used by +the diagnostic script; these arguments are stored at runtime in a dictionary +that is then made available to the diagnostic script via the interface link, +independent of the language the diagnostic script is written in. Here is an +example of such groups of arguments: .. code-block:: yaml @@ -216,10 +215,11 @@ is run in Python, NCL etc). Here is an example of such groups of arguments: obs_models: [ERA-Interim] # list to hold models that are NOT for metrics but for obs operations additional_metrics: [ERA-Interim, inmcm4] # list to hold additional datasets for metrics -In this example, apart from specifying the diagnostic script ``script: autoassess/autoassess_area_base.py``, -we pass a suite of parameters to be used by the script (``area``, ``control_model`` etc). These parameters are -stored in key-value pairs in the diagnostic configuration file, an interface file that can be used by importing -the ``run_diagnostic`` utility: +In this example, apart from specifying the diagnostic script ``script: +autoassess/autoassess_area_base.py``, we pass a suite of parameters to be used +by the script (``area``, ``control_model`` etc). These parameters are stored in +key-value pairs in the diagnostic configuration file, an interface file that +can be used by importing the ``run_diagnostic`` utility: .. code-block:: python @@ -244,18 +244,18 @@ the ``run_diagnostic`` utility: with run_diagnostic() as config: main(config) -This way a lot of the optional arguments necessary to a diagnostic are at the user's -control via the recipe. +This way a lot of the optional arguments necessary to a diagnostic are at the +user's control via the recipe. Running your own diagnostic --------------------------- -If the user decides to test a e.g. ``my_first_diagnostic.py`` diagnostic they have just written -and, of course, this diagnostic is not in the ESMValTool diagnostics library, they can do it by -passing the absolute path to the diagnostic: +If the user wants to test a newly-developed ``my_first_diagnostic.py`` which +is not yet part of the ESMValTool diagnostics library, he/she do it by passing +the absolute path to the diagnostic: .. code-block:: yaml - diagnostics: + diagnostics: myFirstDiag: description: John Doe wrote a funny diagnostic @@ -266,15 +266,15 @@ passing the absolute path to the diagnostic: end_year: 2003 scripts: JoeDiagFunny: - script: /home/users/joepesci/esmvaltool_testing/my_first_diagnostic.py + script: /home/users/john_doe/esmvaltool_testing/my_first_diagnostic.py -This way the user may test their diagnostic thoroughly before committing to git and including -their new diagnostic in the ESMValTool diagnostics library. +This way the user may test a new diagnostic thoroughly before committing to the +GitHub repository and including it in the ESMValTool diagnostics library. Re-using parameters from one ``script`` to another -------------------------------------------------- -Due to ``yaml`` features it is possible to recycle entire diagnostics sections for use with other -diagnostics. Here is an example: +Due to ``yaml`` features it is possible to recycle entire diagnostics sections +for use with other diagnostics. Here is an example: .. code-block:: yaml @@ -289,5 +289,5 @@ diagnostics. Here is an example: calc_grading: true normalization: [centered_median, none] -In this example the hook ``&cycle_settings`` can be used to pass the ``cycle:`` parameters to -``grading:`` via the shortcut ``<<: *cycle_settings``. +In this example the hook ``&cycle_settings`` can be used to pass the ``cycle:`` +parameters to ``grading:`` via the shortcut ``<<: *cycle_settings``. From ffa8a958a33271e11eeb5d3cc51c6546a22f9aab Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Tue, 30 Jul 2019 16:45:32 +0200 Subject: [PATCH 43/49] Restore original section order and set max colummn width --- doc/esmvalcore/preprocessor.rst | 1005 ++++++++++++++++--------------- 1 file changed, 516 insertions(+), 489 deletions(-) diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst index a6bc8b7926..e35a8ab1e8 100644 --- a/doc/esmvalcore/preprocessor.rst +++ b/doc/esmvalcore/preprocessor.rst @@ -22,56 +22,51 @@ following the default order in which they are applied: Overview ======== -ESMValTool is a modular ``Python 3.6+`` software package possesing capabilities -of executing a large number of diagnostic routines -that can be written in a number of programming languages (Python, NCL, R, Julia). -The modular nature benefits the users and developers in different key areas: -a new feature developed specifically for version 2.0 is the preprocessing core or -the preprocessor (esmvalcore) that executes the bulk of standardized data operations -and is highly optimized for maximum performance in data-intensive tasks. The main -objective of the preprocessor is to integrate as many standardizable data analysis -functions as possible so that the diagnostics can focus on the specific scientific -tasks they carry. The preprocessor is linked to the diagnostics library and the -diagnostic execution is seamlessly performed after the preprocessor has completed the -its steps. The benefit of having a preprocessing unit separate from the diagnostics -library include: - -* ease of integration of new preprocessing routines; -* ease of maintenance (including unit and integration testing) of existing routines; -* a straightforward manner of importing and using the preprocessing routines as part - of the overall usage of the software and, as a special case, the use during diagnostic - execution; -* shifting the effort for the scientific diagnostic developer from implementing both standard - and diagnostic-specific functionalities to allowing them to dedicate most of the effort to - developing scientifically-relevant diagnostics and metrics; -* a more strict code review process, given the smaller code base than for diagnostics. +.. + ESMValTool is a modular ``Python 3.6+`` software package possesing capabilities + of executing a large number of diagnostic routines that can be written in a + number of programming languages (Python, NCL, R, Julia). The modular nature + benefits the users and developers in different key areas: a new feature + developed specifically for version 2.0 is the preprocessing core or the + preprocessor (esmvalcore) that executes the bulk of standardized data + operations and is highly optimized for maximum performance in data-intensive + tasks. The main objective of the preprocessor is to integrate as many + standardizable data analysis functions as possible so that the diagnostics can + focus on the specific scientific tasks they carry. The preprocessor is linked + to the diagnostics library and the diagnostic execution is seamlessly performed + after the preprocessor has completed the its steps. The benefit of having a + preprocessing unit separate from the diagnostics library include: + + * ease of integration of new preprocessing routines; + * ease of maintenance (including unit and integration testing) of existing + routines; + * a straightforward manner of importing and using the preprocessing routines as + part of the overall usage of the software and, as a special case, the use + during diagnostic execution; + * shifting the effort for the scientific diagnostic developer from implementing + both standard and diagnostic-specific functionalities to allowing them to + dedicate most of the effort to developing scientifically-relevant diagnostics + and metrics; + * a more strict code review process, given the smaller code base than for + diagnostics. The ESMValTool preprocessor can be used to perform a broad range of operations -on the input data before diagnostics or metrics are applied. The -preprocessor performs these operations in a centralized, documented and -efficient way, thus reducing the data processing load on the diagnostics side. +on the input data before diagnostics or metrics are applied. The preprocessor +performs these operations in a centralized, documented and efficient way, thus +reducing the data processing load on the diagnostics side. Each of the preprocessor operations is written in a dedicated python module and -all of them receive and return an Iris -`cube `_ , -working sequentially on the data with no interactions between them. The order -in which the preprocessor operations is applied is set by default in order to -minimize the loss of information due to, for example, temporal and spatial -subsetting or multi-model averaging. Nevertheless, the user is free to change -such order to address specific scientific requirements, but keeping in mind -that some operations must be necessarily performed in a specific order. This is -the case, for instance, for multi-model statistics, which required the model to -be on a common grid and therefore has to be called after the regridding module. - -Features of the ESMValTool Climate data pre-processor are: - -* Regridding -* Geographical area selection -* Aggregation of data -* Provenance tracking of the calculations -* Model statistics -* Multimodel statistics -* and many more +all of them receive and return an Iris `cube +`_ , working +sequentially on the data with no interactions between them. The order in which +the preprocessor operations is applied is set by default in order to minimize +the loss of information due to, for example, temporal and spatial subsetting or +multi-model averaging. Nevertheless, the user is free to change such order to +address specific scientific requirements, but keeping in mind that some +operations must be necessarily performed in a specific order. This is the case, +for instance, for multi-model statistics, which required the model to be on a +common grid and therefore has to be called after the regridding module. + .. _Variable derivation: @@ -118,293 +113,25 @@ CMORization and dataset-specific fixes ====================================== .. warning:: - Section to be added by Javier ``CMORMAN`` Vegas-Regidor - - -.. _time operations: - -Time manipulation -================= -The ``_time.py`` module contains the following preprocessor functions: - -* ``extract_time``: Extract a time range from an Iris ``cube``. -* ``extract_season``: Extract only the times that occur within a specific season. -* ``extract_month``: Extract only the times that occur within a specific month. -* ``time_average``: Take the weighted average over the time dimension. -* ``seasonal_mean``: Produces a mean for each season (DJF, MAM, JJA, SON) -* ``annual_mean``: Produces an annual or decadal mean. -* ``regrid_time``: Aligns the time axis of each dataset to have common time points - and calendars. - -``extract_time`` ----------------- - -This function subsets a dataset between two points in times. It removes all -times in the dataset before the first time and after the last time point. -The required arguments are relatively self explanatory: - -* ``start_year`` -* ``start_month`` -* ``start_day`` -* ``end_year`` -* ``end_month`` -* ``end_day`` - -These start and end points are set using the datasets native calendar. -All six arguments should be given as integers - the named month string -will not be accepted. - -See also :func:`esmvalcore.preprocessor.extract_time`. - - -``extract_season`` ------------------- - -Extract only the times that occur within a specific season. - -This function only has one argument: ``season``. This is the named season to -extract. ie: DJF, MAM, JJA, SON. - -Note that this function does not change the time resolution. If your original -data is in monthly time resolution, then this function will return three -monthly datapoints per year. - -If you want the seasonal average, then this function needs to be combined with -the seasonal_mean function, below. - -See also :func:`esmvalcore.preprocessor.extract_season`. - - -``extract_month`` ------------------ - -The function extracts the times that occur within a specific month. -This function only has one argument: ``month``. This value should be an integer -between 1 and 12 as the named month string will not be accepted. - -See also :func:`esmvalcore.preprocessor.extract_month`. - -.. _time_average: - -``time_average`` ----------------- - -This functions takes the weighted average over the time dimension. This -function requires no arguments and removes the time dimension of the cube. - -See also :func:`esmvalcore.preprocessor.time_average`. - - -``seasonal_mean`` ------------------ - -This function produces a seasonal mean for each season (DJF, MAM, JJA, SON). -Note that this function will not check for missing time points. For instance, -if you are looking at the DJF field, but your datasets starts on January 1st, -the first DJF field will only contain data from January and February. - -We recommend using the extract_time to start the dataset from the following -December and remove such biased initial datapoints. - -See also :func:`esmvalcore.preprocessor.seasonal_mean`. - - -``annual_mean`` ---------------- - -This function produces an annual or a decadal mean. The only argument is the -decadal boolean switch. When this switch is set to true, this function -will output the decadal averages. - -See also :func:`esmvalcore.preprocessor.annual_mean`. - - -``regrid_time`` ---------------- - -This function aligns the time points of each component dataset so that the dataset -Iris cubes can be subtracted. The operation makes the datasets time points common and -sets common calendars; it also resets the time bounds and auxiliary coordinates to -reflect the artifically shifted time points. Current implementation for monthly -and daily data; the ``frequency`` is set automatically from the variable CMOR table -unless a custom ``frequency`` is set manually by the user in recipe. - -See also :func:`esmvalcore.preprocessor.regrid_time`. - - -.. _area operations: - -Area manipulation -================= -The ``_area.py`` module contains the following preprocessor functions: - -* ``extract_region``: Extract a region from a cube based on ``lat/lon`` corners. -* ``zonal_means``: Calculates the zonal or meridional means. -* ``area_statistics``: Calculates the average value over a region. -* ``extract_named_regions``: Extract a specific region from in the region cooordinate. - - -``extract_region`` ------------------- - -This function masks data outside a rectagular region requested. The boundairies -of the region are provided as latitude and longitude coordinates in the -arguments: - -* ``start_longitude`` -* ``end_longitude`` -* ``start_latitude`` -* ``end_latitude`` - -Note that this function can only be used to extract a rectangular region. - -See also :func:`esmvalcore.preprocessor.extract_region`. - - -``zonal_means`` ---------------- - -The function calculates the zonal or meridional means. While this function is -named ``zonal_mean``, it can be used to apply several different operations in -an zonal or meridional direction. This function takes two arguments: - -* ``coordinate``: Which direction to apply the operation: latitude or longitude -* ``mean_type``: Which operation to apply: mean, std_dev, variance, median, min or max - -See also :func:`esmvalcore.preprocessor.zonal_means`. - - -``area_statistics`` -------------------- - -This function calculates the average value over a region - weighted by the -cell areas of the region. This function takes the argument, -``operator``: the name of the operation to apply. - -This function can be used to apply several different operations in the horizonal -plane: mean, standard deviation, median variance, minimum and maximum. - -Note that this function is applied over the entire dataset. If only a specific -region, depth layer or time period is required, then those regions need to be -removed using other preprocessor operations in advance. - -See also :func:`esmvalcore.preprocessor.area_statistics`. - - -``extract_named_regions`` -------------------------- - -This function extract a specific named region from the data. This function -takes the following argument: ``regions`` which is either a string or a list -of strings of named regions. Note that the dataset must have a ``region`` -cooordinate which includes a list of strings as values. This function then -matches the named regions against the requested string. - -See also :func:`esmvalcore.preprocessor.extract_named_regions`. - - -.. _volume operations: - -Volume manipulation -=================== -The ``_volume.py`` module contains the following preprocessor functions: - -* ``extract_volume``: Extract a specific depth range from a cube. -* ``volume_statistics``: Calculate the volume-weighted average. -* ``depth_integration``: Integrate over the depth dimension. -* ``extract_transect``: Extract data along a line of constant latitude or longitude. -* ``extract_trajectory``: Extract data along a specified trajectory. - - -``extract_volume`` ------------------- - -Extract a specific range in the `z`-direction from a cube. This function -takes two arguments, a minimum and a maximum (``z_min`` and ``z_max``, -respectively) in the `z`-direction. - -Note that this requires the requested `z`-coordinate range to be the -same sign as the Iris cube. ie, if the cube has `z`-coordinate as -negative, then ``z_min`` and ``z_max`` need to be negative numbers. - -See also :func:`esmvalcore.preprocessor.extract_volume`. - - -``volume_statistics`` ---------------------- - -This function calculates the volume-weighted average across three dimensions, -but maintains the time dimension. - -This function takes the argument: ``operator``, which defines the -operation to apply over the volume. - -No depth coordinate is required as this is determined by Iris. This -function works best when the ``fx_files`` provide the cell volume. - -See also :func:`esmvalcore.preprocessor.volume_statistics`. - - -``depth_integration`` ---------------------- - -This function integrate over the depth dimension. This function does a -weighted sum along the `z`-coordinate, and removes the `z` direction of the output -cube. This preprocessor takes no arguments. - -See also :func:`esmvalcore.preprocessor.depth_integration`. - - -``extract_transect`` --------------------- - -This function extract data along a line of constant latitude or longitude. -This function takes two arguments, although only one is strictly required. -The two arguments are ``latitude`` and ``longitude``. One of these arguments -needs to be set to a float, and the other can then be either ignored or set to -a minimum or maximum value. - -**Example**: If we set latitude to 0 N and leave longitude blank, it would produce a -cube along the Equator. On the other hand, if we set latitude to 0 and then -set longitude to ``[40., 100.]`` this will produce a transect of the Equator -in the Indian Ocean. - -See also :func:`esmvalcore.preprocessor.extract_transect`. - - -``extract_trajectory`` ----------------------- - -This function extract data along a specified trajectory. -The three areguments are: ``latitudes``, ``longitudes`` and number of point needed for -extrapolation ``number_points``. - -If two points are provided, the ``number_points`` argument is used to set a -the number of places to extract between the two end points. - -If more than two points are provided, then -``extract_trajectory`` will produce a cube which has extrapolated the data -of the cube to those points, and ``number_points`` is not needed. - -Note that this function uses the expensive ``interpolate`` method from ``Iris.analysis.trajectory``, -but it may be necceasiry for irregular grids. - -See also :func:`esmvalcore.preprocessor.extract_trajectory`. + Section to be added. .. _Vertical interpolation: Vertical interpolation ====================== -Vertical level selection is an important aspect of data preprocessing since it allows the -scientist to perform a number of metrics specific to certain levels (whether it be air pressure -or depth, e.g. the Quasi-Biennial-Oscillation (QBO) u30 is computed at 30 hPa). Dataset native -vertical grids may not come with the desired set of levels, so an interpolation operation will be -needed to regrid the data vertically. ESMValTool can perform this vertical interpolation via the -``extract_levels`` preprocessor. Level extraction may be done in a number of ways: - -Level extraction can be done at specific values passed to ``extract_levels`` as ``levels:`` with -its value a list of levels (note that the units are CMOR-standard, Pascals (Pa)): +Vertical level selection is an important aspect of data preprocessing since it +allows the scientist to perform a number of metrics specific to certain levels +(whether it be air pressure or depth, e.g. the Quasi-Biennial-Oscillation (QBO) +u30 is computed at 30 hPa). Dataset native vertical grids may not come with the +desired set of levels, so an interpolation operation will be needed to regrid +the data vertically. ESMValTool can perform this vertical interpolation via the +``extract_levels`` preprocessor. Level extraction may be done in a number of +ways. + +Level extraction can be done at specific values passed to ``extract_levels`` as +``levels:`` with its value a list of levels (note that the units are +CMOR-standard, Pascals (Pa)): .. code-block:: yaml @@ -414,8 +141,8 @@ its value a list of levels (note that the units are CMOR-standard, Pascals (Pa)) levels: [100000., 50000., 3000., 1000.] scheme: linear -It is also possible to extract the CMIP-specific, CMOR levels as they appear in the CMOR table, -e.g. ``plev10`` or ``plev17`` or ``plev19`` etc: +It is also possible to extract the CMIP-specific, CMOR levels as they appear in +the CMOR table, e.g. ``plev10`` or ``plev17`` or ``plev19`` etc: .. code-block:: yaml @@ -425,10 +152,11 @@ e.g. ``plev10`` or ``plev17`` or ``plev19`` etc: levels: {cmor_table: CMIP6, coordinate: plev10} scheme: nearest -Of good use is also the level extraction with values specific to a certain dataset, without -the user actually polling the dataset of interest to find out the specific levels: e.g. in the -example below we offer two alternatives to extract the levels and vertically regrid onto the -vertical levels of ``ERA-Interim``: +Of good use is also the level extraction with values specific to a certain +dataset, without the user actually polling the dataset of interest to find out +the specific levels: e.g. in the example below we offer two alternatives to +extract the levels and vertically regrid onto the vertical levels of +``ERA-Interim``: .. code-block:: yaml @@ -444,22 +172,27 @@ vertical levels of ``ERA-Interim``: * See also :func:`esmvalcore.preprocessor.get_cmor_levels`. .. note:: - **Advanced User and Developer** - - For both vertical and horizontal regridding one can control the extrapolation mode when defining - the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations - where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). - The extrapolation mode is controlled by the `extrapolation_mode` keyword. For the available interpolation - schemes available in Iris, the extrapolation_mode keyword must be one of: - - * ``extrapolate`` – the extrapolation points will be calculated by extending the gradient - of the closest two points, - * ``error`` – a ``ValueError`` exception will be raised, notifying an attempt to extrapolate, - * ``nan`` – the extrapolation points will be be set to NaN, - * ``mask`` – the extrapolation points will always be masked, even if the source data is not - a ``MaskedArray``, or - * ``nanmask`` – if the source data is a MaskedArray the extrapolation points will be masked. - Otherwise they will be set to NaN. + + For both vertical and horizontal regridding one can control the + extrapolation mode when defining the interpolation scheme. Controlling the + extrapolation mode allows us to avoid situations where extrapolating values + makes little physical sense (e.g. extrapolating beyond the last data point). + The extrapolation mode is controlled by the `extrapolation_mode` + keyword. For the available interpolation schemes available in Iris, the + extrapolation_mode keyword must be one of: + + * ``extrapolate``: the extrapolation points will be calculated by + extending the gradient of the closest two points; + * ``error``: a ``ValueError`` exception will be raised, notifying an + attempt to extrapolate; + * ``nan``: the extrapolation points will be be set to NaN; + * ``mask``: the extrapolation points will always be masked, even if the + source data is not a ``MaskedArray``; or + * ``nanmask``: if the source data is a MaskedArray the extrapolation + points will be masked, otherwise they will be set to NaN. + + +.. _masking: Masking ======= @@ -479,7 +212,6 @@ are used: although these are not model-specific, they represent a good approximation since they have a much higher resolution than most of the models and they are regularly updated with changing geographical features. - .. _land/sea/ice masking: Land-sea masking @@ -502,8 +234,8 @@ To mask out a certain domain (e.g., sea) in the preprocessor, and requires only one argument: ``mask_out``: either ``land`` or ``sea``. -The preprocessor automatically retrieves the corresponding mask (``fx: stfof`` in -this case) and applies it so that sea-covered grid cells are set to +The preprocessor automatically retrieves the corresponding mask (``fx: stfof`` +in this case) and applies it so that sea-covered grid cells are set to missing. Conversely, it retrieves the ``fx: sftlf`` mask when land need to be masked out, respectively. If the corresponding fx file is not found (which is the case for some models and almost all observational datasets), the @@ -530,8 +262,8 @@ losing generality. To mask ice out, ``mask_landseaice`` can be used: and requires only one argument: ``mask_out``: either ``landsea`` or ``ice``. -As in the case of ``mask_landsea``, the preprocessor automatically retrieves the -``fx_files: [sftgif]`` mask. +As in the case of ``mask_landsea``, the preprocessor automatically retrieves +the ``fx_files: [sftgif]`` mask. See also :func:`esmvalcore.preprocessor.mask_landseaice`. @@ -539,10 +271,10 @@ Mask files ---------- At the core of the land/sea/ice masking in the preprocessor are the mask files -(whether it be fx type or Natural Earth type of files); these files (bar Natural Earth) -can be retrived and used in the diagnostic phase as well or solely. By specifying the -``fx_files:`` key in the variable in diagnostic in the recipe, and populating it -with a list of desired files e.g.: +(whether it be fx type or Natural Earth type of files); these files (bar +Natural Earth) can be retrived and used in the diagnostic phase as well or +solely. By specifying the ``fx_files:`` key in the variable in diagnostic in +the recipe, and populating it with a list of desired files e.g.: .. code-block:: yaml @@ -551,38 +283,43 @@ with a list of desired files e.g.: preprocessor: my_masking_preprocessor fx_files: [sftlf, sftof, sftgif, areacello, areacella] -Such a recipe will automatically retrieve all the ``fx_files: [sftlf, sftof, sftgif, areacello, areacella]``-type -fx files for each of the variables that are needed for and then, in the diagnostic phase, -these mask files will be available for the developer to use them as they need to. The `fx_files` -attribute of the big `variable` nested dictionary that gets passed to the diagnostic is, in turn, -a dictionary on its own, and members of it can be accessed in the diagnostic through a simple loop over -the ``config`` diagnostic variable items e.g.: +Such a recipe will automatically retrieve all the ``fx_files: [sftlf, sftof, +sftgif, areacello, areacella]``-type fx files for each of the variables that +are needed for and then, in the diagnostic phase, these mask files will be +available for the developer to use them as they need to. The `fx_files` +attribute of the big `variable` nested dictionary that gets passed to the +diagnostic is, in turn, a dictionary on its own, and members of it can be +accessed in the diagnostic through a simple loop over the ``config`` diagnostic +variable items e.g.: -.. code-block:: bash +.. code-block:: for filename, attributes in config['input_data'].items(): sftlf_file = attributes['fx_files']['sftlf'] areacello_file = attributes['fx_files']['areacello'] - .. _masking of missing values: Missing values masks -------------------- -Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles -and having to compute multimodel statistics; different numbers of missing data from dataset -to datest may introduce biases and artifically assign more weight to the datasets that have -less missing data. This is handled in ESMValTool via the missing values masks: two types of -such masks are available: one for the multimodel case and another for the single model case. - -The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step that usually comes -after all the single-model steps (regridding, area selection etc) have been performed; in a -nutshell, it combines missing values masks from individual models into a multimodel missing -values mask; the individual model masks are built according to common criteria: the user chooses -a time window in which missing data points are counted, and if the number of missing data points -relative to the number of total data points in a window is less than a chosen fractional theshold, -the window is discarded i.e. all the points in the window are masked (set to missing). +Missing (masked) values can be a nuisance especially when dealing with +multimodel ensembles and having to compute multimodel statistics; different +numbers of missing data from dataset to datest may introduce biases and +artifically assign more weight to the datasets that have less missing +data. This is handled in ESMValTool via the missing values masks: two types of +such masks are available: one for the multimodel case and another for the +single model case. + +The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step +that usually comes after all the single-model steps (regridding, area selection +etc) have been performed; in a nutshell, it combines missing values masks from +individual models into a multimodel missing values mask; the individual model +masks are built according to common criteria: the user chooses a time window in +which missing data points are counted, and if the number of missing data points +relative to the number of total data points in a window is less than a chosen +fractional theshold, the window is discarded i.e. all the points in the window +are masked (set to missing). .. code-block:: yaml @@ -593,21 +330,20 @@ the window is discarded i.e. all the points in the window are masked (set to mis min_value: 19.0 time_window: 10.0 -In the example above, the fractional threshold for missing data vs. total data is set to 95% and -the time window is set to 10.0 (units of the time coordinate units). Optionally, a minimum value -threshold can be applied, in this case it is set -to 19.0 (in units of the variable units). +In the example above, the fractional threshold for missing data vs. total data +is set to 95% and the time window is set to 10.0 (units of the time coordinate +units). Optionally, a minimum value threshold can be applied, in this case it +is set to 19.0 (in units of the variable units). See also :func:`esmvalcore.preprocessor.mask_fillvalues`. .. note:: - **Pro Tip: creating a multimodel mask using ``mask_fillvalues``** It is possible to use ``mask_fillvalues`` to create a combined multimodel - mask (all the masks from all the analyzed models combined into a single mask); - for that purpose setting the ``threshold_fraction`` to 0 will not discard any - time windows, essentially keeping the original model masks and combining them - into a single mask; here is an example: + mask (all the masks from all the analyzed models combined into a single + mask); for that purpose setting the ``threshold_fraction`` to 0 will not + discard any time windows, essentially keeping the original model masks and + combining them into a single mask; here is an example: .. code-block:: yaml @@ -621,40 +357,46 @@ See also :func:`esmvalcore.preprocessor.mask_fillvalues`. Minimum, maximum and interval masking ------------------------------------- -Thresholding on minimum and maximum accepted data values can also be performed: masks are -constructed based on the results of thresholding; inside and outside interval thresholding -and masking can also be performed. These functions are ``mask_above_threshold``, -``mask_below_threshold``, ``mask_inside_range``, and ``mask_outside_range``. +Thresholding on minimum and maximum accepted data values can also be performed: +masks are constructed based on the results of thresholding; inside and outside +interval thresholding and masking can also be performed. These functions are +``mask_above_threshold``, ``mask_below_threshold``, ``mask_inside_range``, and +``mask_outside_range``. + +Thes functions always take a cube as first argument and either ``threshold`` +for threshold masking or the pair ``minimum`, ``maximum`` for interval masking. -Thes functions always take a cube as first argument and either ``threshold`` for threshold -masking or the pair ``minimum`, ``maximum`` for interval masking. +See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related +functions. -See also :func:`esmvalcore.preprocessor.mask_above_threshold` and related functions. .. _Horizontal regridding: Horizontal regridding ===================== -Regridding is necessary when various datasets are available on a variety of `lat-lon` grids and they need -to be brought together on a common grid (for various statistical operations e.g. multimodel statistics or -for e.g. direct inter-comparison or comparison with observational datasets). Regridding is conceptually a -very similar process to interpolation (in fact, the regridder engine uses interpolation and extrapolation, -with various schemes). The primary difference is that interpolation is based on sample data points, while -regridding is based on the horizontal grid of another cube (the reference grid). +Regridding is necessary when various datasets are available on a variety of +`lat-lon` grids and they need to be brought together on a common grid (for +various statistical operations e.g. multimodel statistics or for e.g. direct +inter-comparison or comparison with observational datasets). Regridding is +conceptually a very similar process to interpolation (in fact, the regridder +engine uses interpolation and extrapolation, with various schemes). The primary +difference is that interpolation is based on sample data points, while +regridding is based on the horizontal grid of another cube (the reference +grid). -The underlying regridding mechanism in ESMValTool uses ``cube.regrid()`` method from Iris, -so we point the reader to its documentation: -`cube.regrid() `_. +The underlying regridding mechanism in ESMValTool uses ``cube.regrid()`` method +from Iris, so we point the reader to its documentation: `cube.regrid() `_. -The use of the horizontal regridding functionality is flexible depending on what type of reference grid -and what interpolation scheme is preferred. Below we show a few examples. +The use of the horizontal regridding functionality is flexible depending on +what type of reference grid and what interpolation scheme is preferred. Below +we show a few examples. Regridding on a reference dataset grid -------------------------------------- -The example below shows how to regrid on the reference dataset ``ERA-Interim`` (observational data, but just -as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `scheme` is `linear`. +The example below shows how to regrid on the reference dataset ``ERA-Interim`` +(observational data, but just as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `scheme` is `linear`. .. code-block:: yaml @@ -667,10 +409,11 @@ as well CMIP, obs4mips, or ana4mips datasets can be used); in this case the `sch Regridding on an ``MxN`` grid specification ------------------------------------------- -The example below shows how to regrid on a reference grid with a cell specification of ``2.5x2.5`` degrees. -This is similar to regridding on reference datasets, but in the previous case the reference dataset grid -cell specifications are not necessarily known a priori. Reegridding on an ``MxN`` cell specification is -oftentimes used when operating on localized data. +The example below shows how to regrid on a reference grid with a cell +specification of ``2.5x2.5`` degrees. This is similar to regridding on +reference datasets, but in the previous case the reference dataset grid cell +specifications are not necessarily known a priori. Reegridding on an ``MxN`` +cell specification is oftentimes used when operating on localized data. .. code-block:: yaml @@ -680,15 +423,16 @@ oftentimes used when operating on localized data. target_grid: 2.5x2.5 scheme: nearest -In this case the ``NearestNeighbour`` interpolation scheme is used (see below for scheme definitions). +In this case the ``NearestNeighbour`` interpolation scheme is used (see below +for scheme definitions). -When using a ``MxN`` type of grid it is possible to offset the grid cell centrepoints -using the `lat_offset` and ``lon_offset`` arguments: +When using a ``MxN`` type of grid it is possible to offset the grid cell +centrepoints using the `lat_offset` and ``lon_offset`` arguments: * ``lat_offset``: offsets the grid centers of the latitude coordinate w.r.t. the pole by half a grid step; -* ``lon_offset``: offsets the grid centers of the longitude coordinate w.r.t. Greenwich - meridian by half a grid step. +* ``lon_offset``: offsets the grid centers of the longitude coordinate + w.r.t. Greenwich meridian by half a grid step. .. code-block:: yaml @@ -703,9 +447,9 @@ using the `lat_offset` and ``lon_offset`` arguments: Regridding (interpolation, extrapolation) schemes ------------------------------------------------- -The schemes used for the interpolation and extrapolation operations needed by the -horizontal regridding functionality directly map to their corresponding implementaions -in Iris: +The schemes used for the interpolation and extrapolation operations needed by +the horizontal regridding functionality directly map to their corresponding +implementaions in Iris: * ``linear``: `Linear(extrapolation_mode='mask') `_. * ``linear_extrapolate``: `Linear(extrapolation_mode='extrapolate') `_. @@ -716,53 +460,60 @@ in Iris: See also :func:`esmvalcore.preprocessor.regrid` .. note:: - **Advanced User and Developer** - - For both vertical and horizontal regridding one can control the extrapolation mode when defining - the interpolation scheme. Controlling the extrapolation mode allows us to avoid situations - where extrapolating values makes little physical sense (e.g. extrapolating beyond the last data point). - The extrapolation mode is controlled by the `extrapolation_mode` keyword. For the available interpolation - schemes available in Iris, the extrapolation_mode keyword must be one of: - - * ``extrapolate`` – the extrapolation points will be calculated by extending the gradient - of the closest two points, - * ``error`` – a ``ValueError`` exception will be raised, notifying an attempt to extrapolate, - * ``nan`` – the extrapolation points will be be set to NaN, - * ``mask`` – the extrapolation points will always be masked, even if the source data is not - a ``MaskedArray``, or - * ``nanmask`` – if the source data is a MaskedArray the extrapolation points will be masked. - Otherwise they will be set to NaN. + + For both vertical and horizontal regridding one can control the + extrapolation mode when defining the interpolation scheme. Controlling the + extrapolation mode allows us to avoid situations where extrapolating values + makes little physical sense (e.g. extrapolating beyond the last data + point). The extrapolation mode is controlled by the `extrapolation_mode` + keyword. For the available interpolation schemes available in Iris, the + extrapolation_mode keyword must be one of: + + * ``extrapolate`` – the extrapolation points will be calculated by + extending the gradient of the closest two points; + * ``error`` – a ``ValueError`` exception will be raised, notifying an + attempt to extrapolate; + * ``nan`` – the extrapolation points will be be set to NaN; + * ``mask`` – the extrapolation points will always be masked, even if + the source data is not a ``MaskedArray``; or + * ``nanmask`` – if the source data is a MaskedArray the extrapolation + points will be masked, otherwise they will be set to NaN. .. note:: - **Memory limits for horizontal regridding** - The rigridding mechanism is (at the moment) done with fully realized data in memory, so depending - on how fine the target grid is, it may use a rather large amount of memory. Empirically target grids - of up to ``0.5x0.5`` degrees should not produce any memory-related issues, but be advised that - for resolutions of ``< 0.5`` degrees the regridding becomes very slow and will use a lot of memory. + The regridding mechanism is (at the moment) done with fully realized data in + memory, so depending on how fine the target grid is, it may use a rather + large amount of memory. Empirically target grids of up to ``0.5x0.5`` + degrees should not produce any memory-related issues, but be advised that + for resolutions of ``< 0.5`` degrees the regridding becomes very slow and + will use a lot of memory. + .. _multi-model statistics: Multi-model statistics ====================== -Computing multi-model statistics is an integral part of model analysis and evaluation: individual -models display a variety of biases depedning on model set-up, initial conditions, forcings and -implementation; comparing model data to observational data, these biases have a significanly lower -statistical impact when using a multi-model ensemble. ESMValTool has the capability of computing a -number of multi-model statistical measures: using the preprocessor module ``multi_model_statistics`` -will enable the user to ask for either a multi-model ``mean`` and/or ``median`` with a set of argument -parameters passed to ``multi_model_statistics``. - -Multimodel statistics in ESMValTool are computed along the time axis, and as such, -can be computed across a common overlap in time (by specifying ``span: overlap`` argument) or across -the full length in time of each model (by specifying ``span: full`` argument). - -Restrictive computation is also available by excluding any set of models that the user -will not want to include in the statistics (by setting ``exclude: [excluded models list]`` argument). -The implementation has a few restrictions that apply to the input data: model datasets must have -consistent shapes, and from a statistical point of view, this is needed since weights are not yet -implemented; also higher dimesnional data is not supported (ie anything with dimensionality higher -than four: time, vertical axis, two horizontal axes). +Computing multi-model statistics is an integral part of model analysis and +evaluation: individual models display a variety of biases depedning on model +set-up, initial conditions, forcings and implementation; comparing model data +to observational data, these biases have a significanly lower statistical +impact when using a multi-model ensemble. ESMValTool has the capability of +computing a number of multi-model statistical measures: using the preprocessor +module ``multi_model_statistics`` will enable the user to ask for either a +multi-model ``mean`` and/or ``median`` with a set of argument parameters passed to ``multi_model_statistics``. + +Multimodel statistics in ESMValTool are computed along the time axis, and as +such, can be computed across a common overlap in time (by specifying ``span: +overlap`` argument) or across the full length in time of each model (by +specifying ``span: full`` argument). + +Restrictive computation is also available by excluding any set of models that +the user will not want to include in the statistics (by setting ``exclude: +[excluded models list]`` argument). The implementation has a few restrictions +that apply to the input data: model datasets must have consistent shapes, and +from a statistical point of view, this is needed since weights are not yet +implemented; also higher dimesnional data is not supported (ie anything with +dimensionality higher than four: time, vertical axis, two horizontal axes). .. code-block:: yaml @@ -777,50 +528,285 @@ see also :func:`esmvalcore.preprocessor.multi_model_statistics`. .. note:: - **Memory limits for multimodel statistics** + Note that the multimodel array operations, albeit performed in + per-time/per-horizontal level loops to save memory, could, however, be + rather memory-intensive (since they are not performed lazily as + yet). Section MemoryUse_ details the memory intake for different run + scenarios, but as a thumb rule, for the multimodel preprocessor, the + expected maximum memory intake could be approximated as the number of + datasets multiplied by the average size in memory for one dataset. - Note that the multimodel array operations, albeit performed in per-time/per-horizontal level - loops to save memory, could, however, be rather memory-intensive (since they are not performed - lazily as yet). Section MemoryUse_ details the memory intake for different run scenarios, but - as a thumb rule, for the multimodel preprocessor, the expected maximum memory intake could be - approximated as the number of datasets multiplied by the average size in memory for one dataset. -.. _MemoryUse: +.. _time operations: -Information on maximum memory required -====================================== -In the most general case, we can set upper limits on the maximum memory the anlysis will require: +Time manipulation +================= +The ``_time.py`` module contains the following preprocessor functions: +* ``extract_time``: Extract a time range from an Iris ``cube``. +* ``extract_season``: Extract only the times that occur within a specific + season. +* ``extract_month``: Extract only the times that occur within a specific month. +* ``time_average``: Take the weighted average over the time dimension. +* ``seasonal_mean``: Produces a mean for each season (DJF, MAM, JJA, SON) +* ``annual_mean``: Produces an annual or decadal mean. +* ``regrid_time``: Aligns the time axis of each dataset to have common time + points and calendars. -``Ms = (R + N) x F_eff - F_eff`` - when no multimodel analysis is performed; +``extract_time`` +---------------- -``Mm = (2R + N) x F_eff - 2F_eff`` - when multimodel analysis is performed; +This function subsets a dataset between two points in times. It removes all +times in the dataset before the first time and after the last time point. +The required arguments are relatively self explanatory: -where +* ``start_year`` +* ``start_month`` +* ``start_day`` +* ``end_year`` +* ``end_month`` +* ``end_day`` -* ``Ms``: maximum memory for non-multimodel module -* ``Mm``: maximum memory for multimodel module -* ``R``: computational efficiency of module; `R` is typically 2-3 -* ``N``: number of datasets -* ``F_eff``: average size of data per dataset where ``F_eff = e x f x F`` - where ``e`` is the factor that describes how lazy the data is (``e = 1`` for fully realized data) - and ``f`` describes how much the data was shrunk by the immediately previous module e.g. - time extraction, area selection or level extraction; note that for fix_data ``f`` relates only - to the time extraction, if data is exact in time (no time selection) ``f = 1`` for fix_data +These start and end points are set using the datasets native calendar. +All six arguments should be given as integers - the named month string +will not be accepted. + +See also :func:`esmvalcore.preprocessor.extract_time`. -so for cases when we deal with a lot of datasets ``R + N \approx N``, data is fully realized, assuming -an average size of 1.5GB for 10 years of `3D` netCDF data, ``N`` datasets will require +``extract_season`` +------------------ +Extract only the times that occur within a specific season. -``Ms = 1.5 x (N - 1)`` GB +This function only has one argument: ``season``. This is the named season to +extract. ie: DJF, MAM, JJA, SON. -``Mm = 1.5 x (N - 2)`` GB +Note that this function does not change the time resolution. If your original +data is in monthly time resolution, then this function will return three +monthly datapoints per year. + +If you want the seasonal average, then this function needs to be combined with +the seasonal_mean function, below. + +See also :func:`esmvalcore.preprocessor.extract_season`. + +``extract_month`` +----------------- + +The function extracts the times that occur within a specific month. +This function only has one argument: ``month``. This value should be an integer +between 1 and 12 as the named month string will not be accepted. + +See also :func:`esmvalcore.preprocessor.extract_month`. + +.. _time_average: + +``time_average`` +---------------- + +This functions takes the weighted average over the time dimension. This +function requires no arguments and removes the time dimension of the cube. + +See also :func:`esmvalcore.preprocessor.time_average`. + +``seasonal_mean`` +----------------- + +This function produces a seasonal mean for each season (DJF, MAM, JJA, SON). +Note that this function will not check for missing time points. For instance, +if you are looking at the DJF field, but your datasets starts on January 1st, +the first DJF field will only contain data from January and February. + +We recommend using the extract_time to start the dataset from the following +December and remove such biased initial datapoints. + +See also :func:`esmvalcore.preprocessor.seasonal_mean`. + +``annual_mean`` +--------------- + +This function produces an annual or a decadal mean. The only argument is the +decadal boolean switch. When this switch is set to true, this function +will output the decadal averages. + +See also :func:`esmvalcore.preprocessor.annual_mean`. + +``regrid_time`` +--------------- + +This function aligns the time points of each component dataset so that the +dataset Iris cubes can be subtracted. The operation makes the datasets time +points common and sets common calendars; it also resets the time bounds and +auxiliary coordinates to reflect the artifically shifted time points. Current +implementation for monthly and daily data; the ``frequency`` is set +automatically from the variable CMOR table unless a custom ``frequency`` is set +manually by the user in recipe. + +See also :func:`esmvalcore.preprocessor.regrid_time`. -As a thumb rule, the maximum required memory at a certain time, when meeding multimodel analysis -could be estimated by multiplying the number of datasets by the average file size of all the datasets; -this memory intake is high but also assumes that all data is fully realized in memory; this aspect -will gradually change and the amount of realized data will decrease with the increase of ``dask`` use. +.. _area operations: + +Area manipulation +================= +The ``_area.py`` module contains the following preprocessor functions: + +* ``extract_region``: Extract a region from a cube based on ``lat/lon`` + corners. +* ``zonal_means``: Calculates the zonal or meridional means. +* ``area_statistics``: Calculates the average value over a region. +* ``extract_named_regions``: Extract a specific region from in the region + cooordinate. + + +``extract_region`` +------------------ + +This function masks data outside a rectagular region requested. The boundairies +of the region are provided as latitude and longitude coordinates in the +arguments: + +* ``start_longitude`` +* ``end_longitude`` +* ``start_latitude`` +* ``end_latitude`` + +Note that this function can only be used to extract a rectangular region. + +See also :func:`esmvalcore.preprocessor.extract_region`. + + +``zonal_means`` +--------------- + +The function calculates the zonal or meridional means. While this function is +named ``zonal_mean``, it can be used to apply several different operations in +an zonal or meridional direction. This function takes two arguments: + +* ``coordinate``: Which direction to apply the operation: latitude or longitude +* ``mean_type``: Which operation to apply: mean, std_dev, variance, median, min + or max + +See also :func:`esmvalcore.preprocessor.zonal_means`. + + +``area_statistics`` +------------------- + +This function calculates the average value over a region - weighted by the cell +areas of the region. This function takes the argument, ``operator``: the name +of the operation to apply. + +This function can be used to apply several different operations in the +horizonal plane: mean, standard deviation, median variance, minimum and maximum. + +Note that this function is applied over the entire dataset. If only a specific +region, depth layer or time period is required, then those regions need to be +removed using other preprocessor operations in advance. + +See also :func:`esmvalcore.preprocessor.area_statistics`. + + +``extract_named_regions`` +------------------------- + +This function extract a specific named region from the data. This function +takes the following argument: ``regions`` which is either a string or a list +of strings of named regions. Note that the dataset must have a ``region`` +cooordinate which includes a list of strings as values. This function then +matches the named regions against the requested string. + +See also :func:`esmvalcore.preprocessor.extract_named_regions`. + + +.. _volume operations: + +Volume manipulation +=================== +The ``_volume.py`` module contains the following preprocessor functions: + +* ``extract_volume``: Extract a specific depth range from a cube. +* ``volume_statistics``: Calculate the volume-weighted average. +* ``depth_integration``: Integrate over the depth dimension. +* ``extract_transect``: Extract data along a line of constant latitude or + longitude. +* ``extract_trajectory``: Extract data along a specified trajectory. + + +``extract_volume`` +------------------ + +Extract a specific range in the `z`-direction from a cube. This function +takes two arguments, a minimum and a maximum (``z_min`` and ``z_max``, +respectively) in the `z`-direction. + +Note that this requires the requested `z`-coordinate range to be the same sign +as the Iris cube. ie, if the cube has `z`-coordinate as negative, then +``z_min`` and ``z_max`` need to be negative numbers. + +See also :func:`esmvalcore.preprocessor.extract_volume`. + + +``volume_statistics`` +--------------------- + +This function calculates the volume-weighted average across three dimensions, +but maintains the time dimension. + +This function takes the argument: ``operator``, which defines the operation to +apply over the volume. + +No depth coordinate is required as this is determined by Iris. This function +works best when the ``fx_files`` provide the cell volume. + +See also :func:`esmvalcore.preprocessor.volume_statistics`. + + +``depth_integration`` +--------------------- + +This function integrate over the depth dimension. This function does a weighted +sum along the `z`-coordinate, and removes the `z` direction of the output +cube. This preprocessor takes no arguments. + +See also :func:`esmvalcore.preprocessor.depth_integration`. + + +``extract_transect`` +-------------------- + +This function extract data along a line of constant latitude or longitude. +This function takes two arguments, although only one is strictly required. +The two arguments are ``latitude`` and ``longitude``. One of these arguments +needs to be set to a float, and the other can then be either ignored or set to +a minimum or maximum value. + +For example, if we set latitude to 0 N and leave longitude blank, it would +produce a cube along the Equator. On the other hand, if we set latitude to 0 +and then set longitude to ``[40., 100.]`` this will produce a transect of the +Equator in the Indian Ocean. + +See also :func:`esmvalcore.preprocessor.extract_transect`. + + +``extract_trajectory`` +---------------------- + +This function extract data along a specified trajectory. +The three areguments are: ``latitudes``, ``longitudes`` and number of point +needed for extrapolation ``number_points``. + +If two points are provided, the ``number_points`` argument is used to set a +the number of places to extract between the two end points. + +If more than two points are provided, then ``extract_trajectory`` will produce +a cube which has extrapolated the data of the cube to those points, and +``number_points`` is not needed. + +Note that this function uses the expensive ``interpolate`` method from +``Iris.analysis.trajectory``, but it may be necceasiry for irregular grids. + +See also :func:`esmvalcore.preprocessor.extract_trajectory`. .. _unit conversion: @@ -843,3 +829,44 @@ will guarantee homogeneous input for the diagnostics. amount based unit is not supported at the moment. See also :func:`esmvalcore.preprocessor.convert_units`. + + +.. _MemoryUse: + +Information on maximum memory required +====================================== +In the most general case, we can set upper limits on the maximum memory the +anlysis will require: + + +``Ms = (R + N) x F_eff - F_eff`` - when no multimodel analysis is performed; + +``Mm = (2R + N) x F_eff - 2F_eff`` - when multimodel analysis is performed; + +where + +* ``Ms``: maximum memory for non-multimodel module +* ``Mm``: maximum memory for multimodel module +* ``R``: computational efficiency of module; `R` is typically 2-3 +* ``N``: number of datasets +* ``F_eff``: average size of data per dataset where ``F_eff = e x f x F`` + where ``e`` is the factor that describes how lazy the data is (``e = 1`` for + fully realized data) and ``f`` describes how much the data was shrunk by the + immediately previous module, e.g. time extraction, area selection or level + extraction; note that for fix_data ``f`` relates only to the time extraction, + if data is exact in time (no time selection) ``f = 1`` for fix_data so for + cases when we deal with a lot of datasets ``R + N \approx N``, data is fully + realized, assuming an average size of 1.5GB for 10 years of `3D` netCDF data, + ``N`` datasets will require: + + +``Ms = 1.5 x (N - 1)`` GB + +``Mm = 1.5 x (N - 2)`` GB + +As a thumb rule, the maximum required memory at a certain time, when meeding +multimodel analysis could be estimated by multiplying the number of datasets by +the average file size of all the datasets; this memory intake is high but also +assumes that all data is fully realized in memory; this aspect will gradually +change and the amount of realized data will decrease with the increase of +``dask`` use. From 97f3d6c8501cd2171c82178f8c07a552ca9f0431 Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Tue, 30 Jul 2019 16:52:25 +0200 Subject: [PATCH 44/49] Minor cleaning --- doc/esmvalcore/utils.rst | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/doc/esmvalcore/utils.rst b/doc/esmvalcore/utils.rst index ac7e33f2c5..15d4bdec02 100644 --- a/doc/esmvalcore/utils.rst +++ b/doc/esmvalcore/utils.rst @@ -4,20 +4,23 @@ Utilities ********* -This section provides extra information on topics that are not part of ESMValTool -code base but are used by ESMValTool directly or indirectly. +This section provides extra information on topics that are not part of +ESMValTool code base but are used by ESMValTool directly or indirectly. Brief introduction to YAML ========================== -While ``.yaml`` or ``.yml`` is a relatively common format, maybe users may not have +While ``.yaml`` or ``.yml`` is a relatively common format, users may not have encountered this language before. The key information about this format is: -- yaml is a human friendly markup language. -- yaml is commonly used for configuration files (gradually replacing the venerable ``.ini``) -- the syntax is relatively straightforward +- yaml is a human friendly markup language; +- yaml is commonly used for configuration files (gradually replacing the + venerable ``.ini``); +- the syntax is relatively straightforward; - indentation matters a lot (like ``Python``)! -- yaml is case sensitive -- have a look at this `yaml tutorial `_ -- have a look at this `yaml quick reference card `_ -- ESMValTool uses the `yamllint `_ linter tool. +- yaml is case sensitive; + +More information can be found in the `yaml tutorial +`_ and `yaml quick reference card +`_. ESMValTool uses the `yamllint +`_ linter tool to check recipe syntax. From 3ad661d895f5809bc781c3df8895b53965c35e52 Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Fri, 2 Aug 2019 11:13:51 +0200 Subject: [PATCH 45/49] Update doc/esmvalcore/config.rst Co-Authored-By: Lee de Mora --- doc/esmvalcore/config.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index 859a3c5dc4..442317bb84 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -141,7 +141,7 @@ the user. .. note:: You choose your config.yml file at run time, so you could have several of - them available with different purposes. One for formalised run, one for + them available with different purposes. One for a formalised run, another for debugging, etc. From 48d042595c1de2814f9842bb4e67e9b767f47efa Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Fri, 2 Aug 2019 11:14:53 +0200 Subject: [PATCH 46/49] Update doc/esmvalcore/config.rst Co-Authored-By: Lee de Mora --- doc/esmvalcore/config.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index 442317bb84..c3cd23069d 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -140,7 +140,7 @@ the user. .. note:: - You choose your config.yml file at run time, so you could have several of + You choose your ``config-user.yml`` file at run time, so you could have several of them available with different purposes. One for a formalised run, another for debugging, etc. From c89aabd376caae7d8d816a12d580a6a28928e3bd Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Fri, 2 Aug 2019 11:20:26 +0200 Subject: [PATCH 47/49] Update doc/esmvalcore/config.rst Co-Authored-By: Lee de Mora --- doc/esmvalcore/config.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index c3cd23069d..5c114b0a3e 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -188,7 +188,7 @@ actual values supplied for by the user in ``config-user.yml`` and MOHC/HadGEM2-CC/rcp85/mon/ocean/Omon/r1i1p1/latest/tos -Again, for a more in-depth description this process, as part of the data +Again, for a more in-depth description of this process, as part of the data retrieval mechanism, please see :ref:`CMOR-DRS`. .. _config-ref: From d292a1f92d62b4a31c36cc8687314d80153e474f Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Fri, 2 Aug 2019 11:42:06 +0200 Subject: [PATCH 48/49] Implement reviewers suggestions --- doc/esmvalcore/config.rst | 2 +- doc/esmvalcore/datafinder.rst | 24 +++++++------- doc/esmvalcore/preprocessor.rst | 59 +++++++++++++++++---------------- doc/esmvalcore/recipe.rst | 5 +-- 4 files changed, 46 insertions(+), 44 deletions(-) diff --git a/doc/esmvalcore/config.rst b/doc/esmvalcore/config.rst index c3cd23069d..d8903bd61d 100644 --- a/doc/esmvalcore/config.rst +++ b/doc/esmvalcore/config.rst @@ -72,7 +72,7 @@ option: save_intermediary_cubes: false # Remove the preproc dir if all fine - # this option true will remove ALL preprocessor files + # if this option is set to "true", ALL preprocessor files will be removed # CAUTION when using: if you need those files, set it to false remove_preproc_dir: true diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index 9a49a88c30..a1b58efa44 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -8,13 +8,12 @@ Overview ======== Data discovery and retrieval is the first step in any evaluation process; ESMValTool uses a `semi-automated` data finding mechanism with inputs from both -the user configuration file and the recipe file. The reason why the data finder -module is `semi`-automated is that the user will have to provide the tool with -a set of parameters related to the data needed; the reason why it is -semi-`automated` is that once these parameters have been provided, the tool -will automatically find the right data. We will detail below the data finding -and retrieval process and the inputs the user needs to specify, giving examples -on how to use the data finding routine under different scenarios. +the user configuration file and the recipe file: this means that the user will +have to provide the tool with a set of parameters related to the data needed +and once these parameters have been provided, the tool will automatically find +the right data. We will detail below the data finding and retrieval process and +the input the user needs to specify, giving examples on how to use the data +finding routine under different scenarios. .. _CMOR-DRS: @@ -108,7 +107,7 @@ The snippet is another way to retrieve data from a ``ROOT`` directory that has no DRS-like structure; ``default`` indicates that the data lies in a directory that -contains all the files without any structire. +contains all the files without any structure. .. note:: When using ``CMIP6: default`` or ``CMIP5: default`` it is important to @@ -143,7 +142,7 @@ Explaining ``config-user/rootpath:`` .. code-block:: yaml - CMIP6: [/badc/cmip6/data/CMIP6/CMIP, /home/users/joepesci/cmip_data] + CMIP6: [/badc/cmip6/data/CMIP6/CMIP, /home/users/johndoe/cmip_data] * ``OBS``: this is the `root` path(s) to where the observational datasets are stored; again, this could be a single path or a list of paths, just like for @@ -246,6 +245,7 @@ CMOR-DRS_ are used again and the file will be automatically found: ``/group_workspaces/jasmin4/esmeval/obsdata-v2/Tier3/ERA-Interim/OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc`` -Note that for observational data for ``drs: default`` the ``default`` directory -must contain a sub-directory: -``TierX`` (``Tier1``, ``Tier2`` or ``Tier3``). +Since observational data are organized in Tiers depending on their level of +public availability, the ``default`` directory must be structured accordingly +with sub-directories ``TierX`` (``Tier1``, ``Tier2`` or ``Tier3``), even when +``drs: default``. diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst index e35a8ab1e8..745377daa8 100644 --- a/doc/esmvalcore/preprocessor.rst +++ b/doc/esmvalcore/preprocessor.rst @@ -34,7 +34,7 @@ Overview standardizable data analysis functions as possible so that the diagnostics can focus on the specific scientific tasks they carry. The preprocessor is linked to the diagnostics library and the diagnostic execution is seamlessly performed - after the preprocessor has completed the its steps. The benefit of having a + after the preprocessor has completed its steps. The benefit of having a preprocessing unit separate from the diagnostics library include: * ease of integration of new preprocessing routines; @@ -59,7 +59,7 @@ Each of the preprocessor operations is written in a dedicated python module and all of them receive and return an Iris `cube `_ , working sequentially on the data with no interactions between them. The order in which -the preprocessor operations is applied is set by default in order to minimize +the preprocessor operations is applied is set by default to minimize the loss of information due to, for example, temporal and spatial subsetting or multi-model averaging. Nevertheless, the user is free to change such order to address specific scientific requirements, but keeping in mind that some @@ -88,8 +88,8 @@ it and to provide the corresponding CMOR table. This is to guarantee the proper metadata definition is attached to the derived data. Such custom CMOR tables are collected as part of the `ESMValTool core package `_. By default, the variable -derivation will be applied only if not already available in the input data, but -the derivation can be forced by setting the appropriate flag. +derivation will be applied only if the variable is not already available in the +input data, but the derivation can be forced by setting the appropriate flag. .. code-block:: yaml @@ -236,7 +236,7 @@ and requires only one argument: ``mask_out``: either ``land`` or ``sea``. The preprocessor automatically retrieves the corresponding mask (``fx: stfof`` in this case) and applies it so that sea-covered grid cells are set to -missing. Conversely, it retrieves the ``fx: sftlf`` mask when land need to be +missing. Conversely, it retrieves the ``fx: sftlf`` mask when land needs to be masked out, respectively. If the corresponding fx file is not found (which is the case for some models and almost all observational datasets), the preprocessor attempts to mask the data using Natural Earth mask files (that are @@ -272,9 +272,9 @@ Mask files At the core of the land/sea/ice masking in the preprocessor are the mask files (whether it be fx type or Natural Earth type of files); these files (bar -Natural Earth) can be retrived and used in the diagnostic phase as well or -solely. By specifying the ``fx_files:`` key in the variable in diagnostic in -the recipe, and populating it with a list of desired files e.g.: +Natural Earth) can be retrived and used in the diagnostic phase as well. By +specifying the ``fx_files:`` key in the variable in diagnostic in the recipe, +and populating it with a list of desired files e.g.: .. code-block:: yaml @@ -284,7 +284,7 @@ the recipe, and populating it with a list of desired files e.g.: fx_files: [sftlf, sftof, sftgif, areacello, areacella] Such a recipe will automatically retrieve all the ``fx_files: [sftlf, sftof, -sftgif, areacello, areacella]``-type fx files for each of the variables that +sftgif, areacello, areacella]``-type fx files for each of the variables they are needed for and then, in the diagnostic phase, these mask files will be available for the developer to use them as they need to. The `fx_files` attribute of the big `variable` nested dictionary that gets passed to the @@ -305,10 +305,10 @@ Missing values masks Missing (masked) values can be a nuisance especially when dealing with multimodel ensembles and having to compute multimodel statistics; different -numbers of missing data from dataset to datest may introduce biases and +numbers of missing data from dataset to dataset may introduce biases and artifically assign more weight to the datasets that have less missing data. This is handled in ESMValTool via the missing values masks: two types of -such masks are available: one for the multimodel case and another for the +such masks are available, one for the multimodel case and another for the single model case. The multimodel missing values mask (``mask_fillvalues``) is a preprocessor step @@ -494,13 +494,14 @@ See also :func:`esmvalcore.preprocessor.regrid` Multi-model statistics ====================== Computing multi-model statistics is an integral part of model analysis and -evaluation: individual models display a variety of biases depedning on model +evaluation: individual models display a variety of biases depending on model set-up, initial conditions, forcings and implementation; comparing model data to observational data, these biases have a significanly lower statistical impact when using a multi-model ensemble. ESMValTool has the capability of computing a number of multi-model statistical measures: using the preprocessor module ``multi_model_statistics`` will enable the user to ask for either a -multi-model ``mean`` and/or ``median`` with a set of argument parameters passed to ``multi_model_statistics``. +multi-model ``mean`` and/or ``median`` with a set of argument parameters passed +to ``multi_model_statistics``. Multimodel statistics in ESMValTool are computed along the time axis, and as such, can be computed across a common overlap in time (by specifying ``span: @@ -512,7 +513,7 @@ the user will not want to include in the statistics (by setting ``exclude: [excluded models list]`` argument). The implementation has a few restrictions that apply to the input data: model datasets must have consistent shapes, and from a statistical point of view, this is needed since weights are not yet -implemented; also higher dimesnional data is not supported (ie anything with +implemented; also higher dimensional data is not supported (i.e. anything with dimensionality higher than four: time, vertical axis, two horizontal axes). .. code-block:: yaml @@ -604,7 +605,7 @@ See also :func:`esmvalcore.preprocessor.extract_month`. ``time_average`` ---------------- -This functions takes the weighted average over the time dimension. This +This function takes the weighted average over the time dimension. This function requires no arguments and removes the time dimension of the cube. See also :func:`esmvalcore.preprocessor.time_average`. @@ -634,13 +635,13 @@ See also :func:`esmvalcore.preprocessor.annual_mean`. ``regrid_time`` --------------- -This function aligns the time points of each component dataset so that the -dataset Iris cubes can be subtracted. The operation makes the datasets time -points common and sets common calendars; it also resets the time bounds and -auxiliary coordinates to reflect the artifically shifted time points. Current -implementation for monthly and daily data; the ``frequency`` is set -automatically from the variable CMOR table unless a custom ``frequency`` is set -manually by the user in recipe. +This function aligns the time points of each component dataset so that the Iris +cubes from different datasets can be subtracted. The operation makes the +datasets time points common and sets common calendars; it also resets the time +bounds and auxiliary coordinates to reflect the artifically shifted time +points. Current implementation for monthly and daily data; the ``frequency`` is +set automatically from the variable CMOR table unless a custom ``frequency`` is +set manually by the user in recipe. See also :func:`esmvalcore.preprocessor.regrid_time`. @@ -710,7 +711,7 @@ See also :func:`esmvalcore.preprocessor.area_statistics`. ``extract_named_regions`` ------------------------- -This function extract a specific named region from the data. This function +This function extracts a specific named region from the data. This function takes the following argument: ``regions`` which is either a string or a list of strings of named regions. Note that the dataset must have a ``region`` cooordinate which includes a list of strings as values. This function then @@ -765,9 +766,9 @@ See also :func:`esmvalcore.preprocessor.volume_statistics`. ``depth_integration`` --------------------- -This function integrate over the depth dimension. This function does a weighted -sum along the `z`-coordinate, and removes the `z` direction of the output -cube. This preprocessor takes no arguments. +This function integrates over the depth dimension. This function does a +weighted sum along the `z`-coordinate, and removes the `z` direction of the +output cube. This preprocessor takes no arguments. See also :func:`esmvalcore.preprocessor.depth_integration`. @@ -775,7 +776,7 @@ See also :func:`esmvalcore.preprocessor.depth_integration`. ``extract_transect`` -------------------- -This function extract data along a line of constant latitude or longitude. +This function extracts data along a line of constant latitude or longitude. This function takes two arguments, although only one is strictly required. The two arguments are ``latitude`` and ``longitude``. One of these arguments needs to be set to a float, and the other can then be either ignored or set to @@ -804,7 +805,7 @@ a cube which has extrapolated the data of the cube to those points, and ``number_points`` is not needed. Note that this function uses the expensive ``interpolate`` method from -``Iris.analysis.trajectory``, but it may be necceasiry for irregular grids. +``Iris.analysis.trajectory``, but it may be neccesary for irregular grids. See also :func:`esmvalcore.preprocessor.extract_trajectory`. @@ -864,7 +865,7 @@ where ``Mm = 1.5 x (N - 2)`` GB -As a thumb rule, the maximum required memory at a certain time, when meeding +As a rule of thumb, the maximum required memory at a certain time for multimodel analysis could be estimated by multiplying the number of datasets by the average file size of all the datasets; this memory intake is high but also assumes that all data is fully realized in memory; this aspect will gradually diff --git a/doc/esmvalcore/recipe.rst b/doc/esmvalcore/recipe.rst index ea856cd05f..2bfa58e161 100644 --- a/doc/esmvalcore/recipe.rst +++ b/doc/esmvalcore/recipe.rst @@ -117,8 +117,9 @@ Each preprocessor section includes: The following snippet is an example of a preprocessor named ``prep_map`` that contains multiple preprocessing steps (:ref:`Horizontal regridding` with two -arguments, :ref:`Time operations` with no arguments and :ref:`Multi-model -statistics` with two arguments): +arguments, :ref:`Time operations` with no arguments (i.e., calcualting the +average over the time dimension) and :ref:`Multi-model statistics` with two +arguments): .. code-block:: yaml From 66afd8c9b25d1f9b5e984bb065fb55d4ed703835 Mon Sep 17 00:00:00 2001 From: Mattia Righi Date: Fri, 2 Aug 2019 12:02:31 +0200 Subject: [PATCH 49/49] Fix links --- doc/esmvalcore/datafinder.rst | 16 ++++++++++++---- doc/esmvalcore/preprocessor.rst | 19 ++++++++++--------- 2 files changed, 22 insertions(+), 13 deletions(-) diff --git a/doc/esmvalcore/datafinder.rst b/doc/esmvalcore/datafinder.rst index a1b58efa44..206b3d1766 100644 --- a/doc/esmvalcore/datafinder.rst +++ b/doc/esmvalcore/datafinder.rst @@ -212,16 +212,22 @@ The tool will then use the root path ``/badc/cmip6/data/CMIP6/CMIP`` and the dataset information and will assemble the full DRS path using information from CMOR-DRS_ and establish the path to the files as: -``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon`` +.. code-block:: + + /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon then look for variable ``ta`` and specifically the latest version of the data file: -``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/`` +.. code-block:: + + /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/ and finally, using the file naming definition from CMOR-DRS_ find the file: -``/badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc`` +.. code-block:: + + /badc/cmip6/data/CMIP6/CMIP/MOHC/UKESM1-0-LL/historical/r1i1p1f2/Amon/ta/gn/latest/ta_Amon_UKESM1-0-LL_historical_r1i1p1f2_gn_195001-201412.nc .. _observations: @@ -243,7 +249,9 @@ and the dataset: in ``recipe.yml`` in ``datasets`` or ``additional_datasets``, the rules set in CMOR-DRS_ are used again and the file will be automatically found: -``/group_workspaces/jasmin4/esmeval/obsdata-v2/Tier3/ERA-Interim/OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc`` +.. code-block:: + + /group_workspaces/jasmin4/esmeval/obsdata-v2/Tier3/ERA-Interim/OBS_ERA-Interim_reanaly_1_Amon_ta_201401-201412.nc Since observational data are organized in Tiers depending on their level of public availability, the ``default`` directory must be structured accordingly diff --git a/doc/esmvalcore/preprocessor.rst b/doc/esmvalcore/preprocessor.rst index 745377daa8..82bb54833a 100644 --- a/doc/esmvalcore/preprocessor.rst +++ b/doc/esmvalcore/preprocessor.rst @@ -86,7 +86,7 @@ comparison with the observations. To contribute a new derived variable, it is also necessary to define a name for it and to provide the corresponding CMOR table. This is to guarantee the proper metadata definition is attached to the derived data. Such custom CMOR tables -are collected as part of the `ESMValTool core package +are collected as part of the `ESMValCore package `_. By default, the variable derivation will be applied only if the variable is not already available in the input data, but the derivation can be forced by setting the appropriate flag. @@ -385,8 +385,9 @@ difference is that interpolation is based on sample data points, while regridding is based on the horizontal grid of another cube (the reference grid). -The underlying regridding mechanism in ESMValTool uses ``cube.regrid()`` method -from Iris, so we point the reader to its documentation: `cube.regrid() `_. +The underlying regridding mechanism in ESMValTool uses the `cube.regrid() +`_ +from Iris. The use of the horizontal regridding functionality is flexible depending on what type of reference grid and what interpolation scheme is preferred. Below @@ -532,11 +533,11 @@ see also :func:`esmvalcore.preprocessor.multi_model_statistics`. Note that the multimodel array operations, albeit performed in per-time/per-horizontal level loops to save memory, could, however, be rather memory-intensive (since they are not performed lazily as - yet). Section MemoryUse_ details the memory intake for different run - scenarios, but as a thumb rule, for the multimodel preprocessor, the - expected maximum memory intake could be approximated as the number of - datasets multiplied by the average size in memory for one dataset. - + yet). The Section on :ref:`Memory use` details the memory intake + for different run scenarios, but as a thumb rule, for the multimodel + preprocessor, the expected maximum memory intake could be approximated as + the number of datasets multiplied by the average size in memory for one + dataset. .. _time operations: @@ -832,7 +833,7 @@ will guarantee homogeneous input for the diagnostics. See also :func:`esmvalcore.preprocessor.convert_units`. -.. _MemoryUse: +.. _Memory use: Information on maximum memory required ======================================