######################################################## Wednesday A: Adding QC to JEDI ######################################################## Introduction ---------------- This session is on the JEDI Unified Forward Operator (UFO) code and quality control (QC) filters. This tutorial has two main parts. First, you will review how to clone, build and customize a bundle. Second, you will add in a new QC filter to JEDI. This activity has no prerequisites besides having access to either a JupyterLab or SSH session. Filters ---------------- Filters are an essential component in a data assimilation workflow. Filters can change quality control flags (i.e., to reject or retain observations) and observation error variances (e.g., one might wish to increase observation error variances to decrease the observation weight in the analysis instead of rejecting observations altogether). In JEDI, filters are customizable and generic. This means that you can use the same code (written in C++) to accomplish diffferent tasks (specified by you in a YAML file). This tutorial introduces the YAML file format, explains how to specify a QC filter in YAML, and explains the backend code used to actually perform the filtering. We will implement a relatively simple filter in C++. Step 1: Access your AWS instance -------------------------------- Connect to your assigned compute node. You will use the same method as yesterday. Step 2: Clone and customize the fv3-bundle ------------------------------------------ JEDI bundles are convenient ways for users to build all of the JEDI components that are needed for a particular application. Open a new terminal and navigate to the ``~/jedi`` directory. We are going to clone a new copy of the ``fv3-bundle``, which you worked with in previous sessions. We are cloning a new copy of this bundle because we want to modify the JEDI source code to add in a new filter to the Unified Forward Operator (UFO) code. Whenever you add a new feature, it is a very good idea to start with a fresh copy of the source code so that you have not made any accidental changes to unrelated files. To clone the fv3 bundle, run this command: .. code:: bash git clone https://github.com/JCSDA-academy/fv3-bundle.git fv3-bundle-day3 This command gets the release branch of the bundle and places it in the ``fv3-bundle-day3`` directory. Change into this directory and look around. When a bundle is first cloned, it is very small, having only a ``CMakeLists.txt`` file and a few supplemental configuration and documentation files. Any file named ``CMakeLists.txt`` represents a set of instructions for the build system (i.e. CMake and ecbuild). This particular ``CMakeLists.txt`` file describes the components that are incorporated into this particular JEDI software bundle. The "Getting Started" activity provided a brief introduction to this file. If you modify JEDI, you will undoubtedly modify this file extensively, so some review is helpful. Open the ``CMakeLists.txt`` file in a text editor and examine it. It can be divided into three parts. - The first part (lines 1-35) is a preamble that you normally never have to change. The code here tells ecbuild that we are declaring a bundle, that requires a certain minimum version of the ``cmake`` program, and that requires C, C++ and Fortran compilers. - The second part (lines 36-69) has several ``ecbuild_bundle`` declarations that tell ecbuild the components of the bundle and where to find them. By default, the bundle depends on the ``develop`` branch of several repositories. You can easily change the branches that your code targets. This is quite helpful when you are adding a new feature or are testing existing code across multiple systems. - The final part (lines 70-73) is again mostly fixed. It calls a macro function in ecbuild to execute the bundle's instructions. Step 3: Enter the JEDI container, run ecbuild and make a new feature branch --------------------------------------------------------------------------- This again follows the instruction from yesterday. You should already have downloaded the singularity container. Enter the container using: .. singularity shell -e jedi-gnu-openmpi-dev_latest.sif Once inside the container, create a build directory at ``~/jedi/fv3-bundle-day3/build`` and change into it (``mkdir ~/jedi/fv3-bundle-day3/build && cd ~/jedi/fv3-bundle-day3/build``). Next, we invoke ecbuild to download the repositories' source codes and "configure" the build. Ecbuild will find all of the required JEDI dependencies, set initial build flags, and will generate a series of GNU Makefiles. Basically, ecbuild works to set up the build system. To invoke ecbuild, run: .. code:: bash ecbuild --build=RelWithDebInfo .. The ``--build=RelWithDebInfo`` option to ecbuild specifies that we want an optimized build of the fv3-bundle, but with debugging symbols incorporated into the executables and libraries. We will be adding a new filter, and it is helpful to have debugging information available, in case something goes wrong. Once ecbuild completes, verify that it reports that configuration has succeeded. Ecbuild has cloned the stable branches of several repositories. However, in this tutorial we want to make modifications to the UFO code. In JEDI, we aim to follow the "git flow" paradigm when developing, and we will discuss this in depth in a later lecture on Friday. In short summary, the ``develop`` branch contains the development version of each repository. This version of the code should always build and test successfully. Whenever you want to add a new feature to the code, you should do your work in another branch of the repository. Once the work is done, you can issue a "Pull Request" to have other JEDI users review your code and merge in your changes into the ``develop`` branch. Every month or so, we aim to release a stable, consistent snapshot of the JEDI repositories. We copy the development branch to a git "tag" (an immutable branch). Open the top-level CMakeLists.txt file (``~/jedi/fv3-bundle-day3/CMakeLists.txt``). On change line 59 from .. code:: bash ecbuild_bundle( PROJECT ufo GIT "https://github.com/jcsda-academy/ufo.git" TAG 1.0.0 ) to .. code:: bash ecbuild_bundle( PROJECT ufo GIT "https://github.com/jcsda-academy/ufo.git" BRANCH feature/new_qc_filter_example ) Then, enter the source code's ``ufo`` subdirectory (``cd ~/jedi/fv3-bundle-day3/ufo``; NOTE: There is also a ``ufo`` directory in your current directory at ``~/jedi/fv3-bundle-day3/build/ufo`` <-- This is not the directory that you want.) Run this git command to make a copy of the ``1.0.0`` tag: .. code:: bash git checkout -b feature/new_qc_filter_example Ordinarily, you would copy the ``develop`` branch when making new feature branches. The meanings of the ``master``, ``develop``, and tagged branches will be discussed in the git-flow lectures and in later practical exercises. Finally, return to the build directory (``~/jedi/fv3-bundle-day3/build``) and re-run ecbuild: .. code:: bash ecbuild .. Step 4: Compiling and testing ----------------------------- Now we can build the code and run the unit tests. .. code:: bash make -j4 ctest This will take some time. GNU Make also needs to compile many source code files into executables and libraries. CTest must download a fresh copy of the testing data files for JEDI to run. CTest will then validate the compiled programs and ensure that no tests fail. While this process is happening, read ahead. Step 5: Review of YAML files ---------------------------- Programmers and computers typically store data as complex "objects” (`structures and classes`_). In a computer's memory, these objects may have very complicated storage involving pointers, references, dictionaries, and similar constructs. However, when we need to store these complex structures to a disk or send them across a network, we have to translate these complex structures into a series of bytes (a.k.a. we `serialize`_ an object into `a byte stream`_). There are lots of ways of doing this. However, JEDI wanted to employ a consistent, well-documented format that is easy for people to edit and for machines to read. So, we chose to use the YAML Ain't Markup Language (YAML) format to store the configuration data for the JEDI project. `YAML`_ was developed in 2001 and has been implemented for use with `several`_ programming languages. Let's take a look at a YAML file for a brief overview. .. code:: yaml --- # Comments are indicated with the '#' symbol. name: "Your name here" # A string a-boolean-value: true an-integer-value: 3 pi: 3.14159 list-of-some-jedi-components: - saber - oops - ioda - ufo dictionary-of-places-to-explore-in-a-staycation: - local-park: scenic: true features: - "Running trails" - Trees - "Duck pond" - aquarium: types-of-animals: - jellyfish - turtles - fish free: false mask: true # TODO: Explore this area and add more details. The file starts with three dashes. These dashes indicate the start of a new YAML document. YAML supports multiple documents, and compliant parsers will recognize each set of dashes as the beginning of a new one. Comments are started with a space and a hashtag (" #") and extend to the end of the line. Next, we see the construct that makes up most of a typical YAML document: a key-value pair. "name” is a key that points to a string value: "Your name here”. YAML allows for several types of values: strings, integers, floating-point numbers, boolean values and dates are all acceptable. Strings can optionally be enclosed in quotes. Quotes include both single and double quotes. You can also add in arrays / lists. Each element in a list is denoted by an opening dash. YAML elements can also be nested. This lets you emulate a group / folder structure. Nesting is accomplished by adding levels of spaces (no tabs allowed). See `this link`_ for more examples. Step 6: How do we invoke filters using a YAML configuration file? ----------------------------------------------------------------- Example YAML code for filters can be found in the UFO repository in ufo/test/testinput. The ``qc*.yaml`` files provide many examples of how to use QC filters. Let's look at the DifferenceCheck filter to see how a relatively basic filter works. See qc_differencecheck.yaml: .. code:: yaml window begin: 2018-01-01T00:00:00Z window end: 2019-01-01T00:00:00Z observations: - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1] obs filters: - filter: Difference Check # test minvalue with one var (compare var3-var4 with min) value: var3@MetaData # var3@MetaData = 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 reference: var4@MetaData # var4@MetaData = 0, 0, 0, 0, 0, 1, 2, 3, 4, 5 minvalue: 0.0 passedBenchmark: 5 - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1, variable2, variable3] obs filters: - filter: Difference Check # test same minvalue with three vars (compare var3-var4 with min) value: var3@MetaData # var3@MetaData = 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 reference: var4@MetaData # var4@MetaData = 0, 0, 0, 0, 0, 1, 2, 3, 4, 5 minvalue: 1.0 passedBenchmark: 15 - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1] obs filters: - filter: Difference Check # test maxvalue (compare var3-var4 with max) filter variables: - name: variable1 value: var3@MetaData # var3@MetaData = 1, 1, 1, 1, 1, 0, 0, 0, 0, 0 reference: var4@MetaData # var4@MetaData = 0, 0, 0, 0, 0, 1, 2, 3, 4, 5 maxvalue: -3.0 passedBenchmark: 3 - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1] obs filters: - filter: Difference Check # test min and maxvalue (compare var3-var4 with min and max) filter variables: - name: variable1 value: variable2@ObsValue # variable2@ObsValue = 10, 12, 14, 16, 18, 20, 22, 24, 26, 28 reference: variable1@ObsValue # variable1@ObsValue = 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 minvalue: 2.0 maxvalue: 6.0 passedBenchmark: 5 - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1] obs filters: - filter: Difference Check # test threshold (compare abs(variable2 - variable1) with threshold) filter variables: - name: variable1 value: variable2@ObsValue # variable2@ObsValue = 10, 12, 14, 16, 18, 20, 22, 24, 26, 28 reference: variable1@ObsValue # variable1@ObsValue = 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 threshold: 3 passedBenchmark: 4 - obs space: name: test data obsdatain: obsfile: Data/ufo/testinput_tier_1/filters_testdata.nc4 simulated variables: [variable1] obs filters: - filter: Difference Check # test min and maxvalue (equal), equivalent to previous test filter variables: - name: variable1 value: variable2@ObsValue # variable2@ObsValue = 10, 12, 14, 16, 18, 20, 22, 24, 26, 28 reference: variable1@ObsValue # variable1@ObsValue = 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 minvalue: -3 maxvalue: 3 passedBenchmark: 4 UFO accesses observation data via functions and subroutines in the ObsSpace (Observation Space) class. The above YAML files specifies several instances of ObsSpace. Each space has a name, a path to the input data and a list of variables to be simulated. Paired with the ObsSpaces are the filters (ObsFilters) that act on each space. When specifying a filter, you must provide its name and any other configuration information that it requires. The above YAML file invokes the Difference Check filter. Its options are described in `[the ReadTheDocs Documentation Site] `__. For reference, a segment of the documentation is reproduced here: +-----------+---------------------------------------------------------+ | Parameter | Description | +===========+=========================================================+ | value | The variable that we are comparing | +-----------+---------------------------------------------------------+ | reference | The variable that we are comparing against | +-----------+---------------------------------------------------------+ | minvalue | The minimum difference of (value - reference) for a | | | valid datum | +-----------+---------------------------------------------------------+ | maxvalue | The maximum difference of (value - reference) for a | | | valid datum | +-----------+---------------------------------------------------------+ | threshold | A shortcut for expressing minvalue = -threshold, | | | maxvalue = threshold | +-----------+---------------------------------------------------------+ Since we are using this YAML file in a test, we also encode the expected number number of passed locations using the ``passedBenchmark`` option. Step 7: How do we implement a filter? ------------------------------------- The C++ code for all filters can be found in the UFO repository in src/ufo/filters `[link]`_. The full source code for the DifferenceCheck filter is available `[here]`_ and the full source code is `[here] `__. Defining the filter - the header file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Annotated excerpt from DifferenceCheck.h: .. code:: cpp namespace ufo { /// DifferenceCheck filter class DifferenceCheck : public FilterBase, private util::ObjectCounter { public: static const std::string classname() {return "ufo::DifferenceCheck";} /// !!! This Constructor function initializes an instance of the /// !!! filter based on options specified in the YAML configuration file. DifferenceCheck(ioda::ObsSpace &, const eckit::Configuration &, std::shared_ptr >, std::shared_ptr >); ~DifferenceCheck(); private: void print(std::ostream &) const override; /// !!! This is the function that does all of the work in the filter. !!! void applyFilter(const std::vector &, const Variables &, std::vector> &) const override; int qcFlag() const override {return QCflags::diffref;} const Variable ref_; const Variable val_; }; } // namespace ufo Implementing the filter - the source code file ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Excerpt from DifferenceCheck.cpp: .. code:: cpp // ----------------------------------------------------------------------------- /** !!! This is the constructor function. * * When we instantiate a new DifferenceFilter object, we read in configuration * from the YAML files (stored in the _config_ variable). * * We look for two keys: * - refererence: the name of the variable used for the reference. * - value: the name of the variable that contains our data. **/ DifferenceCheck::DifferenceCheck(ioda::ObsSpace & obsdb, const eckit::Configuration & config, std::shared_ptr > flags, std::shared_ptr > obserr) : FilterBase(obsdb, config, flags, obserr), ref_(config_.getString("reference")), val_(config_.getString("value")) { oops::Log::trace() << "DifferenceCheck contructor starting" << std::endl; /// Here we tell OOPS and IODA that our filter requires these variables to work. /// I.e. these variables have to be available in memory. allvars_ += ref_; allvars_ += val_; } // ----------------------------------------------------------------------------- /** !!! This function does the actual work. * * We read in three keys from the YAML configuration: minvalue, maxvalue, and threshold. * * When applying this filter, we loop over all possible locations. * For each location, we check the difference between the two variables (reference and value). * If the difference is outside of the bounds specified by minvalue, maxvalue and threshold, * then we flag that location. This flag gets passed back to the calling function, which then * sets the appropriate QC flag. **/ void DifferenceCheck::applyFilter(const std::vector & apply, const Variables & filtervars, std::vector> & flagged) const { oops::Log::trace() << "DifferenceCheck priorFilter" << std::endl; const float missing = util::missingValue(missing); const size_t nlocs = obsdb_.nlocs(); // min/max value setup float vmin = config_.getFloat("minvalue", missing); float vmax = config_.getFloat("maxvalue", missing); // check for threshold and if exists, set vmin and vmax appropriately const float thresh = config_.getFloat("threshold", missing); if (thresh != missing) { vmin = -thresh; vmax = thresh; } // Get reference values and values to compare (as floats) std::vector ref, val; data_.get(ref_, ref); data_.get(val_, val); ASSERT(ref.size() == val.size()); // Loop over all obs for (size_t jobs = 0; jobs < nlocs; ++jobs) { if (apply[jobs]) { // check to see if one of the reference or value is missing if (val[jobs] == missing || ref[jobs] == missing) { for (size_t jv = 0; jv < filtervars.nvars(); ++jv) { flagged[jv][jobs] = true; } } else { // Check if difference is within min/max value range and set flag float diff = val[jobs] - ref[jobs]; for (size_t jv = 0; jv < filtervars.nvars(); ++jv) { if (vmin != missing && diff < vmin) flagged[jv][jobs] = true; if (vmax != missing && diff > vmax) flagged[jv][jobs] = true; } } } } } } // namespace ufo Step 8: Try to add a new filter ------------------------------------ We are going to re-implement a **simplified** version of the Bounds Check filter. This filter checks that observation data are within certain user-specified bounds. Step 8a: The backend logic ~~~~~~~~~~~~~~~~~~~~~~~~~~ In your JEDI bundle, navigate into the ``ufo/src/ufo/filters`` directory. Copy the ``DifferenceCheck.cc`` and ``DifferenceCheck.h`` files to ``PracticalBoundsCheck.cc`` and ``PracticalBoundsCheck.h``, respectively. Open these files in your editor of choice. In ``PracticalBoundsCheck.h``: - Rename all references of ``DifferenceCheck`` to ``PracticalBoundsCheck``. Search for all possible capitalizations. Don't forget the capitalized text on lines 8, 9 and 55! - Change the line ``int qcFlag() const override {return QCflags::diffref;}`` to return a different flag: ``QCflags::bounds``. This QC flag is conveniently already defined in ``ufo/filters/QCflags.h``. - Remove the lines defining ``const Variable ref_;`` and ``const Variable val_;``. In ``PracticalBoundsCheck.cc``: - Rename all references of ``DifferenceCheck`` to ``PracticalBoundsCheck``. - In ``PracticalBoundsCheck::PracticalBoundsCheck(...)``, remove references to ``ref_`` and ``val_``. - In ``PracticalBoundsCheck::applyFilter(...)``, replace the function body with something like this: .. code:: cpp const float missing = util::missingValue(missing); ufo::Variables testvars; testvars += ufo::Variables(filtervars, "ObsValue"); const float vmin = config_.getFloat("minvalue", missing); const float vmax = config_.getFloat("maxvalue", missing); // Sanity checks if (filtervars.nvars() == 0) { oops::Log::error() << "No variables will be filtered out in filter " << config_ << std::endl; ABORT("No variables specified to be filtered out in filter"); } // Loop over all variables to filter for (size_t jv = 0; jv < testvars.nvars(); ++jv) { // get test data for this variable std::vector testdata; data_.get(testvars.variable(jv), testdata); // apply the filter for (size_t jobs = 0; jobs < obsdb_.nlocs(); ++jobs) { if (apply[jobs]) { ASSERT(testdata[jobs] != missing); if (vmin != missing && testdata[jobs] < vmin) flagged[jv][jobs] = true; if (vmax != missing && testdata[jobs] > vmax) flagged[jv][jobs] = true; } } } - Feel free to customize the function further. Step 8b: Add your new filter to the build system ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Edit ``src/ufo/filters/CMakeLists.txt`` and add in ``PracticalBoundsCheck.cc`` and ``PracticalBoundsCheck.h``. - UFO needs to be told that another filter is available. The list of known filters is located in ``ufo/src/ufo/instantiateObsFilterFactory.h``. To add in the new filter, first add ``#include "ufo/filters/PracticalBoundsCheck.h"`` to the top of ``instantiateObsFilterFactory.h``. At the end of ``instantiateObsFilterFactory.h``, follow the pattern and add in: .. code:: cpp static oops::FilterMaker > practicalBoundsCheckMaker("Practical Bounds Check"); - Change back to the build directory (``cd ~/jedi/fv3-bundle-day3/build``) and re-run ``make ufo``. - The filter is added! Step 8c: Add in the YAML that describes this filter to a test ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is covered in a later tutorial on unit testing. .. _structures and classes: http://www.cplusplus.com/doc/tutorial/structures/ .. _serialize: https://en.wikipedia.org/wiki/Serialization .. _a byte stream: https://en.wikipedia.org/wiki/Bitstream .. _YAML: https://yaml.org/about.html .. _several: https://yaml.org/ .. _this link: https://docs.ansible.com/ansible/latest/reference_appendices/YAMLSyntax.html .. _[link]: https://github.com/JCSDA/ufo/tree/develop/src/ufo/filters .. _[here]: https://github.com/JCSDA/ufo/blob/develop/src/ufo/filters/DifferenceCheck.h