Argo Quality Control Comparison Project

The goal of this project is to compare and contrast the real-time Argo data quality control (QC) efforts of oceanographic forecasting institution, with the delayed-mode QC of the Argo collaboration as a bench-mark. It is hoped that in-depth comparisons of the various QC strategies will provide insight into the best practice for real-time Argo QC, to the benefit of all Argo users.

Data

The data used in the study is provided by Jim Cummings (US Navy). Institutions upload data to Jim's ftp server for processing. Currently real-time QC data has been received from four institutions: UK Met Office (UKMO); Australia's Bureau of Meteorology (BoM); the US Navy's Fleet Numerical Meteorology and Oceanography Center (FNMOC); and Canada's Marine Environmental Data Service (MEDS). The data is processed by Jim into yearly float-specific netCDF files that contain the original float data, all the real-time QC data from the four institutions, and the delayed mode Argo QC data pertaining to that float.

The available data covers the years 2006-2009. There are some temporal gaps and anomalous results in some of the provided data stream; the institutions are being contacted to correct these. An effort will also be made to invite other institutions to participate in the project.

The QC philosophy of each center looks like this:

BMRC UKMO
Coriolis
FNMOC
MEDS

Similar to Argo recommended real-time QC, but
- slightly different gradient and spike tests
- bathymetry depth test
- climatology comparison (5 sigma)

Fully in-house developed QC structure. Highly tested results. Bayesian background test rather than a fixed N*sigma test.

As recommended by the Argo real-time,
plus comparison with climatology and neighbor measurements.

  Quite sophisticated, with more detailed level-specific variable range tests, doxy test, and bathymetry tests. Only QC with a freezing point test.

More details of each center's QC process can be found here. Official Argo real-time QCis described here: http://www.coriolis.eu.org/coriolis/cdc/argo/argo-dm-user-manual.pdf

Progress

The initial comparison work was performed by Alastair Gemmell with Justin Buck, Keith Haines and Jon Blower. Real-time QC data from 2007-2008 was compared with Argo delayed-mode QC results. All data from each center was considered. The profiles were divided into those which the delayed mode QC processed as having 0%, 1-50%, 50-99%, and 100% good data, and the individual institution QC strategies were assessed by the proportions of each group they allowed into analysis. Temperature and salinity data were considered. The links below contain slides from Alastair Gemmell detailing the results of the analysis.

The BoM QC (marked as BMRC in the slides) was found to have the lowest acceptance of bad data profiles (<50% good data) and also the highest acceptance of good data profiles (>50% good data). FNMOC, UKMO and MEDS ranked about equally, with FNMOC showing the greatest discrimination between good and bad data of the three institutions.

These results are based on the total data provided from each institution, which results in a non-level playing field for comparison. The number of profiles provided by the UKMO was almost an order of magnitude smaller than those from the other three institutions. Some of the UKMO results are thus rendered suspect due to poor statistics, especially those of the profile groups with high proportions of bad data. The data are also non-homogeneous in floats analysed, which could introduce float- and location-dependent factors into the relative performance of each institution.

Marc Stringer extended the analysis to include data from 2004-2010. Focusing only on delayed-mode profile which show either 100% good or 100% bad data, the real-time QC data is assessed according to whether or not it agrees with the delayed-mode QC. Biases are calculated, as are the Equitable Threat Score (more on ETS). The data is analysed on a profile-by-profile and level-by-level basis, and also on a level-by-level basis for only those profiles that are included in all of the institutions QC results. Another institution (Coriolis) is included in the level-by-level analysis.

The advantage shown by the BoM QC strategy in the initial study is evident again here in the profile-by-profile all-data comparison, with the BoM results having a lower bias and a higher ETS score. FNMOC and MEDS are comparable, and UKMO significantly behind. When the data is looked at on a level-by-level basis, however, this advantage disappears. The BoM results are very similar to those of Coriolis, FNMOC and MEDS. When only the common profiles are analysed on a level-by-level basis, the BoM strategy is reduced again, and is out-performed by the Coriolis results. All of the institutions perform worse when taken on a level-by-level basis, and again when only common data is use, but the BoM drop is far larger.

The common lowering of skill when only common profiles are used could possibly be the result of tailoring QC requirements to give the best results in the specific areas (geographical, dynamical, temporal) of most interest to a particular institution. The lowering of skill when data is considered on a level-by-level basis could be an indication that profile-based selection criteria are the most successful in real-time. The below link shows Marc's results in detail

http://www.met.rdg.ac.uk/~marc/eResearch/argo2.html

Alastair Gemmell has begun identifying metrics with which to compare the different QC strategies. The averages of the metrics "recall" (R) and "precision" (P) for each institution is shown in Figure 1. The data runs from 2006 to 2009. "Recall" is the ratio of the number of bad data removed to the total number of bad data. It is a measure of the success of the QC. "Precision" is the ratio of the number of bad data removed to the total number of data removed. It is a measure of the accuracy of the QC. The two metrics are shown in percentage form. Figure 1 shows the percentage R and P for the total Argo data from each institution.

  • Figure 1: The average recall and precision of Argo QC from four institutions 2006-9. (Alastair Gemmell):

Alastair_bar.png

The BoM QC removes 20% of the bad data and an equal number of good data. FNMOC removes 10% more of the total bad data at the expense of removing about 12 times as much good data as bad. This seems like a bad trade, but when good data outweighs bad by 50-60 times and bad data entering assimilation can have grave effects on an analysis, perhaps a 12-for-1 trade of good data for bad is desirable. These metrics carry no information about the relative proportions of good and bad data, something that should be taken into account. This is discussed further below. The desirable ratio of good/bad data removed might also depend on the geographical location of the data removed and other factors that make one profile more important than another. This will require further investigation to discern. The UKMO QC has good R but very poor P, indicating that the QC is too broad; this is actually thought to be an issue with the data provided and is discussed further below.The MEDS QC has the lowest R and low P as well.

Figures 2 and 3 show time series of the same data. The first thing to note is the odd behavior of the UKMO data. There is no data for 2006, and in late 2008 the QC begins to reject almost all the data, giving a recall of 100% and a very low precision. This is the reason for the poor results from the UKMO and is believed to be most likely due to a data transfer error. Contact with the UKMO will be made to correct this.

The QC of FNMOC is clearly the most successful at removing bad data in the first two years of the time period studied. The FNMOC recall rate drops significantly in 2008. In mid-2009 the BoM (BMRC) QC is generally second, and improves significantly in recall halfway through 2009 after a poor period in late 2008 to early 2009.Figure 2: The recall of Argo QC from four institutions. (Alastair Gemmell):

  • Figure 2: The Recall of Argo QC from four institutions from 2006 to the end of 2009 . (Alastair Gemmell):
  • Alastair_recall.png
  • Figure 3: The Precision of Argo QC from four institutions from 2006 to the end of 2009. (Alastair Gemmell): Alastair_prec.png

The advantage for the BoM QC seems to be a result of high precision, which is much higher than FNMOC or MEDS, though comparable to the UKMO during the periods when the UKMO data is OK. This leads to the question of whether the good results for the BoM are justified. In the problem of data QC for the purpose of data assimilation, the removal of bad data is of paramount importance. The assimilation of inaccurate and spurious data into ocean models can lead to the degradation of the initial state and thus of the predictive abilities of the system. In a observation system of high quality, such as Argo, there are many more good data points than bad (bad data is ~1.5-2.0% in the data Marc used). In this case the removal of bad data is more important than preserving the maximum number of good data. It is a matter of finding an acceptable balance, as it always is in the task of improving signal-to-noise, but in this case the scales are tipped towards the sacrifice of good data points to remove more of the bad. In this light, perhaps the QC of the other institutions, especially FNMOC, are better than they seemed in the initial analysis.

As previously mentioned, the Recall/Precision metrics do not contain information about the relative abundance of good and bad data. This information should be incorporated into the metric in order to assess the relative value of removing good and bad data, ie, how much good data is it worth removing to get rid of more bad data? This would also enable a fairer comparison of QCs run over different data sets: two QCs might have similar Precisions, but if the data sets they were run over had very different relative amounts of good and bad data, the quality of the resultant data sets would also be very different. Instead of using Precision, a ratio of the number of good data passed to the total number of good data would be more informative. This ratio can then be combined with the ratio of the number of bad data passed to the total number of bad data (the inverse of Recall) to form a single metric that includes all the information desired: a "Figure of Merit" (FoM ) defined by

fom_eq.gif

Where R is the fractional number of good data passing the QC and I is the fractional number of bad data passing the QC. This metric has the behavior:

  • FoM = 1 when R = 1 and I = 0
  • FoM = 0 when R = 0 and I = 1
  • FoM = 0.5 when R = I, ie, when the QC has not changed the relative frequency of good and bad data.
So a score above 0.5 means the QC is having a positive effect on the quality of the data, a score of 0.5 means the QC has no more skill than removing profiles at random, and a score of below 0.5 means the QC is having a negative impact. The plot below (Figure 4) shows the FoM behavior for a range of R and I results.

  • Figure 4: Behavior of the FoM metric with various R and I:
    fom.png

I have calculated FoMs for Marcs results for those profiles that are included in both delayed mode and Data Centres QC and in which the delayed mode QC either accepted or rejected all the levels in a profile. The results are in Figure 5. Only the BoM QC is found to be significantly improving the quality of this data set; the other three centers have results only slightly above that of rejecting a random selection of profiles. The Recall and Precision results are different to those found by Alastair, I am not sure which data set was used in the summer school. One interesting thing to note about this is that the BoM data set has ~2.4% bad data according to the delayed mode QC, while FNMOC, MEDS and UKMO have ~13%,~6.5% and ~7.1% bad data respectively. The FoM metric takes this into account (ie, an FoM score is independant of the quality of the initial data set), but it is interesting to highlight the differences in the data over which each QC was run.

  • Figure 5: The FoM, Recall and Precision results of profile-based QC for the four centers:
fom_recall_prec2.png

Figure 6 shows the FoM, Recall and Precision results for the four centers plus Coriolis when the QC of every level in each profile is assessed. On a level-by-level basis, the FoMs of all institutions except FNMOC fall. The FNMOC results remain approximately the same. The BoM QC is now only slightly prefered, possibly indicating that the advantage this institution has lies in profile-based selection criteria. The UKMO FoM has fallen below 0.5: the QC is flagging so many good levels as bad that the loss outweighs the benefit of removing the correctly flagged bad levels.

  • Figure 6: The FoM, Recall and Precision results of level-based QC for the four centers plus Coriolis:
fom_recall_prec_levels1.png

Figure 7 shows the results for the set of levels from profiles that are common to the BoM , FNMOC, Coriolis and MEDS QC. In this data set, the BoM QC has dropped behind both the Coriolis and the FNMOC QC, and is about equal with that of MEDS. No QC does well in this data set, though that of Coriolis has a clear advantage.

  • Figure 7: The FoM, Recall and Precision results of level-based QC for only the common profiles in the data sets of BoM , Coriolis, FNMOC and MEDS:

fom_recall_prec_levels_cross.png

Argo Delayed Mode Availability

Just a quick assessment of the rate at which Argo data is delayed-mode (DM) QC'd and available on the GODAE data server. Proportion of total profiles that have been QC'd:

1/6/2011 11%
1/1/2011 26%
1/6/2010 48%
1/1/2010 67%
1/6/2009 69%
1/1/2009 74%

So, if we go up to 2010 inclusive, the number of data will drop off. This is not really a problem, as the data set we will be using will be defined by the Argo DM data, ie, only profiles from the centers that have undergone DM QC will be considered (at least at the beginning. Keith mentioned the possibility of extending the analysis up to real-time without the use of DM QC data). We should check the performance of each QC in time-lines a-la Alastairs plots anyway.

Available Data as of 19/9/2011

Institution
UKMO
MEDS
CRLS
BoM
Year Range

4/10/2006 - 22/7/2008

17/12/2008 - today

21/6/2004 - 16/9/2011

One day in 2007

Very patchy 2008

23/7/2009 - today

1/1/2005 - 19/6/2011
Continuity Daily files Monday, Wednesday, Friday Daily files Daily files
Summary/Detailed Flags? Yes / Yes ? / Yes Yes / Yes Yes / Yes

QC documentation?

No No No Yes

The FNMOC raw data is not on the server, but is continuous from 2006-2010 at least.

So we have 2006-present for three centers (FNMOC, MEDS and BoM ). UKMO we can only trust data from 2007 to mid-2008. Coriolis only from late 2009.

Jim's data processing has files for 2004-2010. These files contain the total QC accept/reject and QC flags for T (background and clim), S (background and clim), level, date and time, and position.

If we want to use Jims processed data, which is a good idea at least for the beginning, then we will have to ask him to re-process the data once we get the centers to provide more. We have the raw data for each of the centers that includes their individual QC flags, so that is available for more in-depth later analysis. (I'm not sure about the MEDS raw data as it is in ASCII format and I haven't had time to write some code to read it)

Looking at the UKMO data uploaded to Jims server, it has changed format twice. The timing of these changes seems to coincide with the problems with the data, as seen in Figure3. On 27/2/2007 the files change from "OCEANSOUND_LIST_{date}.GZ" to "OceanSound_longlist_{date}.gz". Not sure if this changes the internal variables as I'm not sure what format the files are in or how to read them, but is it approximately the time the UKMO data starts to be OK. Then on 17/12/2008 they change to netCDF files of the form "profile.obs.{date}.nc" and at this point the data goes bad again. It seems possible that Jims processing code might not have changed with the format changes? Needs to be investigated.

There are two gaps in the UKMO raw data, one a few weeks (18/12/2007 to 4/1/2008), and one of five months (22/7/2008 to 17/12/2008). These will need to be filled in.

Wishlist:

If 2006 (or later) to 2010 is enough data, then we need:

  • UKMO data to be uploaded from 18/12/2007 to 4/1/2008 and from 22/7/2008 to 17/12/2008.
  • The other problems need to be resolved with UKMO data, either by re-uploading the odd data or by resolution of problems with the analysis code.
  • If we want to add Coriolis, we need 1/1/2006-22/7/2009 to be uploaded, and hopefully some profile QC information included.
  • Uploading of ECMWF data if they get onboard.
  • Jim to reprocess all the new data.
  • Some form of documentation describing the QC selection criteria for each of the centers.

The good thing is that we have complete data from 3 centers, so we can start analysing that first and add the other centers as the data becomes available.

UPDATE 5/1/2012

We have decided to switch to using the raw data files found in

http://www.usgodae.org/pub/incoming/godae_qc/

rather than relying on Jim Cumming's processed data. There are several reasons for the change: we are not certain exactly how the translations from the institutional flags to Jim's flags are performed; we cannot discover the source of the data gaps in the UKMO QC data ourselves; are unable to fill the gaps in the UKMO QC data ourselves; we have been unable until the last few days to contact Jim to obtain clarification of these issues; and we decided to move from the simple 'pass, fail, not assessed' profile QC framework to the percentage-of-levels-passed framework of the Argo real-time QC manual Table 2a, which would require a complete re-processing of Jim's data anyway.

The UKMO data gaps have now been filled.

Documentation of some kind has been obtained for all centers. The FNMOC documentation is basic, and extrapolated from a document describing general QC tactics by Jim Cummings. We have now contacted Jim and requested further information.

Raw data for all centers except for FNMOC have been obtained and are being processed. Jim has been asked to provide access to the FNMOC data. ECMWF do not look like they will take part due to problems with getting the data in a useable format.

-- RobinWedd - 10 Aug 2011

Topic attachments
I Attachment Action Size Date Who Comment
pngpng Alastair_bar.png manage 18.9 K 02 Sep 2011 - 10:58 RobinWedd  
pngpng Alastair_prec.png manage 71.8 K 11 Aug 2011 - 15:39 RobinWedd Plot of the precision of Argo QC from four institutions. (Alastair Gemmell)
pngpng Alastair_recall.png manage 67.2 K 11 Aug 2011 - 15:41 RobinWedd Plot of the recall of Argo QC from four institutions. (Alastair Gemmell)
pdfpdf buck-euroargo-17062010_p1-21.pdf manage 5524.6 K 11 Aug 2011 - 16:04 RobinWedd Slides by Alastair Gemmell showing initial comparisons with 2007-2008 data
pdfpdf buck-euroargo-17062010_p22-41.pdf manage 4756.9 K 11 Aug 2011 - 16:08 RobinWedd Slides pt 2
pngpng fom.png manage 41.2 K 06 Sep 2011 - 14:10 RobinWedd Behaviour of the FoM metric with various R and I
gifgif fom_eq.gif manage 0.4 K 06 Sep 2011 - 14:00 RobinWedd  
pngpng fom_recall_prec2.png manage 26.0 K 09 Sep 2011 - 10:37 RobinWedd  
pngpng fom_recall_prec_levels.png manage 28.1 K 07 Sep 2011 - 14:46 RobinWedd  
pngpng fom_recall_prec_levels_cross.png manage 26.1 K 07 Sep 2011 - 14:46 RobinWedd  
gifgif test.gif manage 0.4 K 06 Sep 2011 - 13:59 RobinWedd  
Topic revision: r26 - 06 Jan 2012 - 14:57:21 - RobinWedd
 
This site is powered by the TWiki collaboration platformCopyright &© by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback