Argo Quality Control Comparison Project
The goal of this project is to compare and contrast the real-time Argo data quality control (QC) efforts of oceanographic forecasting institution, with the delayed-mode QC of the Argo collaboration as a bench-mark. It is hoped that in-depth comparisons of the various QC strategies will provide insight into the best practice for real-time Argo QC, to the benefit of all Argo users.
Data
The data used in the study is provided by Jim Cummings (US Navy). Institutions upload data to Jim's ftp server for processing. Currently real-time QC data has been received from four institutions: UK Met Office (UKMO); Australia's Bureau of Meteorology (
BoM); the US Navy's Fleet Numerical Meteorology and Oceanography Center (FNMOC); and Canada's Marine Environmental Data Service (MEDS). The data is processed by Jim into yearly float-specific netCDF files that contain the original float data, all the real-time QC data from the four institutions, and the delayed mode Argo QC data pertaining to that float.
The available data covers the years 2006-2009. There are some temporal gaps and anomalous results in some of the provided data stream; the institutions are being contacted to correct these. An effort will also be made to invite other institutions to participate in the project.
The QC philosophy of each center looks like this:
| BMRC |
UKMO |
Coriolis |
FNMOC |
MEDS |
Similar to Argo recommended real-time QC, but - slightly different gradient and spike tests - bathymetry depth test - climatology comparison (5 sigma) |
Fully in-house developed QC structure. Highly tested results. Bayesian background test rather than a fixed N*sigma test. |
As recommended by the Argo real-time, plus comparison with climatology and neighbor measurements. |
|
Quite sophisticated, with more detailed level-specific variable range tests, doxy test, and bathymetry tests. Only QC with a freezing point test. |
More details of each center's QC process can be found
here. Official Argo real-time QCis described here:
http://www.coriolis.eu.org/coriolis/cdc/argo/argo-dm-user-manual.pdf
Progress
The initial comparison work was performed by Alastair Gemmell with Justin Buck, Keith Haines and Jon Blower. Real-time QC data from 2007-2008 was compared with Argo delayed-mode QC results. All data from each center was considered. The profiles were divided into those which the delayed mode QC processed as having 0%, 1-50%, 50-99%, and 100% good data, and the individual institution QC strategies were assessed by the proportions of each group they allowed into analysis. Temperature and salinity data were considered. The links below contain slides from Alastair Gemmell detailing the results of the analysis.
The
BoM QC (marked as BMRC in the slides) was found to have the lowest acceptance of bad data profiles (<50% good data) and also the highest acceptance of good data profiles (>50% good data). FNMOC, UKMO and MEDS ranked about equally, with FNMOC showing the greatest discrimination between good and bad data of the three institutions.
These results are based on the total data provided from each institution, which results in a non-level playing field for comparison. The number of profiles provided by the UKMO was almost an order of magnitude smaller than those from the other three institutions. Some of the UKMO results are thus rendered suspect due to poor statistics, especially those of the profile groups with high proportions of bad data. The data are also non-homogeneous in floats analysed, which could introduce float- and location-dependent factors into the relative performance of each institution.
Marc Stringer extended the analysis to include data from 2004-2010. Focusing only on delayed-mode profile which show either 100% good or 100% bad data, the real-time QC data is assessed according to whether or not it agrees with the delayed-mode QC. Biases are calculated, as are the Equitable Threat Score (more on ETS). The data is analysed on a profile-by-profile and level-by-level basis, and also on a level-by-level basis for only those profiles that are included in all of the institutions QC results. Another institution (Coriolis) is included in the level-by-level analysis.
The advantage shown by the
BoM QC strategy in the initial study is evident again here in the profile-by-profile all-data comparison, with the
BoM results having a lower bias and a higher ETS score. FNMOC and MEDS are comparable, and UKMO significantly behind. When the data is looked at on a level-by-level basis, however, this advantage disappears. The
BoM results are very similar to those of Coriolis, FNMOC and MEDS. When only the common profiles are analysed on a level-by-level basis, the
BoM strategy is reduced again, and is out-performed by the Coriolis results. All of the institutions perform worse when taken on a level-by-level basis, and again when only common data is use, but the
BoM drop is far larger.
The common lowering of skill when only common profiles are used could possibly be the result of tailoring QC requirements to give the best results in the specific areas (geographical, dynamical, temporal) of most interest to a particular institution. The lowering of skill when data is considered on a level-by-level basis could be an indication that profile-based selection criteria are the most successful in real-time. The below link shows Marc's results in detail
http://www.met.rdg.ac.uk/~marc/eResearch/argo2.html
Alastair Gemmell has begun identifying metrics with which to compare the different QC strategies. The averages of the metrics "recall" (R) and "precision" (P) for each institution is shown in Figure 1. The data runs from 2006 to 2009. "Recall" is the ratio of the number of bad data removed to the total number of bad data. It is a measure of the success of the QC. "Precision" is the ratio of the number of bad data removed to the total number of data removed. It is a measure of the accuracy of the QC. The two metrics are shown in percentage form. Figure 1 shows the percentage R and P for the total Argo data from each institution.
- Figure 1: The average recall and precision of Argo QC from four institutions 2006-9. (Alastair Gemmell):
The
BoM QC removes 20% of the bad data and an equal number of good data. FNMOC removes 10% more of the total bad data at the expense of removing about 12 times as much good data as bad. This seems like a bad trade, but when good data outweighs bad by 50-60 times and bad data entering assimilation can have grave effects on an analysis, perhaps a 12-for-1 trade of good data for bad is desirable. These metrics carry no information about the relative proportions of good and bad data, something that should be taken into account. This is discussed further below. The desirable ratio of good/bad data removed might also depend on the geographical location of the data removed and other factors that make one profile more important than another. This will require further investigation to discern. The UKMO QC has good R but very poor P, indicating that the QC is too broad; this is actually thought to be an issue with the data provided and is discussed further below.The MEDS QC has the lowest R and low P as well.
Figures 2 and 3 show time series of the same data. The first thing to note is the odd behavior of the UKMO data. There is no data for 2006, and in late 2008 the QC begins to reject almost all the data, giving a recall of 100% and a very low precision. This is the reason for the poor results from the UKMO and is believed to be most likely due to a data transfer error. Contact with the UKMO will be made to correct this.
The QC of FNMOC is clearly the most successful at removing bad data in the first two years of the time period studied. The FNMOC recall rate drops significantly in 2008. In mid-2009 the
BoM (BMRC) QC is generally second, and improves significantly in recall halfway through 2009 after a poor period in late 2008 to early 2009.Figure 2: The recall of Argo QC from four institutions. (Alastair Gemmell):
- Figure 2: The Recall of Argo QC from four institutions from 2006 to the end of 2009 . (Alastair Gemmell):
-
- Figure 3: The Precision of Argo QC from four institutions from 2006 to the end of 2009. (Alastair Gemmell):
The advantage for the
BoM QC seems to be a result of high precision, which is much higher than FNMOC or MEDS, though comparable to the UKMO during the periods when the UKMO data is OK. This leads to the question of whether the good results for the
BoM are justified. In the problem of data QC for the purpose of data assimilation, the removal of bad data is of paramount importance. The assimilation of inaccurate and spurious data into ocean models can lead to the degradation of the initial state and thus of the predictive abilities of the system. In a observation system of high quality, such as Argo, there are many more good data points than bad (bad data is ~1.5-2.0% in the data Marc used). In this case the removal of bad data is more important than preserving the maximum number of good data. It is a matter of finding an acceptable balance, as it always is in the task of improving signal-to-noise, but in this case the scales are tipped towards the sacrifice of good data points to remove more of the bad. In this light, perhaps the QC of the other institutions, especially FNMOC, are better than they seemed in the initial analysis.
As previously mentioned, the Recall/Precision metrics do not contain information about the relative abundance of good and bad data. This information should be incorporated into the metric in order to assess the relative value of removing good and bad data, ie, how much good data is it worth removing to get rid of more bad data? This would also enable a fairer comparison of QCs run over different data sets: two QCs might have similar Precisions, but if the data sets they were run over had very different relative amounts of good and bad data, the quality of the resultant data sets would also be very different. Instead of using Precision, a ratio of the number of good data passed to the total number of good data would be more informative. This ratio can then be combined with the ratio of the number of bad data passed to the total number of bad data (the inverse of Recall) to form a single metric that includes all the information desired: a "Figure of Merit" (
FoM ) defined by
Where R is the fractional number of good data passing the QC and I is the fractional number of bad data passing the QC. This metric has the behavior:
- FoM = 1 when R = 1 and I = 0
- FoM = 0 when R = 0 and I = 1
- FoM = 0.5 when R = I, ie, when the QC has not changed the relative frequency of good and bad data.
So a score above 0.5 means the QC is having a positive effect on the quality of the data, a score of 0.5 means the QC has no more skill than removing profiles at random, and a score of below 0.5 means the QC is having a negative impact. The plot below (Figure 4) shows the
FoM behavior for a range of R and I results.
- Figure 4: Behavior of the FoM metric with various R and I:
I have calculated
FoMs for Marcs results for those profiles that are included in both delayed mode and Data Centres QC and in which the delayed mode QC either accepted or rejected all the levels in a profile. The results are in Figure 5. Only the
BoM QC is found to be significantly improving the quality of this data set; the other three centers have results only slightly above that of rejecting a random selection of profiles. The Recall and Precision results are different to those found by Alastair, I am not sure which data set was used in the summer school. One interesting thing to note about this is that the
BoM data set has ~2.4% bad data according to the delayed mode QC, while FNMOC, MEDS and UKMO have ~13%,~6.5% and ~7.1% bad data respectively. The
FoM metric takes this into account (ie, an
FoM score is independant of the quality of the initial data set), but it is interesting to highlight the differences in the data over which each QC was run.
- Figure 5: The FoM, Recall and Precision results of profile-based QC for the four centers:
Figure 6 shows the
FoM, Recall and Precision results for the four centers plus Coriolis when the QC of every level in each profile is assessed. On a level-by-level basis, the
FoMs of all institutions except FNMOC fall. The FNMOC results remain approximately the same. The
BoM QC is now only slightly prefered, possibly indicating that the advantage this institution has lies in profile-based selection criteria. The UKMO
FoM has fallen below 0.5: the QC is flagging so many good levels as bad that the loss outweighs the benefit of removing the correctly flagged bad levels.
- Figure 6: The FoM, Recall and Precision results of level-based QC for the four centers plus Coriolis:
Figure 7 shows the results for the set of levels from profiles that are common to the
BoM , FNMOC, Coriolis and MEDS QC. In this data set, the
BoM QC has dropped behind both the Coriolis and the FNMOC QC, and is about equal with that of MEDS. No QC does well in this data set, though that of Coriolis has a clear advantage.
- Figure 7: The FoM, Recall and Precision results of level-based QC for only the common profiles in the data sets of BoM , Coriolis, FNMOC and MEDS:
Argo Delayed Mode Availability
Just a quick assessment of the rate at which Argo data is delayed-mode (DM) QC'd and available on the GODAE data server. Proportion of total profiles that have been QC'd:
| 1/6/2011 |
11% |
| 1/1/2011 |
26% |
| 1/6/2010 |
48% |
| 1/1/2010 |
67% |
| 1/6/2009 |
69% |
| 1/1/2009 |
74% |
So, if we go up to 2010 inclusive, the number of data will drop off. This is not really a problem, as the data set we will be using will be defined by the Argo DM data, ie, only profiles from the centers that have undergone DM QC will be considered (at least at the beginning. Keith mentioned the possibility of extending the analysis up to real-time without the use of DM QC data). We should check the performance of each QC in time-lines a-la Alastairs plots anyway.
Available Data as of 19/9/2011
Institution |
UKMO |
MEDS |
CRLS |
BoM |
Year Range |
4/10/2006 - 22/7/2008 17/12/2008 - today |
21/6/2004 - 16/9/2011 |
One day in 2007 Very patchy 2008 23/7/2009 - today |
1/1/2005 - 19/6/2011 |
| Continuity |
Daily files |
Monday, Wednesday, Friday |
Daily files |
Daily files |
| Summary/Detailed Flags? |
Yes / Yes |
? / Yes |
Yes / Yes |
Yes / Yes |
| QC documentation? |
No |
No |
No |
Yes |
The FNMOC raw data is not on the server, but is continuous from 2006-2010 at least.
So we have 2006-present for three centers (FNMOC, MEDS and
BoM ). UKMO we can only trust data from 2007 to mid-2008. Coriolis only from late 2009.
Jim's data processing has files for 2004-2010. These files contain the total QC accept/reject and QC flags for T (background and clim), S (background and clim), level, date and time, and position.
If we want to use Jims processed data, which is a good idea at least for the beginning, then we will have to ask him to re-process the data once we get the centers to provide more. We have the raw data for each of the centers that includes their individual QC flags, so that is available for more in-depth later analysis. (I'm not sure about the MEDS raw data as it is in ASCII format and I haven't had time to write some code to read it)
Looking at the UKMO data uploaded to Jims server, it has changed format twice. The timing of these changes seems to coincide with the problems with the data, as seen in Figure3. On 27/2/2007 the files change from "OCEANSOUND_LIST_{date}.GZ" to "OceanSound_longlist_{date}.gz". Not sure if this changes the internal variables as I'm not sure what format the files are in or how to read them, but is it approximately the time the UKMO data starts to be OK. Then on 17/12/2008 they change to netCDF files of the form "profile.obs.{date}.nc" and at this point the data goes bad again. It seems possible that Jims processing code might not have changed with the format changes? Needs to be investigated.
There are two gaps in the UKMO raw data, one a few weeks (18/12/2007 to 4/1/2008), and one of five months (22/7/2008 to 17/12/2008). These will need to be filled in.
Wishlist:
If 2006 (or later) to 2010 is enough data, then we need:
- UKMO data to be uploaded from 18/12/2007 to 4/1/2008 and from 22/7/2008 to 17/12/2008.
- The other problems need to be resolved with UKMO data, either by re-uploading the odd data or by resolution of problems with the analysis code.
- If we want to add Coriolis, we need 1/1/2006-22/7/2009 to be uploaded, and hopefully some profile QC information included.
- Uploading of ECMWF data if they get onboard.
- Jim to reprocess all the new data.
- Some form of documentation describing the QC selection criteria for each of the centers.
The good thing is that we have complete data from 3 centers, so we can start analysing that first and add the other centers as the data becomes available.
UPDATE 5/1/2012
We have decided to switch to using the raw data files found in
http://www.usgodae.org/pub/incoming/godae_qc/
rather than relying on Jim Cumming's processed data. There are several reasons for the change: we are not certain exactly how the translations from the institutional flags to Jim's flags are performed; we cannot discover the source of the data gaps in the UKMO QC data ourselves; are unable to fill the gaps in the UKMO QC data ourselves; we have been unable until the last few days to contact Jim to obtain clarification of these issues; and we decided to move from the simple 'pass, fail, not assessed' profile QC framework to the percentage-of-levels-passed framework of the Argo real-time QC manual Table 2a, which would require a complete re-processing of Jim's data anyway.
The UKMO data gaps have now been filled.
Documentation of some kind has been obtained for all centers. The FNMOC documentation is basic, and extrapolated from a document describing general QC tactics by Jim Cummings. We have now contacted Jim and requested further information.
Raw data for all centers except for FNMOC have been obtained and are being processed. Jim has been asked to provide access to the FNMOC data. ECMWF do not look like they will take part due to problems with getting the data in a useable format.
--
RobinWedd - 10 Aug 2011