Journal Entry

I got a lot of great questions from a chemistry teacher through our 'Ask the Team' tab who after reading the March 10 entry on data quality control titled 'Amazing discovery or data error?' wanted more information. Her questions were directed to Mary and Kristin, so I passed them along. Mary did such a great job responding to her questions that all I am going to do is add a couple of pictures here and there.

Question: Not all things in science come out the way they are supposed to for lots of reasons like human error, instrument error, equipment error. How often would errors in data happen?

Mary: As you said, data errors can happen because of equipment failure and/or operator error. On some cruises, we have very few problems; on others, we have a lot. We seem to have more equipment failures when weather is bad. This cruise has an additional problem: very cold air that can freeze water in the pump tubes, and cause temporary bad readings (until the ice melts), or damage the sensors. I have seen cold air freeze collected water in Niskin bottles.

Rosette hanging on the airThe rosette hangs for a couple of minutes in its way to the water and when it comes out. The cold winds can freeze sensor components and produce bad data.

The CTD that collects the continuous data can have sensor issues from electrical connections gone bad, jellyfish or other organic matter fouling the sensors or plugging the pump tubes, and/or failure of pumps that pull water past the sensors. Bottle lanyards can fail to release because they get hung up on one of the other wires - until partway up the cast; or bottles can leak. Some of these problems are tricky to diagnose because they only happen under water, and we can only look at the instruments on deck.

Samplers can sample out of the wrong Niskin bottle, or enter the wrong volume-calibrated flask number in their data files. We try to avoid a lot of these problems by prevention and cross-checks: we have a 'sample cop' who keeps track of who has sampled which Niskin bottle, and samples are collected in numerical order (deepest bottle first) with a pre-defined sample order (gas samples go first). Chemicals we use in analyses can go bad, or there can be air leaks in an analytical system that causes wrong values to spit out.

Ninskin bottles on the rosetteThe Niskin bottles are very close together on the rosette. There can be lanyard problems or samplers sampling from the wrong bottle. That is why we have a sampling cop that makes sure everybody is in the right bottle.

Question: What do you do with this outlier data? Leave it in and state what may have caused this error in data.

Mary: Yes, we leave suspicious data in our files, but add a quality code of 'questionable' or 'bad', always logging comments about why we coded a sample that way. The codes and comments are sent with the data to the data repository at CCHDO. The codes we use can be found on page 18 at:

http://cchdo.ucsd.edu/Data_Evaluation_reference.pdf

Scientists who use the data years from now will then know to beware of a 'questionable' value when using the data; or they might choose to use the values because they represent real features in hind-sight (such as the 'Meddies' data mentioned at the end of Juan's data error blog).

Here's an example of cross-checks for dissolved oxygen data: the sampling flask numbers are called out to the sample cop by the sampler, and the samples are put in orderly rows into slotted boxes, with the 'start' and 'end' corners marked. When the analyst runs the samples, s/he checks the flask number in-hand against the sampling log; if if they don't match, a comment is added to the data file and the actual flask number is used in the analysis log.

Box with bottles fro oxygen samplingBox with numbered bottle for the oxygen sampling. A mistake can happen when not following the right bottle order.

We double-check these flask numbers at the data processing end. If we see a suspicious data value, we look for a change in flask order in the box since the previous cast, or data entry errors. If we see an error that is obvious (the flask number in the data file doesn't match what's in the box), we 'comment out' the original data in our files, and edit in a new data line with the correct number that's re-uploaded into our database. Any change like this would always be accompanied by a logged comment.

Question: I take it that the two people that work with that overwhelming amount of data and numbers must love dealing with it. Was it a difficult job to begin with? What kind of training did you need for this type of job?

Most of the data analysts and data processors have a Bachelor's or Master's degree in a physical science (chemistry, physics, biology). Others are Masters or PhD students working toward their degrees, some using the data in their thesis work. The data are also reviewed by one or both of the chief scientists on board (both PhDs with years of experience), who compare historical data (if any) from the area to our data to look for real changes or systematic problems.

Mary: I have a bachelor's degree in chemistry from UC San Diego. I wanted to be a criminalist (like CSI), but ran into a seagoing chemist job at Scripps a year or so after graduating. I got so seasick I nearly lost my job - but there was a huge amount of back-logged CTD data to be processed, so I was asked if I'd like to shift over to that area. It turned out I had a knack for it, and enjoyed it... the chemistry degree gave me the background to do it, but 95% of what I do has been learned on the job. I had little or no computer experience when I started - now I upload operating systems and all of our software to empty computers, write scripts (non-compiled programs) to look for potential data problems, modify and recompile programs as needs change... and go to sea all the time. (If you can get past the first few days of seasickness, it almost always gets better and you get your 'sea legs'.) I also love puzzles (jigsaw puzzles, sudoku, etc.), something CSI and data processing have in common. And I can sit in one place (at my computer) for the 12+ hours a day I work at sea, doing somewhat the same thing over and over (7 days a week for the whole 66 days - we don't get days off or holidays out here). I know the data we produce will be used for decades by many scientists around the world, to record and predict what's going on with our environment.