LHST -- Field Journals

Field Journal from Tom Comeau - 3/8/96

HAWKS, ALUMINUM, AND MATH: MORE WORK ON THE ARCHIVE

I had just packed away my wife's snowboots (we're moving the end of this month) and this morning it snowed. Two inches or so, but enough to mess things up and close some of the schools.

We think the hawks are back! Last year we had a small family of red-tailed hawks that lived in a nearby park. A couple times a week they would come over onto the campus and go hunting. We'd see them perched halfway up a tree, or up on the gutters just below the roof line of the building. The female is pretty big, the male was about the size of a large crow, and the juveniles (there were one or two, we were never sure which) got to be the size of the male by the end of the summer. A couple of people think they saw the female return in the last few days. I've been keeping an eye out for her.

Today I'm doing two things at once: testing the changes I described in my last journal, and doing some database work.

The optical platters we use cost about $300 each, so we try to make sure things are working pretty well before we actually start writing data. I've been "burning aluminum" all day, and I'm pretty confident that things are working correctly. Suzanne (another DADS developer) and I have been working on this project since about Halloween, and we're both relieved to see it coming close to the end. And we're anxious to get on to the next phase of work for SM-97.

I mentioned "burning aluminum" above. I call it that because when we write to an optical disk, a laser in the drive blows little pits in a very thin sheet of aluminum trapped between two layers of transparent plastic. When we write, a high-powered laser burns out the pits. When we read, a lower-powered laser looks to see what those pits look like. Once you've burned the pits into the aluminum, it's permanent. You can't erase it, and we expect the disks to be good for at least 20 years, and maybe as much as 100 years.

CD-ROMs (and music CDs) work basically the same way, but the pits are "stamped" using a pressing machine, rather than blowing them out with a laser. The low-power laser in your CD player works the same as our optical disks drives.

To test a new version of the programs that add data to the archive, we run a standard set of test data through the system. There are about 700 files, for a total of 305 megabytes of data. That's about half of a typical CD-ROM, or about one twentieth of our big disks. It takes a couple hours to run all the data through, and that leaves me time to work on my other problem.

The database work I'm doing involves figuring out how much space we should reserve when we are making tapes to send to astronomers.

I'm trying to figure out just how big HST Datasets are. A dataset is a collection of files that together hold all the data for an image or spectrum. For WFPC-II (Wide-Field and Planetary Camera Two - the camera that takes most HST pictures) this is a pretty constant number: about 25 megabytes in 10 files. It's a pretty constant because the camera takes the same sort of pictures all the time. Each picture is four "chips" in an 800x800 array. (A typical PC screen has 1024x768 pixels -- a single WFPC-II chip is just slightly smaller, but square.) There are a total of about 40 bytes of information about each pixel, including calibrated and uncalibrated values, quality information, and other stuff. Since the size of the picture doesn't change, the size of the dataset doesn't change either.

For the spectrographs, the size of the dataset can vary a lot. This is because a single dataset can contain multiple spectra. In the case of the Goddard High Resolution Spectrograph, it can vary from just 38 kilobytes to over 300 megabytes!

But what I want is a "pretty good" estimate of each kind of dataset, and I can use that to plan how much space I'll need to retrieve a particular set of data. To get a statistical look at the data, I have this nice complicated query that gets the minimum, maximum, and average size of "Z-CAL" datasets. "Z-CAL" datasets are CALibrated science data for the GHRS. (Each instrument has a letter associated with it: U is for WFPC-II, X is for FOC, Z is for GHRS.) Once I have all that data, I can also compute the "standard deviation", which is a kind of average difference in sizes. That gives me an idea of how much variation there is in size.

Here's another example: If ten people take a test, and they all score between forty and sixty points, with an average of fifty points, that's a pretty low standard deviation. If another group of ten take the test, and half of them score about 20, while the other half score about 80, the average would still be 50, but the standard deviation would be pretty big.

When you see a large standard deviation like that, you have to decide if you're seeing different "populations". For example, if you have a test aimed at eighth graders, and you get five people who score about 20, and five who score about 80, the fact that you have a large deviation makes you wonder if maybe the five who scored 20s were perhaps second graders!

In my case, I've discovered there are two types of GHRS observations: short, small observations with one or a few spectra, and large observations that have many spectra. The "mode" I see for those observations is "RAPID", and I'll have to get one of the astronomer types to explain that operating mode to me.

That's the kind of math I do pretty regularly: Statistical analysis of the contents of the archive. I rarely need to do any calculus, though I know enough to understand how the mathematical "tools" I use work. But I do a lot of algebra, and use programs that have statistical functions.

Well, my big test is finished, and while most things are working, there are a couple of problems I need to work on. I'm going to take a break, get something to drink, and see if I can spot that hawk before I tackle them.