Hugh D. Wilson
Department of Biology, Texas A&M University, College Station, Texas, USA.
Stephan L. Hatch
Department of Rangeland Ecology and Management, Texas A&M University, College Station, Texas, USA.
This paper describes the initial result of the confluence of these activity streams: the Herbarium Specimen Browser (http://www.csdl.tamu.edu/FLORA/tracy/main1.html). Section 2 provides some background about our working group and the botanical collections known as herbaria. Section 3 explains implementation details behind the Specimen Browser. Section 4 describes some desirable properties the Specimen Browser possesses, which are reflective of principles that other Web designers may wish to consider. Section 5 concludes with speculations about future work.
Our working group is fortunate in that even before the current collaboration, several of the participants from biology had begun developing Web materials on their own, and were therefore proficient in the Web technologies of the time (HTML markup and the structuring of Web information spaces) as well as in the use of commercial database programs. Consequently, they have been able to maintain information structured according to botanical needs and to maintain and develop quite a bit of the group's Web infrastructure, leaving the computer scientists to develop the "advanced" Web features.
Since its formation the group has had as a long-term goal the transfer of the information in the S.M. Tracy Herbarium into electronic form. The herbarium, one of approximately 2600 in the world, is a collection of plant specimens which have been dried, pressed, and glued to cardstock sheets. Each specimen sheet has a label containing information on the collector, the location of collection, an accession number (a number uniquely identifying the specimen within the collection), and an identification of the specimen via a Latin scientific name, along with an indication of the taxonomist responsible for associating that name with that species. The process of assigning scientific names to plant species is one fraught with dispute and subject to continual evolution; as a result, many specimen sheets have annotations reflecting re-identification by later investigators.
The specimens in herbaria are vital to the practice of systematic botany, the branch of the field dealing with taxonomy. Herbarium specimens form the foundation of plant nomenclature, in that all scientific names (and the procedures for assigning them) are ultimately linked to specific type specimens. Also, herbarium collections are important in the construction of floristic manuals or floras. A flora is a compilation, for a given region, of all of the plants in a given branch of a taxonomy of all plants (e.g., all grasses, or all flowering plants) and their distributions within that region and other information. A flora is deemed to possess greater veracity when the distributions in it are documented by herbarium specimens, not simply field observations. The over 1 million specimens housed in Texas herbaria provide a base of hard data that can be used for these floristic summaries and any study dealing with Texas plants.
Our Herbarium Specimen Browser uses, as a source database, the results of an initial data-gathering pass over the Tracy Herbarium's collection, still in process at this time. At present only specimens collected from the state of Texas are being recorded. For each of those, the following items are being recorded: accession number and source herbarium, collector's name, a collector-specific ID number for the specimen, date of collection, county (within Texas) of collection, and scientific name (along with some special codes relating that name to a global taxonomy). Future data-gathering passes will involve specimens not from Texas, data in annotations, and images of the plants themselves.
Our systems make use of MG's full-text retrieval facilities to, in some sense, emulate the query functions of a relational database. "Documents" are formed from a table's individual records. Each field is prefixed with a unique string. For fields that consist of strings lacking spaces, the prefix plus field contents thus forms a "word" which the full-text retrieval system can search for. As a result, one can retrieve the "records" containing desired field values by retrieving documents containing desired "words". If a field contains free text including spaces, one can search for a desired word within the field, then check the returned document for presence of the prefix in the proper place to eliminate false matches.
For applications such as ours where updates are infrequent, MG is much more convenient to use than a standard database system. Since the collections are read-only, much of the overhead caused by transaction management and concurrency facilities is eliminated. Also, the retrieval system is optimized heavily with regard to query speed by front-loading much computation into the collection construction phase.
An illustration of the Specimen Browser in action is seen in figure 1. The top frame, which is static, contains a title and some information about the current database being viewed: the herbaria from which the specimens are drawn, the number of specimens, and the number of taxonomic families, genera, and species that those specimens are part of. The frame on the left contains a number of controls which are used to change what is displayed in the frame on the right. Initially, that frame, generated by a CGI program, consists of a view of the database at the family level, listing each family represented by specimens, and for each family, the number of genera, species, and specimens contained in it.
The family names in this display are HTML anchor sites. Selecting one of them causes an "expansion" of the display to show a listing of the genera (represented by specimens) contained in that family. One can then select one of the genera to see a list of species in the genus. Selecting an already expanded item causes its "contraction". Figure 2 shows the results of expanding the family Araceae, and within that family, the genus Arisaema. Selecting "Araceae" again would cause all sub-items under it to disappear.
The first is through the use of two of the controls in the control bar - the "Show All Counties" button and the list box with Texas county names. Selecting a county name from the list box causes an immediate update of the contents of the left frame; any expanded items are still shown as expanded, but the display reflects the existence of specimens from the selected county, rather than the whole state. Selecting an already selected item deselects it, and more than one item can be selected at a time. Figure 3 shows the result of selecting Brazos and Robertson counties from the list, starting from the state of the previous figure. (Texas A&M University is located in Brazos county; Robertson county is directly to the north.) Note that the Araceae and the genus Arisaema are still expanded, but fewer items are displayed, reflecting the lack of specimens of those other items from these two counties. Also, note that the totals indicating numbers of genera, species, and specimens have changed to reflect the restriction to these counties as well. As stated before, all of this information is updated immediately on selection or deselection of a county from the list. (One can deselect all list items, returning to the "all counties" display, by pressing the "show all counties" button above the list.)
Another method for county filtering is graphical. Pressing the "select from map" button causes a map of Texas to appear in the right frame, with the currently-selected counties colored in. Figure 4 shows the result of such action (the two colored counties are Brazos and Robertson counties). Clicking a county on the map will cause the corresponding entry in the list to be selected or deselected appropriately, as well as updating the map; clicking a name in the list will cause the map to be updated in an analogous way. In this manner, one can build up a region of inquiry; when finished, pressing the "show taxon tree" button will redisplay the list of items, updated appropriately with respect to the new list of selected counties.
Some remarks on how this filtering feature is implemented efficiently are appropriate. Simply displaying which families, genera, or species are located in a given set of counties is very straightforward - a simple boolean search will suffice. However, running totals of specimens and the various taxonomic categories are also displayed. These totals are updated automatically when the filter specification is changed, and are not amenable to precomputation - since there are 254 counties in Texas, this would require computing 2254 totals for every item!
When the collection is generated, the record documents are sorted in "taxonomic" order (that is, alphabetically by family, genus, and species name). During this generation phase, a set of files are generated that list the numeric ranges that each family, genus, and species fall in; since the sorting is taxonomic, the range of a genus is contained in the range of its family, and a species within its genus. As mentioned above, MG has several options for retrieving search results, one of which is to return the numbers of all matching documents. To generate the list of items, a boolean search is performed for documents containing one of the desired counties, but retrieving the document numbers only. These returned numbers are examined with respect to the various lists, counting the number of hits per taxonomic grouping; this is always done for all families, but only for genera and species when they have been expanded in the viewer. As a result there is no need to perform expensive string comparisons on every returned document; instead, the system manages to perform something like the SQL "select - group by" statement with the full-text retrieval system. (When no counties are selected, the database is not even queried - the category ranges are examined directly.)
Figure 5 shows the results of following the "specimens" link associated with the family Araceae in Figure 3. Each specimen is given one line in this summary, showing its herbarium and accession number, scientific name, collector, and county of collection (here, either Brazos or Robertson). If the link for Arisaema dracontium was followed after switching to "full data" mode, the result would be as in Figure 6. Here all information for each specimen is listed.
More discussion of the mapping system is warranted here. Most of the Web tools the working group has developed have involved some sort of clickable map feature. To begin with, a bitmap image of the map in question is acquired and manually edited to remove irregularities. The bitmap is then processed by a program which identifies connected regions, i.e., the counties. At that point, another manual process is undertaken whereby each region is identified, i.e., a mapping between the names of counties and the internally assigned numbers of the regions is determined. After this process is completed, a file is generated which encodes the map's regions using a run length scheme: the file is a list of (region,length) pairs, essentially listing which pixel belongs to which region in a left-right, top-bottom raster scan.
This encoded file is used both to generate the maps and to provide clickable images. The encoding scheme allows for quick and easy construction of an image in memory, which can then be passed to a public domain GIF creation function. Also, the scheme allows for easy mapping from a mouse click to a region number: simply multiply y by the width and add x to get a pixel offset, then start adding up run lengths from the file. When one exceeds the pixel offset, one has identified the selected region. There is no need to worry about generating complex bounding polygons or applying winding rules. We believe this technique has a great deal of applicability to "irregular" image maps of all kinds. Also, it appears to be much faster than using a general GIS system to generate the maps.
To achieve greater efficiency in the construction of the maps, another MG collection is generated from the specimen database, this time with the records sorted by county. Document numbers are retrieved via the query mechanism in the same way as described above for the list, but the groups formed are county clusters rather than taxonomic categories. Certain specimen records are specially tagged as representatives of their species to make species-density mapping easier, by insuring only one "representative" exists per county.
Overviews and filtering. Much of our previous work in mapping geographic distributions and our current work on the Herbarium Specimen Browser is motivated by the desire to give biologists meaningful overviews of large quantities of data. In this sense our work has a certain affinity with other digital library projects such as the Visible Human project [North96]. The idea is to provide a general overview with allows the discernment of global patterns, coupled with the ability to quickly investigate details if desired. In the phraseology of graphic theorist Jacques Bertin [Bertin81], the ideal is to produce a system which allows one to see, or perceive immediately, not simply to read, or perceive over time.
The entire idea of the expanding, contracting, and filterable list (similar to Nelson's notion of stretchtext [Nelson93]) came about as an attempt to realize this principle. The initial family-level overview allows one to see how specimens are distributed through the collection by family. Interesting families can then be expanded if desired and the resultant subtotals displayed. If one wishes to restrict one's attention to a specific geographic area, one can do so while still maintaining the context of one's attention to particular taxonomic items.
Computers remember perfectly and forget perfectly; humans do not. This is a property that should be exploited in achieving the above principle: viewers looking for generalities should not be forced to rely on their own memories. This motivated the implementation of the "list" versus "full data" options for viewing specimen sets. The list option allows one to look for certain general patterns, such as preponderances of collectors, without having to page through large amounts of other data. The full data option, however, allows one to see everything that is recorded about small sets, rather than forcing one to visit each specimen in turn via the list and remember the details.
It is also important to not only make patterns visible, but also avoid generating false ones. A staple of botanical literature is the dot map, a map of a region, divided into subregions, with dots or other symbols in the subregions indicating the presence of something. This is an adequate system for indicating individual facts but does not indicate general patterns well, as the dots in regions generally have the same visual salience as the regional borders. Our mapping system thus completely shades regions to allow these patterns to be detected easily (as opposed to some other botanical information systems which simply replicate familiar dot maps). However, we had to take care to avoid generating false patterns. Initial experiments with our maps used red and green to symbolize the high and low ends of ranges, with a blending to indicate the middle. Unfortunately, this created a color which had greater visual salience than either endpoint, creating false impressions. This effect disappeared when we switched to a single-color scheme. (Bertin's work [Bertin81] [Bertin83] contains many useful guidelines for map designers regarding what can and cannot be signfied by color, value, shape, etc., and how those variables should relate to the actual data to avoid false patterns. Similar insights can be gleaned from the work of Tufte [Tufte83].)
Incidentally, our overview system has had immediate payoffs, in that it has allowed us to easily detect certain kinds of errors in the specimen data entry process. For example, to the trained eye, misspelled scientific names practically leap out of the list.
Regularity. The displays in our system are very rich in links. This gives the impression of an extensive information field through which viewers can explore in an unrestrained manner. However, we avoid disorientation by having links lead to destinations (or trigger actions) in a uniform matter. In addition, link sources are uniform as well - simple rules indicate if a link should be present and are never violated. (They are thus instantiations of what DeRose calls annotation, as opposed to associative, links [DeRose89].) This is not to say that all designers should attempt to impose uniformity on their information spaces, but it demonstrates that Web systems are suitable for constructing tools to explore detailed information spaces with regular contours.
One aspect of the information "space" that is amenable to processing but is currently not utilized is the time dimension. Time-series display of, say, the activities of a given collector represented in an herbarium might be interesting, but it is not clear how to do this in a straightforward yet effective way.
It will also be interesting to see what other kinds of searches we can perform using MG. Searching for records on multi-word fields (like collector name) is easily done. However, complex searches on date ranges, for example (such as finding all specimens collected between a pair of dates), are difficult to perform efficiently using our current date representations; in the future we will be investigating alternate representations more suited to the searches a full-text retrieval system can perform.
[Bertin83] Jacques Bertin. Semiology of Graphics. Madison, WI: University of Wisconsin Press, 1983.
[DeRose89] Steven J. DeRose. Expanding the notion of links. Proceedings of the Second ACM Conference on Hypertext (Hypertext '89), Pittsburgh, PA, November 1989, 249-257.
[Nelson93] Theodor Holm Nelson. Literary Machines. Sausalito, CA: Mindful Press, 1993.
[North96] Chris North, Ben Shneiderman, and Catherine Pleasant. User controlled overviews of an image library: a case study of the Visible Human. Proceedings of the First ACM International Conference on Digital Libraries (DL '96), Bethesda, MD, March 1996, 74-82.
[Tufte83] Edward R. Tufte. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 1983.
[Witten94] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. New York, NY: Van Nostrand Reinhold, 1994.