|Checklist - Vascular Plants of Texas
Query the Full Text Index - Information
"Boolean Query" is needed when you do query on more then one term, and you need combine the terms by using & (to produce boolean AND) and | (to produce boolean OR). You can also precede terms with ! (to produce boolean NOT). For example, if you want to search a "sunflower" species that has associated photos, you can search by using query term "sunflower&images". The return on this query will be a listing of species documents that include only these two strings. The full document can be viewed by clicking on the name.
The ranked query is concerned with ordering the top, r, documents which are relevant to the query according to a similarity measure or some other measures. In the MG system a Cosine measure is used. The Cosine measure considers the following factors:
1. Query terms which exist in the collection;
2. Frequency of term in document;
3. Frequency of term in query;
4. Number of documents in collection;
5. Number of documents containing particular term;
Some other factors are also considered.
This type of selective query will be more useful when each species entry carries more information. It will usually produce a larger subset of species documents in that documents with any element of a multiple string query will be returned from the index.
Identical to the "Boolean Query", this option will produce a full return from the checklist full text index, as opposed to a clickable listing of species.
are located or adjustments are needed, the spreadsheet file can be edited
(see edit tracking).
The new version is then processed to 'refresh' or update HTML pages and
the full text index. Future development of this view of the Texas
flora, in terms of composition and nomenclature, will probably involve
reference - during processing - to the combined Texas
herbarium specimen database. Texas herbaria contributing
to this resource provide firm statements with regard to 'accepted' names
in that names present in individual herbarium datasets represent names
in use by curators of Texas collections. Thus, reference to the combined
specimen data set could allow a 'consensus' check in at least some of those
cases where taxonomic opinion comes into play. This checklist system
could be dynamic, i.e., refreshed each time the herbarium specimen database
is updated, and this would reflect curatorial decisions of Texas botanists.
If one wants to find some particular information which is stored in a computer text file then one has a few alternative courses of action. One can operate directly on the text files with utilities, such as UNIX grep, or can process the text files into some form of database. Grep is generally limited to identifying lines by matching on regular expressions. If the collection of files which grep operates on becomes large, then continual passes over the entire text on each query becomes expensive. However, its usage is simple as no auxiliary files must be created.
A database consists of some data and indexes into that data. By having indexes one can query a large database quickly. Standard databases divide the data up into records of fields. This means that the granularity of search is a field. In a full-text system, such as MG, there are no fields (or there is an arbitrary sized list of word fields per document) and instead every word is indexed. Using this method, we can accept free-form information and yet be fast on searches. The next question is what is the overhead of this database. In MG most files which are produced are in a compressed form. The two notable compressed files being the given data and the index, called an "inverted file". By compressing the files it is possible to have the size of the database smaller than the size of the source data.
The most common use for MG has been as a
search database on unix mail files. However, any set of text data
can be used, one just needs to determine what constitutes a document (see
MG has also been used on large collections such as Comact (Commonwealth
Acts of Australia) which is around 132 megabytes and also on sizes up to
around 2 gigabytes for TREC (a mixture of collections such as the Wall
Street Journal and Associated Press).
This document is modified from an original prepared by Dr. John Leggett's CPSC670 Fall 1999 students that selected this system as a project, Haiyan Wang and Jingchen Xu. (return to entry page).