Bibliography - Vascular Plants of Texas
Query the Full Text Index - Information

Query

   Basically, you can query the DFT reference collection based on any information.  That means you can query on any word in the collection, such as a plant name, author name, or a date, etc.  The system will return all records or documents that include text strings that match your query string.

Boolean Query (not working right now)

   "Boolean Query" is needed when you do query on more then one term, and you need combine the terms by using & (to produce boolean AND) and | (to produce boolean OR). You can also precede terms with ! (to produce boolean NOT).   For example, if you want to search a "sunflower" that has picture in the document, you can search by using query term "sunflower&images".

Ranked Query  (not working right now)

    The ranked query is concerned with ordering the top, r, documents which are relevant to the query according to a similarity measure or some other measures. In the MG system a Cosine measure is used.    The Cosine measure considers the following factors:

1.  Query terms which exist in the collection;
2.  Frequency of term in document;
3.  Frequency of term in query;
4.   Number of documents in collection;
5.   Number of documents containing particular term;
    Some other factors are also considered.

    In current collection, each document is a full reference citation, i.e., author(s), date(s), title, and publication info.

About the Collection

    The collection is 'merged' product derived from word processing files associated with on-going Texas floristic projects and extracts from on-line data resources.  Initial contributors include George Diggs of Austin College G. M. Diggs, Jr, B. L. Lipscomb, and R. J. O'Kennon. (Shinners & Mahler's Illustrated Flora of North Central Texas), Stanley Jones (S. D. Jones, J. K. Wipff, and P. M. Montgomery. 1997.  Vascular Plants of Texas, A comprehensive Checklist including synonymy, Bibliography, and Index.  University of Texas Press), and Monique Reed (Reed, M. D.  1997. Manual of the dicot flora of Brazos and surrounding counties.  Master's Thesis. Texas A&M University, CollegeStation).  Each document represents a reference citation.

What is the format of the collection?

   The collection is based on a spreadsheet file that lists over 3,000 references ordered by author and date.  This is processed, using other data sources, by Dr. Wilson to produce are large text file that includes HTML formatting and delimitation of  each document, using "#" as a delimiter.

How to update the collection in the future?

   The base bibliography file is re-processed by Dr. Wilson when changes are made to the spreadsheet, i.e., references added, removed or altered.   All updating, correction, addition is done by Dr. Wilson or student workers at the TAMU herbarium.  All file production is done on Dr. Wilsons PC using a program written by Dr. Wilson (bibpages.ipf).  The large text file produced is shipped via FTP to the CSDL server and, via an automated system activated by Dr. Wilson from a web page, reindexed.  Automated re-indexing  was established by students (Haiyan Wang and Jingchen Xu) in Dr. John Leggett's  course (CPSC670) in the Fall of 1999 as a project.  The next version of this system, now under development, will remove TAMU Herbarium processing.  Content will be corrected, adjusted, and expanded on line by those involved with the DFT project and output, both HTML pages and the index file, will be updated with each change in the base content.
 

The MG System

   The MG (Managing Gigabytes) system is a collection of programs which comprise a full-text retrieval system.  A fulltext retrieval system allows one to create a database out of some given documents and then do queries upon it to retrieve any relevant documents.  It is "full-text" in the sense that every word in the text is indexed and the query operates only on this index to do the searching.  For example, one could have a database on the book, "Alice in Wonderland."  A document could be represented by each paragraph in the book. Having built up the "Alice" database, one could do queries such as "cat alice grin" and retrieve any paragraph which includes a match the query.  The matching could either be boolean, that is the retrieved paragraphs contain a boolean expression of the query terms e.g. "cat alice grin"; or the matching could be ranked i.e. the most relevant documents to the query in relevance order, using some standard heuristic measure.

   If one wants to find some particular information which is stored in a computer text file then one has a few alternative courses of action. One can operate directly on the text files with utilities, such as UNIX grep, or can process the text files into some form of database.  Grep is generally limited to identifying lines by matching on regular expressions.  If the collection of files which grep operates on becomes large, then continual passes over the entire text on each query becomes expensive.  However, its usage is simple as no auxiliary files must be created.

   A database consists of some data and indexes into that data.  By having indexes one can query a large database quickly.  Standard databases divide the data up into records of fields.  This means that the granularity of search is a field.  In a full-text system, such as MG, there are no fields (or there is an arbitrary sized list of word fields per document) and instead every word is indexed.  Using this method, we can accept free-form information and yet be fast on searches.  The next question is what is the overhead of this database.  In MG most files which are produced are in a compressed form.  The two notable compressed files being the given data and the index, called an "inverted file".  By compressing the files it is possible to have the size of the database smaller than the size of the source data.

   The most common use for MG has been as a search database on unix mail files.  However, any set of text data can be used, one just needs to determine what constitutes a document (see mgintro++(1)).  MG has also been used on large collections such as Comact (Commonwealth Acts of Australia) which is around 132 megabytes and also on sizes up to around 2 gigabytes for TREC (a mixture of collections such as the Wall Street Journal and Associated Press). 


This document is modified from an original prepared by Dr. John Leggett's CPSC670 Fall 1999 students that selected this system as a project, Haiyan Wang and Jingchen Xu. (return to query page).