Examples of Use

The textIR suite comes with a set of applications for building document term indexes, building term or document thesauri, and querying the indexes and thesauri. Below are outlined some of the common usages.

Using TREC data

textIR can be used to index TREC document sets marked up using SGML and query the index using the TREC topic files. textIR makes use of UNIX file redirection to process the text, therefore all text is either redirected or piped into the index construction program.

Building an index from a TREC file.

Assuming you are using the file trec_documents.txt, if the file is compressed using gzip, use the command gunzip -dc trec_documents.txt.gz | dindex. If the text is not compressed, use dindex < trec_documents.txt. If the CSV documents are divided into many files, the program xargs can be used to direct all files into dindex.

Querying using a TREC topic file

Assuming you are using the topic file topics.txt, if the file is compressed using gzip, use the command gunzip -dc topics.txt.gz | dquery > results.txt to have the resulting ranked document list saved in the file results.txt. If the file is not compressed, use dquery < topics.txt > results.txt.

To query a system with a PLSA thesaurus, issue the command dquery --thesaurus=plsa --expansion-mix=0.6 --expansion-terms=100 < query.file, where expansion terms is the number of terms to extract from the thesaurus, and the expansion mix is the mix of the original query terms with the new thesaurus terms (where --expansion-mix=1 implies no original query terms, and --expansion-mix=0 implies no thesaurus terms). Note: to use this option you must have built a PLSA thesaurus (see below).

For further options, run dquery --help.

Using CSV data

textIR can be used to index and query data from comma separated value (CSV) files. This is useful for dealing with data that has already been converted to tokens in a table. textIR makes use of UNIX file redirection to process the text, therefore all text is either redirected or piped into the index construction program.

Building an index from a CSV file

Assuming you are using the file csv_table.csv, if the file is compressed using gzip, use the command gunzip -dc csv_table.csv.gz | dindexCsv. If the text is not compressed, use dindex < csv_table.csv. If the CSV documents are divided into many files, the program xargs can be used to direct all files into dindex.

Querying using a CSV file

Assuming you are using the topic file table.csv, where each line of the file is a separate query, if the file is compressed using gzip, use the command gunzip -dc table.csv.gz | dquery > results.txt to have the resulting ranked document list saved in the file results.txt. If the file is not compressed, use dquery < table.csv > results.txt.

For further options, run dquery --help.

Building a thesaurus

To build a thesaurus, we must first start by constructing the document index of the text (see above). Once the index is constructed, the thesaurus program can analyse the term frequencies from the index to build a thesaurus to be used for term expansion.

Co-occurrence thesaurus

A term co-occurrence thesaurus simply contains the number of documents each term appeared together in. To build such a thesaurus, provide the command dcot once the document index has been constructed.

LSA thesaurus

A latent semantic analysis (LSA) thesaurus contains term relationships constructed using LSA. To build a LSA thesaurus, issue the command dlsa -t once the document index has been constructed. To change the number of topics in the PLSA computation, use the flag -e. For example, dlsa -t -e 300 will compute PLSA with 300 topics and construct the associated thesaurus. Rather than building a thesaurus, an index can be generated using the command dlsa -i -e 300.

PLSA thesaurus

A probabilistic latent semantic analysis (PLSA) thesaurus contains term relationships constructed using PLSA. To build a PLSA thesaurus, issue the command dplsa once the document index has been constructed. To change the number of topics in the PLSA computation, use the flag -e. For example, dplsa -e 300 will compute PLSA with 300 topics and construct the associated thesaurus. The number of iterations used to compute the PLSA topics can be controlled using the -i flag. For example: dplsa -e200 -i50 will compute 200 topics using 50 iterations.