Projects

Approximate clustering of very large scale data

Many machine learning, data mining and statistical analysis tasks are used to identify properties of data, or to build models of sampled data. Unfortunately, these analytical methods are computationally expensive; their computational resource use being a function of the dimensionality and sample count of the data being analysed. Recent advances is data acquisition (such as genetic sequencing) have allowed us to capture very large amounts of data (of the order of terabytes) for analysis and modelling.

Projects

Automatic thesaurus construction using non-linear term relationships

Text based information retrieval systems retrieve documents based on the set of key terms provided to them. The documents returned are ranked according to the count of each query term, therefore if the query terms do not exist in the document it is not found. Latent semantic analysis (LSA) is a method of computing hidden topics within documents using linear algebra. By obtaining the relationships between each hidden topic and each term, we are able to compute which terms are similar by comparing the similarity of each of the terms topics.

Projects

Discovering document model deficiencies for information retrieval

Text based information retrieval systems are built using document models. To analyse the retrieval precision of a model, a set of queries are provided to the model and the results are compared to the desired results. This type of analysis allows us to compare the precision of different retrieval models, but it does not provide us with any feedback on where the models could be improved. Currently there are no methods of analysis of text retrieval systems that are able to show where deficiencies lie within the document model.

Projects

Efficient and effective use of Language Models for Information Retrieval

Language models for text based information retrieval have become a de facto standard due to their simplicity and effectiveness. Recently, several language modelling techniques have been developed that assume a hidden distribution of topics within the set of documents. Such methods include Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA), where the former uses a multinomial distribution of topics, while the latter uses a Dirichlet prior. By using this notion of hidden topics, we are able to compute relationships from term to topic and hence term to term.

Projects

Improving the quality of Web information retrieval using multi-resolution link analysis

The World Wide Web is the most important information source in modern society. The information within the Web is accessed through Web search engines such as Google and Yahoo. These search engines use a global popularity rank during the computation of their results, based on links from all of the pages on the Web. This global popularity rank used by Google is known as PageRank. Using these global popularity ranks biases the search results towards the globally popular Web pages.

Projects

Relevance-based document models for Information Retrieval

Document models are used for information retrieval in order to compute the probability of a query being related to the document. The majority of document models are functions of the terms that appear within the document. This implies that a query is only relevant to a document if the query terms exist within the document, which is far from the truth. In our project, we have created a new form of document model called, a Relevance-based document model, which is built based on the relevance of each query to the document and not the words that appear within the document.

Projects

Web page prefetching using temporal link analysis

Web searching should be as simple as providing the search engine with a query and having the search engine return a link to the desired Web page. Unfortunately, current Web search engines use text based queries and therefore require the user to provide keywords. Converting the users information need into a few key words is not a simple process. Due to this, Web search patterns involve the user visiting many Web pages that do not satisfy their information need, while interleaving this process with several visits to the Web search engine.