IR2003S_Homework

IR Homework Page

Homework #2 :Classic Retrieval Models

Homework #3 :Query Expansion and Term Reweighting

Homework #4 :HMM/N-gram-based and PLSI Retrieval Models

The the query-document relevance information (AssessmentTrainSet.txt) for a set of queries (16 queries) on a collection of 2,265 documents is provided. An IR model is then tested on this query set and save the corresponding ranking results in a file (ResultsTrainSet.txt) . Please evaluate the overall model performance using the following two measures.

1. Interpolated Recall-Precision Curve:
(for each query)

(overall performance)

2. (Non-interpolated) Mean Average Precision:

, where "non-interpolated average precision" is "average precision at seen relevant documents" introduced in the textbook.

Homework #2 :Classic Retrieval Models

A set of text queries (16 queries) and a collection of text documents ( 2,265 documents) is provided, in which each word is represented as a number except that the number "-1" is a delimiter. Implement an information retrieval system based on the Vector (Space) Model as well as different term weighting schemes. The query-document relevance information is in "AssessmentTrainSet.txt". You should evaluated you system with the two measures described in HW#1.

Homework #3 :Query Expansion and Term Reweighting

You should augment the function of query expansion and term reweighting into your retrieval system that has been built in HW#2. Either (automatic) reference feedback or local analysis can be adopted as the strategy for it, but local analysis is preferred.

Homework #4: HMM/N-gram-based and PLSI Retrieval Models

You have to implement either the HMM/N-gram-based retrieval model or the PLSI retrieval model. A set of query exemplars associated with topic information (a list of 819 queries and query files) is provided. In addition, the topic information of the document collection and a word-based unigram language model estimated from a general corpus are provided as well .

Homework #5: A Web-based IR System

A zip file Word_Information.rar is provided. Each file in it stands for a spoken document,
and each line of a document contains the following information of a word:

POS_IN_DOC WD_NAME WD_ID Begin_Time-1 End_Time Acoustic_Score Confidence_Score1 Confidence_Score2

You can skip the word "SIL" (which means a silence segment).

1. Use overlapping character bigrams as index terms to build your own retrieval
system.
2. Implement the inverted file structure for doc indexing.