IR Homework Page
Homework #1 :Evaluation Measures
Homework #2 :Classic Retrieval Models
Homework #3 :Query Expansion and Term Reweighting
Homework #4 :HMM/N-gram-based and PLSI Retrieval Models
¡@
Homework #1 :Evaluation Measures
The the query-document relevance information (AssessmentTrainSet.txt) for a set of queries (16 queries) on a collection of 2,265 documents is provided. An IR model is then tested on this query set and save the corresponding ranking results in a file (ResultsTrainSet.txt) . Please evaluate the overall model performance using the following two measures.
1.
Interpolated Recall-Precision Curve:
(for each query)
(overall performance)
2. (Non-interpolated) Mean Average Precision:
, where "non-interpolated average precision" is "average precision at seen relevant documents" introduced in the textbook.
¡@
Homework #2 :Classic Retrieval Models
A set of text queries (16 queries) and a collection of text documents ( 2,265 documents) is provided, in which each word is represented as a number except that the number "-1" is a delimiter. Implement an information retrieval system based on the Vector (Space) Model as well as different term weighting schemes. The query-document relevance information is in "AssessmentTrainSet.txt". You should evaluated you system with the two measures described in HW#1.
¡@
Homework #3 :Query Expansion and Term Reweighting
You should augment the function of query expansion and term reweighting into your retrieval system that has been built in HW#2. Either (automatic) reference feedback or local analysis can be adopted as the strategy for it, but local analysis is preferred.
¡@
Homework #4: HMM/N-gram-based and PLSI Retrieval Models
You have to implement either the HMM/N-gram-based retrieval model or the PLSI retrieval model. A set of query exemplars associated with topic information (a list of 819 queries and query files) is provided. In addition, the topic information of the document collection and a word-based unigram language model estimated from a general corpus are provided as well .
¡@
Homework #5: A Web-based IR System
A zip file Word_Information.rar is provided.
Each file in it stands for a spoken document,
and each line of a document
contains the following information of a word:
POS_IN_DOC WD_NAME WD_ID Begin_Time-1 End_Time Acoustic_Score Confidence_Score1 Confidence_Score2
You can skip the word "SIL" (which means a silence segment).
1. Use overlapping character bigrams as index
terms to build your own retrieval
system.
2. Implement the inverted file structure for doc indexing.
¡@