Web Search and Text Analysis

Subject COMP90042 (2015)

Note: This is an archived Handbook entry from 2015.

Credit Points: 12.5
Level: 9 (Graduate/Postgraduate)
Dates & Locations:

This subject has the following teaching availabilities in 2015:

Semester 1, Parkville - Taught on campus.
Pre-teaching Period Start not applicable
Teaching Period 02-Mar-2015 to 31-May-2015
Assessment Period End 26-Jun-2015
Last date to Self-Enrol 13-Mar-2015
Census Date 31-Mar-2015
Last date to Withdraw without fail 08-May-2015

Timetable can be viewed here. For information about these dates, click here.
Time Commitment: Contact Hours: 36 hours, comprising of one 2-hour lecture and one 1-hour workshop per week
Total Time Commitment:

200 hours


One of the following:

Study Period Commencement:
Credit Points:
Semester 1, Semester 2
Semester 1, Semester 2


Recommended Background Knowledge:


Non Allowed Subjects:

433-460 Human Language Technology
433-467 Text and Document Management
433-660 Human Language Technology
433-667 Text and Document Management
433-476 Text and Document Management

Core Participation Requirements:

For the purposes of considering request for Reasonable Adjustments under the Disability Standards for Education (Cwth 2005), and Student Support and Engagement Policy, academic requirements for this subject are articulated in the Subject Overview, Learning Outcomes, Assessment and Generic Skills sections of this entry.

It is University policy to take all reasonable steps to minimise the impact of disability upon academic study, and reasonable adjustments will be made to enhance a student's participation in the University's programs. Students who feel their disability may impact on meeting the requirements of this subject are encouraged to discuss this matter with a Faculty Student Adviser and Student Equity and Disability Support: http://services.unimelb.edu.au/disability


Assoc Prof Steven Bird


email: sbird@unimelb.edu.au

Subject Overview:


The aims for this subject is for students to develop an understanding of the main algorithms used in natural language processing and text retrieval, for use in a diverse range of applications including search engines, cross-language information retrieval, machine translation, text mining, question answering, summarisation, and grammar correction. Topics to be covered include text normalisation, sentence boundary detection, part-of-speech tagging, n-gram language modelling, and text classification. The programming language used is Python.


Topics covered will include:

  • Document classification, including gender detection, topic detection and language identification
  • Weighted finite state transducers, hidden Markov models
  • N-gram language modelling, including statistical estimation
  • Sentence segmentation and alignment, the IBM models, expectation maximisation
  • Search algorithms including beam search, A* search
  • Term indexing, vector space model, term weighting.

Learning Outcomes:


On completion of this subject the student is expected to:

  1. Articulate issues relevant to the efficient implementation of language processing systems and text retrieval systems
  2. Apply natural language processing and information retrieval methodologies to textual data
  3. Develop and evaluate computational models of language, based on results from the research literature
  • Project assignments will be done during the semester, requiring approximately 50 - 55 hours of work in total (40%). There are two projects, due around week 6 and week 12
  • A research-oriented workshop presentation, requiring approximately 13 - 15 hours of work (10%)
  • One 2-hour end-of-semester examination (50%).

Hurdle requirement: To pass the subject, students must obtain at least:

  • 50% overall
  • 25/50 in the continuous assessment
  • 25/50 in the end-of-semester written examination.

Intended Learning Outcomes (ILOs) 1 and 2 are addressed in the lectures, workshops, and exam; ILOs 3 and 4 are addressed in the project work and oral presentation.

Prescribed Texts: None
Breadth Options:

This subject is not available as a breadth subject.

Fees Information: Subject EFTSL, Level, Discipline & Census Date
Generic Skills:

On completing this subject, students should have the following skills:

  • Formulate and implement algorithmic solutions to computational problems, with reference to the research literature
  • Apply a systems approach to complex problems, and design for operational efficiency
  • Design, implement and test programs for small and medium size problems in the Python programming language.



The subject comprises a weekly 2 hour lecture followed by a 1 hour laboratory exercise. Weekly readings are assigned from the research literature, and weekly laboratory exercises are assigned. Additionally, a significant amount of project work is assigned.


At the beginning of the semester, the coordinator will post a list of readings from the research literature and research monographs which will form the basis of the intellectual content of the subject. An indicative monograph is Statistical Machine Translation, by Philipp Koehn (2010).


A growing sector of the IT industry is concerned with leveraging the information that is locked up in semi-structured text data on the web. Large scale analysis and exploitation of this information depends on graduates with a solid grounding in natural language processing and text retrieval algorithms, and experience with implementing systems that are informed by the research literature.

Related Course(s): Master of Information Technology
Master of Philosophy - Engineering
Master of Science (Computer Science)
Master of Software Systems Engineering
Ph.D.- Engineering
Related Majors/Minors/Specialisations: B-ENG Software Engineering stream
Computer Science
Computer Science
MIT Computing Specialisation
MIT Distributed Computing Specialisation
Master of Engineering (Software)

Download PDF version.