Empirical Methods in NLP

This year's seminar is about how to design an experiment both in general and with specific application to NLP, how to test an hypothesis and, more in general, how to evaluate an NLP system. We will try to use examples from anaphora resolution, but we will also read experimental work from other areas of NLP.

The most novel feature of this year's seminar is that this is going to be an AUDIENCE PARTICIPATION SEMINAR, meaning that the participants (you) are all expected to present some material; Massimo will only do few presentations .... so have a look at the topics identified below and decide what you'd like to read.

This term, the seminar will meet in the Colloquium Room in Computer Science (5A.540, next to Massimo's office), Tuesdays, 11-12:45.

This page: http://cswww.essex.ac.uk/LAC/LAC_empirical_seminar_syllabus.html

**Primary Text:**- Paul Cohen,
*Empirical Methods in AI*, MIT Press, 1995

- Paul Cohen,
**Supplementary Readings I: Statistics**- Woods, Fletcher, and Hughes,
*Statistics in Language Studies*. Cambridge. - R. Kirk,
*Experimental Design*, Brooks / Cole

- Woods, Fletcher, and Hughes,
**Experimental design (in psychology): a first introduction**- October 11th, Sonja
- Readings:
- Sonja's handout
- Cohen, ch. 3
- Kirk, chapter 1

- Readings:

- October 11th, Sonja
**Hypothesis testing: a first introduction**- October 18th / 25th, Ron (October 25th: also Roman Tesar,
text classification using n-grams)
- Readings:
- Ron's handout and his Perl scripts: consecutive-toss-computation.perl and consecutive-toss-simulation.perl
- Kirk, chapter 4

- Readings:

- October 18th / 25th, Ron (October 25th: also Roman Tesar,
text classification using n-grams)
**Experimental design II: Latin Square design**- November 1st: Udo's JASIST paper
- Readings:
- Udo's JASIST paper
- Introductory discussion of Latin Square design:Kirk, chapter 1

- Readings:
- November 8th: Richard's TREC paper

**Hypothesis testing II: the t-test and its applications**- November 15th: t-test (Mijail)
- Dietterich, 1998
- For a more basic intro, see Cohen ch. 4 / Woods and Hughes ch 8

- November 22nd: use of t-test to compare the performance
of anaphora resolution systems:
- Soon et al 2001, Computational Linguistics
- For examples with other NLP applications, see e.g., Manning and Schuetze's chapter 5 on collocations

- November 1st: Udo's JASIST paper
**Evaluation in NLP & Anaphora Resolution (November 29th)**- Evaluation in MUC
- The main reading will be the CL 1993 paper by Chinchor et al on Evaluation in MUC3
- You may also want to have a look at the the two overview papers in the Proceedings of MUC7

- Evaluation in Anaphora Resolution:
- For a general introduction to some of the issues, see ch. 8 of Ruslan Mitkov's Anaphora Resolution book
- For evaluation of the coreference task in MUC5 (and following), see Vilain et al 1995

- Evaluation in Information Extraction:
- Yeh et al, Evaluation in Biocreative
- Sundheim, Evaluation in TIPSTER /MUC

- Evaluation in MUC
**Hypothesis testing III: Computer-intensive methods (December 6th)**- General motivation (for which kinds of population
parameters you can't use the t-test?)
- Readings: Cohen, ch. 5 (handouts still available from Massimo)

- Further readings:
- The standard reading is Noreen, E. W., (1989), Computer intensive methods for testing hypotheses, John Wiley and Sons (but our library doesn't have it)
- Alternatives: Manly, B. F., Randomization and Monte-Carlo methods in biology, Chapman and Hall, 1991
- Edgington, E. S., Statistical inference: the distribution-free approach, McGraw-HIll, 1969.

- General motivation (for which kinds of population
parameters you can't use the t-test?)
**Experimental design, IV: Power calculations (December 13th)**- December 13th: Nancy
- Main readings: Cohen, ch. 4
- Further reading: R. Kirk, ch. 1 (t-test)

- January 10th: Riccardo Russo
- Further reading suggested by Riccardo:
- Howell 2003 - Statistical Methods for Psychology
- Russo 2003 - Statistics for Behavioral Sciences

- Further reading suggested by Riccardo:
- January 17th: An example of power calculations - dual models vs connectionist explanations of morphology (Sonja)

- December 13th: Nancy
**Hypothesis Testing IV: ANOVA**- January 24th, Theoretical introduction (Ron)
- Readings: Cohen ch. 7? Woods-Hughes ch. 12? Kirk chapter 5?

- ANOVA in psychology:
- The Poesio et al 2001 paper on underspecification in Anaphora?
- The Spivey et al paper?

- ANOVA in NLP: examples
- Maria Wolters and Donna Byron, Prosody and the resolution of pronominal anaphora, ACL/COLING 2000
- Simone Teufel and Marc Moens, Summarizing Scientific Articles, Computational Linguistics, 2002

- January 24th, Theoretical introduction (Ron)
**Experimental design IV: examples of good practice in experimental design in AR & NLP**- (Ron?) Frank Keller and Mirella Lapata, Using the Web to obtain Frequencies for Unseen Bigrams, Computational Linguistics v. 29, n. 3, 2003
- (Olivia?) Dan Gildea and Dan Jurafsky, Automatic labelling of semantic roles, Computational Linguistics, 2002
- (Mijail) One for the papers by Veronique Hoste from Walter Daelemans' group - for instance, Comparing Learning Approaches to coreference resolution
- (Mijail) Kehler et al, The non-utility of predicate-argument frequencies for pronoun interpretation, NAACL 2004
- Also possible: Lapata CL 2002

**Hypothesis testing V: Chi-square**- Basic intro: Olivia?
- Readings: Woods and Hughes ch 9?

- Applications to anaphora / NLP
- Readings: Poesio to appear??

- An alternative to Chi-square: log-likelihood (Dunning CL 1993)

- Basic intro: Olivia?
**Experimental design, III: Sample design**- General intro
- Readings: from R. Kirk

- Corpora used in AR / NLP

- General intro
**Hypothesis Testing VI: Other distributions**- Binomial & the sign test
- Poisson

**Additional forms of performance assessment**- Learning curves (Richard? Mijail?)
- Readings: Cohen ch. 6
- Maybe also go back to Cohen ch. 2 (advanced visualization)

- Analysis of a decision tree
- Readings: Shiberg et al

- Feature selection

- Learning curves (Richard? Mijail?)
**More Hypothesis Testing:**- Linear regression
- Readings: Woods-Hughes ch. 13?

- Logistic regression
- Magnitude estimation

- Linear regression
**Improving the performance of ML systems**

**Courses **

**Projects**

**Other useful Web links**: