THE GNOME ANNOTATION SCHEME MANUAL

Massimo Poesio

(With help and suggestions from Hua Cheng, Kees van Deemter, Debbie DeJongh, Barbara Di Eugenio, Ben Donaldson, Marisa Flecha-Garcia, Michael Green, Renate Henschel, Rodger Kibble, Shane Montague, Carol Rennie, Rosemary Stevenson, Donia Scott, and Claire Thomson.)

Fourth Version

Last Modified: April 12th, 2000


OUTLINE


  1. Introduction
  2. Layout Annotation
  3. Annotation of Sentences and Utterance Units
  4. NP Annotation
  5. The ANTE Element
  6. Discourse Structure Annotation
  7. Appendix
    1. DTD
    2. gnome-mode.el
  8. References


1. INTRODUCTION


This manual specifies the annotation scheme in course of development for the GNOME project. The primary goal of the annotation is to investigate the factors affecting the choice of NP form in natural language generation. Secondary goals are to study pronominalization and definiteness. The scheme specifies information to be associated with each NP for this purpose (using the <NE> element), including syntactic information about number and gender, semantic information, and discourse information. In case of NPs which are judged to have antecedents in the text, the antecedent and some of its properties are also annotated. As in the case of the scheme developed for the MUC initiative, the GNOME scheme includes a separate specification of the elements to be marked; and because in our texts metalinguistic elements such as paragraphs and sentences can also serve as antecedents of anaphoric expressions, those layout elements which are referred by anaphoric expressions are also marked. XML is used as a markup language.


2. LAYOUT ANNOTATION


(This section is still very preliminary.)

The files contain some information about the annotation itself and the layout of the text. Each file is marked up as an element of type <gnomedoc> a <gnomedoc> contains (i) information about the annotation contained in an element <annoinfo>, (ii) a <head>, and (iii) a <body>.

The <annoinfo> element provides information about the file itself (file name, annotation level, annotator, and date of annotation) and the source of the text. The contents of the <annoinfo> element should not be marked.

The <body> contains the text proper. For the moment, we mark up the following information about the layout of the text:

At the moment, we use the special element <layspecial> for all those layout elements which are not otherwise classified. The contents of <layspecial> elements should NOT be annotated; nor should the contents of <annonote> elements, which are used by the annotators to make notes about the annotation.


3. ANNOTATION OF SENTENCES AND UTTERANCE UNITS


Before annotating NPs using the <ne> and <ante> elements discussed in the next two sections, we need to identify sentences and what we will call UNITS and can be thought of as generalized clauses.

This part of the annotation should proceed as follows:

  1. First mark up all sentences;
  2. Then mark up all units. Note that not all units are included be contained in sentences.


3.1 SENTENCES

Each sentence should be annotated using the element <s>. We consider as a `sentence' each textual sequence in a paragraph (element <p>) a title (element <title>) or a list item (element <item>) that can be classified as being either declarative, interrogative, imperative, or exclamative. In practice, we identify most sentences by looking at whether they end with a full stop, question mark, or exclamation mark as in the following examples; note that the punctuation marks are INCLUDED in the <s> element.

<s>This leaflet is a summary of the important information about Menorest.</s>
<s>Keep this leaflet in a safe place.</s>
<s>What are Menorest patches? </s>

One exception to this rule is that the last string of text in a paragraph containing a verbal complex (see below) should always be marked as a sentence, even though sometimes the punctuation mark is missing from such sentences in our texts: e.g., the following bit of text from the Dermovate text should be annotated as a sentence:

<s>IT IS IMPORTANT TO READ THIS CAREFULLY BEFORE STARTING TREATMENT</s>

Also, sometimes colons (`:') break up text in sentences of different types -e.g., a declarative and an interrogative. When this is the case, as in the following example, two sentences should be annotated.

<s>I told him:</s> <s>What do you want from me?</s>

SENTENCE ATTRIBUTES

The <s> element has two attributes:

  1. ID
  2. STYPE

STYPE: type of sentence

This attribute should be used to indicate whether the sentence is declarative, interrogative, etc. Its possible values are:

Decl
for declarative sentences: e.g.,
<s stype=decl>This leaflet is a summary of the important information about Menorest.</s>
Int
for interrogative sentences:
<s stype=int>What are Menorest patches? </s>
Imp
for imperative sentences:
<s stype=imp>Keep this leaflet in a safe place.</s>
Excl
for exclamative sentences, i.e., sentences ended by an exclamation point.

 


3.2 UNITS

The <unit> element is used to mark subdivisions of sentences that may play a role in discourse either because they update the local focus, or because they are involved in rhetorical relations. A lot of units are clauses, but a few of them (e.g., titles, verbless parentheticals) aren't.


MARKING UP UNITS

Most units are clauses, but some non-clausal elements should be marked up as units, as well. The general principles to follow in marking up units are as follows:

(We consider here as a `clause' each sentence constituent that includes a verbal complex and all of its arguments.) The annotation of units should proceed top-down, as follows:

  1. Check first if the piece of text under consideration (sentence, layout element, or other unit) contains coordinated units, and mark them up; then repeat the whole procedure for each coordinated unit. Coordinators include and, or, but, although, yet, as well as `correlative pairs' (Quirk and Greenbaum, 1973): either ... or, neither ... nor, and both ... and, and some comparative constructions. while is ambiguous: sometimes serves as coordinator (as in John is smart, while Bill is rich), other times as subordinator (as in John slept while Bill was setting up the table). Among punctuation marks, the semicolon `;' usually indicates coordination; the comma by itself (i.e., without other coordinators such as and) sometimes does too, but less often, at least in our corpus. Examples of coordinated units are:
    <s>[Fashions in diamond jewelry were similar throughout Europe,] [although botanical imagery remained strongest in Paris].</s>
    <s> [The Getty Museum's microscope still works,] [and the case is fitted with a drawer filled with the necessary attachments]. </s>
    <s> [The more I listen to it,] [the more I like it]. </s>
    <s> [Purple, white and green were the colours of the suffragette movement;][women would wear a brooch like this [to show solidarity or affiliation with the movement]] </s> .

    Note that the final full stop (`.') in a sentence should NOT be included in the unit.

    Only parts of the text that count as units should be divided this way: e.g., in John and Mary left, we only have one unit. (See the rules for determining units below.) Although appositions and parentethicals are best seen as cases of coordination than of subordination, we will treat them as sub-units here because they may occur in the middle of the text. Punctuation marks should be included with the first unit, whereas the conjunction or disjunction should be included with the second unit.

  2. Otherwise, identify all sub-units of the present unit, mark them up, and then apply the whole procedure again to each of them. All units included in the present unit are considered sub-units for the present purposes, unless they are united by a coordination such as and, or or but-so, complement clauses, parentheticals, etc are all considered sub-units. (Examples of each of these types of subunits are given below.)

The following rules should be used to determine the units of text and to deal with problematic cases. (In the following examples, the part of text that should be marked as a unit is indicated with square brackets.)

  1. A complete clause (i.e., a sequence of text including a verbal complex, all its obligatory arguments - subject, object, etc. - and all its postverbal adjuncts) is always a unit, irrespective of whether the clause is main or subordinate, and of whether its verb is finite or non-finite:
    [They were founded in 1903 by Josef Hoffmann and Koloman Moser].
    [The front lowers [to form a writing surface], [revealing drawers and pigeonholes]].
    Notice how both the infinitival [to form a writing surface] and the parenthetical [revealing drawers and pigeonholes] are considered sub-units of the main unit. Conversely, notice that neither the subjects They and The front nor the adjuncts in 1903 and by Josef Hoffmann and Koloman Moser constitute units.

  2. Subordinate clauses should be marked just like main clauses. An example of subordinate clause are restrictive and non-restrictive clausal NP modifiers, i.e., clauses that specify further information about an object described by a noun phrase. These always count as units, whether their verb is finite or non-finite. Finite restrictive and non-restrictive clausal modifiers often include a complementizer such as that, as in the following examples:
    [This cream has been prescribed [to treat the skin problem [that you showed to your doctor]]].
    [Each coffer also has a lid [that opens in two sections]].

    (NB: These clausal NP modifiers should always be embedded inside the <ne> element for the NP they modify; the instructions for <ne> elements are given below.)

    There are three types of non-finite clausal NP modifier units: -ing participles, -ed participles, and infinitival clauses:

    [A panel of marquetry [showing the cockerel of France [standing triumphant over both the eagle of the Holy Roman Empire and the lion of Spain and the Spanish Netherlands]] decorates the central door].
    [A large table [decorated in the same manner] would have been placed in front [for working with those papers]].
    [The painted black-and-gold decoration is commonly known as vernis Martin, [named after the Martin brothers, [Parisians who excelled at the craft of imitating Japanese lacquer]]].
    [Early alchemists were driven by the desire [to transform and alter material properties]]

    These modifiers also count as units when they are preposed rather than postposed:

    [[Intended to hold jewels or small precious items,] the interiors of this pair of coffers are lined with tortoiseshell and brass or pewter, with secret compartments in the base ].

    However, only clausal NP modifiers should be marked up; e.g., verbal adjectives in pre-modifier position (as in the The painted black-and-gold decoration seen above) should not be marked as separate units.

  3. Appositive clauses are a second type of subordinate clause that specify the identity of an object, rather than modifying a predicate. They should be marked up as units, as well:
    [The fact [that he wrote a letter to her] suggests [that he knew her]].
    [The belief [that no-one is infallible] is well founded].
    [but also by men's desire [to show off their wealth] ].
  4. Complement clauses are another type of subordinate clause that should be marked as units, again whether finite or non-finite. Complement clauses include, among others, (generally finite) clauses which serve as complements of verbs and specify propositions that are conveyed or believed or reported:
    [a recent study of the arms suggests [that it may have been made for another member of the family rather than Jean himself]].
    [An inscription on the Getty Museum's drawing for one of these wall lights explains [that it should hang above the fireplace]].
    [the bureaucrats argued [that they could not purchase objects [that were neither "advantageous" nor useful]]].

    Finite clauses can also occur as complements of adjectives:

    [It may be unsurprising [that Bjorg, [as a Scandinavian,] should choose silver as her jewellery material]].

    Non-finite clauses that occur as complements of verbs or adjectives such as slow should be marked as units as well:

    [Dubois was commissioned through a Warsaw dealer [ to construct the cabinet for the Polish aristocrat ]]
    [This piece of mid-eighteenth-century furniture was meant [ to be used like a modern filing cabinet]]
    [Bob is slow [to react]]

    Verbal structures can sometimes be very complex, making it difficult to decide whether a verbal element should be treated as a complement. This problem will be addressed as follows. (These guidelines follow Quirk and Greenbaum, chapters 3 and 12.) As already seen in one of the examples above, if the verbal complex consists of the auxiliaries have or be followed by a participle (e.g., to indicate tense or the active/passive distinction) we do not consider the participle a complement, so it should not be treated as a separate unit. We are going to treat modal auxiliaries in the same way: these include can, could, may, might, shall, should, will, would, must, ought to, used to, need, dare and their negative forms. All of these auxiliaries should be treated like have and be; i.e., the main verb should not be marked as a separate unit.

    [This table's marquetry of ivory and horn would have followed the house's blue-and-white color scheme ]

    On the other hand, separate units should be marked when the main verb itself is a modal that takes infinitival clauses or gerunds as complements, like want or like; or a verb like begin or end that takes infinitival complements. We are also going to annotate two units when the verb is a complement of an adjective like necessary or helpful.

    [The upright secretaire began [to be a fashionable form] around the mid-1700s ]
    [I would like [to be able to travel] ]
    [which seems [to argue against any single place of manufacture] ]
    [it may be helpful [to tick the box on the inside cover of the wallet] ]

    We will treat the infinitivals in a verbal construction with get, let, make, and have like those in a construction with may and be - i.e., as not introducing separate units:

    [I let him do his homework].
    [Get her to do it].

    In case of multiple complements embedded into each other, each should be marked as a separate clause according to the rules above:

    [since it will probably be necessary [to stop [using the dressing]]] .
  5. Finite and non-finite clauses in subject position should also be treated as units (of type subject, see below).
    [Use your fingers [because [cutting with scissors] might damage the patch inside]]
    [[That John was rude to her parents] annoyed Sue immensely]

    A special example of this situation are so-called pseudo-cleft sentences, in which a free relative (a clause with complementizer what) occurs as the subject of a clause. The wh-clause should be marked as an embedded unit:

    [[What you need most] is a good rest].
  6. Other sentence constituents are explicitly marked with subordinators such as after, as if, because, before, if, once, since, until, when, where, while, etc. These should always be marked as units when they contain a verb.
    [[If your periods have stopped [or become very irregular],] you can start [using Estracombi TTS] at any time] .

    Phrases that are marked using subordinators such as if and because should be marked explicitly even when they do not include a verbal complex:

    [Use your fingers [because [cutting with scissors] might damage the patch inside]]
    [He was plagued by nightmares that night,] [and [because of them,] he could not run away.]
    [They refused [to pay the higher rent] [when an increase was announced;]] [as a result, they were evicted from their house]
    [Regardless of his attitude,] he will be a good student.]

    However, elements of text preceded by subordinators (especially temporal subordinators) that subcategorize for both NPs and clauses should only be marked as units when they contain a verb. For example, although before John arrived and before getting home count as units, before 4pm doesn't. Prepositions that subcategorize for both NPs and clauses include after, before, since, until.

  7. The purpose or goal of an action is sometimes expressed by means of infinitival clauses that syntactically function as adjuncts. These should be marked as units, as well:
    [The center of the narrow body swells [to allow for the pendulum's swing]].
  8. Cleft sentences are constructions of the form It was John that gave me that letter that consist of two clausal elements, one of which typically similar in form to a relative clause (that gave me that letter). Both clauses should be marked as units, and the second should be embedded in the first:
    [It was John [that gave me the letter]] .
  9. Titles should be counted as units even when they do not include a verb:
    <title>[Patient Information Leaflet]</title>
  10. Parentheticals - which we define as any expressions between parentheses, dashes, and commas - can play several different functions in a text:

    Every element of text surrounded by parentheses, commas, and dashes should be marked as a unit if it satisfies the following two conditions: (i) it includes a NP or a verb, and (ii) the commas are not simply used to coordinate two nouns or two NPs. So, all of the text constituents in bold in the following examples should be marked up as units:

    [We also started up a subsidiary [(affiliated company)] in November]
    [That leaves a new generation of world masters [ - Greece's Theo Angelopoulos, Taiwan's Hou Hsiao-hsien, Iran's Abbas Kiarostami - ] [that is largely unknown to Americans]]
    [First, they can carry 10 or even 100 times as much information] [ - and hold it much more robustly ]
    [Football, [his only interest in life,] has brought him many friends]
    [The painted black-and-gold decoration is commonly known as vernis Martin, [named after the Martin brothers, [Parisians who excelled at the craft of [imitating Japanese lacquer]]]].

    But the elements in the following examples should not:

    [The suffragette brooch and the costume buttons, on the other hand, .... ]
    [Finally, it was felt [that the Exhibition should also include some other products of the Byzantine goldsmith's art .... ]]
    [it invariably adorns the necks, hair, wrists and expressive fingers of the female figures]
    [The necklaces [produced by Byzantine jewellers] included chains [studded with stones], gold chains in a variety of types with pendants in the form of crosses, amulets, or amygdalia, and with disks with open-work or other decoration on the catch, and necklaces [made of solid disks with coins [issued by emperors]]
  11. Although coordinated VP are not, strictly speaking, clauses, we want to mark them up to study how they affect CB shift. Coordinated VPs will therefore count as units when each verb has (explicit) arguments expressed by NPs or PPs; the coordinated VP should be marked as an embedded unit within the unit containing the main verb.
    [The center of the narrow body swells [to allow for the pendulum's swing], [and has a viewing hole to observe the movement]].
  12. For the same reason, we want to treat preposed PPs as separate units, even though post-verbal PPs are not going to be:
    [[With the development of heraldry in the later Middle Ages in Europe as a means of identification], all [who were entitled to bear arms] wore signet-rings [engraved with their armorial bearings]].
  13. Lists are a way of coordinating elements of a text. Paragraphs, sentences, clauses, verb phrases, and noun phrases can all occur as list items, especially in the texts in the pharmaceutical domain. When the list item is a paragraph (i.e., a sequence of sentences) it should be treated just like all other paragraphs: i.e., each of its constituent sentences should be marked up, and then their unit constituents. Similarly when the list item is a single sentence, as in the following example:
    <list>
    <item>
    <s> Have you ever had treatment for a breast lump, or any serious disease of your womb? </s>
    </item>
    <item> ...
    </list>
    Lists can also be used to coordinate non-sentential elements, including clauses, VPs, NPs - everything that can be coordinated. List items should always be marked up as units even when they are not clauses.

    <list>
    <item> [a. Tear open the pouch along the 2 edges ] </item>
    <item> [b. Take out the patch ] </item>
    ...
    </list>
    Note that in some cases non-clausal list items can contain clausal elements, or be coordinated with them.
    <list>
    <item> [4 Estraderm TTS 50 Patches [each containing 4mg of oestradiol].] [Your body will absorb about 50 micrograms of oestradiol each day.] </item>
    <item> [4 Estragest Patches.] </item>
    ...
    </list>
    (Note: the special value UTYPE=listitem should be used for list items which are not clausal.)
  14. Verbal gappings also count as (coordinated) units:
    [John likes dogs,] [and Mary cats]

The following elements of text should not be annotated as units:

  1. Non-clausal post-modifiers of NPs or adjectives, such as NP complements, prepositional phrases, etc:
    [Their workshop probably also supplied the bronze Chinese figures on each side of and above the clock].
    [Each vase is decorated with inlaid decoration]

    The one exception are preposed PPs, that should be treated as separate units with UTYPE="preposed-pp" (see above).

  2. Non-clausal comparatives:
    [it was considered more as an art material than as a low-value throw-away material].
  3. Although coordinated verb phrases with arguments should be marked up as units as discussed above, coordinated verbs should not be:
    [The oestradiol and norethisterone acetate are plant derived and synthetically produced]
    [Drawings and engravings were sources [frequently used by foreign patrons and craftsmen [to order and copy the latest fashions in French interior design]]]

    However, each infinitival phrases should be treated as a separate VP:

    [These luxury items, [which from the early years of the empire remain outside the philosophical and theological thought of Byzantium,] continued [to be appreciated, ][to be hidden away in times of danger,] [and to be displayed ostentatiously in times of happiness and prosperity]]
  4. Quoted parts of text should not be marked as units when they only serve as uninterpreted strings, even if they are clausal:
    [The inscription 'CHNETOC BASHLHKOC CPATHARHC' .... ]
    [Above the shield the letters E.I.D. Gre have been interpreted as 'Est Sigillum Iohannis De Gre', [meaning 'this is the seal of Jean de Grailly']].

    However, quoted speech should be marked up:

    [["[Where are the women of today...] [take out the hand of the coquettish woman,][ and see [how it is gilded...]] [Tell me [how many of the poor's needs, and even more, your hand could satisfy]]. [As I said you come into the church with hands and neck all gilded..]."] comments Chrysostomos].


UNIT COORDINATION

The element <unit-coordination> should be used to group together units in a few cases where simple embedding is not sufficient to preserve the structure of the text. Two such cases are particularly common. In the pharmaceutical domain, it's very common for a conditional statement to contain two antecedents and/or two consequents,

<unit utype="main"> <unit-coordination> <unit utype="adjunct"> If you have any questions </unit> <unit utype="adjunct"> or are not sure about anything, </unit> </unit-coordination> ask your doctor or your pharmacist </unit>

(Note how the <unit-coordination> element itself is not annotated for utype, but each of the embedded units gets the value it would have if it were alone.) In the museum domain, quotations often consists of many units, all of which should be marked up as complements:

<unit utype="main"> <unit-coordination> "<unit utype="complement"> Where are the women of today... </unit> <unit utype="complement"> take out the hand of the coquettish woman, and see how it is gilded... </unit> <unit utype="complement"> Tell me how many of the poor's needs, and even more, your hand could satisfy. </unit> <unit utype="complement"> As I said you come into the church with hands and neck all gilded... </unit> " </unit-coordination> comments Chrysostomos </unit>

Unit coordination should only be used when necessary; in particular, it should NOT be used to group together two main units, since the info about the grouping is already given by the <s> element.


UNIT ATTRIBUTES

Units have the following attributes:

  1. ID
  2. VERBED
  3. SUBJECT
  4. UTYPE
  5. FINITE

The ID attribute is as in the case of <ne>, except that it's not obligatory. All of the other attributes can also take an `unsure' value.

The recommended procedure for marking up unit attributes is as follows. First of all, decide on a value for the VERBED attribute. If the value of VERBED is verbed-no, then some of the other values are going to be determined: FINITE will be no-finite, and SUBJECT will be no-subject. Also, this unit cannot be a main clause. Then proceed to decide a value for the UTYPE attribute; then to FINITE, and finally on to SUBJECT.

In what follows, we will use an SGML notation to indicate units rather than the square brackets used in the previous section, to show which units get the attribute being discussed.

VERBED: Does the unit have a verb?

Attribute values

This is a boolean attribute, with values verbed-yes and verbed-no, used to indicate if the unit contains a verb.

verbed-yes

This value should be used if the unit contains verbal elements. Note that forms of to be and non-finite verbal forms such as gerunds or participles count as verbs, as well:

<unit VERBED="verbed-yes" > that in medieval times the wedding ceremony took place at the church door, not at the entry to the chancel </unit>
<unit VERBED="verbed-yes" > This is a type of brooch </unit>
<unit VERBED="verbed-yes" > in adopting these in their creations </unit>
verbed-no

This value should be used if the unit does not contain verbal elements.

<unit VERBED="verbed-yes" > it was made by Anne-Marie Shillitoe, <unit VERBED="verbed-no" > an Edinburgh jeweller </unit> </unit>

Possible Difficulties

Keep in mind that sometimes present participles are used as adjectives, or even nouns; in these cases, verbed-no should be used.

UTYPE: Unit Type.

This attribute is used to specify whether the unit is a main clause, a relative clause, etc.

main

This value should be used if the unit is a clause and, furthermore, it's the only clause in a simple sentence, the superordinate clause in a complex sentence, or one of the conjuncts in a coordinated sentence.

<s> <unit UTYPE="main"> They were founded in 1903 by Josef Hoffmann and Koloman Moser </unit>. </s>
<s> <unit UTYPE="main"> The Getty Museum's microscope still works,</unit>
<unit UTYPE="main"> and the case is fitted with a drawer filled with the necessary attachments</unit>
. </s>
<unit UTYPE="main"> Each coffer also has a lid
<unit UTYPE="relative"> that opens in two sections. </unit> </unit>

(A list of coordinating elements is at the beginning of this section.)

relative

Relative clauses have the function of restricting the range of the predicate expressed by the head noun of a noun phrase, as in the man who shot Liberty Valance. As such, they are always embedded in a noun phrase. They can be distinguished from other sentence constituents that modify nouns because they contain a verb, unlike PP modifiers such as with a golden arm in the man with a golden arm. Complementizers include that and relative pronouns such as who, which, and where.

<unit UTYPE="main"> Each coffer also has a lid
<unit UTYPE="relative"> that opens in two sections. </unit> </unit>
Three pairs of lions clamber up the section from the point <unit UTYPE="relative"> where the sheath and bow are joined </unit>

whereby can also sometimes be used to introduce relative clauses:

<unit UTYPE="main"> and indeed ignored the despairing opposition of the Church to the practice <unit UTYPE="relative"> whereby according to Gregory of Nyssa, <unit UTYPE="paren-app"> "The ring worn on the hand, <unit UTYPE="parenthetical"> through the engraving on its bezel, </unit> implies a repetition of the icon." </unit> </unit>

One point to remember is that relative clauses do not always come with a relative pronoun or other complementizer; so-called reduced relatives only have a non-finite verb, typically a past participle of infinitival:

<unit UTYPE="main"> A large table <unit UTYPE="relative"> decorated in the same manner </unit> would have been placed in front <unit UTYPE="adjunct"> for working with those papers</unit> </unit>

Some parentheticals can also express so-called non-restrictive relative clauses. We will classify such units as types of parentheticals, as discussed below.

<unit UTYPE="main"> This cabinet celebrates the Treaty of Nijmegen, <unit UTYPE="paren-rel"> which concluded the war. </unit> </unit>
such-as

Units preceded by such as, with or without a verb, should be marked as UTYPE=such-as:

<unit UTYPE="main"> For jewellery was also decorated with religious scenes or individual figures, <unit UTYPE="such-as"> such as the Virgin, Christ, saints and angels </unit> </unit>

This value should be used also when the unit is a parenthetical clauses (see below).

appositive

Appositive clauses are also post-modifiers of NP, which syntactically look a lot like relative clauses, but have a different semantic function: to specify the value of the function denoted by the rest of the noun phrases, rather than modifying the head noun. Thus, in the following example, the clause that he wrote a letter is not a restriction on the head noun fact, but specifies which fact we are talking about:

<unit UTYPE="main"> The fact <unit UTYPE="appositive"> that he wrote a letter to her </unit> suggests <unit UTYPE="complement"> that he knew her </unit> </unit>
<unit UTYPE="main"> The belief <unit UTYPE="appositive"> that no-one is infallible </unit> is well founded. </unit>
<unit UTYPE="main"> The desire <unit UTYPE="appositive"> to become rich </unit> is a powerful motivation. </unit>

The term apposition is commonly used to indicate a certain type of parentheticals, that play an appositive function. We classify these units as types of parentheticals, as discussed shortly.

parenthetical, paren-app, paren-rel, paren-main

All units included between commas, parentheses, and dashes should be marked using one of these four values, except when

We are going to divide parentheticals into four classes: those that serve as relative clauses (marked as paren-rel), those that serve as appositions (marked as paren-app), those that serve as main clauses (marked as paren-main), and all the rest, marked simply as parenthetical: this last class includes parentheticals used to give comments, provide translations, refer to page numbers, etc.

The first two categories of parentheticals can be recognized as follows. Parentheticals that serve as relative clauses include a predicate expressed by a VERB and have the same form as relative clauses: they are either full sentences initiated by a complementizer such as who, or reduced relative clauses containing a participle. Parentheticals that serve as appositions are NOMINAL predicates, and are expressed by complete noun phrases.

<unit UTYPE="main"> The painted black-and-gold decoration is commonly known as vernis Martin, <unit UTYPE="paren-rel"> named after the Martin brothers, <unit UTYPE="paren-app"> Parisians <unit UTYPE="relative"> who excelled at the craft of imitating Japanese lacquer </unit> </unit> </unit> </unit>

Note that for a parenthetical to qualify as a paren-app it is not sufficient that it consists of a full noun phrase: the two noun phrases must denote the same object. The following example includes a number of parenthetical noun phrase whose relation with the noun phrase they modify is not identity:

<unit UTYPE="main"> 589, <unit UTYPE="paren-app"> a 15th-century French ring, </unit> was found in Selby, <unit UTYPE="parenthetical"> Yorkshire </unit> <unit UTYPE="parenthetical"> (England) </unit> </unit>
<unit UTYPE="main"> We also started a subsidiary <unit UTYPE="parenthetical"> (affiliated company) </unit> in November. </unit>

In particular, watch out for parenthetical elements that provide translations in English of words or sentences in another language - these should be classified as UTYPE=parenthetical rather than as UTYPE=paren-app.

... <unit UTYPE="relative"> on which is the incorrectly spelled inscription and wish: YGIA MAKEDONIO <unit UTYPE="parenthetical"> (Health to Macedonios) </unit> </unit>

Be careful with units that semantically act as relative clauses or appositive units: if they occur between commas or dashes, UTYPE=paren-rel and UTYPE=paren-app should always be used.

<unit UTYPE="main" > <unit UTYPE="preposed-pp" > From the beginning to the end of the Byzantine empire, </unit> jewellery was highly valued by high-ranking officials at the Byzantine court, <unit UTYPE="paren-rel" > whose signet-rings were decorated with ingenious, complicated monograms, </unit> and even more so by wealthy Byzantine ladies, <unit UTYPE="paren-rel" > who never ceased <unit UTYPE="complement" > to adore it </unit> </unit> </unit> .

The value UTYPE=paren-main should be used for clauses that occur in parts of text that would otherwise count as full sentences, but are enclosed between parentheses:

<s> <unit UTYPE="main" > John was really surprised </unit> . </s>
( <s> <unit UTYPE="paren-main"> He had no idea <unit UTYPE="complement"> that dogs could run that fast </unit> </unit> . </s> )

If in the end you're still unsure about which type of parenthetical a unit is, use UTYPE=parenthetical.

subject

Clauses in subject position should be marked as units, as well, and classified as UTYPE=subject:

<unit UTYPE="main" > <unit UTYPE="subject"> That John was rude to her parents </unit> annoyed Sue immensely </unit>
complement

This value should be used for units that act as complements of verbs or adjectives, i.e., as arguments of that verb other than subject:

<unit UTYPE="main" > a recent study of the arms suggests <unit UTYPE="complement"> that it may have been made for another member of the family rather than Jean himself </unit> </unit>
<unit UTYPE="main" > It may be unsurprising <unit UTYPE="complement"> that Bjorg, <unit UTYPE="parenthetical"> as a Scandinavian, </unit> should choose silver as her jewellery material </unit> </unit>

Infinitival clauses with to can also serve as complements of verbs such as want:

<unit UTYPE="main"> John wants <unit UTYPE="complement"> to buy a car </unit> </unit>
<unit UTYPE="main"> Dubois was commissioned through a Warsaw dealer <unit UTYPE="complement"> to construct the cabinet for the Polish aristocrat </unit> </unit>

whereas phrases with a present participle as head can serve as complements of verbs such as start:

<unit UTYPE="main"> John started <unit UTYPE="complement"> building a car </unit> last year </unit>.

Complements should be distinguished by ADJUNCTS, for which the special value UTYPE=adjunct should be used. Specifications of time, place and purpose are generally adjuncts rather than complements; see below for examples. Be particularly careful when annotating infinitival clauses with to, which can serve both as complements or certain verbs (as in the examples above) and as purpose clauses (see below).

adjunct

This value should be used for all subordinate clauses that are not relatives, appositive, subjects, complements, clefts or parentheticals. This class includes all clauses introduced by the connectives after, as, because, before, for, if, immediately, last, like, once, since, though, till, unless, until, when, whenever, where, whereas, while, whilst (See (Quirk and Greenbaum, 1973), chapter 11.)

<unit UTYPE="main"> Use your fingers <unit UTYPE="adjunct"> because cutting with scissors might damage the patch inside </unit> </unit>
<unit UTYPE="adjunct"> If the treated skin has become infected </unit>
<unit UTYPE="adjunct"> while you are using it. </unit>

Infinitival clauses with to expressing the purpose of a certain action should also be marked as UTYPE=adjunct; be careful in distinguishing these from infinitival clauses that express a complement.

<unit UTYPE="main"> John went to London <unit UTYPE="adjunct"> to buy a car </unit> </unit>
.... <unit UTYPE="relative"> often worn both by men and by Byzantine ladies <unit UTYPE="adjunct"> to fasten their cloaks at the shoulder </unit> </unit>

Be careful also to distinguish adjuncts occurring at the beginning of a unit (as in if clauses) from units that should be marked as UTYPE=preposed-pp. Use the following heuristic: if the unit has a verb, it should be marked as UTYPE=adjunct. If it doesn't have a verb, it should be marked as an adjunct if it contains a connective such as because, if, though, till, unless, whereas; otherwise it should be marked as UTYPE=preposed-pp.

coord-vp

Although coordinated VP are not, strictly speaking, clauses, we want to look at how they affect CB shift; any element after the first should therefore be marked as UTYPE=coord-vp.

<unit UTYPE="main"> A microscope of this same model belonged to Louis XV, <unit UTYPE="paren-app"> King of France, </unit> <unit UTYPE="coord-vp"> and was part of his observatory at the Chateau de La Muette. </unit> </unit>
preposed-pp

In the same way, we are going to mark up preposed pps even though strictly speaking they are not clauses, and mark them as UTYPE=preposed-pp:

<unit UTYPE="main"> <unit UTYPE="preposed-pp"> In the 1930s,</unit> plastic was still relatively new. </unit>

As mentioned above, be careful when dealing with constituents at the beginning of a clause, as these could be either preposed-PPs or adjuncts; use the heuristic discussed under UTYPE=adjunct.

listitem

This value should be used for all and only those units that are items in a list, but are not clausal: e.g., for nominal items, or PP items.

<list>
<item> <unit UTYPE="listitem"> 4 Estraderm TTS 50 Patches <unit UTYPE="relative"> each containing 4mg of oestradiol </unit> </unit> </item>
<item> <unit UTYPE="listitem"> 4 Estragest Patches </unit> </item>
...
</list>

By contrast, units that occur in lists that simply coordinate sentence constituents should be classified as if a coordination was present: e.g., as complements, main, etc.

<list>
<item> <unit UTYPE="main"> a. Tear open the pouch along the 2 edges </unit> </item>
<item> <unit UTYPE="main"> b. Take out the patch </unit> </item>
...
</LIST>

 

cleft

This value should be used for clauses that occur in cleft position:

<unit UTYPE="main"> It was John <unit UTYPE="cleft"> that gave me the letter </unit> </unit>

In the case of pseudo-clefts, the free relative should be marked as UTYPE=subject:

<unit UTYPE="main"> <unit UTYPE="subject"> What you need most </unit> is a good rest </unit>
title

This value should be used for titles which are not sentences and are not even full clauses - e.g., for titles that are just NPs.

<title> <unit UTYPE="title"> Other special warnings </unit> </title>
<title> <unit UTYPE="title"> Putting on a patch </unit> </title>

When the title is a full clause, it should be assigned the value UTYPE=main

disc-marker

This value should be used to mark all units that consist of a discourse marker only, such as yes,okay,right. These units are common in dialogues, but can also be found in some texts in the pharmaceutical domain, formulated according to a question-answer scheme:

<s> <unit UTYPE="disc-marker" verbed="verbed-no"> Okay, </unit> <unit UTYPE="disc-marker" verbed="verbed-no"> thankyou </unit> </s>
<title> <s> <unit UTYPE="main" verbed="verbed-yes"> Can you wear a patch <unit UTYPE="adjunct" verbed="verbed-yes"> while bathing or taking exercise </unit> </unit> ? </s> </title>
<s> <unit UTYPE="disc-marker" verbed="verbed-no"> Yes </unit> . </s>

 

FINITE: Is the UNIT finite?

This is a boolean attribute whose value should be specified for all and only units with value verbed=verbed-yes, and takes as values finite-yes and finite-no. The purpose of this attribute is to allow us to study Kameyama's (1998) hypothesis that it's only tensed UNITs that result in CB transitions. Units with value verbed=verbed-no should get the value no-finite.

finite-no

Nonfinite clauses include clauses whose verbs are infinitivals with and without to, -ing participles, and -ed participles. All of these clauses should get the value finite=finite-no, whether they occur as complements, or reduced relative clauses.

<unit UTYPE="main" FINITE="finite-yes"> A large table <unit UTYPE="relative" FINITE="finite-no"> decorated in the same manner </unit> would have been placed in front <unit UTYPE="adjunct" FINITE="finite-no"> for working with those papers</unit> </unit>
<unit FINITE="finite-yes"> the ebeniste <unit FINITE="finite-no"> most frequently employed by the marchands-merciers to mount porcelain plaques on furniture </unit> </unit>
finite-yes

This value should be used for UNITs of all types (main, subordinate, etc.) with a finite verb (whether present, past or future). Note that the main clause of sentences in imperative mood such as instructions counts as finite (Quirk and Greenbaum, 1973, p. 38).

<unit FINITE="finite-yes"> They were founded in 1903 by Josef Hoffmann and Koloman Moser </unit>
<unit FINITE="finite-yes"> Some of the slides are from the 1800s </unit>
<unit FINITE="finite-yes"> The cream does not contain any lanolin, parabens or colouring agents </unit>
<unit FINITE="finite-yes"> Gently spread a thin layer of Nerisone onto the affected area of the skin </unit>
Possible difficulties

It is possible for a sentence to include both finite and non-finite clauses, and clauses do not `inherit' the finiteness of their superordinate clauses---e.g., it is possible to have a non-finite clause subordinate to a finite one:

<unit FINITE="finite-yes"> Please tell your doctor <unit FINITE="finite-yes"> if you have any doubts or worries <unit FINITE="finite-no" > about using Nerisone, </unit> either <unit FINITE="finite-yes"> before you start </unit> or <unit FINITE="finite-yes"> while you are using it. </unit>

It's also possible for a finite clause to be embedded into a non-finite one:

<unit FINITE="finite-no"> indicating <unit FINITE="finite-yes"> that the instrument was in continual use for over a century </unit> </unit>

As said above, if the verbal complex consists of an auxiliary such as have or be followed by a participle, the participle should not be treated as a separate unit, and the whole clause should be classified as "finite-yes".

<unit FINITE="finite-yes" > while you are using it. </unit>
<unit FINITE="finite-yes" > If the treated skin has become infected </unit>

If, however, the main verb is a modal or an adjective which takes infinitival clauses or gerunds as complements, such as want or like, two clauses should be annotated, one with FINITE="finite-yes", the other with FINITE="finite-no".

<unit FINITE="finite-yes"> since it will probably be necessary
<unit FINITE="finite-no"> to stop <unit FINITE="finite-no" > using the dressing </unit> </unit> </unit>
<unit FINITE="finite-yes" > The upright secretaire began <unit FINITE="finite-no"> to be a fashionable form </unit> around the mid-1700s </unit>

SUBJECT: Does the unit have a subject?

This attribute should be used to indicate if the unit has a subject and, if so, which type: `full' or `empty' (as in it rains).

no-subject

This value should be used if the unit does not have a subject at all: e.g., for units with VERBED="verbed-no", for reduced relative clauses (relative clauses without a relative pronoun or a complementizer) for infinitival sentences, and for the main clause of imperative sentences.

<s> <unit VERBED="verbed-yes" SUBJECT="full-subject"> Drawings and engravings were sources <unit VERBED="verbed-yes" SUBJECT="no-subject"> frequently used by foreign patrons and craftsmen <unit VERBED="verbed-yes" SUBJECT="no-subject"> to order and copy the latest fashions in French interior design </unit> </unit> </s>

(Note: we are only interested in fully realized subjects; clauses that would be analyzed in certain syntactic theories as having phonologically null subjects should therefore be assigned a value of no-subject rather than full-subject or empty-subject.)

full-subject

This value should be used if the unit has a verb (the VERBED=verbed-yes) and the verb has a subject which does play a semantic role in the sentence:

<unit VERBED="verbed-yes" SUBJECT="full-subject"> Indeed, the term `jewelry' encompasses an extraordinary range of accessories <unit VERBED="verbed-yes" subject="full-subject" > which people have used <unit VERBED="verbed-yes" subject="no-subject"> to decorate themselves </unit> </unit> </unit>

One possible difficulty here are relative clauses; see the discussion below.

empty-subject

This value should be used if the unit has a subject, but this subject does not denote anything. There are two basic cases in which this may happen: with clauses with subject expletive `it', as in it rains; and with clauses whose subject is `there', as in There is a person looking for you.

<unit VERBED="verbed-yes" subject="empty-subject">There was no other time <unit subject="full-subject"> when the simultaneous use of this cipher and of those armorial bearings would have been as correct as in those ten years </unit> </unit>
<unit verbed="verbed-yes" subject="empty-subject" >but it is important <unit verbed="verbed-yes" subject="no-subject">to remember <unit verbed="verbed-yes" utype="complement" finite="finite-yes" subject="full-subject"> that jewelry doesn't have to be expensive or elaborately crafted </unit> </unit> </unit>
relpro

This is a special value used only for relative clauses with an explicit complementizer, and where the relative pronoun acts as a subject:

<unit subject="relpro" > which was used <unit subject="no-subject"> to fasten the straps of a dress at the neckline </unit> </unit>

Note that reduced relative clauses (i.e., relative clauses without a complementizer) should be assigned the value no-subject, whereas relative clauses where the relative pronoun does not occur in subject position (as in the following example, where the subject is they) should be classified according to the type of subject that actually occurs (full or empty):

<unit subject="full-subject" > which Beatle they liked best </unit>


4. NP ANNOTATION


The textual elements of interest in which we are mostly interested are Noun Phrases (NPs) and those parts of a text that provide antecedents for anaphoric expressions. We consider as NPs all phrases whose syntactic head is a noun, such as a man, the man, every man, most people, and all phrases that can occur in the same syntactic positions as phrases headed by nouns -- e.g., proper nouns like John or pronouns like she. All NPs are to be tagged with a <ne> (Nominal Expression) tag, and assigned an ID as well as the other attributes specified below:

<ne ID="ne1" .... (other attributes) ...> a tiara</ne>

The annotation of NPs should be done after both layout and units have been marked up, and should then proceed as follows:

  1. First mark up all NEs and specify values for the attributes CAT, GEN, GF, NUM and PER.
  2. Then annotate anaphoric relations and DEIX.
  3. Then annotate the attributes ANI, COUNT, LFTYPE and ONTO taking information about anaphoric relations.
  4. Finally, mark GENERIC, STRUCTURE and LOEB.


NPS TO BE MARKED AS NES

Identifying the noun phrases to mark with an <ne> tag is not always easy. The criteria for identifying NPs could be based either on task-oriented grounds-i.e., we may mark only the NPs that would give us information about the task of generating nominal expressions (e.g., only those NPs that realize knowledge base entities)-or on syntactic grounds-i.e., mark as NEs all syntactic constituents that could occur in NP position. From a syntactic point of view, noun phrases typically occur as complements of verbs or propositions:

[John] likes [dogs]
[Each vase] is decorated with [inlaid decoration]

but NPs may also occur inside other NPs:

[Their workshop] probably also supplied [the bronze Chinese figures above [the clock]].

From the point of view of their denotation, NPs generally fall into one of two classes:

As we are interested in getting an idea of the distribution of various types of NPs, and we do not have yet a semantic specification of the types of entities we may have to realize, we mostly based our decisions as to what to annotate on syntactic grounds, with a couple of exceptions discussed below. Also, NPs in all syntactic contexts should be annotated, including:

In the rest of this section we specify the types of NPs to be marked, and how they should be marked; the parentheses indicate where the <ne> and </ne> tags should be placed. Some of the examples are borrowed from (Passonneau, 1996) and from (Quirk and Greenbaum, 1973)

Noun Phrases with a Head Noun

These are the `prototypical' noun phrases. They include: (some of the examples refer to types of noun phrases discussed later; the noun phrases being exemplified are in bold font)

Noun phrases can be fairly complex and include a number of modifiers such as adjectives, relative clauses, and prepositional phrases. In these cases, the NP tags should include the whole NP, including pre- and post- modifiers; embedded NPs should then be marked in turn. Among the constituents of noun phrases we can find:

Note that NPs may include more than one pre- and post-modifier:

(The posthumous inventory of ((the French king Louis XIV's) possessions in (1720))) describes (the table) in (considerable detail).

Nounless phrases functioning as noun phrases

Some phrases are classified as `noun phrases' not because they have a head noun, but because they can occupy the same syntactic positions as `true' noun phrases. We will mark most of these with an NE tag as well, except for a few cases discussed at the very end. One class of NPs we want to mark are pronouns, including:

Other phrases classified as noun phrases without including a noun (and which we want to mark) include:

Empty Noun Phrases

A NE element should also be marked where `empty' NPs occur. These are obligatory arguments of verbs which are not, however, realized. Unrealized arguments are especially common in instructions: an example is the argument of cover in the following example.

Knead in (enough remaining flour to make (a moderately stiff dough) that is smooth and elastic).
Cover (_), let rest 10 min.

Coordinated NPs

Phrases consisting of two or more coordinated NPs, such as John and Mary, John or Mary can occur in the same syntactic positions as atomic noun phrases:

((John) and (Mary)) went to (the movies)
I will hire ((John) or (Mary))
(John) is ((an officer) and (a gentleman))

These phrases should be marked as <ne>, as well; their constituents should be separately marked, as well. In case of more than two conjuncts, only one coordinated NP should be marked; also, keep in mind that coordination is sometimes expressed by complex constructions such as both John and Mary.

((construction), (art) and (design))
(Both (John) and (Mary)) should go
(Either (John) or (Mary)) should go

Note that the constituent resulting from coordination does not have all the properties of NPs: e.g., it's not clear what the gender of John and Mary should be, or the number of either Sue or and Mary. Also, be sure to make a distinction between coordinated NPs and NPs with a conjoined head noun: in the first case three <ne> elements should be marked (each coordinated NP independently as well as their coordination), whereas in the second only one:

((your doctor) or (your pharmacist)) should be able to tell (you) (this)
(your doctor or pharmacist) should be able to tell (you) (this)

Parentheticals and Appositive Units

Some of the most difficult decisions in NP annotation have to do with parentheticals, such as appositions. Quirk and Greenbaum discuss in some detail parentheticals associated with noun phrases:

In order to uniformly mark both contiguous and non-contiguous appositions, we will follow the proposal adopted in the MUC scheme (Hirschman and Chinchor, 1997), and tag separately any NP contained in a paren-app or parenthetical unit. In addition, if the parenthetical is contiguous to the NP, it will be included in the <ne> element for the NP as a whole. Both restrictive and non-restrictive appositions should be tagged.

(The best jeweller, (Anne-Marie Sillitoe))
(Norman Jones, at that time (a student),) ....
We should send (one of (the engines at (Avon)), say (engine E1,)) to Bath to pick up the tanker car

Each NP embedded in the parenthetical should then be marked as normal:

(the fact that (he) wouldn't betray (his friends)) is very much to his credit

However, we will not mark up as a separate NP the proper names that serve an appositive function but are not part of a parenthetical:

(the famous critic Paul Jones)
(the king of France Louis XIV)

Appositions not contiguous to the NP they modify will not be included in that NP's tags:

(An unusual present) awaited him, (a book on ethics)

Finally, parentheticals that come before the NP they modify should not be included in the <ne> element that marks that NP:

(A sign of economic prosperity), (it) invariably adorns (the necks, hair, wrists and expressive fingers of (the female figures depicted in ((monumental mosaics) and (other works of art))).

Noun Phrases not to be annotated

The following NPs should not be marked:


NE ATTRIBUTES

These are the attributes specified for NEs:

  1. ID
  2. ANI
  3. CAT
  4. COUNT
  5. DEIX
  6. DEN
  7. GEN
  8. GENERIC
  9. GF
  10. LFTYPE
  11. LOEB
  12. NUM
  13. ONTO
  14. PER
  15. REFERENCE
  16. STRUCTURE

These attributes and the values they may take are discussed below. Each of these attributes can take the value "unsure".


ID: Identity Specification

Each NE is to be assigned a distinct ID. This is automatically done by the annotation software automatically when the NE is created using the New NE option from the GNOME menu. (See below.) Don't worry if this doesn't come out right - ID numbers can be assigned automatically.

It is important to use <ne id="ne60"> Nerisone</ne> correctly.


ANI: Animacy

(This attribute should only be annotated after information about antecedent relations has been annotated.)

This attribute should be used to specify whether an object is animate or not.

Attribute values

animate

This value should be used for living beings such as humans and animals are considered animate, and more in general for any entity capable of performing an action. Only objects with onto=concrete may be marked as animate. In the following example, only the NP in bold should be marked as animate; all the other ones as no-animate or inanimate.

(Archibald Knox) contributed (the largest number of (designs) for (the Cymric scheme)).
inanimate

This value should be used for all objects with onto=concrete that are not living beings and / or could not be replaced by living beings in the particular text.

Coloured stones were sometimes added.

This value should also be used for all NEs with onto=abstract or one of its subvalues, event and time.

undersp-animate

This value should be used for all NEs that cannot be classified as either animate or inanimate, including coord-NPs in which one conjunct is animate and the other is inanimate.

Possible difficulties

Institutions and other social groups should be classified as animate whenever they fulfill an agent role; otherwise, as inanimate.

the work of the Austrian Secession movement
The Wiener Werkstaette were responsible for ...

In general, NPs which could be seen as either animate or inanimate should be tagged as ani=animate if they can be substituted by a person in the clause where they actually occur.

For anaphoric expressions (pronouns, complementizers, etc.) use the ani value of the antecedent. Mark coord-nps as animate if all conjuncts are, otherwise as undersp-animate.


CAT: NP Type

(This attribute should be specified during the first pass, while NEs are marked up.)

The CAT attribute specifies the type of the NE - proper name, definite NP, etc. The values for this attribute are as follows.

Noun Phrases with a Head Noun

a-np

This value should be used for all NPs with the determiner a:

(a tiara)
(an orange)
another-np

To be used for NPs with the determiner another or other:

(Other special warnings)
(other German and Flemish craftsmen)
q-np

This value should be used for NPs with a head noun and one the following quantifiers: a few, all, any, both, each, either, every, few, less, little, many, more, most, much, neither, no, several, some

Have (you) ever had (treatment for ((a breast lump) or (any serious disease of ((your) womb))))?
Once (some improvement) is seen in ((your) condition)
(Both students) passed ((their) exam)
(Every student)
(The first three students)
num-np

This value should be used for NPs with ordinal and cardinal numeral determiners not preceded by other determiners:

three years
first car

(The value numwd-np was formerly used for these NPs.)

Note that numerals can also occur after other determiners, as in the first car or all four years; in these cases, the leftmost determiner should be used to decide on the value of cat.

meas-np

This value is to be used for all measure NPs. These typically have a pseudo-partitive structure including a NP specifying the measure and a second NP (which should also be marked as a NE) specifying what is being measured, as in:

(4 mg of (Estradiol))
(about fifty micrograms of (Nerisone))
(a lot of (students))

(The value numfig-np was formerly used for these NPs.)

this-np

For NPs with the determiners this, these.

this amount
these cars

NB: this and these can also be used by themselves as pronouns; in this case, the value this-pro should be used (see below).

that-np

NPs with the determiners that and those:

that night
these cars

Note that that and those can be used as pronouns just like this and these; in these cases, the value that-pro should be used. Furthermore, that can be used as a complementizer, i.e., as a particle that introduces a relative clause or a complement:

The fact that he wrote a letter to her suggests that he knew her.
Each coffer also has a lid that opens in two sections

These uses of that should not be marked up as NEs and should not be assigned a value.

such-np

For NPs with such as a determiner.

such amount
such cars
wh-np

This value should be used for all NPs consisting of a wh-determiner (what, which, how + many/few, ) followed by a head noun:

which car
how many students

Wh-NPs consisting of an interrogative pronoun only should be classified as wh-pro (see below).

poss-np

This value should be used for all NPs with an NP marked with genitive case, such as a possessive pronoun (see below) or John's.

((John's) car)
((your) doctor)
((the Getty museum's) microscope)
bare-np

This value should be used for all NPs with a head noun and possibly premodified by adjectives or other nouns, but no determiner, quantifier or numeral (in these cases the values q-np, num-np, etc. should be used).

(actors)
(intrinsically worthless material)

This value should also be used for NPs used to specify somebody's job or appellation, as in:

(Mary, (Queen of Scots))
(Louis XIV, (King of France))

One category of NPs which can sometimes be confused with bare-nps are proper nouns. In our domain there are two cases that are particularly difficult to classify as either : names of products (such as Nerisone) and of chemical compounds (such as dextrose, progestogen, or Interferon beta-1b). The convention adopted here is to always classify chemicals as bare-np, even in the case of chemicals with fairly specific names such as Interferon beta-1b; and to always use pn for product names such as Nerisone, except when used with determiners, as in how much Nerisone to use. On the other hand, proper names of sicknesses such as osteoporosis should always be classified as pn.

Another category that is sometimes difficult to tell apart from bare-nps are what we call here gerunds. These can be distinguished by keeping in mind that only present participles of verbs should be classified as gerunds, as in the following example:

<ne cat="pn"> osteoporosis (<ne cat="gerund"> thinning of bones </ne>) </ne>

every other determinerless NP should be classified as bare-np or pn, as appropriate.

The-NPs and Proper Names

We treat these two classes of NPs together as a lot of NPs in our corpus could be classified either way. (The problematic cases are NPs such as the United States of America, which have a definite article but behave like proper names.)

pn

This value should be used for proper names without the determiner the, including for example names of persons, institutions, geographical places, and products. Dates, as well, should be classified as pn

Louis XIV
London
1732
Nerisone

Notice that whereas dates like 1732 are to be classified as pn, shorthands of measures such as 25 C should be classified as num-np.

As mentioned above, names of chemicals and other substances such as gold, water, progestogen should be classified as bare-np rather than pn, except when they are names of products. On the other hand, NPs denoting illnesses or cures should be classified as pn:

flu
mumps
Hormone Replacement Therapy

Finally, the value pn should also be used when the NP contains appositive information, as in French artist Gilles Jonnemann.

the-pn

This value should be used for proper names with the determiner the, such as:

the Beatles
the Faubourg-St-Antoine

The value the-np should be used for all other NPs with the determiner the. It is usually easy to recognize definite NPs used as proper names, but in order to determine whether a NE should be classified as the-np or the-pn in the difficult cases, the following test may be used: substitute the definite article with the indefinite article (or no determiner in case of plural NPs) and use the new NP in subject position in a sentence:

the Beatles are in town / *Beatles are in town
the Faubourg-St-Antoine is in Paris / *a Faubourg-St-Antoine is in Paris

if the new sentence doesn't sound right, as in the examples marked with *, the NP is probably a proper name and should be marked as the-pn. This test only fails to work with NPs which are semantically functional, with which the indefinite article cannot be used even though they do not count as proper names:

the first man to reach the moon / *a first man to reach the moon

However, it is generally easy to recognize these NPs in that they have a real head noun, such as man.

the-np

This value should be used for all definite NPs with the determiner the that pass the test just discussed, i.e., that can be replaced by an indefinite article / zero article, or are semantically functional:

the car sped by / a car sped by
the dogs barked / dogs barked
the first man to reach the moon / *a first man to reach the moon

Pronouns

pers-pro:

This value should be used for the pronouns I, you, he, she, it, we, they, me, him, her, us, them

poss-pro:

For the pronouns my, your, his, her, its, our, their, whose

refl-pro

For the pronouns myself, yourself, himself, herself, itself, ourselves, yourselves, themselves

rec-pro

For the reciprocal pronouns each other, one another

q-pro

for the pronouns anybody, anyone, anything, everybody, everyone, everything, nobody, noone, nothing, somebody, someone, something, (also with else) and for any, all, both, some used as pronouns instead of as determiners (as in I'll take both).

wh-pro
free-rel

wh-pro should be used for all interrogative pronouns except for possessive whose: who, whom, where, what, when, why, which.

One special case of wh-NPs are so-called free relatives, as in:

(What you need most) is (a good rest)
(What to do when starting to use Estracombi)

The syntactic status of these constructs is not very clear; for the moment they should be assigned the special value free-rel.

this-pro

For this, these used as pronouns (i.e., without nouns); otherwise use this-np

I'll buy this
I'll buy these
that-pro

For that, those used as pronouns (i.e., without nouns); otherwise use that-np

I'll buy that
I'll buy those
num-ana

To be used for NPs with numerical determiners such one, ones, two used without a head noun.

I want one
I'll buy two
null-ana

This value should be used for NEs which mark an expected, but omitted argument of the governing verb. This can be tested as follows: Can a pronoun be inserted without changing the meaning?

Do not use <ne CAT="null-ana"></ne> on a child under one year of age.

Other NPs

gerund

This value should be used for all uses of the present participle form without determiner as subject, object or within a prepositional phrase:

(The color) was scraped away before (firing)
((Andre-Charles Boulle's) name) is synonymous with (the practice of (veneering (furniture)))
complementizer

This value should be used for relative pronouns in subject position, and for wh-NPs occurring in complementizer position in relative clauses:

(the man (that) shot (Liberty Valance)
(this) is (the man (whose company) was bought by (Microsoft))
coord-np

This value should be used for coordinated NPs:

Either John or Mary
construction and design


COUNT

(This attribute should only be annotated after information about antecedent relations has been annotated.)

This boolean attribute should be used to specify whether an NP denotes a countable object (an example of countable objects are rings, as one can see from the fact that three rings is possible) or a non-countable one (such as gold or information).

Attribute values

count-yes

This value should be used for countable nouns such as car, town, year or jewel. As the name says, a name is countable if it can be counted, so a simple test for countability is whether it's possible to replace the NP with a num-np with the same head noun:

(I) bought (a car) / (I) bought (three cars) [COUNTABLE]
(I) bought (gold) / *(I) bought (three golds) [NON COUNTABLE]

However, we also want to classify as count-yes proper names or pronouns that refer to objects of this type, such as London, 1764, the Koh-I-Noor.

count-no

For non-countable nouns: e.g., names of chemicals such as water, gold, oestradiol and most physical substances such as wax, cream, gas, whether viewed functionally (poison) or chemically. Also for substances produced by or contained in the body such as blood, sweat (but not tears, which can be counted: three tears).

undersp-count

For cases in which it's not clear from the text how a NP that can be interpreted as count-yes or count-no (e.g., Dermovate, that is both a product / medicine and a cream) should be interpreted.

no-count

Use this value for nes with CAT=coord-np.

Possible Difficulties

Since the test whether a singular NP is count-yes or count-no is to try to replace it with its plural form, most plural NPs are count-yes. However, it is important to remember that most non-count nouns can also be pluralized, possibly with a change in meaning: e.g.,

The lambs were eating quietly.(Quirk and Greenbaum)
There is lamb on the menu.(Quirk and Greenbaum)

A first exception to this test are meas-np, most of whose properties derived from those of the object that gets classified and should therefore be classified as count-no.

(4mgs of (oestradiol))

A second exception, already mentioned, are headless NPs such as pronouns and proper names, that should be marked according to the type of object they denote: count-yes if they denote a countable object, as in it referring to a dog or a car, count-no if they refer to a non-countable object, as in that referring to ivory. Complementizers, as well, should be classified according to the type of object denoted by the NP they modify. Louis XIV should be marked as count-yes because person is countable. One difficulty here is that proper names of products like Estracombi are used in two different ways in these texts: to refer to a product, in which case they should be marked as count-yes, as in:

What you need to know about Estracombi.

and to refer to the actual physical object of which the product consists, which could be non-count as in the case of Dermovate , which is a cream:

Gently rub the correct amount of Dermovate into the skin until it has all disappeared.

When the context does not disambiguate between the two readings, use undersp-count.

num-nps are count-yes by definition (you can only count something which is countable). Quantified NPs are generally count-yes, but not always (e.g., some water should be classified as count-no). q-pros like someone or stand-alone any should be annotated in the same way (someone always as countable, any depending on the type of antecedent).

The decision about the value for COUNT depends in part on the semantic properties of the object, in part on the basis of the lexical items that are available to describe it. One case in which this becomes an issue are events. Objects with ONTO=event are generally count-yes in a semantic sense: i.e., it is possible to count events. However, not all lexical items that can be used to refer to events are countable. Nouns like war and discovery are; so are nominalizations like treatment. All of these can be referred to by means of noun phrases with articles or quantifiers:

(the wars between (France) and (Spain))
(the discovery of (penicillin))
I've had (three treatments) so far.

gerunds such as bleeding, however, are always count-no, even though they also are nominalizations of events. Of the other abstract objects, temporal expressions are generally count-yes (e.g., three years), whereas ideas, concepts such the law, etc. are count-no. In the case of free relatives, as always, the decision should be based on the type of object that the free relative describes:

(what (you) need to know about (Nerisone)) [COUNT-NO]
(what's in ((your) medicine)) [COUNT-NO]


DEIX: Deictical NE?

(This attribute should be annotated while marking up anaphoric information.)

This attribute should be used to identify those NEs that refer to objects located in the visual or immediate situation in which the text is being read. We are especially interested in studying the use of these references in the museum domain, whose texts describe an object by providing a picture and text that refers to parts of that picture: e.g., the description of a XVII-century French cabinet may include references such as:

this cabinet
The fleurs-de-lis on the top two drawers indicate that the cabinet was made for Louis XIV.BR> The bronze medallion above the central door was cast from a medal struck in 1661 which shows the king at the age of twenty-one.

In the patient information leaflets domain, you should imagine the reader having the package in front of him/herself; therefore references to the medicine or the package should be treated as DEIX=yes.

Note that deictic references are not only done with that-NPs or the-NPs; they can also be done with indefinite NPs, especially when the part of the picture in question is not visible:

A phrase from Virgil engraved beneath the dial- Solem audet dicere falsum (It dares the Sun to tell a lie)-alludes to the accuracy of this type of clock and its ability to demonstrate the irregularity of the sun's orbit.

Such NEs should be given DEIX=yes. NEs that do not refer to objects in the immediate situation should be given the value DEIX=deix-no. We are also interested in identifying references to parts of the document itself: e.g., this section, the picture, the next page. The value DEIX=meta should be used for these NEs.

Attribute values

deix-yes

As said above, the primary use of this value is to mark up all references to objects contained in pictures associated with the texts we mark up:

this table

This value should also be used for references to the product and its packaging in the patient information leaflet domain, and to the speaker and hearer of the text; thus, I and you should also always be tagged DEIX=deix-yes (but not when the text contains some reported speech in which a person mentioned in the text refers to her/himself.)

Objects that should definitely not be marked with deix-yes include references to people such as the maker of a particular jewel or the client that ordered it; these should all be marked as deix-no (see below).

meta

This value should be used for all NEs that refer to parts of the document such as section, page, paragraph, picture, title, etc.

It is important to keep in mind that although a picture is part of the layout, the object described by it is not: thus, if we have a text with a picture of a case, the NE this picture should be marked as DEIX=meta, whereas this case should be marked as DEIX=deix-yes.

deix-no

This value should be used for all NEs which do not refer to either an object in a picture included in the text, or to part of the document. This includes pretty much all NEs in the pharmaceutical domain except for the second person pronouns that refer to the reader of the leaflet (that should be marked as deix-yes) and for the references to the document (that should be marked as meta). In the museum domain, this value should be used for all references to people, places, and to properties (see below). E.g., in the following example, all NEs should be marked as deix-no:

In (the Dutch Wars of (1672 - 1678 )), (France) fought simultaneously against (the Dutch, Spanish, and Imperial armies), defeating ((them ) all).

Possible difficulties

The main difficulty is when the value deix-yes should be used. As said above, in the pharmaceutical domain this value is used only for second person pronouns and for references to the product being sold; but in the museum domain, it is sometimes difficult to decide what counts as `an object in the picture'. For one thing, some parts of an object may be hidden in the picture (e.g., the back); these parts should nevertheless be counted as being part of the situation described by the picture. A second difficulty concerns deciding what counts as `an object in the picture'. Only clearly identifiable, CONCRETE objects should be marked as deix-yes. Any references to properties of an object such as its color, its height, its weight, etc. should be marked as deix-no. (More in general, any reference to abstract objects should be treated as deix-no.) Also, any references to the materials that compose the object (ivory, horn, gold, etc.) should be marked as deix-no.

On the other hand, references to the objects which are represented by the objects in the picture should be marked as deix-yes. E.g., if a particular object has the shape of a griffin, references to the griffin should be treated as deix-yes. Similarly for inscriptions found on objects, characters represented on them, etc.

All NEs marked as GF=predicate, such as NPs in appositions or in copular clauses, should also be marked as deix-no.


GEN: GENder

(This attribute should be annotated while marking up NEs.)

This value is used to specify the syntactic gender of the NP, if any. The test to be used to decide on the value for GEN is whether a subsequent pronoun co-referring with the NE would be masculine, feminine, neuter, or if more than one value could be used.

Values

fem

This value should be used for feminine noun phrases:

<ne ID="ne1" CAT="pn" GEN="fem">Mary</ne> .....
<ne ID="ne2" CAT="pers-pro" GEN="fem">she</ne> ....
<ne ID="ne3" CAT="poss-np" GEN="fem">my aunt</ne> ...
masc

This value should be used for masculine noun phrases:

<ne ID="ne1" CAT="pn" GEN="masc">John</ne >.....
<ne ID="ne2" CAT="pers-pro" GEN="masc">he</ne> ....
<ne ID="ne3" CAT="poss-np" GEN="masc">my uncle</ne> ...
neut

For neuter noun phrases:

<ne ID="ne1" CAT="pn" GEN="neut">1789</ne> .....
<ne ID="ne2" CAT="pers-pro" GEN="neut">it</ne> ....
<ne ID="ne3" CAT="poss-np" GEN="neut">my desk</ne> ...
<ne ID="ne4" CAT="q-np" GEN="neut">any doubts or worries</ne> ...
undersp-gen

A number of NPs could be the antecedent of either masculine or feminine pronouns (e.g., the doctor), whereas other ones can serve as the antecedent for masculine, feminine or neuter pronouns (e.g., the baby or also some country names such as England). Quirk and Greenbaum classify the first type of NPs as having DUAL gender, and the second one as having COMMON gender. We will use the value GEN=undersp-gen for all of these NPs. This value should also be used for coordinated NPs in which the conjuncts have different genders, as in John and Mary or he or she.

<ne ID="ne1" CAT="pn" GEN="undersp-gen">England</ne >.....
<ne ID="ne2" CAT="coord-np" GEN="undersp-gen">John and Mary</ne> ....

The value GEN=undersp-gen should also be used for first and second person NPs.

Possible difficulties

Because only singular pronouns are marked with gender in English, the pronominalization test suggested above doesn't work with plurals. In order to decide the gender of a plural NP, the annotator should consider which pronoun would be used if the head noun were singular instead:

<ne ID="ne1" CAT="the-np" GEN="masc">the men</ne> .....
<ne ID="ne2" CAT="q-np" GEN="fem">some women</ne> ....
<ne ID="ne3" CAT="poss-np" GEN="neut">John's houses</ne> ...

In case of coordinate NPs, what said above in the description of undersp-gen applies - if all conjuncts have the same value of GEN use that value, else use undersp-gen. The same applies with NPs with coordinated plural heads: in this case, use the singular test to decide on the gender of each conjoined noun, then decide on the gender for the NP as a whole:

<ne ID="ne1" CAT="q-np" GEN="neut">any doubts or worries</ne> ... <ne ID="ne2" CAT="q-np" GEN="undersp-gen">most men and women</ne> ...


GENERIC

(This attribute should only be annotated after anaphoric information has been annotated.)

This attribute is used to specify whether an NP refers generically or not. the syntactic gender of the NP, if any. The test to be used to decide on the value for GEN is whether a subsequent pronoun co-referring with the NE would be masculine, feminine, neuter, or if more than one value could be used.

Values

generic-no

This is the value to used for NEs that denote "unique physical entities, located at a particular place in space or time" (Lyons, 1977, p. 14) - i.e., the objects denoted by NPs with lftype=term that would be represented in a semantic network as instances of a type or generic concept - and for groups of these objects (i.e., quantifiers over such entities, and coordinations). In our domains, these include proper names (NPs with cat=pn or the-pn) referring to people, spatial locations, and particular points in time such as years and centuries:

I like <ne cat="pn" lftype="term" generic="generic-no"> John </ne>
<ne cat="pn" lftype="term" generic="generic-no> Staffordshire </ne> <ne cat="the-pn" lftype="term" generic="generic-no>the 4th century </ne>

Demonstrative and that-NPs marked deix=deix-yes (e.g., all references to objects in pictures in the museum domain) or deix=meta should also be marked as generic-no:

I want <ne cat="that-np" generic="generic-no" > that book </ne >

First and second pronouns are virtually always generic-no; in particular, in the medicine domain, all references to you should be classified as generic-no. Third person pronouns tend to refer to particular objects, and definite NPs are often used in this way. In particular, most (although not all) references to body parts in the pharmaceutical domain are generic-no

Where is <ne cat="the-np" generic="generic-no" > the pen </ne > that <ne cat="pro" per="per1" generic="generic-no" > I </ne > bought? (Quirk and Greenbaum)

Finally, indefinite NPs such as a-nps, bare-nps, and num-nps may also refer to particular objects:

<ne cat="poss-np" generic="generic-no" > His wife > lived modestly in <ne cat="a-np" generic="generic-no" > a five-room apartment in <ne cat="the-pn" generic="generic-no" > the Faubourg Saint-Antoine </ne >
<ne cat="num-np" generic="generic-no" > two rings from <ne cat="pn" generic="generic-no" > Rocester </ne > </ne >
<ne cat="coord-np" generic="generic-no" > <ne cat="a-np" generic="generic-no" > A lion</ne> and <ne cat="num-np" generic="generic-no" > two tigers </ne> </ne> are sleeping in the cage. (Quirk and Greenbaum)

"we have in mind specific specimens of the class `tiger'" (Quirk and Greenbaum, p. 88) (and lion). Note also that the coord-np in this last example is marked generic-no as both of its conjuncts are.

generic-yes

This value should be used for all NPs that refer to types of objects. For example, in

Tigers are dangerous animals
the NP `tigers' does not refer to any specific tiger; the sentence is used to specify a property of the class as a whole. Indeed, as shown by Carlson, there are properties that can only be attributed to the class as a whole:

Dodos are extinct

NEs used in this way should be classified as generic-yes. This include, first of all, all NPs classified as lftype=pred; all of these should be classified as generic-yes. As the examples above show, bare nps (plural and singular) are often used for generic references, especially when in subject position. In our domain, most singular bare-nps referring to chemicals are also generic:

I like music / wine / bread
Estracombi TTS patches contain oestradiol and norethisterone acetate
Are you taking Barbiturates

bare-nps referring to abstract objects are also mostly generic:

change of life
scenes from mythology

Gerunds also always denote types (of events).

She succeeded in creating beauty

Other types of indefinite NPs, and even definite NPs, can be used for generic references as well, although as we saw above, they are often used for non-generic references:

The tiger / a tiger is a dangerous animal.
The German / A German is a good musician

In fact, even pronouns can sometimes be used in a generic sense, as shown by so-called paycheck pronouns:

The man who gives his paycheck to his wife is wiser than the man who gives it to his mistress
undersp-generic

This value should be used for coord-nps if one conjunct is generic-yes and the other generic-no, or for NPs that could be either generic or not (like references to products when it's not clear whether the text is referring to the medicine in general or the particular object in the patient's hands).

Possible difficulties

It is sometimes difficult to decide whether a particular NE denotes generically or not. As said above, in some cases the value of cat can be useful. Thus, proper names are almost always classified as generic-no; this includes, for example, all references to diseases (epilepsy, thrombosis, etc.). The one exception to this rule are names of products: Estracombi could refer either to the product in general, or to the particular medicine you have in your hands. In the patient informtion leaflets domain it's generally safer to treat these NPs as generic-yes (i.e., as referring to the product in general).

Conversely, gerunds should always be classified as generic-yes, and bare nps almost always; this includes references to substances such as oestradiol in the patient information leaflets domain.

Of the other NP types, pronouns should be classified like their antecedents (see the ante information); for quantifiers, it depends on what kind of objects they are quantifying over. For example, any-nps in questions about the patient tend to be generic-no, since they are referring to properties of a particular individual:

Have you ever had treatment for any serious disease of your womb?

Whereas quantifiers in generic sentences should be classified as generic-yes:

Many women had their ears pierced

Looking at the verb helps in some cases: thus, NPs in object position of stative verbs usually refer to types:

I like water / music / cars / long walks
This bread tastes of onion

whereas the same NPs when arguments of telic verbs tend to be used non-generically:

I drank water
I saw cars going by
While in Scotland, I took long walks

In other cases, however, the decision as to whether an entity is a type or a token is genuinely difficult. First of all, there are NEs that refer to entities whose ontological status is not clear - basically, it is up to the KB designer to decide how they should be treated. One example are abstract objects such as `society', `hope' or `the law'. As said above, it is usually safe to classify as generic-yes all abstract bare-nps. NPs referring to properties of a specific individual, such as its height or the shape of the jewel should be classified as generic if the object in question is generic, otherwise as non-generic.

Finally, there is a case that is very relevant to one of our domains: this includes proper names that refer to objects that have `copies': examples are book names or names of products:

I bought War and Peace
A guide to using Nerisone

The decision in these cases should be based on an educated guess as to whether the entity in question is likely to be realized in the KB as a type or an instance; in most cases, they should be treated as generic.

In the case of one-anaphora, we have a pronoun that refers to a particular (although non-specific) instance of a type, but is anaphoric on the type. The solution suggested for these cases is to classify the NE as non-generic, but indicate anaphoric dependence on the type by means of the ANTE element. (Discussed below.)


GF: Grammatical Function of a NP

(This attribute should be annotated while marking up NEs.)

This attribute is used to mark the grammatical function of an NP in the clause in which it occurs. For this reason, it should only be used for NEs which occur in UNITs marked with VERBED=yes. NEs that occur in parenthetical units which in turn occur inside <ne> elements may be marked as predicate or np-mod; look at the discussion of these values. All other NEs (occurring in parentheticals not included in NEs, in titles, in listitems, etc.) should be given the value no-gf.

(Note: the values of GF are mainly taken from the FrameNet annotation scheme. Some of the differences are that in FrameNet, comp and np-compl are conflated into one class; comp is defined in a slightly different way; and predicate is missing.).

Attribute values

subj

This value should be used for NEs that occur in subject position in units with VERBED=yes.

And (Morris) followed very quickly after.

NEs should only be marked with this value if they occur in units marked as SUBJECT=full-subject. Check whether these values are consistent - i.e., if you find yourself annotating a NE with GF=subj in a unit marked as SUBJECT=no-subject or SUBJECT=empty-subject, or viceversa if a unit marked as SUBJECT=full-subject contains no NE marked as subj.

obj - the direct object

This value should be used for NP in direct object position of transitive (such as buy) and ditransitive verbs (such as give):

(A new group of (customers)) stimulated (the jewelry trade).
(The posthumous inventory of ((the French king (Louis XIV))'s possessions in (1720)) describes (the table) in (considerable detail)

This value should also be used with NPs occurring as non-subject arguments of a transitive phrasal verb (Quirk and Greenbaum, chapter 12). Phrasal verbs are verbs which consist of a verb plus a particle; they can be intransitive (as in sitting down) or transitive (like set up in John set up a new unit). The post-verbal argument of transitive phrasal verbs should be marked with GF=obj rather than GF=adjunct (the following examples are from Quirk and Greenbaum):

(We) will set up (a new unit).
Drink up (your milk) quickly.
put on (a patch)
cover up ((your) patch)

One test to recognize phrasal verbs is that in most cases the particle can either precede or follow the argument (except when the argument is a pronoun):

(They) turned on (the light) (They) turned (the light) on .
(They) called off (the strike) / (They) called (the strike) off.
(He) looked (it) up / * (He) looked up (it) .

for the purposes of this annotation, we are only going to consider as phrasal the verbs that pass this test; in all other cases, mark the NP as GF=adjunct.

On the other hand, when the main verb is to be, the values GF=predicate or GF=there-obj should be used; see next.

predicate

Predicate is the grammatical function assigned to the complement in copula constructions, i.e., costructions in which the verb be is the head (as opposed to acting as an auxiliary), except in the case when the subject is expletive there, in which case the value GF=there-obj should be used (see below).

(This) is (a production watch).
(the Palais-Royal) was (the residence of ((the king's cousin)))

This value should also be used for NEs occurring in units marked as paren-app:

(Anne-Marie Shillitoe, (an Edinburgh jeweller))
... (inflammation around (the mouth) (perioral dermatitis).
there-obj

This value should be used for the post-copular NP in there constructions, as in the following examples:

There is (a man) in (the garden)
There is (considerable evidence) for (the valuable, frequently gem-studded belts) ...
comp

This value should be assigned to NPs and PPs governed by a ditransitive verb such as give, grant or allow, which are not direct objects.

(The duke) gave (the teapot) (to my aunt)
(The king) granted (him) (the royal privilege of (lodging in (the Palais du Louvre)))
adjunct

This value should be used for all other NEs which occur as part of prepositional phrases inside of a unit (if they occur inside of a NE, use GF=np-mod or GF=np-compl instead). These prepositional phrases may express the spatial or temporal location of the eventuality described by the verb, or instruments used to make some object, or the material:

In (the courts of Europe), (lavish quantities for (formal diamond jewelry)) continued to be worn.
(This jewel) is made of (wood)

Notice that PPs are not tagged in the scheme! Although in fact the PP is the adjunct, the embedded NP is tagged as GF=adjunct instead of the PP .

gen

This value should be used for possessive NPs functioning as determiner:

((its) mount)
((the artist's) collection)

NPs occurring after an of particle should be classified as np-compl or np-mod even when they express possession::

(the ring of (Jean de Grailly))
np-compl

For NPs occuring as post-nominal complement of an NP. Some complements can be recognized because they are indicated by the particle of:

(the use of (acrylics))
(the design of (this chandelier))
(the straps of (a dress))
(the ring of (Jean de Grailly))
((Purple), (white) and (green)) were (the colours of (the suffragette movement))

Notice that these NEs do not all have the same semantic function: in the second example above design clearly calls for an argument, whereas in the third, fourth and fifth example the particle of is used to express possession; however, we are going to mark all of these cases as np-compl. One distinction that we are going to make, however, is that between these cases and partitive and quasi-partitive constructions, in which of is used to indicate the argument of a determiner rather than a noun. In these latter cases, np-part should be used (see next).

The more difficult cases are those in which the complement is not indicated by the particle of. Some examples are:

(the answer to ((your) question))
(the solution to (these problems))
np-part

This value should be assigned to NPs that specify the domain of quantification of a quantifier. These NPs also occur as arguments of the particle of, but the of-construction is used to specify an argument of the determiner rather than a noun:

(Two of (them)) are (buttons)
(Titanium) is (one of (the refractory metals))

In some cases, particularly noun phrases that refer to a certain amount of a given substance, the determiner / quantifier involves noun-like elements (as in a lot), making it difficult to decide whether np-part or np-compl should be used:

(A lot of (effort)) went into (the making of (these early plastics))
(Three pounds of (bread))

Although the decision may be rather difficult in general, in some of these cases it is possible to decide by looking at whether the head noun of the embedding noun phrase indicates a measure; in these cases, np-part should be used.

np-mod

This value should be used for NEs occuring in a PP which modifies a noun, but cannot be classified as either np-compl or np-part.

<ne> the man with <ne GF="np-mod"> a hat on <ne GF="np-mod"> his head </ne> </ne> </ne>

This value should also be used for NEs included in a unit marked as utype=parenthetical which occurs inside a NE:

<ne> a clock of <ne GF="np-compl"> <ne GF="np-compl"> the same design </ne> and <ne GF="np-compl"> similar marquetry </ne> </ne> <unit UTYPE="parenthetical"> now in <ne GF="np-mod"> the Ecole Nationale Superieure des Beaux-Arts </ne> </unit> </ne>

adj-mod

For NEs that occur as arguments of adjectives:

Are you allergic to (any component)?
(the hanging oak branches) are also typical of ((Carlin's) work)
if (you) are heavy with (water)
no-gf

This value should be used for NPs which occur in a unit with VERBED=no and is not a parenthetical or paren-app included in a NE (for which np-mod or np-compl should be used). These units include, among others, titles, listitems, and parentheticals inside other units.

An example

<ne CAT="coord-np" GF="subj"> <ne CAT="bare-np" GF="subj">purple </ne>, <ne CAT="bare-np" GF="subj">white </ne> and <ne CAT="bare-np" GF="subj">green </ne> </ne> were <ne CAT="the-np" GF="pred"> the colors of <ne CAT="the-np" GF="np-compl"> the suffragette movement </ne> </ne>

Possible difficulties

NPs which are part of a coordinated NP inherit their GF value from the grammatical function of the overall coordination.


LFTYPE: Logical Form Type of a NP

(This attribute should only be annotated after information about antecedent relations has been annotated.)

NPs are used to realize different types of LF constituents, such as terms, quantifiers, and predicates. This attribute is used to indicate which type of LF object is realized by a given NE.

Values

quant

This value should be used to mark quantifiers. In our scheme, all quantifiers are marked up as either CAT=q-np or CAT=q-pro:

At (all times) (he) tries to create (a sense of beauty)
(Most of ((Carlin's) works) were (small, portable and extremely elegant items)
(Every department)has (different procedures) for (hiring)

Wh-NPs should also be given a quant value:

(how many books) did (you) buy?
(which jewel) do (you) prefer?

Finally, any-nps (phrases with any as a determiner) should also be given a quant value:

Ask your doctor to explain any of the medical terms

On the other hand, a-nps, another-nps, num-nps, meas-nps, this- and that-nps, poss-nps, bare-nps, pns, the-nps, pronouns other than q-pro, and gerunds should never be classified as quant. (We assume here a DRT-style treatment of definites and indefinites.)

coord

This is the value that should be assigned to all coord-nps.

pred

Any remaining NP should be classified as either pred or term. pred is used to mark NPs that denote properties. A typical example of such NPs are indefinite and definite NPs in copular clauses:

John is an astronomer

This sentence attributes to John the property of being an astronomer, i.e., its logical interpretation is of the form astronomer(john). Other examples of NPs that denote predicates include the arguments of verbs of change such as become:

(The upright secretaire) began to be (a fashionable form)around (the mid-1700s)
(Martin Carlin) emigrated to (France) to become (an ebeniste).
(The egg) becomes transformed into (a beautiful as well as precious object).
.... colored to (a dull but lustrous grey)

A last case of NPs that denote predicates are NPs occurring in appositions (= parentheticals marked as paren-app):

(Anne-Marie Shillitoe, (an Edinburgh jeweller))
... (inflammation around (the mouth) (perioral dermatitis).

All of these NPs should have been marked as GF=predicate. They should all be annotated as LFTYPE=pred, except when the value of CAT is pn, in which case the value term should be used.

term

This value should be used to mark up all NPs that realize terms in the logical form of a sentence. These include every type of NP not mentioned so far (proper names, that- and this- NPs, gerunds, etc.) as well as definite and indefinite NPs not used to realize a predicate (i.e., annotated as pred), including possessive NPs. Proper names are the prototypical example of NP that should be annotated in this way:

I like <ne cat="pn" lftype="term"> John </ne>

Use term for all anaphoric expressions as well, except for q-pros, even when the antecedent is a quantifier:

<ne cat="q-np" lftype="quant"> Many women </ne> had <ne cat="poss-np" lftype="term"> <ne cat="poss-pro" lftype="term">their</ne> ears </ne> pierced

For the moment, complementizers should be marked up as terms as well.

Possible difficulties

This attribute should be marked by following the steps below:

  1. First of all, see if this NE could be marked as quant.
  2. Next, see if it could be marked up as coord.
  3. If neither of these two classifications apply, check the linguistic context and/or the value of GF. If GF=predicate and the NP is not a pn, use pred; else, use term.


LOEB: Functionality of a NP

(This attribute should only be annotated after anaphoric information has been annotated.)

Loebner (1987) claims that what licenses the use of definites is not givenness (with respect to discourse information or to the hearer's information) but whether or not the NP denotes a function. This may happen either on semantic or on pragmatic grounds. The head noun itself may be functional (as in the end or the beginning), or it may be sortal but modified by a modifier such as first or last that turns it into a function (as in the first dog). It may also be the case that the noun complex behaves like a function because of commonsense reasoning: e.g., in the man that Mary married, the predicate man that Mary married can be assumed to be functional. Finally, a predicate may be functional in the discourse: e.g., if a single dog has been mentioned in the discourse, then dog is functional.

The LOEB attribute is used to specify whether an NP is functional or not in the sense just mentioned. When you are annotating an NP, you should ask yourself: is the writer presupposing that there is only one object of this type? Try the possible values in the order given: i.e., first try to decide whether the NP is a propername; if not, if it's a disc-function; etc.

Important note: LOEB is unlike other semantic categories, in that its value for an anaphoric expression is NOT determined by the value of its antecedent.

Attribute values

propername

Loebner classifies as proper names, in addition to all the NEs classified as CAT=pn (see above) all NPs that are really proper names with a determiner and we classified as CAT=the-pn, such as the United States of America, as well as proper names with a restrictive prenominal apposition, such as the year 1984 or the number zero. These NEs should get the value LOEB=propername.

disc-function

This value should be used for all pronouns that have not been marked as ambiguous (i.e., their ANTE element only includes one ANCHOR element) and all definite NPs which have only one possible antecedent in the discourse: e.g., for the ring or this medicine in a text in which only one ring or medicine have been mentioned.

We will also assign the value LOEB=disc-function to the indexicals I, you, we.

All following values of LOEB depend on the type of head noun.

sem-function

An NP may be semantically functional, according to Loebner, either because the predicate associated in the lexicon to the head noun is semantically functional, or because it is made into a function by a modifier. These NPs should be given a value of LOEB=sem-function. The following classes of NPs are semantically functional according to Loebner:

the rumour that the president resigned
the question whether the definite article is a numeral
the claim that Nixon was innocent
the dream to become rich
the distance between A and B

In addition, the head noun can be made into a function by a modifier. The modifiers having this effect include:

pragm-function

Some NPs have a head noun denoting a predicate that is not in general functional, but denote uniquely on commonsense reasoning grounds. An example is the woman that John went out with last night: under normal assumptions about going out, it can be assumed that John went out with a single woman. These NPs should get a value LOEB=pragm-function.

relation

The value LOEB=relation should be used for NEs whose head noun denotes a one-to-many or a many-to-many relation, instead of a function. Examples include sister, uncle, teacher, arm, leg, ....

undersp-loeb

This value should be used for predicates such as result or effect that seem to allow both a functional and a non-functional reading: e.g., the result of the war or a result of the war. Loebner does not discuss these cases in the paper, but what seems to be happening is that we have roles of an event which may take a set as value: so we can talk about the murderer but also about one of the murderers, the victim or the victims, etc.

sort

The value LOEB=sort should be used for those NEs whose head noun is not a relation and is not pragmatically functional.

no-loeb

This value should be used for all coord-NPs.

Possible difficulties

Quantified NPs and bare-NPs are generally sortal (but not always!).


NUM: NUMber

(This attribute should be annotated while marking up NEs.)

This attribute is used to specify the syntactic number of the NP.

Attribute values

sing

This value is assigned to singular noun phrases:

<NE ID="ne1" CAT="pn" NUM="sing">John</NE> .....
<NE ID="ne2" CAT="pers-pro" NUM="sing">he</NE> ....
<NE ID="ne3" CAT="pers-pro" NUM="sing">you</NE> .....
<NE ID="ne4" CAT="a-np" NUM="sing">a jewel</NE> ...
<NE ID="ne5" CAT="the-np" NUM="sing">the jewel</NE> ...

By convention, all instances of the second person pronouns you and your should be marked sing.

plur

This value is assigned to plural noun phrases:

<NE ID="1" CAT="coord-np" NUM="plur">John and Mary</NE> ..... <NE ID="2" CAT="pers-pro" NUM="plur">they</NE> ....
<NE ID="3" CAT="bare-np" NUM="plur">dogs</NE> ... <NE ID="4" CAT="the-np" NUM="plur">the jewels</NE> ...
undersp-num

Some native speakers find both singular and plural anaphoric reference acceptable in the case of NPs such as measles or data. When this is the case, the value NUM=undersp-num should be used.

Possible difficulties

We are mainly interested in information about the number of NPs to investigate the importance of number as a factor in affecting pronominalization; as in the case of GEN, therefore, the test to be used to decide on the value for NUM is whether a subsequent pronoun co-referring with the NE would be singular or plural. Thus, for example, unmarked plurals such as people and `pluralia tantum' nouns such as customs should all be classified as NUM=plur. On the other hand, mass nouns such as water should be classified as NUM=sing because the singular pronoun is used to refer to them:

I prefer to drink <NE ID="ne1" CAT="bare-np" NUM="sing">coffee</NE>. <NE ID="ne2" CAT="pers-pro" NUM="sing">It</NE> wakes me up more.

Conjoined NPs, plural quantifiers, and measure-NPs should also be marked as NUM=plur since plural pronouns are used to refer back to them:

<NE ID="ne1" CAT="coord-np" NUM="plur">John and Mary </NE> came over last night. <NE ID="ne2" CAT="pers-pro" NUM="plur">They</NE> stayed for dinner.
<NE ID="ne3" CAT="q-np" NUM="plur">Some friends from work </NE> came over last night. <NE ID="ne4" CAT="pers-pro" NUM="plur">They</NE> stayed for dinner.
<NE ID="ne5" CAT="meas-np" NUM="plur"> 4 mg of <NE ID="ne6" CAT="bare-np" NUM="sing">oestradiol</NE> </NE>

On the other hand, disjunctive NPs are often NUM=sing:

You should ask <NE ID="ne1" CAT="coord-np" NUM="sing"> your doctor or your pharmacist </NE>. <NE ID="ne2" CAT="pers-pro" NUM="sing">He</NE> will tell you what to do.

This is particularly the case when it's the head noun that is disjoint:

You should ask <NE ID="ne1" CAT="poss-np" NUM="sing"> your doctor or pharmacist </NE>. <NE ID="ne2" CAT="pers-pro" NUM="sing">He</NE> will tell you what to do.

NEs with CAT=gerund should be marked as NUM=sing since singular anaphoric pronouns have to be used to refer to them:

<NE ID="ne1" CAT="gerund" NUM="sing"> Using <NE ID="ne2" CAT="pn" NUM="sing">Nerisone </NE> </NE> is not always easy.
Doing <NE ID="ne3" CAT="this-pro" NUM="sing">this</NE> sometimes causes <NE ID="ne4" CAT="bare-np" NUM="plur">side effects</NE>.

NEs with CAT=free-rel should also be marked as NUM=sing:

<NE ID="ne3" CAT="this-pro" NUM="sing">This</NE> is <NE ID="ne1" CAT="free-rel" NUM="sing"> what you should do </NE>

Otherwise, use the value NUM=unsure-num.


ONTO: Ontological status (abstract or concrete)

(This attribute should only be annotated after information about antecedent relations has been annotated.)

This attribute is used to specify whether a NE refers to an abstract or a concrete object. We also want to identify some sub classes of abstract and concrete objects.

Concrete objects are physical objects that you can touch: people, houses, trees, money. For some of these we are going to use special values: person,substance,medicine. We will also treat as concrete everything that has a spatiotemporal location, such as towns. Everything else should be classified abstract: this includes events such as the second world war, time periods such as this century, objects such as the law or art, etc.

Values

person

This value should be used for all NPs that refer to people:

(You) should tell immediately to ((your) doctor)
Does (any of ((your) friends)) know (this)?
((This table's) unusual materials and coloring) allow (scholars) to link (it) to (a written source) and (a particular building)
substance

This value should be used for any NPs that refers to substances such as water, gold, even when they are just mentioned as being part of a medicine as in oestradiol.

(Estracombi TTS patches) contain ((oestradiol) and (norethisterone acetate))
((This table's) marquetry of ((ivory) and (horn)))
medicine

The third subtype of concrete objects that we want to distinguish are medicines:

(Estracombi TTS patches) contain ((oestradiol) and (norethisterone acetate))
Read (this leaflet) carefully before (you) start using ((your) medicine)
concrete

This value should be used for anything else that can be touched; in particular, for all objects being described in the museum domain.

((This table's) unusual materials and coloring) allow (scholars) to link (it) to (a written source) and (a particular building)

In the pharmaceutical domain, symptoms of diseases should be treated as concrete when they can be touched (hot sweat, red marks) otherwise as events (cough). Temporal expressions and events should be classified as abstract.

space

This value should be used for geographical entities such as towns and countries. It should also be used for references to objects that definitely occupy a space such as farm, factory.

time

Everything that doesn't pass the tests above should be classified as abstract; but there are a few types of abstract objects that we'd like to identify. One example are temporal expressions: the century, 1988, etc. should be counted as temporal expressions and marked up as time.

event

A second example of abstract object that we'd like to classify specially are events. We count as an `event' every NP which is not concrete according to the test before, does not refer directly to a span of time, and yet it has a duration. Examples are wars, diseases and their non-concrete symptoms, etc:

In (the Dutch wars (of 1672 - 1678))
(the menopause)

All gerunds should also be classified as events. However, we want to single out a particular type of events, diseases: for these, the value disease should be used (see below).

disease

Use this value for all references to diseases such as breast cancer and epilepsy.

abstract

This value should be used for all other non-concrete objects: references to concepts such as art, law, alchemy, ideas, etc. It should also be used for properties of objects such as height, weight, etc.

<ne onto="concrete"> One stand </ne> was adapted in <ne onto="time"> the late 1700s or early 1800s century </ne> to make <ne onto="concrete"> it </ne> <ne onto="abstract"> the same height as <ne onto="concrete"> the other </ne> </ne>
undersp-onto

This value should be assigned to all NPs that could be classified as either abstract or concrete. One such case are NPs with coordinated head nouns, one of which is abstract while the other is concrete:

((This table's) unusual materials and coloring) allow (scholars) to link (it) to (a written source) and (a particular building)
no-onto

This value should be assigned to all coord-NPs.

Possible Difficulties

In the case of pronouns, complementizers, and other headless anaphoric expressions, use the ONTO value of the antecedent. NPs with a head should generally be classified according to the type of object denoted by the head, except for meas-nps - as already said when discussing how to annotate count, in the case of an NP such as (4 mgs of (oestradiol)) it is the substance actually measured that should be looked at when deciding how to classify; so this example should be classified as substance, like oestradiol, rather than abstract because of milligrams. Similarly in the case of free relatives:

(what (you) need to know about (Nerisone)) [ABSTRACT]
(what's in ((your) medicine)) [SUBSTANCE]

Collective entities should be classified on the basis of their parts: so, a group of people should be classified as person because it has people as `parts'. By the same item, this pair of coffers should be classified as concrete since it was two concrete objects as parts.

In the same way, generic references to types should be marked according to the value that would be given to their instances: e.g., the generic reference to dinosaurs in the following sentence should be marked as concrete rather than abstract. NP such as a type of ... should be classified according to the type of object: so a type of oestrogen should be classified as substance, whereas a type of art should be classified as abstract.

undersp-onto should be used when you're not sure whether the object referred to by a particular NP is abstract or concrete; other values of onto should be used when the uncertainty is more restricted. The values concrete and abstract should be used as underspecified values for the cases in which all you know is whether an object can be touched or not, but you're not sure which subclass of concrete or abstract object should be used. In addition, substance should be used as underspecified value for substance and medicine, and event should be used to leave things underspecified between event and disease. That is, if you know a particular object can be touched, but you're not sure whether it should be classified as a substance or as a medicine, use substance.

References to abstract events such as life should be treated as abstract.


PER: PERson

This attribute should be used to specify the syntactic person of the NP.

Attribute values

Possible values are:

per1

This value should be used for 1st person personal, possessive and reciprocal pronouns (singular and plural): I, we, my, mine, our, ours, myself, ourselves.

per2

This value should be used for 2nd person personal, possessive and reciprocal pronouns (singular and plural): you, your, yours, yourself, yourselves.

per3

This value should be used for all other pronouns and all other non-coordinate NPs.

Possible difficulties

This attribute is relatively easy to mark; the only problems arise in the case of coordination - e.g., for you and me, or me and him In these cases, one should see which pronoun could be used to refer back to the NP, and use that pronoun's per value: e.g., we is used to refer back to you and me, which should therefore be marked as per1.


REFERENCE

(This attribute should only be annotated after anaphoric information has been annotated.)

This attribute should be used to indicate whether the NP denotes `directly' or `indirectly'. The NPs taken to denoted directly include proper names and deictic NPs; otherwise, an NP could be either quantified or bound by a quantifier.

Attribute values

direct

This value should be used for all NPs classified as CAT=pn or the-pn, all NPs marked as DEIX=deix-yes, and all NPs such as pronouns that are anaphoric on these.

quantified

This value should be used for all NPs with LFTYPE=quant. It should also be used for all indefinite NPs (CAT=a-np,another-np,bare-np,num-np,meas-np) not marked as DEIX=deix-yes, and all NPs which are anaphoric on one of these (i.e., if there is an ANTE relation indicating this anaphoric relation). Note that if the antecedent is marked LFTYPE=quant, bound should be used instead (see next).

bound

Sentences that contain quantifiers can also contain NPs that logically behave like variables bound by these quantifiers. For example, the pronoun their in the following example is bound by the q-np Few of Carlin's wealth clientele:

(Few of Carlin's wealth clientele) would have put their money in this area

This value should be used for all NPs that are anaphoric on an NP annotated as LFTYPE=quant, like their in this example.

no-reference

This value should be used for all NPs with LFTYPE=pred or LFTYPE=coord.

Possible difficulties

Direct references to diseases such as leukemia should generally be marked as direct since they are annotated as CAT=pn, whereas references to substances should be marked as quantified, whether they are generic or not. More in general, bare-nps are generally quantified, whether generic or not. References to non-countable abstract objects such as law should be marked as direct, whereas countable ones as quantified.


STRUCTURE

(This attribute should only be annotated after anaphoric information has been annotated.)

This attribute should be used to indicate whether the semantic correlate of the (countable) NP is a set or an atom. It only applies for NPs with COUNT=count-yes and LFTYPE=term or quant; otherwise, use the value no-structure.

Attribute values

atom

This value should be used for all NPs marked with COUNT=yes and that denote an individual object, such as a single person (e.g., John), a specific town (Rocester), a physical object and one of its parts, or a specific disease (e.g., diabetes). Basically, all NPs marked as COUNT=yes, and LFTYPE=term or quant should be marked as atom if they are singular (NUM=sing), except for singular NPs referring to collections (as in a group of people or a list of items).

set

This value should be used for all countable NPs referring to collections. This includes all plural terms (cars, two rings from Rocester) and pretty much all NPs marked as LFTYPE=quant (except for singular uses of any, as in Do you have any book you could give me?, which should be classified as atom). All singular terms referring to collections, such as a group of people,a combination of styles or John's record collection should also be classified as set.

undersp-structure

This value should be used for all NPs that could possibly be interpreted as referring to either an atomic individual or a set, but it's not possible to decide. One example are collections just discussed: whereas John's record collection is fairly clearly a set, the Burrell collection could be used to refer either to a museum, as in Yesterday I visited the Burrell collection (in which case it should be classified as atom) or to the set of paintings contained in the museum, as in The Burrell collection consists of more than three thousand paintings (in which case set should be used).

no-structure

This value should be used for all NPs with a value of COUNT of count-no (i.e., mass nouns) or no-count (i.e., coord-nps).

Possible difficulties

As usual, the classification to be assigned to a pronoun depends on its antecedent: so a pronoun could be classified as either atom, set, undersp-structure, or no-structure. Proper names should be classified according to their denotation. Abstract objects are generally classified as COUNT=count-no, so in most cases they should be classified as no-structure; for abstract objects classified as count-yes, just decide on the basis of NUM (i.e., use atom if the NP is singular, set if it's plural).


MARKING UP SEMANTIC ATTRIBUTES

The semantics attributes (ANI, COUNT, ONTO, GENERIC, LOEB, REFERENCE and STRUCTURE) are best marked up all together, first thinking about the type of object the NP is referring to, then using the appropriate C-c C-s macro, and finally correcting any feature that is not quite right for the particular NP in consideration. In this section we will briefly discuss how the main types of objects should be generally annotated.

people

References to specific singular inviduals should always be marked as ANI=animate, COUNT=count-yes, ONTO=person, GENERIC=generic-no, STRUCTURE=atom, irrespective of the particular type of NP used: proper name, pronoun, a-np, the-np, etc. The value of cat affects the value of REFERENCE and LOEB: If CAT=pn or CAT=the-pn then REFERENCE=direct and LOEB=propername; otherwise REFERENCE should be quantified or bound, except when the NP is a pronoun and its antecedent marked REFERENCE=direct, in which case REFERENCE=direct should be marked again. The value of LOEB should generally be disc-function for pronouns, this-nps and the-nps referring anaphorically; in other cases, look at the head noun and use sem-function, pragm-function, relation or sort, as specified above. The only difference for plural references is that STRUCTURE=set should be used.

So, the only real complication when marking up people is to decide whether the NP under consideration is a generic reference or not. In the case of people, proper names are always generic-no; other than that, it depends on whether the sentence is generic or not.

physical objects

In the museum domain, many NPs refer to physical objects such as the jewels under display. The C-c C-s o key sets up appropriate defaults for these objects: ONTO=concrete, ANI=inanimate, COUNT=count-yes. It also sets the defaults LFTYPE=term, STRUCTURE=atom, GENERIC=generic-no and REFERENCE=quantified that should be checked (e.g., every object with DEIX=deix-yes should have REFERENCE=direct; plurals should have STRUCTURE=set; etc.)

substances

One of the trickiest decisions to make in the patient information leaflets domain is what to do with NPs referring to substances, such as oestradiol. The C-c C-s s macro sets up defaults that should apply in most cases: DEIX=deix-no, ONTO=substance, COUNT=count-no, GENERIC=generic-yes, STRUCTURE=no-structure (because these are non-countable), REFERENCE=quantified, LOEB=sort. The two things you should think about is whether this particular reference is generic or no (in which case you should change the value of GENERIC) and whether the object denoted is unique in the discourse or not (in which case you should change LOEB to disc-function).

diseases

Diseases are also very complex from a semantic point of view; hence, for example, the decision to have a special value ONTO=disease for them. Mark them up as direct references to objects; the C-c C-s d macro does this, and sets COUNT=count-no, STRUCTURE=no-structure, REFERENCE=direct,GENERIC=generic-no. These values should be changed for quantified or generic references.

coord-nps

The function C-c C-s c sets up most semantic values for coord-nps: DEIX=deix-no, LFTYPE=coord, COUNT=no-count, ONTO=no-onto, STRUCTURE=no-structure, REFERENCE=no-reference, LOEB=no-loeb. The two values that have to be set by annotators are ANI and GENERIC: these depend on the values set for the conjuncts (the common value if the values are the same, otherwise undersp-ani and undersp-generic).


5. THE ANTE ELEMENT


This section contains instructions for annotating anaphoric relations between text elements using the <ante> element. We use the term anaphoric relation to indicate, first of all, the relation between two text elements that denote the same object; the subsequent mention of an entity already introduced is often marked by means of a particular type of noun phrase (NP) called an anaphoric expression. A typical example of anaphoric expression are pronouns such as he in the text

John arrived. He looked tired.

In the preferred reading of this text, the pronoun he refers to the individual 'John' which is denoted by the expression John that occurred earlier on in the text.

Although expressions of all syntactic categories can be anaphorically related, we are only interested in anaphoric relations between noun phrases-indeed, between the subclass of all noun phrases that we mark up using the <ne> tag. For this reason, the annotation of anaphoric relations should come after all the <ne> marks have been annotated as discussed in Section 4 of this manual.

Not all noun phrases marked with the <ne> tag are involved in anaphoric relations: for example, whereas

John likes Bill

introduces two potential antecedents for anaphoric expressions, as can be shown by the fact that a follow-up like

He is crazy

is ambiguous in that he can refer either to John or to Bill, the sentence

John is a policeman,

which from a syntactic point of view also contains two NPs, nevertheless only introduces one possible antecedent, as can be seen by the fact that in this case, the continuation He is crazy is not ambiguous. As a rule, NPs marked as lftype=PRED do not enter into anaphoric relations. The assumption underlying our annotation scheme for anaphoric information is that processing text involves building a discourse model containing discourse entities, and that anaphoric relations between textual elements such as noun phrases express semantic relations (like identity) between these discourse entities (Webber, 1978; Heim, 1982; Kamp, 1981). From now on we will say that a noun phrase that can enter into anaphoric relations `introduces a discourse entity'.

In all examples of anaphoric relations seen so far, the underlying semantic relation between discourse entities is identity; all of these cases should be marked specifying rel=IDENT for the rel attribute of the <ante> element. We are also, interested, however in a few cases of anaphoric relations which involve semantic relations other than identity. So-called bridging references (Clark, 1977) are anaphoric expressions that denote objects only related to the denotation of their antecedent by (shared) generic knowledge. An example is the indicators in:

John has bought a new car. The indicators use the latest laser technology.

We are able to interpret the description the indicators because we know that indicators are a part of cars, and a car was mentioned in the first sentence. These relations should be marked, and the value rel=POSS should be used. Another case of anaphoric relation that doesn't involve identity but we want to mark is element-set, as in:

The Italian team didn't play well yesterday until the centre-forward was replaced in the 30th minute.

We are also interested in marking cases in which an expression in the text refers to an object that has not been mentioned before, but is 'accessible' because it is part of the visible situation: these expressions are called deictics or also indexicals. An example of indexical expression in a real life conversation is the salt in an utterance of the sentence pass me the salt, please in a context in which the salt hasn't been mentioned before. This information should be marked up at the same time as the rest of the anaphoric information; but instead of marking information about deixis by means of <ante> relations, we will use the deix attribute of the <ne> element; see Section 4.


5.1 MARKING UP ANTECEDENTS

The annotation of anaphoric information should proceed as follows:

  1. First mark up all identity relations;
  2. Then mark up information about deixis;
  3. Finally, mark up the information about bridges.

The main problem to face in doing this type of annotation, especially during the last step of marking information about bridging relations, is that almost any two elements of a text are somehow related; so it is very important to try to annotate only what is absolutely necessary. The general criteria for deciding when to annotate are as follows. First of all, as said above, only anaphoric relations between <ne> elements should be marked. And in doing this, the following principles should be used:

The first of these rules says that you should try to identify all and only the expressions in a text that `establish a link' of some sort with previous units (see Section 3.2 for a discussion of units) by being related to some object in that unit via one of the relations discussed below. The trick here is to make sure that you always establish at least one link between distinct units, without however going overboard. This is not always as easy as in the examples above; you'll find that a lot of <ne>s are somehow related to a previously introduced entity. The way we limit this is, first of all, to annotate only the limited number of relations that we have introduced. Second, you should avoid introducing an <ante> element to mark a relation other than identity between <ne>s in the same clause. E.g., in the following example, you shouldn't use an <ante> element to mark a relation between `John' and `a car', even though the relation is one of those we want to mark (poss), because this relation does not establish a link with a previous unit:

(John) has (a car)

Finally, we limit the amount of work to do by marking up at most one ident relation and one bridging relation. E.g., in the following example, one is related both to John (by a poss relation) and to three TVs (by an element relation); only one should be marked, that to the closest antecedent (the three TVs).

(John) has (three TVs). (He) keeps (one) in the kitchen.

The third rule says that the only case in which you should mark more than one antecedent relation for an <ne> element is when the closest antecedent is not IDENT; in this case, mark both the bridge relation to the closest element and the ident relation further away.

Do make sure, however, that you always mark at least one antecedent! In particular, be careful that it's not only definite expressions (the-nps, pronouns, that-nps) that can participate in anaphoric relations. E.g., in the following example, the link with the previous unit is established by a drawer.

[((The table's) top) may be raised to form (an angled reading or writing stand)], [while (a drawer) is fitted for writing equipment].

Possible difficulties

In the case of possessive relations, always mark a poss-inv relation between the possessor and the object owned rather than a poss relation.

Do NOT mark:

In case more than one NE appear to introduce an antecedent for a given anaphoric expression and you are not sure which is the antecedent, mark all of them using separate anchors (use the `New Anchor' command for this).


5.2 ATTRIBUTES OF THE ANTE AND ANCHOR ELEMENTS

The <ante> element has two attributes: current, which specifies the ID of the <ne> which represents the anaphoric expression; and rel, which specifies the relation between the discourse entity and the antecedent one. Each <anchor> element has one attribute, antecedent, which should be set to the ID of the antecedent <ne>. The possible values of rel are:

ident

This should be used for the case when both NEs refer to the same object.

(This) is (a type of (brooch) (that) was popular around (the 1960s). (It) might not be instantly recognisable as (`jewellery'), ...
element

This should be used when the NP denotes an element of a set of objects.

(The sixteen panels) are each divided into (three horizontal zones), (the middle) containing (a letter);

This value should also be used when the current NP denotes an instance of a type, whether the antecedent type is expressed by a plural expression, as in:

In (many cases), (the mottoes on (the panels)) are in Greek: (that on (602 (left), from (Corbridge)) ....

or by a singular one, as in:

From (the second century AD) onwards, (the Christian betrothal ring) was usually made of (gold), and (the delicate polygonal example excavated at the Roman fort at Corbridge) ....
subset

This relation should be used when the anaphoric expression denotes a subset of a set of objects.

poss

This relation should be used when the anaphoric expression denotes an object which is somehow `owned' by another object, either because it's part of that object (as in the door ... the house) or because the antecedent owns the object in some sense.

Each of these relations (Except for identity) has an inverse relation, which should be used as appropriate so as to always mark anaphoric relations to previously occurring <> elements.


6. APPENDIX


5.1 DTD

5.2 gnome-mode: a GNU-Emacs mode for annotating

The file gnome-mode.el contains the definition of a GNU Emacs minor mode, gnome-mode, for producing the annotation specified in this manual. The mode extends PSGML, a general-purpose SGML mode for GNU Emacs; you need to have PSGML before you can use gnome-mode.el.

Using gnome-mode

Once you have installed gnome-mode (see below), you can get Emacs to use it for annotation in three different ways, depending on the status of the annotation. If what you want is to start annotating a new file that you didn't annotate before, you should start emacs and then type:

M-x gnome-start-annotation

the function will ask you for the name of a file to edit, and will create a new emacs buffer which contains what follows:


<!-- -*-mode: gnome -*- -->

<!DOCTYPE GNOMEDOC SYSTEM "gnome.dtd" [ ]>

<gnomedoc>

<body>

[content of your file here]

</body>

</gnomedoc>


This function also puts the buffer in gnome-mode.

When you want to edit a file that has already been annotated using gnome-mode, do not use gnome-start-annotation again; simply visit the file using C-x C-f. emacs is clever enough to start gnome-mode automatically. (NB: emacs recognizes this because of the line

<!-- -*-mode: gnome -*- -->

at the beginning of the file: don't mess with it!)

Finally, you can also start gnome-mode for a file not created with gnome-start-annotation by typing:

M-x gnome-mode

However, you should only do this when the file was created using gnome-start-annotation; otherwise you may end up with a file with syntax errors.

If you want to stop annotating, you can save the file you were annotating and quit Emacs; the next time you visit the file, you should be automatically put in gnome mode. Before quitting, you should validate the annotation by typing C-c C-p and then looking for errors by using C-c C-o.

S and UNIT annotation

When you want to mark a sentence, you should mark the region of text you want to annotate, then select from the GNOME menu the button `New Sentence'. This will automatically put <s> tags around the text and increment the sentence counter. If you want instead to mark a unit, select the region as before, but choose the New Unit button from the GNOME menu.

If you or Emacs make a mistake (e.g., do not select the region, or select the wrong region, etc.), just undo what you did as you normally would in Emacs by typing C-x C-u.

NP annotation

If you put brackets (np .. ) around all NP markables in your text, you can use the function gnome:next-np (C-c C-d) to move from markable to markable and create <ne> elements. This function moves you to the next (np .. ) parenthesis, creates a <ne> element around it, and increments the NP counter. Otherwise, you can select by hand the region you want to mark and then choose the Tag-Region function from the Markup menu element.

When you restart editing a file that you only partially annotated, you should set the automatic counter for NPs to the last ID you assigned to an NP in the previous round by selecting the `Set NP Counter' from the GNOME menu item.

Editing attributes

Once you have created a NE element, you can edit its attributes by typing C-c C-a (edit attributes). This will start a second window in which for each attribute you get to see whether its value has been set or not, and what are the possible values. You can go from attribute to attribute using TAB, and once you are there, you can specify the value you want by clicking with the mouse on one of the values, then typing C-c C-v. If you want to change a value, type C-c C-k and then choose another value. When you're done, type C-c C-c.

If you prefer, you can also specify the value of an attribute via menus. Click on the Markup menu item and choose Insert Attribute. You'll get a new menu with one item for each attribute, and submenus to choose their value.

Installing gnome-mode

To use gnome-mode, you should first check that you have psgml (see below). This is not always installed, and is not the default sgml mode. In order to get psgml, you have to add this to your .emacs file (on HCRC machines):

(setq load-path
(cons
(expand-file-name "~poesio/projects/crt/annotation/psgml-1.0.1")
load-path))
(autoload 'sgml-mode "psgml")

Then you should put gnome-mode.el in the directory where you are going to do the editing, and add the following lines to your .emacs:

(autoload gnome-mode "gnome-mode.el")

(autoload gnome-start-annotation "gnome-mode.el")

Finally, you should make sure you have gnome.dtd in the directory where you are editing, since psgml accesses it to find the names of the elements and their attributes.


6. REFERENCES


S. E. Brennan, M. W. Friedman and C. J. Pollard. 1987. A Centering Approach to Pronouns. Proc. of the ACL, 155-162.

B. J. Grosz, A. K. Joshi and S. Weinstein. 1995. Centering: A Framework for Modeling the Local Coherence of Discourse, Computational Linguistics, v. 21, n.2, 202-225.

L. Hirschman and N. Chinchor, 1997. MUC-7 Coreference Task Definition, Version 3.0, 13th July, 1997.

M. Kameyama. 1998. Intrasentential Centering: A Case Study, In (Walker, Joshi and Prince, 1998), 89-112.

S. Loebner. 1987. Definites, Journal of Semantics, v. 4, 279-306.

J. Lyons. 1977. Semantics, Cambridge University Press.

R. J. Passonneau. 1996. Instructions for applying discourse reference annotation for multiple applications (DRAMA), Draft, December 20th, 1996.

M. Poesio and R. Vieira. 1998. A corpus-based investigation of definite description use. Computational Linguistics. v. 24, n. 2, 183-216. Also available as Research Paper CCS-RP-71, Centre for Cognitive Science, University of Edinburgh.

E. Prince. 1992. The ZPG letter: subjects, definiteness, and information status, In S. Thompson and W. Mann (eds.), Discourse description: diverse analyses of a fund-raising text, John Benjamins.

R. Quirk and S. Greenbaum. 1973. A University Grammar of English, Longman.

L. Z. Suri and K. F. McCoy. 1994. RAFT/RAPR and Centering: A Comparison and Discussion of Problems Related to Processing Complex Sentences, Computational Linguistics, v. 20, n.2, 301-317.

M. A. Walker, A. K. Joshi and E. F. Prince (eds). 1998. Centering Theory in Discourse, Clarendon / Oxford.