uk.ac.essex.malexa.nlp.dp.GuiTAR
Class DiscourseModelImplementer

java.lang.Object
  extended byuk.ac.essex.malexa.nlp.dp.GuiTAR.DiscourseModelImplementer
All Implemented Interfaces:
DiscourseModel

public class DiscourseModelImplementer
extends Object
implements DiscourseModel

This class provides an implementation for GuiTAR's Discourse Model interface.

Version:
1.1
Author:
Mijail A. Kabadjov

Field Summary
private  Document domDocument
           
private  DiscourseModel flatDM
           
private  Map mapAnaphorToAntecedent
           
private  Map mapEquivalenceClasses
           
private  Map mapIdToNode
           
private  Map mapNeIdToVectors
           
private  Map mapNodeToSegment
           
private  Segment rootSegment
           
private  Vector vecEquivalenceClasses
           
 
Constructor Summary
DiscourseModelImplementer(Document documentCorpus, Document documentParsed, String anteTag, String catAttributeName, Vector typeOfNP)
          Constructs a discourse model for a specific type of NP with reference to an annotation.
DiscourseModelImplementer(Document document, String anteTag)
          Constructs a discourse model for all types of NP with reference to an annotation.
DiscourseModelImplementer(Document document, String anteTag, String catAttributeName, Vector typeOfNP)
          Constructs a discourse model for a specific type of NP with reference to an annotation.
DiscourseModelImplementer(String inputFileName)
          Constructs a discourse model incrementally (GuiTAR's main processing).
 
Method Summary
private  void constructEquivalenceClasses(Document documentCorpus, Document documentParsed, String anteTag, String catAttributeName, Vector typeOfNP)
          Builds the Discourse Model out of the annotation provided in the XML file.
private  void constructEquivalenceClasses(Document document, String anteTag)
          Builds the Discourse Model out of the annotation provided in the XML file.
private  void constructEquivalenceClasses(Document document, String anteTag, String catAttributeName, Vector typeOfNP)
          Builds the Discourse Model out of the annotation provided in the XML file.
private  Node createAnteNode(Node current, Node antecedent, String relation)
          Creates the ante node () to be attached to the DOM tree.
private  Cf createCf(Node theNode, Utterance utt)
          Creates an appropriate Cf object for a given syntactic phrase held in a DOM Node.
 void createDiscourseEntity(Cf cf)
          Creates a new Discourse Entity for the given Cf.
private  Utterance createUtterance(Node theNode)
          Creates an appropriate Utterance object for a given Cf held in a DOM Node.
private  Vector findSegmentNodes(Node utteranceNode)
          Finds the DOM nodes corresponding to the path of segments, starting from the utterance node given as a parameter and continuing to the root Segment of the document that is being processed.
static Node findUtteranceNode(Node node)
          Finds the DOM node corresponding to the sentence in which a node holding a Cf is located.
 Map getAnaphorToAntecedentMap()
          Returns the data structure (Map) that holds the anaphor-to-antecedent mappings.
 Object getAntecedent(Object anaphorId)
          Retrieves the refId of the antecedent of anaphorId.
 Vector getCfs(Utterance utt)
          Returns the forward-looking centers within the given utterance.
 DiscourseEntity getDiscourseEntity(Cf cf)
          Retrieves the Discourse Entity, which cf is a realization of.
 int getDistance(Object refId1, Object refId2, String tagName)
          Returns the distance between the referential expression identified by refId1, and the one identified by refId2.
 Document getDOMDocument()
          Returns the pointer to the DOM Document of this DiscourseModel.
 Set getEquivalenceClass(int classIndex)
          Retrieves the equivalence class corresponding to classIndex.
 Set getEquivalenceClass(Object refId)
          Retrieves the equivalence class to which refId belongs.
 int getEquivalenceClassIndex(Object refId)
          Retrieves the index of the equivalence class to which refId belongs.
 String getEquivalenceClassString(int index)
          Returns all the members of the corresponding equivalence class separated by commas.
 String getEquivalenceClassString(Object refId)
          Returns all the members of the corresponding equivalence class separated by commas.
 DiscourseModel getFlatDM()
          Returns a flat Discourse Model built from the annotation.
 Utterance getNextUtterance(Utterance uttRef)
          Returns the utterance following the reference utterance in this Discourse Model.
 int getNumberOfAnaphoricReferences()
          Returns the number of (anaphoric) referential expressions in the discourse model.
 int getNumberOfEntities()
          Returns the number of entities contained in this discourse model.
 Utterance getPrevUtterance(Utterance uttRef)
          Returns the utterance preceeding the reference utterance in this Discourse Model.
 Segment getRootSegment()
          Returns the root segment of the physical Discourse Model.
 Set getSetOfAnaphors()
          Returns the set of anaphors stored in this discourse model.
 int getTimesMentioned(DiscourseEntity de)
          Returns the number of mentions of a given Discourse Entity, that is number of Cfs in its equivalence class.
 Vector getVectors(Object refId)
          Returns a vector of vectors in which this referential expression features.
private  void initialiseDataStructures()
          Initialises the data structures of the Discourse Model.
 boolean isAnaphoric(Object refId)
          Checks whether the referential expression provided is anaphoric.
private  boolean isAnaphorOfType(Document doc, String anaphorNeId, Vector typeOfNP, String catAttributeName)
          This method checks whether a given neId is of a certain type or set of types.
 void printEquivalenceClassesStatistics(DiscourseModel dm)
          Computes P/R per equivalence class on the basis of class intersection and with reference to this discourse model.
private  void processCf(Node theNode)
          This method does the following 3 things: 1.
 void processFile(String inputFileName)
          Processes a file in XML-in XML-out fashion, identifying and annotating anaphoric links.
 void setAnaphoricLink(Object anaphorId, Object anteId)
          Inserts a new anaphoric link into the discourse model.
 void setAnaphoricLink(Object anaphorId, Object anteId, String relation)
          Inserts a new anaphoric link into the discourse model.
 void setFlatDM(DiscourseModel dm)
          Sets the pointer to a flat Discourse Model built from the annotation.
 String toString()
          Converts the Discourse Model contained in this object into a String.
private  void updateLogicalDiscourseModel(Cf anaphor, Cf antecedent)
          Updates the logical part of the ongoing DiscourseModel.
private  void updateNeToVectorsMap(Vector vec)
          Updates the mapping of ne-to-(vector of vectors of nes) with a new vector of keys.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

vecEquivalenceClasses

private Vector vecEquivalenceClasses

mapEquivalenceClasses

private Map mapEquivalenceClasses

mapAnaphorToAntecedent

private Map mapAnaphorToAntecedent

mapIdToNode

private Map mapIdToNode

mapNodeToSegment

private Map mapNodeToSegment

mapNeIdToVectors

private Map mapNeIdToVectors

rootSegment

private Segment rootSegment

domDocument

private Document domDocument

flatDM

private DiscourseModel flatDM
Constructor Detail

DiscourseModelImplementer

public DiscourseModelImplementer(Document document,
                                 String anteTag,
                                 String catAttributeName,
                                 Vector typeOfNP)
Constructs a discourse model for a specific type of NP with reference to an annotation.

Parameters:
document - The DOM document containing the annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)
catAttributeName - The attribute name which stores the type of NP (i.e. cat, AAcat)
typeOfNP - The type(s) of NP to be considered (the-np, pers-pro, etc.)

DiscourseModelImplementer

public DiscourseModelImplementer(Document documentCorpus,
                                 Document documentParsed,
                                 String anteTag,
                                 String catAttributeName,
                                 Vector typeOfNP)
Constructs a discourse model for a specific type of NP with reference to an annotation. An annotation generated automatically is aligned with a corpus one.

Parameters:
documentCorpus - The DOM document containing the Corpus annotation
documentParsed - The DOM document containing the automatically generated annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)
catAttributeName - The attribute name which stores the type of NP (i.e. cat, AAcat)
typeOfNP - The type(s) of NP to be considered (the-np, pers-pro, etc.)

DiscourseModelImplementer

public DiscourseModelImplementer(Document document,
                                 String anteTag)
Constructs a discourse model for all types of NP with reference to an annotation.

Parameters:
document - The DOM document containing the annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)

DiscourseModelImplementer

public DiscourseModelImplementer(String inputFileName)
Constructs a discourse model incrementally (GuiTAR's main processing).

Method Detail

initialiseDataStructures

private void initialiseDataStructures()
Initialises the data structures of the Discourse Model.


isAnaphoric

public boolean isAnaphoric(Object refId)
Checks whether the referential expression provided is anaphoric.

Specified by:
isAnaphoric in interface DiscourseModel
Parameters:
refId - The id of a referential expression
Returns:
boolean True if the NP is anaphoric, false otherwise

getEquivalenceClass

public Set getEquivalenceClass(Object refId)
Retrieves the equivalence class to which refId belongs. If none it returns null.

Specified by:
getEquivalenceClass in interface DiscourseModel
Parameters:
refId - The id of a referential expression
Returns:
Set The set of refIds, members of the equivalence class being retrieved

getEquivalenceClass

public Set getEquivalenceClass(int classIndex)
Retrieves the equivalence class corresponding to classIndex. If an invalid classIndex has been provided it returns null.

Specified by:
getEquivalenceClass in interface DiscourseModel
Parameters:
classIndex - The index of the equivalence class to be returned
Returns:
Set The set of refIds, members of the equivalence class being retrieved

getDiscourseEntity

public DiscourseEntity getDiscourseEntity(Cf cf)
Retrieves the Discourse Entity, which cf is a realization of. If no such Discourse Entity, it returns null.

Specified by:
getDiscourseEntity in interface DiscourseModel
Parameters:
cf - The Cf for which a DE is to be retrieved
Returns:
DiscourseEntity The Discourse Entity corresponding to cf

getTimesMentioned

public int getTimesMentioned(DiscourseEntity de)
Returns the number of mentions of a given Discourse Entity, that is number of Cfs in its equivalence class.

Specified by:
getTimesMentioned in interface DiscourseModel
Parameters:
de - The Discourse Entity
Returns:
int The number of mentions

getEquivalenceClassIndex

public int getEquivalenceClassIndex(Object refId)
Retrieves the index of the equivalence class to which refId belongs. If there is no matching class, it returns -1.

Specified by:
getEquivalenceClassIndex in interface DiscourseModel
Parameters:
refId - The id of a referential expression
Returns:
int The index of the equivalence class to which refId belongs

getAntecedent

public Object getAntecedent(Object anaphorId)
Retrieves the refId of the antecedent of anaphorId.

Specified by:
getAntecedent in interface DiscourseModel
Parameters:
anaphorId - The id of a referential expression used anaphorically
Returns:
Object The id of the expression which is coreferential with the one identified by anaphorId

getCfs

public Vector getCfs(Utterance utt)
Returns the forward-looking centers within the given utterance.

Specified by:
getCfs in interface DiscourseModel
Parameters:
utt - The Utterance
Returns:
Vector The list of Cfs within the utterance

getNextUtterance

public Utterance getNextUtterance(Utterance uttRef)
Returns the utterance following the reference utterance in this Discourse Model.

Specified by:
getNextUtterance in interface DiscourseModel
Parameters:
uttRef - The reference utterance
Returns:
Utternace The next utterance

getPrevUtterance

public Utterance getPrevUtterance(Utterance uttRef)
Returns the utterance preceeding the reference utterance in this Discourse Model.

Specified by:
getPrevUtterance in interface DiscourseModel
Parameters:
uttRef - The reference utterance
Returns:
Utternace The previous utterance

getDistance

public int getDistance(Object refId1,
                       Object refId2,
                       String tagName)
Returns the distance between the referential expression identified by refId1, and the one identified by refId2. The distance can be measured in terms of number of intermediate words (tagName=W), number of intermediate NEs (tagName=ne) or number of intermediate utterances

Specified by:
getDistance in interface DiscourseModel
Parameters:
refId1 - The id of a referential expression
refId2 - The id of a referential expression
tagName - The tag name of the nodes to be accounted for between refId1 and refId2
Returns:
int The distance between the anaphor and the antecedent

getNumberOfEntities

public int getNumberOfEntities()
Returns the number of entities contained in this discourse model.

Specified by:
getNumberOfEntities in interface DiscourseModel
Returns:
int the number of entities in the discourse model

getNumberOfAnaphoricReferences

public int getNumberOfAnaphoricReferences()
Returns the number of (anaphoric) referential expressions in the discourse model.

Specified by:
getNumberOfAnaphoricReferences in interface DiscourseModel
Returns:
int the number of anaphoric references

getSetOfAnaphors

public Set getSetOfAnaphors()
Returns the set of anaphors stored in this discourse model.

Specified by:
getSetOfAnaphors in interface DiscourseModel
Returns:
Set The set of anaphors

getVectors

public Vector getVectors(Object refId)
Returns a vector of vectors in which this referential expression features. (Every vector of those vectors matches a specific corpus NE)

Specified by:
getVectors in interface DiscourseModel
Parameters:
refId - An Id of a referentail expression
Returns:
Vector The vector of vectors in which this referential expression features

getAnaphorToAntecedentMap

public Map getAnaphorToAntecedentMap()
Returns the data structure (Map) that holds the anaphor-to-antecedent mappings.

Specified by:
getAnaphorToAntecedentMap in interface DiscourseModel
Returns:
Map The anaphor-to-antecedent mappings

getEquivalenceClassString

public String getEquivalenceClassString(Object refId)
Returns all the members of the corresponding equivalence class separated by commas.

Specified by:
getEquivalenceClassString in interface DiscourseModel
Returns:
String Co-referential expressions separated by commas

getEquivalenceClassString

public String getEquivalenceClassString(int index)
Returns all the members of the corresponding equivalence class separated by commas.

Specified by:
getEquivalenceClassString in interface DiscourseModel
Parameters:
index - The index of the equivalence class to be retrieved

printEquivalenceClassesStatistics

public void printEquivalenceClassesStatistics(DiscourseModel dm)
Computes P/R per equivalence class on the basis of class intersection and with reference to this discourse model. Equivalence classes which did not match any of the corpus classes are printed off at the end.

Specified by:
printEquivalenceClassesStatistics in interface DiscourseModel
Parameters:
dm - The Discourse Model with which this dicourse model will be intersected

toString

public String toString()
Converts the Discourse Model contained in this object into a String.

Specified by:
toString in interface DiscourseModel
Returns:
String A String representation of this Discourse Model

getRootSegment

public Segment getRootSegment()
Returns the root segment of the physical Discourse Model.

Specified by:
getRootSegment in interface DiscourseModel
Returns:
Segment The root segment

getFlatDM

public DiscourseModel getFlatDM()
Returns a flat Discourse Model built from the annotation. (Used by the gold standard algorithm)

Specified by:
getFlatDM in interface DiscourseModel
Returns:
DiscourseModel The flat DM

getDOMDocument

public Document getDOMDocument()
Returns the pointer to the DOM Document of this DiscourseModel. (Used by the gold standard algorithm)

Specified by:
getDOMDocument in interface DiscourseModel
Returns:
Document The DOM Document

setAnaphoricLink

public void setAnaphoricLink(Object anaphorId,
                             Object anteId,
                             String relation)
Inserts a new anaphoric link into the discourse model. In the flat version of the discourse model, which is constructed out of the annotation, anaphorId and anteId are usually Strings (i.e. ne237), whereas in the full version of the discourse model, which is to be constructed incrementally, they are object references of type Cf.

Specified by:
setAnaphoricLink in interface DiscourseModel
Parameters:
anaphorId - The id of the anaphor
anteId - The id of the antecedent of the anaphor
relation - The type of relation that holds between the anaphor and the antecedent (ident, poss-inv, etc.)

setAnaphoricLink

public void setAnaphoricLink(Object anaphorId,
                             Object anteId)
Inserts a new anaphoric link into the discourse model. This is a default version which assumes an "ident" relationship between the anaphor and the antecedent.

Specified by:
setAnaphoricLink in interface DiscourseModel
Parameters:
anaphorId - The id of the anaphor
anteId - The id of the antecedent of the anaphor

createDiscourseEntity

public void createDiscourseEntity(Cf cf)
Creates a new Discourse Entity for the given Cf.

Parameters:
cf - The Cf

setFlatDM

public void setFlatDM(DiscourseModel dm)
Sets the pointer to a flat Discourse Model built from the annotation. (Used by the gold standard algorithm)

Specified by:
setFlatDM in interface DiscourseModel
Parameters:
dm - The flat DM

constructEquivalenceClasses

private void constructEquivalenceClasses(Document document,
                                         String anteTag,
                                         String catAttributeName,
                                         Vector typeOfNP)
Builds the Discourse Model out of the annotation provided in the XML file. This version of the method is tailored for a specific type of NP, which is given as a parameter.

Parameters:
document - The DOM document containing the annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)
catAttributeName - The attribute name which stores the type of NP (i.e. cat, AAcat)
typeOfNP - The type(s) of NP to be considered (the-np, pers-pro, etc.)

constructEquivalenceClasses

private void constructEquivalenceClasses(Document documentCorpus,
                                         Document documentParsed,
                                         String anteTag,
                                         String catAttributeName,
                                         Vector typeOfNP)
Builds the Discourse Model out of the annotation provided in the XML file. This version of the method is tailored for a specific type of NP, which is given as a parameter. Additionally a translation from the neIds in the corpus file to the neIds in the parsed file is performed.

Parameters:
documentCorpus - The DOM document containing the Corpus annotation
documentParsed - The DOM document containing the automatically generated annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)
catAttributeName - The attribute name which stores the type of NP (i.e. cat, AAcat)
typeOfNP - The type(s) of NP to be considered (the-np, pers-pro, etc.)

updateNeToVectorsMap

private void updateNeToVectorsMap(Vector vec)
Updates the mapping of ne-to-(vector of vectors of nes) with a new vector of keys. Every member of vec will be mapped to vec. Here, the intuituion is that a given ne from the corpus could have been mapped to more than one sets (vectors) of nes from the parsed file.

Parameters:
vec - The new vector of keys (i.e. neIds)

constructEquivalenceClasses

private void constructEquivalenceClasses(Document document,
                                         String anteTag)
Builds the Discourse Model out of the annotation provided in the XML file. This version of the method is generic in the sense that it takes into account anaphors of all types of NP.

Parameters:
document - The DOM document containing the annotation
anteTag - The tag name of the nodes containing the annotation (i.e. ante)

isAnaphorOfType

private boolean isAnaphorOfType(Document doc,
                                String anaphorNeId,
                                Vector typeOfNP,
                                String catAttributeName)
This method checks whether a given neId is of a certain type or set of types. The method returns true if the neId is of the desired type, and false otherwise. For efficiency the first time this method is called an ne-to-node map is constructed.

Parameters:
anaphorNeId - The id of the ne, whose type will ve checked
typeOfNP - The type(s) of NP to be considered (the-np, pers-pro, etc.)
catAttributeName - The attribute name which stores the type of NP (i.e. cat, AAcat)
Returns:
boolean True if the type corresponding to anaphorNeId is one of typeOfNP, false otherwise

processFile

public void processFile(String inputFileName)
Processes a file in XML-in XML-out fashion, identifying and annotating anaphoric links. This is the main method for constructing incrementally a Discourse Model. Essentially it does the following: 1. Opens an XML file compiant with MAS-XML 2. Extracts all the NPs (NEs) in pre-order and processes ( processCf() ) them sequentially one at a time 3. Resolved anaphors are stored in a Discourse Model and annotated directly into the DOM tree 4. At the end the DOM tree is stored back to an XML file.


processCf

private void processCf(Node theNode)
This method does the following 3 things: 1. Updates the physical Discourse Model by processing a Cf held in a DOM Node and creating a corresponding Cf object for it 2. Resolves anaphoric links (if any) 3. Updates the logical Discourse Model and DOM tree accordingly.

Parameters:
theNode - The DOM node holding a Cf to be processed

createCf

private Cf createCf(Node theNode,
                    Utterance utt)
Creates an appropriate Cf object for a given syntactic phrase held in a DOM Node.

Parameters:
theNode - The DOM node that holds the syntactic phrase
Returns:
Cf The Cf object appropriate for the syntactic phrase

createUtterance

private Utterance createUtterance(Node theNode)
Creates an appropriate Utterance object for a given Cf held in a DOM Node. It also updates the physical Discourse Model (segments<
Parameters:
theNode - The DOM node that holds the Cf
Returns:
Utterance The Utterance of the Cf held in theNode

updateLogicalDiscourseModel

private void updateLogicalDiscourseModel(Cf anaphor,
                                         Cf antecedent)
Updates the logical part of the ongoing DiscourseModel. This involves updating three data structure: 1. The Cf-to-EntityId map 2. The Vector of Entities (EntityId is equal to the index in which a given Entity resides in this Vector) 3. The Cf_anaphor-to-Cf_antecedent map

Parameters:
anaphor - The Cf holding the anaphor
antecedent - The Cf holding the antecedent

createAnteNode

private Node createAnteNode(Node current,
                            Node antecedent,
                            String relation)
Creates the ante node () to be attached to the DOM tree.

Parameters:
current - The node of the current Cf
antecedent - The node of the antecedent
relation - The relation between current Cf and antecedent
Returns:
Node The DOM node that holds the anaphoric information

findUtteranceNode

public static Node findUtteranceNode(Node node)
Finds the DOM node corresponding to the sentence in which a node holding a Cf is located. Starts from node (first parameter) and goes up the tree until the sentence is found. If no sentence is found then the parent of the upmost UNIT element is returned.

Parameters:
node -
Returns:
Node

findSegmentNodes

private Vector findSegmentNodes(Node utteranceNode)
Finds the DOM nodes corresponding to the path of segments, starting from the utterance node given as a parameter and continuing to the root Segment of the document that is being processed.

Parameters:
utteranceNode - The utterance node
Returns:
Vector The segment nodes above the utterance node