uk.ac.essex.malexa.nlp.dp.GuiTAR.txtToXML
Class XMLTokeniser

java.lang.Object
  extended byuk.ac.essex.malexa.nlp.dp.GuiTAR.txtToXML.XMLTokeniser

public class XMLTokeniser
extends Object

A class that encapsulates the functionality to convert a file processed by the TxtToXMLPipeline into a fully annotated XML file (i.e. each word has a separate XML tag).

Version:
1.0
Author:
Mijail A. Kabadjov

Field Summary
private  File iFile
           
private  int neID
           
private  int neVeID
           
private  Node rootDOMTreeSource
           
private  Node rootDOMTreeTarget
           
private  int veID
           
private  String XML_TEMPLATE_FILE
          PARAMETERS DEFINED AS CONSTANTS
 
Constructor Summary
XMLTokeniser(String inFileName)
          The constructor of the class.
 
Method Summary
static void main(String[] args)
           
private  String preTokenise(String str)
          Inserts extra an blank space before the characters {", ', ., ?, !}.
 void processFile()
          Takes the reference to the source XML document, loaded from a file processed by the TxtToXMLPipeline, then retrieves all the sentences from it and passes them on, one at a time, to the method processSentence().
private  Vector processLine(String line)
          NEW Version.
private  Vector processLine(String line, String dummy)
          OLD Version - In order to use it delete second parameter.
private  Node processSentence(Node sentence, Document document, int utteranceId)
          A wrapper method that calls the recursive method tagSentence().
private  Node tagSentence(Node node, Document document)
          Receives a partially annotated sentence (only ne's and ve's have been marked) and text has been tokenised by ltchunk previously (i.e.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

XML_TEMPLATE_FILE

private final String XML_TEMPLATE_FILE
PARAMETERS DEFINED AS CONSTANTS

See Also:
Constant Field Values

rootDOMTreeSource

private Node rootDOMTreeSource

rootDOMTreeTarget

private Node rootDOMTreeTarget

iFile

private File iFile

neVeID

private int neVeID

neID

private int neID

veID

private int veID
Constructor Detail

XMLTokeniser

public XMLTokeniser(String inFileName)
The constructor of the class. Loads the XML files; one is partially annotated by the TxtToXMLPipeline, the other is a template over which, the resulting XML document will be instantiated.

Method Detail

processFile

public void processFile()
Takes the reference to the source XML document, loaded from a file processed by the TxtToXMLPipeline, then retrieves all the sentences from it and passes them on, one at a time, to the method processSentence(). Also this method takes the reference to the target XML document, loaded from the file Template.xml, and passes it on to the aforementioned method.


processSentence

private Node processSentence(Node sentence,
                             Document document,
                             int utteranceId)
A wrapper method that calls the recursive method tagSentence().

Parameters:
sentence - The node to be post-tokenised (partially annotated sentence)
document - The reference to the target document where the new node will be appended
utteranceId - The global index of the utterance (sentence)
Returns:
Node The node representing a fully annotated sentence

tagSentence

private Node tagSentence(Node node,
                         Document document)
Receives a partially annotated sentence (only ne's and ve's have been marked) and text has been tokenised by ltchunk previously (i.e. word_POS). Returns a fully annotated sentence (i.e. every word enclosed in its own XML tag).

Parameters:
node - The node to be post-tokenised
document - The reference to the target document where the new node will be appended
Returns:
Node The node with all its children appended correspondingly

processLine

private Vector processLine(String line)
NEW Version. Processes a line of text processed by ltchunk. (pos tags appended to words through an underscore). Returns a vector containing two vectors v1- the tokens read, v2- the POS tags corresponding to the tokens in v1.

Parameters:
line - The line to be post-tokenised (after ltchunk)
Returns:
Vector the Vector containing v1, v2

preTokenise

private String preTokenise(String str)
Inserts extra an blank space before the characters {", ', ., ?, !}. (The comma is used to separate the characters)

Parameters:
str - The String of characters to be processed
Returns:
String The String received as a parameter with additional blank spaces before the aforementioned characters

processLine

private Vector processLine(String line,
                           String dummy)
OLD Version - In order to use it delete second parameter. Processes a line of text processed by ltchunk. (pos tags appended to words through an underscore). Returns a vector containing two vectors v1- the tokens read, v2- the POS tags corresponding to the tokens in v1.

Parameters:
line - The line to be post-tokenised (after ltchunk)
Returns:
Vector the Vector containing v1, v2

main

public static void main(String[] args)