Combining handcrafted and corpus-acquired Lexical Knowledge into a Morphosyntactic Tagger*

Giorgos Orphanos and Christos Tsalidis

Computer Engineering & Informatics Department

and Computer Technology Institute

University of Patras

26500 Rion, Patras, Greece

{georfan, tsalidis}


  1. Introduction
  2. The aim of this paper is to present a morphosyntactic tagger for Modern Greek (M. Greek); a highly-inflectional language. We will focus on: (a) The tagger itself, as a tool that integrates a computational lexicon and a disambiguator/guesser, and (b) The underlying infrastructure, comprising a formalism for the morphological definition of Greek words, a set of tools for the parsing/compilation of word-definitions into a computational lexicon, a text database and how it is used for the induction of decision trees that play the role of disambiguating/guessing devices.

    Towards the hard task of lexical acquisition, many attractive automatic or semi-automatic approaches have been proposed, provided that machine readable dictionaries, (annotated) corpora or existing lexical databases are available [Zernik, 1991; Daelemans, 1995; Boguraev and Pustejowsky, 1996]. On the other hand, one can find the well established, yet very expensive, manual approach supported by primitive or advanced lexicographic environments [Domenig and Ten Hacken, 1992; Wilks, 1996]. For the development of the M. Greek lexicon, we exploited an available spelling lexicon [Vagelatos et al., 1995]. First, we extended the formalism on which it was based beyond the description of inflectional morphology and stress, in order to be able to attach morphosyntactic information to word-forms [Tsalidis and Orphanos, 1995] and then we used it as an outline for the development of the morphosyntactic lexicon. After a year of hand-coding, the lexicon currently contains 880.474 word-forms (~70.000 lemmas) with full morphosyntactic information and covers ~97,5% of the words found in Greek running texts.

    Towards the resolution of morphosyntactic ambiguity and the unknown word guessing, we first recorded the ambiguity schemes (e.g., Verb-Noun, Article-Pronoun, Adjective-Adverb, etc.) present in M. Greek by extracting from the lexicon all forms with identical orthography but different morphosyntactic attributes [Gakis et al., 1998]. Subsequently, we set up a corpus, studied the idiosyncratic properties of each ambiguity scheme as well as the features of the unknown words and adopted the data-driven approach for ambiguity resolution proposed by [Daelemans et al., 1996]. By inducing a decision tree for each ambiguity scheme and a decision tree for unknown word classification, we achieved a promising overall performance of ~93,5%.

  3. Modern Greek Tagger
    1. Natural language issues

Our study targeted M. Greek, a rather difficult to describe computationally, yet attractive, natural language. In short, the inherent characteristics of M. Greek are:

The complexity of the M. Greek inflectional system disallows the application of automatic means (e.g., a stemmer/lemmatizer) for the extraction/bootstrapping of lexical knowledge. As a consequence, the construction of the M. Greek lexicon required lots of manual labor; i.e. explicit definition of lemma clusters containing inflected forms accompanied by their morphosyntactic attributes. On the other hand, the M. Greek inflectional system exhibits a narrower range of morphosyntactic ambiguity, compared to the corresponding ambiguity revealing in low inflected languages (e.g., in English almost every Noun can be a Verb), and is mainly attributed to the frequent occurrence of functional words, such as Articles and Pronouns (similar to the French le, la), almost all of which are ambiguous [Gakis et al., 1998].


    1. Overall architecture

Figure 1. Architecture of the Tagger

Figure 1 illustrates the functional components of the tagger and the order of processing. Raw text passes through the Tokenizer, where it is converted into a stream of tokens. Non-word tokens (e.g., punctuation marks, numbers, dates, etc.) are resolved by the Tokenizer and receive a tag corresponding to their category. Word tokens are looked-up in the Lexicon and those found receive one or more tags. Words with more than one tag and those not found in the Lexicon pass through the Disambiguator/Guesser, where the contextually appropriate tag is decided/guessed with the help of decision trees.

  1. Lexicon
    1. Word definition formalism

The morphological definition of Greek words is based on an extended version of the Greek Word Description Language (GWDL) [Vagelatos et al., 1995], a formalism used for the development of a spelling lexicon for M. Greek. GWDL comprised a set of rules that encode the declinable part of inflected words. It utilized 290 rules. Using GWDL, only the stem of each word followed by a rule (or a set of rules) representing possible endings, was stored in the lexicon. It must be mentioned here, that we had to deal with a lexicon of nearly 90.000 stems where for almost each stem, a set of rules was required.

Extended GWDL (EGWDL) extends the previous formalism in 4 points:

  1. The morphemes used for the formation of Greek words can incorporate sets of morphosyntactic or semantic attributes.
  2. A lemma can be defined as a cluster of word-form definitions.
  3. Two-dimensional information structures can accompany a lemma.
  4. Syntactic or semantic relations between lemmas can be defined as typed links that are resolved during lexicon construction

Using EGWDL we can declare:

%attributes = (NOUN "Ουσιαστικό" | ARTICLE "Άρθρο" | … ).

%infos = ( ANTONYM: 1 "Αντώνυμο" |

MEANING: 1 "Έννοια" | …


@MSCSING = (MASCULINE | SINGULAR) "αρσενικό, ενικός".

#MSCas = (ας @(NOMINATIVE) | α @GAV) @MSCSING "Ενικός αρσενικών σε -ας".

!a1=(1). -- Stress in final position.

!a2=(2). -- Stress in penultimate position.

!a6=(3). -- Stress in antepenultimate position.

$S1 = ( ο-τε-ρ #OSe !a6 @(MASCULINE) | -- ο-τε-ρ = Infix,

-- #OSe = inflectional rule

-- !a6 = stress rule

-- @(MASCULINE) = attribute

ο-τε-ρ #OSp !a6 @(MASCULINE) |

ο-τε-ρ #THIi !a6 |

ο-τε-ρ #ESWN !a6 @(FEMININE) |

ο-τε-ρ #OUDOe !a6 |

ο-τε-ρ #OUDOp !a6

) @EPSIGR -- @EPSIGR = attributes applied

-- to all morphemes

"συγκριτικός σε -ότερος -η -ο (αρχαιότερος)". -- description

Example of a coded lemma:

< -- start of lemma definition

[άμεσος] -- label of lemma

α-με-σ ($E2 | $E5 | $Eoud | $S1 | $Y1). -- words with stem "α-με-σ"

\ANTONYM{\W(έμμεσος)} -- info with link to lemma "έμμεσος"

\MEANING{"'άμεσος' σημαίνει ευθύς"} -- another info


The above definition produces the word-forms of the adjective "άμεσος" (= direct) with full morphosyntactic attributes: άμεσος, άμεσου, άμεσο, άμεσοι, άμεσων, άμεσους, άμεση, άμεσης, άμεσες, άμεσα, αμεσότερος, αμεσότερου, αμεσότερο, αμεσότεροι, αμεσότερων, αμεσότερους, αμεσότερη, αμεσότερης, αμεσότερες, αμεσότερα, αμεσότατος, αμεσότατου, αμεσότατο, αμεσότατοι, αμεσότατων, αμεσότατους, αμεσότατη, αμεσότατης, αμεσότατες, αμεσότατα. Also, supplies the meaning of "άμεσος" and declares that "άμεσος" is an antonym of "έμμεσος" (= indirect).


    1. Lexicon structure
    2. Our morphological lexicon contains ~70.000 lemmas. The possible word-forms produced from these lemmas have been calculated to exceed 870.000 (see Section 3.4, Table 1). The primary indexing/storage mechanism used to access the words of the lexicon is the Compressed Trie [Maly, 1976]. This data structure is the most appropriate for our purposes, since it enables efficient search and occupies less disk space. The Compressed Trie is used as an index to the database records of the lemmas. This structure is relatively small (about 700Kb) compared to the size of data needed to represent the entire lexicon. Thus, we can load a big part of it (or the whole, if the computer has enough memory) into main memory. The Compressed Trie contains the unique prefixes of all word-forms produced with the help of EGWDL. A path in the Trie, starting from the root and ending to a leaf, represents a word prefix; the leaf points to those lemma records in the database that contain the suffixes of word-forms with this prefix.

      The number of infixes and inflections encoded via EGWDL is very small in comparison to the number of words. For the ~90.000 prefixes of the Lexicon database, there are about 400 distinct infixes and 200 distinct inflections, which means that these infixes and inflections are used very frequently in order to cover the 90.000 prefixes. Consequently, it is more efficient to keep them in main memory stored in a Symbol Table. The actual lemma records are stored in a disk-file. We applied many packing techniques in order to compress data for each record. Using the Compressed Trie as an index to access the lemma records, we achieved approximately one disk access per search for data located on the disk.


    3. Supporting tools
    4. The development of the lexicon was a long and tedious process in which many difficulties were encountered. The main difficulty was to successfully attach the appropriate rules to each stem, so as to derive all possible word-forms accompanied by their morphosyntactic attributes, while avoiding the production of unacceptable forms, as well as avoiding attachment of incorrect attributes, redundancy or overlapping.

      In order to simplify this process, an environment was built consisting of the following tools:

      1. An editing, syntax checking/correction environment, consisting of a professional editor and a checker of the EGWDL formalism. This checker assisted in two ways: First it helped validate the syntax rules for each entry (whether or not the rules were syntactically correct). Second, having the ability to produce the inflected forms of each entry, the linguists were able to check the "correctness"/acceptability of each entry and also the completeness of the EGWDL. This way, additional rules were provided so as to result in a description language that would be complete.

      2. Mkdict (make dictionary), is a program which was developed in order to construct the dictionaries. It takes as input the lexicon files (which contain word definitions according to EGWDL) and the set of rules (declared according to EGWDL), and constructs the Compressed Trie index and the lexical database.


    5. Lexicon contents and POS ambiguity

Table 1 illustrates the distribution of lexicon contents over the POSs. For example, the lexicon contains 153.929 distinct word-forms that are Nouns; this number includes Nouns with POS ambiguity (e.g. Nouns that are also Verbs or Adjectives).


# of word-forms














Total unique orthographic forms


Word-forms with POS ambiguity


Table 1. Lexicon Contents

As shown in Table 1, word-forms with POS ambiguity sum up to 0,77% (= 6.773 / 873.701) over the lexicon contents. Does it worth the price to bother with such a low percentage? The truth is that this percentage does not reflect the POS ambiguity occurring in M. Greek texts; counting the number of word-forms with POS ambiguity in a corpus (see Table 4 in Section 5), their percentage came up to 20,85%. That is because a high rate of functional words (Articles, Pronouns, Prepositions) with POS ambiguity is present.

  1. Disambiguator/Guesser
  2. The Disambiguator/Guesser is a ‘forest’ of decision trees, one tree for each ambiguity scheme present in M. Greek and one tree for unknown word guessing. When a word with two or three tags appears, the Disambiguator identifies its ambiguity scheme and selects the corresponding decision tree, which is traversed according to the values of morphosyntactic features extracted from contextual tags. This traversal returns the contextually appropriate POS. The ambiguity is resolved by eliminating the tag(s) with different POS than the one returned by the decision tree. The POS of an unknown word is guessed by traversing the decision tree for unknown words, which examines contextual features, the ending and the capitalization of the word and returns one of the open-class POSs.

    In the following paragraphs we describe the methodology used to induce decision trees from a tagged corpus.


    1. Tagged corpus
    2. For the study and resolution of lexical ambiguity in M. Greek, we set up a corpus of 137.765 tokens (7.624 sentences), collecting sentences from student writings, literature, newspapers, and technical, financial and sports magazines. In order to have a representative corpus, we took care to cover adequately all POS ambiguity schemes present in M. Greek [Gakis et al., 1998], but without showing preference to any scheme.

      Subsequently, we tokenized the corpus and inserted it into a relational database and used the lexicon to assign a morphosyntactic tag to each word-token. The advantage of the database is that it offers a high-level data manipulation language (e.g., SQL) that can be used to rapidly answer simple or complicated questions. Table 1 shows a sample sentence retrieved from the database (the column English Translation exists for reasons of illustration; the symbolic names of feature-values encoded in the tags are given in the Appendix):

      Sentence ID

      Token ID


      English Translation

      Automatic Tag (assigned by the lexicon)

      Manual Tag











      Vrb(_B_SglActPstSjv + _B_SglActFutInd) + Nnn(FemPlrNomAccVoc)






      Prn(_C_MscNtrSngGen) + Clt + Art(MscNtrSngGen)












      "ου" Cap Nnn + Vrb + Adj + Pcp + Adv




















      Table 2. An example-sentence from the tagged corpus

      To those words with POS ambiguity (e.g., tokens 2 and 3) we manually assigned an extra POS label which definitely resolves the ambiguity. To unknown words (e.g., token 5), which by default received a disjunct of open-class POS labels, we manually assigned their real POS and declared explicitly their inflectional ending.

    3. Training sets

At a second phase, for all words relative to a specific ambiguity scheme or for all unknown words we collected from the tagged corpus their automatically and manually assigned tags along with the automatically assigned tags of their neighboring tokens. This way, we created a training set for each ambiguity scheme and a training set for unknown words. Table 2 shows a 10-examples fragment from the training set for the ambiguity scheme Verb-Noun:

Example ID




Manual Tagi



Vrb(_B_SglPntActImv) + Nnn(FemSglNomAccVoc)

Prn(_C_FemSglGen) + Clt + Art(FemSglGen)




Vrb(_C_SglPntFcsActIndSjv) + Nnn(FemSglNomAccVoc)




Nnn + Vrb + Adj + Pcp + Adv

Vrb(_B_SglFutPstActIndSjv) + Nnn(FemPlrNomAccVoc)




Prn(_A_SglGenAcc) + Pps

Vrb(_B_SglFutPstActIndSjv) + Nnn(NtrSglPlrNomGenAccVoc)





Vrb(_B_SglFutPstActIndSjv) + Nnn(FemPlrNomAccVoc)

Prn(_C_MscNtrSglGen) + Clt + Art(MscNtrSglGen)




Vrb(_B_SglPntFcsFutPstActIndSjv) + Nnn(MscSglNom)

Prn(_A_SglGenAcc) + Pps




Vrb(_B_SglFutPstActIndSjv) + Nnn(FemPlrNomAccVoc)





Vrb(_B_SglFutPstActIndSjv) + Nnn(NtrSglPlrNomGenAccVoc)





Vrb(_C_SglPntFcsActIndSjv) + Nnn(FemSglNomAccVoc)

Art(MscSglAcc + NtrSglNomAcc)



Pcl + Adv

Vrb(_B_SglPntFcsFutPstActIndSjv) + Nnn(MscSglNom)



Table 3. A fragment from the training set Verb-Noun

For reasons of space, Table 2 shows the tags of only the previous (column Tagi–1) and next (column Tagi+1) tokens in the neighborhood of an ambiguous word, whereas more contextual tags actually comprise a training example. Also, a training example includes the manually assigned tag (column Manual Tagi) along with the automatically assigned tag (column Tagi) of the ambiguous word. One can notice that some contextual tags may be missing (e.g., Tagi–1 of Example 7; the ambiguous word is the first in the sentence), or some contextual tags may exhibit POS ambiguity (e.g., Tagi+1 of Example 1), an incident implying that the learner must learn from ambiguous examples, since this is the case in real texts.

If we consider that a tag must encode 1 to 5 morphosyntactic features, each feature taking one or a disjunction of 2 to 11 values, then the total number of different tags counts up to hundreds. This fact makes prohibitive to feed the tree-induction algorithm with training patterns that have the form ([Tagi–2],[Tagi–1],[Tagi],[Tagi+1],[Manual Tagi]), which is the case for similar systems that learn POS disambiguation (e.g., [Daelemans et al. 1996]). On the other hand, it would be inefficient (yielding information loss) to generate a simplified tag-set in order to reduce its size. The 'what-the-training-patterns-should-look-like' bottleneck was surpassed by assuming a set of functions that extract from a tag the value(s) of specific features, e.g.:

Gender("Art(MscSglAcc + NtrSglNomAcc)") = Msc + Ntr

With the help of these functions, the training examples shown in Table 3 are interpreted to patterns that look like:

(POS([Tagi–1]),Gender([Tagi]),POS([Tagi+1]),Gender([Tagi+1]),[Manual Tagi])

that is, a sequence of feature-values extracted from the previous/current/next tags along with the manually assigned POS label. This interpretation provides the capability to create flexible training sets, by including or excluding features, which, combined with the availability of a separate training set for each ambiguity scheme, offers two major advantages:

  1. The features comprising the patterns of a training set can be selected according to linguistic criteria that reflect the idiosyncratic properties of the corresponding ambiguity scheme. This is a way to introduce linguistic bias to the learner and address effectively the needs of the particular problem.
  2. The training procedure for a specific ambiguity scheme can be fine-tuned by changing the set of features that form its training patterns, without affecting the learning of the other ambiguity schemes.


    1. Decision tree induction
    2. The algorithm used to transform a training set into a decision tree belongs to the TDIDT (Top Down Induction of Decision Trees) family [Quinlan 1986]. Each training pattern, as already shown in the previous paragraph, is a vector of the form (v1,v2,…vn,c) where vi is a value (or a set of values) of a feature Fi and c is the manually assigned class label. Each tree-node constructed by the training algorithm includes a class label, representing the decision being made by the node, and a feature to be tested over input patterns. The algorithm is initialized with a root-node containing the most frequent class label of the entire training set and the feature Fbest, the best (most predictive) of the features F1,F2,…Fn encoded in the training patterns. Based on the divide and conquer principle, the algorithm partitions the training set according to the values of Fbest and creates under the root one child-node for each partition. The class label assigned to a child-node is the most frequent class label of its corresponding training-subset. Subsequently, the algorithm excludes Fbest from the training patterns and visits each child-node, where it selects a new best feature, re-partitions the training patterns of the child-node and continues recursively until all (or almost all) patterns in a partition have the same class label c or no more features are left.

      The best feature computed during each step by the training algorithm is the feature with the highest Gain Ratio, an information-based quantity introduced by Quinlan [Quinlan 1986]. Intuitively, the best feature is the one that has the maximum contribution to our knowledge about the correct (most probable/frequent) class label pertaining a particular set of patterns, or, equivalently, the feature that its values present the minimum Entropy (uncertainty) when used to predict the class label of a particular set of patterns.


    3. Decision tree traversal

Each tree node, as already mentioned, contains a class label that represents the ‘decision’ being made by the specific node. Moreover, when a node is not a leaf, it also contains a list of values corresponding to a particular feature tested by the node. Each value is the origin of a sub-tree hanging under the non-terminal node. For the needs of the POS disambiguation/guessing problem, tree nodes contain POS labels and test morphosyntactic features.

The disambiguating/guessing procedure is accomplished as follows: First, assume an ambiguous word. From the tags assigned by the lexicon, we identify its ambiguity scheme and select the corresponding decision tree. Also, its tags along with the tags assigned to neighboring tokens are set to form a testing-pattern. Second, assume an unknown word. Tags of neighboring tokens along with the unknown word itself and its capitalization feature are set to form a testing-pattern. The corresponding tree is the one for unknown word guessing. What we have at hand for both situations is a decision tree and a testing-pattern. The tree is traversed from the root to the leaves. Being at a specific non-terminal node, each feature-value contained in that node is looked-up in the testing-pattern and if found the traversal continues through the sub-tree hanging under it. If none of the values is found or the current node is a leaf, the traversal is finished and the node’s POS label is returned.

  1. Experimental Evaluation

To evaluate our method, the datasets described in Section 4.2 were partitioned according to the 10-fold cross-validation method and over the derived datasets we performed two experiments:

  1. We resolved the ambiguity by keeping the tag with the POS that occurs more frequently for the specific ambiguity scheme. To unknown words, we assigned the most frequent POS (i.e., Noun).
  2. We used the decision trees.

Table 3 concentrates the results of our experiments. In detail: Column (1) shows in what percentage the ambiguity schemes and the unknown words occur in the corpus. The total problematic word-tokens in the corpus are 23,38%. Column (2) shows in what percentage each ambiguity scheme contributes to the total POS ambiguity. Column (3) shows the error rates of experiment (a). Column (4) shows the error rates of experiment (b).

To compute the total POS disambiguation error rates of the two methods (24,1% and 5,48% respectively) we used the contribution percentages shown in column (2).




% occurrence in the corpus


% contribution to POS ambiguity



most frequent POS



decision trees

POS Ambiguity Schemes







































































Total POS Ambiguity




Unknown Words










Table 4. Statistics and Evaluation Measurements

  1. Discussion and Future Work

We have presented a POS tagger for M. Greek, the construction of which combined a linguistic approach for the development of a high-coverage lexicon and a statistical approach for the induction of disambiguating/guessing devices. As a general remark, we argue that the linguistic approach is effective when the knowledge or the behavior of a language can be defined explicitly (by means of lexicons, syntactic grammars, etc.), whereas empirical (corpus-based statistical) learning should apply when exceptions, complex interaction or ambiguity arise. In addition, there is always the opportunity to bias empirical learning with linguistically motivated parameters, so as to meet the needs of the specific language problem.

Based on the above statements, we constructed a tagger that achieves ~93,5% unambiguous tagging of M. Greek running texts. Our results outperform the ~25% error rate obtained by the n-gram tagger for M. Greek presented in [Dermatas and Kokkinakis, 1995] and are comparable to the results of taggers built for other languages.

In the future, our work aims to cover the following topics:


Boguraev, B., and Pustejowsky, J. (eds.) 1996. "Corpus Processing for Lexical Acquisition", Cambridge: MIT Press.

Daelemans, W. 1995. "Memory-Based Lexical Acquisition and Processing", In: P. Steffens (ed.) Machine Translation and the Lexicon, Spriger Lecture Notes in Artificial Intelligence 898, pp. 85-98.

Daelemans, W., Zavrel, J., Berck, P., and Gillis, S. 1996. "MBT: A memory-based part of speech tagger generator", In E. Ejerhed and I. Dagan (eds.), Proceedings of 4th Workshop on Very Large Corpora, ACL SIGDAT, pp. 14-27.

Dermatas E. and Kokkinakis G. 1995. "Automatic Stochastic Tagging of Natural Language Texts", Computational Linguistics, 21:2, pp. 137-163.

Domenig, M., and Ten Hacken, P. 1992. "Word Manager: A System for Morphological Dictionaries". Olms, Hildesheim.

Gakis, P., Orphanos, G., Michalakou, A., and Christodoulakis, D. 1998. "An Attempt to Record Lexical Ambiguity in Modern Greek" (in Greek), In Proceedings of the 19th Annual Meeting of the Department of Linguistics, Faculty of Philosophy, Aristotle University of Thessaloniki, Greece.

Mackridge P. 1987. "The Modern Greek Language. A Descriptive Analysis of Standard Modern Greek". Clarenton Press, Oxford.

Maly, K. 1976. "Compressed Tries", Communications of the ACM, Volume 19, Number 7.

Philippaki-Warburton, I. 1985. "Word Order in Modern Greek". In Transactions of the Philological Society, pp. 113-143.

Quinlan, J. R. 1986. "Induction of Decision Trees". Machine Learning, 1, pp. 81-106.

Tsalidis, C, and Orphanos, G. 1995. "Word Description Languages", In Proceedings of the 1st Workshop in Natural Language Processing, Athens, Greece, pp. 239-253.

Vagelatos, A., Triantopoulou, T., Tsalidis, C., and Christodoulakis, D. 1995. "Utilization of a Lexicon for Spelling Correction in Modern Greek", In Proceedings 10th Annual Symposium on Applied Computing (SAC95), Special Track on Artificial Intelligence, Nashville, Tennessee.

Wilks, Y., Slator, B., and Guthrie, L. 1996. "Electric Words. Dictionaries, Computers and Meanings", Cambridge Mass.: MIT Press.

Zernik, U. (ed.) 1991. "Lexical acquisition: exploiting on-line resources to build a lexicon", Hillsdale, N.J.: Erlbaum.


Morphosyntactic Feature Values/Shortcuts

Part-Of-Speech = {Article/Art, Noun/Nnn, Adjective/Adj, Pronoun/Prn, Verb/Vrb, Participle/Pcp, Adverb/Adv, Conjunction/Cnj, Preposition/Pps, Particle/Pcl, Clitic (Clitical Pronoun)/Clt}

Number = {Singular/Sng, Plural/Plu}

Gender = {Masculine/Msc, Feminine/Fem, Neuter/Ntr}

Case = {Nominative/Nom, Genitive/Gen, Dative/Dat, Accusative/Acc, Vocative/Voc}

Person = {First/_A_, Second/_B_, Third/_C_}

Tense = {Present/Pnt, Future/Fut, Future Perfect/Fpt, Future Continus/Fcs, Past/Pst, Present Perfect/Pnp, Past Perfect/Psp}

Voice = {Active/Act, Passive/Psv}

Mood = {Indicative/Ind, Imperative/Imv, Subjanctive/Sjv}