PTOLEMAIOS

See also PTOLEMAIOS project website at Saarland University.

The main focus of my current research is on learnability of grammars and actual grammar learning experiments. This includes questions about formal representation, computational properties of grammar formalisms, formulation of linguistic constraints, algorithms for analysis and learning, (probabilistic) learning models based on corpus data, bootstrapping techniques, etc.

The key idea of what I call the PTOLEMAIOS project (for "Parallel-Text-based Optimization for Language learning--Exploiting Multilingual Alignment for the Induction Of Syntactic grammars") is the following: sentence-aligned parallel corpora contain significant implicit information about syntactic structure of the sentences, and thus about the grammars of the languages involved. With appropriate learning techniques, one should be able to access this information and induce grammars from parallel corpora (compare the successful pilot study in Kuhn 2004a/b).

The motivation behind the project is twofold: (1) understanding learnability properties of grammar models is crucial for a better theoretical grasp of the language faculty; and (2) being able to train grammars from corpus data is an important step for multilingual Natural Language Processing applications.


The PTOLEMAIOS project will combine different lines that I have pursued in earlier research work: (i) syntactic and semantic modeling of linguistic phenomena, often in a crosslinguistic context, (ii) studies of formal properties of grammar formalisms, (iii) implementation of processing algorithms for grammar formalisms and infrastructure tools for maintenance of linguistic resources, and (iv) corpus-based training of linguistic models.

Linguistic Modeling

I have worked for many years on theoretical and practical issues of broad-coverage grammar writing. This work has been situated in the context of the Parallel Grammar Development project ParGram, based on the Lexical-Functional Grammar (LFG) formalism. In have been involved in the motivation and development of various formal devices for linguistic modeling such as parametrized categories (Kuhn, 1998a), Optimality-Theory-style constraint ranking (Kuhn and Rohrer, 1997; Frank et al., 1998, 2001; King et al., 2000, 2004), and a feature declaration mechanism (Butt et al., 2003). The guiding principle of the ParGram project is to exploit linguistic insights into cross-linguistic generalizations in order to write "parallel" broad-coverage grammars for multiple languages (Butt et al., 2003; King et al., forthcoming). There are numerous advantages to parallelism, such as easy porting of applications for one grammar to the others, applicability in machine translation or analysis of parallel corpora, but also reduction of development effort for new grammars or subgrammars. Most of these advantages carry over to the PTOLEMAIOS approach, in which the grammars induced for different languages produce very similar representations.

Besides linguistic modeling work in the closer context of the ParGram project, I have made theoretical contributions in syntax and semantics, applying various formalisms such as LFG, HPSG, DRT, and Glue Language Semantics: (Kuhn, 1994; Kuhn and Heid, 1994; Kuhn, 1996a,b,c,d; Dogil et al., 1997; Berman et al., 1998; Kuhn, 1999a, 2001d, in preparation; Denis et al., 2003).

Formal properties of grammar formalisms

My second line of research has addressed grammar formalisms, originally growing out of ParGram-related work (Kuhn, 1999b, 2001c), but ultimately representing a separate focus, in particular in the work on Optimality-Theoretic (OT) Syntax. My OT work (Kuhn, 2000a,c, 2002c, 2003b and in particular my dissertation Kuhn 2001b, and the CSLI Pubications book Kuhn 2003c) develops the framework originally proposed by Joan Bresnan, which builds on candidate representations from LFG (OT-LFG). The central questions I have addressed in OT concern the formalization of the candidate generation function and of the violable constraints, the "direction" of optimization (i.e., whether we compare alternative realizations of the same meaning or alternative analyses of the same string), and decidability of the question whether a given candidate is optimal according to an OT grammar. The PTOLEMAIOS project builds directly on a number of insights from the OT formalization work.

Computational processing and tool building

I have developed prototype systems for processing tasks related to grammar formalisms and infrastructure tools in the context of linguistic engineering. For example, Kuhn (1998b) discusses tools for testing a grammar against an annotated testsuite; in (Zinsmeister et al., 2002), a system converting LFG representations into a dependency treebank format is discussed; Kuhn (2000b, 2001a) presents a chart-based algorithm for processing OT-syntactic grammars; Kuhn (2003a), addresses a finite-state approximation of a feature-based morphological grammar for word formation in German; in (Kuhn and Mateo-Toledo, 2004), various experiments with NLP tools applied in corpus construction for the endangered Mayan language Qanjobal are discussed; Palmer et al. (2004) report on experiments using various linguistic resources and NLP tools for the implementation of Carlota Smiths discourse-semantic theory.

Corpus-based learning

My fourth and last line of research has attempted to exploit text corpora in order to acquire linguistic knowledge, using a variety of techniques and typically exploiting higher-level linguistic representations or background knowledge. In (Kuhn et al., 1998), we focus specifically on acquiring lexical subcategorization information for verbs, using an existing large-coverage grammar for hypothesis testing. In (Riezler et al., 2000), we trained a log-linear (or Maximum Entropy) model for disambiguating the output of the large-scale German LFG grammar from the ParGram project. Part of (Kuhn and Mateo-Toledo, 2004) are experiments in training a Maximum Entropy part-of-speech tagger for Qanjobal, for which only very limited resources exist; the Maximum Entropy approach is particularly suited for combining many different features in learning, so the most effective use can be made of the small set of learning data. I also use a Maximum Entropy model in ongoing work on learning various linguistic models, such as a coreference resolution model for anaphoric expressions exploiting a deep syntactic grammar and insights from Segmented Discourse Representation Theory (in joint research with Nicholas Asher (Asher et al., 2004)). Corpus-based learning based on an OT grammar architecture was also one of the topics I addressed in my postdoctoral research project Optimization Inside and Outside Grammar: a Formal Linguistic Approach to Corpus-based Learning at Stanford University in 2001/02 (compare Kuhn, 2002a,b).

Bibliography

Jonas Kuhn, January 2005