PTOLEMAIOS
See also PTOLEMAIOS project website at Saarland University.
The main focus of my current research is on
learnability of grammars and actual grammar learning experiments.
This includes questions about formal representation,
computational properties of grammar formalisms, formulation of
linguistic constraints, algorithms for analysis and learning,
(probabilistic) learning models based on corpus data,
bootstrapping techniques, etc.
|
| |
The key idea of what I call the PTOLEMAIOS project (for
"Parallel-Text-based Optimization for
Language learning--Exploiting
Multilingual Alignment for the Induction
Of Syntactic grammars") is
the following: sentence-aligned parallel corpora contain significant
implicit information about syntactic structure of the sentences, and
thus about the grammars of the languages involved. With appropriate
learning techniques, one should be able to access this information and
induce grammars from parallel corpora (compare the successful pilot
study in Kuhn 2004a/b).
|
The motivation behind the project is twofold: (1) understanding
learnability properties of grammar models is crucial for a better
theoretical grasp of the language faculty; and (2) being able to train
grammars from corpus data is an important step for multilingual
Natural Language Processing applications.
The PTOLEMAIOS project will combine different lines that I
have pursued in earlier research work: (i) syntactic and semantic
modeling of linguistic phenomena, often in a crosslinguistic
context, (ii) studies of formal properties of grammar
formalisms, (iii) implementation of processing algorithms for
grammar formalisms and infrastructure tools for maintenance
of linguistic resources, and (iv) corpus-based training of linguistic
models.
Linguistic Modeling
I have worked for many years on theoretical
and practical issues of broad-coverage grammar writing.
This work has been situated in the context of the Parallel
Grammar Development project ParGram, based on the Lexical-Functional
Grammar (LFG) formalism. In have been involved
in the motivation
and development of various formal devices for linguistic
modeling such as parametrized categories (Kuhn, 1998a),
Optimality-Theory-style constraint ranking (Kuhn and Rohrer,
1997; Frank et al., 1998, 2001; King et al., 2000, 2004), and a
feature declaration mechanism (Butt et al., 2003). The guiding
principle of the ParGram project is to exploit linguistic insights
into cross-linguistic generalizations in order to write "parallel"
broad-coverage grammars for multiple languages (Butt et al.,
2003; King et al., forthcoming). There are numerous advantages
to parallelism, such as easy porting of applications for one
grammar to the others, applicability in machine translation or
analysis of parallel corpora, but also reduction of development
effort for new grammars or subgrammars. Most of these advantages
carry over to the PTOLEMAIOS approach, in which the
grammars induced for different languages produce very similar
representations.
Besides linguistic modeling work in the closer context of the
ParGram project, I have made theoretical contributions in syntax
and semantics, applying various formalisms such as LFG,
HPSG, DRT, and Glue Language Semantics: (Kuhn, 1994;
Kuhn and Heid, 1994; Kuhn, 1996a,b,c,d; Dogil et al., 1997;
Berman et al., 1998; Kuhn, 1999a, 2001d, in preparation; Denis
et al., 2003).
Formal properties of grammar formalisms
My second
line of research has addressed grammar formalisms, originally
growing out of ParGram-related work (Kuhn, 1999b, 2001c),
but ultimately representing a separate focus, in particular in
the work on Optimality-Theoretic (OT) Syntax. My OT work
(Kuhn, 2000a,c, 2002c, 2003b and in particular my dissertation Kuhn
2001b, and the CSLI Pubications book Kuhn 2003c) develops the framework
originally proposed by Joan Bresnan,
which builds on candidate representations from LFG (OT-LFG).
The central questions I have addressed in OT concern the
formalization of the candidate generation function and of the violable
constraints, the "direction" of optimization (i.e., whether
we compare alternative realizations of the same meaning or alternative
analyses of the same string), and decidability of the
question whether a given candidate is optimal according to an
OT grammar. The PTOLEMAIOS project builds
directly on a number of insights from the OT formalization
work.
Computational processing and tool building
I have developed
prototype systems for processing tasks related to grammar
formalisms and infrastructure tools in the context of linguistic
engineering. For example, Kuhn (1998b) discusses tools
for testing a grammar against an annotated testsuite; in (Zinsmeister
et al., 2002), a system converting LFG representations
into a dependency treebank format is discussed; Kuhn
(2000b, 2001a) presents a chart-based algorithm for processing
OT-syntactic grammars; Kuhn (2003a), addresses a finite-state
approximation of a feature-based morphological grammar
for word formation in German; in (Kuhn and Mateo-Toledo,
2004), various experiments with NLP tools applied in corpus
construction for the endangered Mayan language Qanjobal
are discussed; Palmer et al. (2004) report on experiments using
various linguistic resources and NLP tools for the implementation
of Carlota Smiths discourse-semantic theory.
Corpus-based learning
My fourth and last line of research
has attempted to exploit text corpora in order to acquire linguistic
knowledge, using a variety of techniques and typically
exploiting higher-level linguistic representations or background
knowledge. In (Kuhn et al., 1998), we focus specifically on
acquiring lexical subcategorization information for verbs, using
an existing large-coverage grammar for hypothesis testing.
In (Riezler et al., 2000), we trained a log-linear (or Maximum
Entropy) model for disambiguating the output of the large-scale German
LFG grammar from the
ParGram project. Part of (Kuhn and Mateo-Toledo, 2004) are experiments
in training a Maximum Entropy part-of-speech tagger
for Qanjobal, for which only very limited resources exist;
the Maximum Entropy approach is particularly suited for combining
many different features in learning, so the most effective
use can be made of the small set of learning data. I also use
a Maximum Entropy model in ongoing work on learning various
linguistic models, such as a coreference resolution model
for anaphoric expressions exploiting a deep syntactic grammar
and insights from Segmented Discourse Representation Theory
(in joint research with Nicholas Asher (Asher et al., 2004)).
Corpus-based learning based on an OT grammar architecture
was also one of the topics I addressed in my postdoctoral research
project Optimization Inside and Outside Grammar: a Formal
Linguistic Approach to Corpus-based Learning at Stanford University
in 2001/02 (compare Kuhn, 2002a,b).
Bibliography
- Berman, Judith, Stefanie Dipper, Christian Fortmann, and Jonas
Kuhn. 1998. Argument clauses and correlative `es in German:
Deriving discourse properties in a unification analysis. In M. Butt
and T. H. King (eds.), Proceedings of the LFG98 Conference,
Brisbane, Australia, CSLI Proceedings Online.
- Butt, Miriam, Martin Forst, Tracy H. King, and Jonas Kuhn. 2003.
The feature space in parallel grammar writing. In Proceedings of
ESSLLI03-Workshop on Ideas and Strategies in Multilingual
Grammar Development, August 2003, Vienna.
- Denis, Pascal, Jonas Kuhn, and Stephen Wechsler. 2003. V-PP goal
motion complexes in English: an HPSG account. In ACL-SIGSEM
workshop: The Linguistic Dimensions of Prepositions and their
Use in Computational Linguistics Formalisms and Applications,
September 2003, Toulouse.
- Dogil, Grzegorz, Jonas Kuhn, Jörg Mayer, Gregor Möhler, and
Stefan Rapp. 1997. Prosody and discourse structure: issues and
experiments. In Proceedings of the ESCA Workshop on Intonation:
Theory, Models and Applications, pp. 99--102, Athens, Greece.
- Frank, Anette, Tracy H. King, Jonas Kuhn, and John Maxwell. 1998.
Optimality Theory style constraint ranking in large-scale LFG
grammars. In M. Butt and T. H. King (eds.), Proceedings of the
Third LFG Conference, CSLI Proceedings Online.
- Frank, Anette, Tracy H. King, Jonas Kuhn, and John Maxwell. 2001.
Optimality Theory style constraint ranking in large-scale LFG
grammars. In Peter Sells (ed.), Formal and Empirical Issues in
Optimality-theoretic Syntax, pp. 367--397. Stanford: CSLI
Publications.
- King, Tracy Holloway, Stefanie Dipper, Annette Frank, Jonas Kuhn,
and John Maxwell. 2000. Ambiguity management in grammar
writing. In Erhard Hinrichs, Detmar Meurers, and Shuly Wintner
(eds.), Proceedings of the Workshop on Linguistic Theory and
Grammar Implementation, ESSLLI-2000, Birmingham, UK.
- King, Tracy Holloway, Stefanie Dipper, Annette Frank, Jonas Kuhn,
and John Maxwell. 2004. Ambiguity management in grammar
writing. Research on Language and Computation 2:259--280.
- King, Tracy H., Martin Forst, Jonas Kuhn, and Miriam Butt.
forthcoming. The feature space in parallel grammar writing.
Research on Language and Computation. Accepted for publication
in the special issue on Shared Representation in Multilingual
Grammar Engineering.
- Kuhn, Jonas. 1994. Die Behandlung von Funktionsverbgefügen in
einem HPSG-basierten Übersetzungsansatz. Technical report,
Institut fur maschinelle Sprachverarbeitung, Universität Stuttgart.
Studienarbeit [undergraduate research thesis], appeared as
Verbmobil Report 66, Dezember 1994.
- Kuhn, Jonas. 1996a. Context effects on interpretation and intonation.
In Dafydd Gibbon (ed.), Natural Language Processing and Speech
Technology. Results of the 3rd KONVENS Conference, pp.
186--198. Berlin: de Gruyter.
- Kuhn, Jonas. 1996b. Domain restriction in quantification is
independent of focus marking -- a DRT account of the use of
adverbial quantifiers in partial answers to questions. In
Proceedings of the Conference on Formal Grammar, Prague.
- Kuhn, Jonas, 1996c. On intonation and interpretation in context -- is
there a unitary explanation for focus and deaccenting?
Diplomarbeit [Master thesis], Institut für maschinelle
Sprachverarbeitung, Universität Stuttgart.
- Kuhn, Jonas. 1996d. An underspecified HPSG representation for
information structure. In Proceedings of COLING-96, pp. 670--675,
Copenhagen.
- Kuhn, Jonas. 1998a. Some recent extensions of the LFG formalism
and their application in broad-coverage grammars. In Workshop
Applications of Constraint-Based Programming to Computational
Linguistics. Blaubeuren.
- Kuhn, Jonas. 1998b. Towards data-intensive testing of a
broad-coverage LFG grammar. In Bernhard Schröder, Winfried
Lenders, Wolfgang Hess, and Thomas Portele (eds.), Computers,
Linguistics, and Phonetics between Language and Speech,
Proceedings of the 4th Conference on Natural Language
Processing -- KONVENS-98, pp. 43--56, Bonn. Peter Lang.
- Kuhn, Jonas. 1999a. The syntax and semantics of split NPs in LFG.
In F. Corblin, C. Dobrovie-Sorin, and J.-M. Marandin (eds.),
Empirical Issues in Formal Syntax and Semantics 2, Selected
Papers from the Colloque de Syntaxe et Semantique `a Paris (CSSP
1997), pp. 145--166, The Hague. Thesus.
- Kuhn, Jonas. 1999b. Towards a simple architecture for the
structure-function mapping. In M. Butt and T. H. King (eds.),
Proceedings of the LFG99 Conference, Manchester, UK, CSLI
Proceedings Online.
- Kuhn, Jonas. 2000a. Faithfulness violations and bidirectional
optimization. In M. Butt and T. H. King (eds.), Proceedings of the
LFG 2000 Conference, Berkeley, CA, CSLI Proceedings Online, pp.
161--181.
- Kuhn, Jonas. 2000b. Processing Optimality-theoretic syntax by
interleaved chart parsing and generation. In Proceedings of the
38th Annual Meeting of the Association for Computational
Linguistics (ACL-2000), pp. 360--367, Hongkong.
- Kuhn, Jonas. 2000c. Resolving some apparent formal problems of
OT syntax. In Proceedings of NELS 30, October 1999, Rutgers, NJ,
Amherst, MA. Graduate Linguistics Students Association.
- Kuhn, Jonas. 2001a. Computational optimality-theoretic syntax -- a
chart-based approach to parsing and generation. In Christian
Rohrer, Antje Roßdeutscher, and Hans Kamp (eds.), Linguistic
Form and its Computation, pp. 353--385, Stanford. CSLI
Publications.
- Kuhn, Jonas. 2001b. Formal and Computational Aspects of
Optimality-theoretic Syntax. PhD thesis, Institut für maschinelle
Sprachverarbeitung, Universität Stuttgart.
- Kuhn, Jonas. 2001c. Generation and parsing in Optimality Theoretic
syntax -- issues in the formalization of OT-LFG. In Peter Sells
(ed.), Formal and Empirical Issues in Optimality-theoretic Syntax,
pp. 313--366. Stanford: CSLI Publications.
- Kuhn, Jonas. 2001d. Resource sensitivity in the syntax-semantics
interface and the German split NP construction. In Detmar Meurers
and Tibor Kiss (eds.), Constraint-Based Approaches to Germanic
Syntax, pp. 177--215. Stanford: CSLI Publications.
- Kuhn, Jonas. 2002a. Corpus-based learning in stochastic
OT-LFG--experiments with a bidirectional bootstrapping
approach. In M. Butt and T. H. King (eds.), Proceedings of the
LFG 2002 Conference, Athens, Greece, CSLI Proceedings Online,
pp. 239--257.
- Kuhn, Jonas. 2002b. Extended constraint ranking models for
frequency-sensitive accounts of syntax. Slides for a presentation at
the Workshop Quantitative Investigations in Theoretical Linguistics
(QITL), 3-5 October 2002, Osnabrück, Germany.
- Kuhn, Jonas. 2002c. OT syntax -- decidability of generation-based
optimization. In Proceedings of the 40th Annual Meeting of the
Association for Computational Linguistics (ACL02), pp. 48--55,
Philadelphia.
- Kuhn, Jonas. 2003a. Compounding and derivational morphology in a
finite-state setting. In Proceedings of the 41st Annual Meeting of
the Association for Computational Linguistics (ACL03), Sapporo,
Japan, pp. 192--199.
- Kuhn, Jonas. 2003b. Generalized tree descriptions for LFG. In
Proceedings of the LFG 2003 Conference, Saratoga Springs, NY,
USA.
- Kuhn, Jonas. 2003c. Optimality-Theoretic Syntax--A Declarative
Approach. Stanford, CA: CSLI Publications.
- Kuhn, Jonas. 2004a. Experiments in parallel-text based grammar
induction. In Proceedings of the 42nd Annual Meeting of the
Association for Computational Linguistics (ACL 2004).
forthcoming.
- Kuhn, Jonas. 2004b. Exploiting parallel corpora for monolingual
grammar induction--a pilot study. In Proceedings of the Workshop
on the Amazing Utility of Parallel and Comparable Corpora, LREC
2004.
- Kuhn, Jonas. in preparation. Constraint-based theories of grammar.
To appear in G. Ramchand und C. Reiss (Eds.), Handbook on
Interfaces. Oxford: Oxford University Press.
- Kuhn, Jonas, Judith Eckle, and Christian Rohrer. 1998. Lexicon
acquisition with and for symbolic NLP-systems -- a bootstrapping
approach. In Proceedings of the First International Conference on
Language Resources and Evaluation (LREC98), pp. 89--95,
Granada, Spain.
- Kuhn, Jonas, and Ulrich Heid. 1994. Treating structural differences
in an HPSG-based approach to interlingual machine translation. In
Peter Bosch and Christopher Habel (eds.), Kognitive Grundlagen
für interlinguabasierte Übersetzung, Working Papers des Instituts
für Logik und Linguistik, Paper Nr. 3, IBM Deutschland
Informationsysteme GmbH, Heidelberg, pp. 11--36.
- Kuhn, Jonas, and Balam Mateo-Toledo. 2004. Applying
computational linguistic techniques in a documentary project for
Qanjobal (Mayan, Guatemala). In Proceedings of the
International Conference on Language Resources and Evaluation
(LREC 2004), Lisbon.
- Kuhn, Jonas, and Christian Rohrer. 1997. Approaching ambiguity in
real-life sentences -- the application of an Optimality
Theory-inspired constraint ranking in a large-scale LFG grammar.
In Proceedings of DGfS-CL, Heidelberg.
- Palmer, Alexis, Jonas Kuhn, and Carlota Smith. 2004. Utilization of
multiple language resources for robust grammar-based tense and
aspect classification. In Proceedings of the International
Conference on Language Resources and Evaluation (LREC 2004),
Lisbon.
- Riezler, Stefan, Detlef Prescher, Jonas Kuhn, and Mark Johnson.
2000. Lexicalized stochastic modeling of constraint-based
grammars using log-linear measures and EM training. In
Proceedings of the 38th Annual Meeting of the Association for
Computational Linguistics (ACL00), Hong Kong, pp. 480--487.
- Zinsmeister, Heike, Jonas Kuhn, and Stefanie Dipper. 2002. Utilizing
LFG parses for treebank annotation. In LFG 2002, Athens.