Supporting Universal Dependencies in Tree Editor TrEd

Authors

  • Jan Štěpánek Charles University, Faculty of Mathematics and Physics, Institute of Formal and Applied Linguistics Author

Keywords:

NLP, treebank, Universal Dependencies

Abstract

The paper presents the tree editor TrEd and related tools that can be used to create, modify, browse, and search treebanks - large language corpora annotated with syntactic and/or semantic structure information.  This might include not only phrase structure or dependencies, but also coreference, discourse analysis, and even inter-sentence relations. The project started in the year 2000, and it has been in continuous use since then at various institutions all over the world. Most of the tools are written in Perl, which makes them available to all major operating systems. For searching the treebanks, a query language was developed that describes sets of tree nodes and the relations between them. It also supports aggregation to produce quantitative outputs.  There are two different implementations, one translates the queries into SQL statements, the other searches the data directly in the editor. Originally, TrEd supported the PML data format used for the Prague Dependency Treebank. To process data in a different format, one first needed to convert the data into the PML format (and possibly convert the modified data data back to the initial format). Later, a versatile extension system was added to TrEd which made it possible to support other data formats directly. We will show how this works on the example of Universal Dependencies. UD is a framework for grammar annotation across different human languages. The described extension allows TrEd (and some other tools) to open the files in the original UD format natively, building the internal representation on the fly, and also serialise them back after editing.

Published

11/17/2024

Issue

Section

Full Paper (10-36 pages + References)

Categories