home edit page issue tracker

This page pertains to UD version 2.

Introduction

Japanese corpora annotated according to the Universal Dependency annotation scheme will be obtained by conversion from multiple linguistic resources.

As a first step, we construct conversion rules on using the ‘NTT Japanese Phrase Structure Treebank’ (Tanaka and Nagata 2013) for the Mainichi Shimbun Newspaper.

We construct conversion rules for the ‘Balanced Corpus of Contemporary Written Japanese’ (BCCWJ) (Maekawa et al. 2014) with third-party annotations such as morpheme information and phrase-based dependencies. Currently, we annotated same annotation for UD Japanese GSD and UD Japanese PUD.

Basic Policy

The Japanese language is written without spaces or other clear divisions to show word boundaries. We tend to define morphemic units smaller than the word unit in order to maintain unit uniformity. Therefore, when we define the morpheme unit as the Universal Dependency word unit, we have to annotate the compound word construction, as defined in the morphological layer of Japanese linguistics.

The Universal Dependency scheme is not suited for Japanese dependency annotation. This is because the dependency annotation label set used by Universal Dependency includes several different layers such as morphological, syntactic and semantic dependency. To address the issue of the split between the morphology and syntactic levels, we define a Japanese base phrase unit — bunsetsu (文節) — for syntactic dependency annotation. The morphology level including for multi-word expressions is encapsulated within bunsetsu. Therefore we can concentrate on the annotate of purely syntactic phenomena.

The discrepancy between syntactic phrases and phonetic (accent) phrases is another issues in word-based dependency annotation. Since we focus not on speech corpora but on written corpora, we omit this issue from the Universal Dependency annotation schema.

The understanding of parts-of-speech (PoSs) in Japanese corpora can be split into two philosophies: lexicon-based (語彙主義) and usage-based (用法主義). The lexicon-based approach involves extracting all possible categories for one word as labels. For example, the label ‘名詞-普通名詞-サ変形状詞可能’ means that the word can be a Noun, Verbal Noun or Adjective. The labels are maintained in a large-scale PoS-tagged lexicon and used in semi-Markov model-based morphological analysers. Usage-based labelling is determined by the contextual information in a sentence; we used Usage-based PoS tags from UniDic based lexicon/corpora/morphological analysers to align the Universal PoS tags.

We also separate certain issues — such as coordination structures, surface case frame, and scope of negation — from the bunsetsu-based dependency annotation.

Coordination structures cannot be expressed straightforwardly as dependency structures. Thus, we lose some information related to nested coordination, non-constituent conjunct coordination, and coordination between different syntactic categories when we project the coordination structure to the dependency structures. Therefore, we keep the coordinate structure information in a different layer of annotation from the bunsetsu-based dependency annotation. We also keep the surface case frame structures and the scope of negation in different layers.

The Universal Dependency label set includes syntactic roles such as ‘nsubj’, ‘dobj’, ‘iobj’. These annotations are not provided under bunsetsu-based dependency annotation, and will be instead served from predicate-argument relation annotations in future development.

The Universal Dependency label set discriminates whether the target is a clause or not. Unfortunately, the definition of ‘clause’ here is vague. We defined some heuristic rules to define clauses: for example, the difference between acl (adjectival clause) and amod (adjectival modifier) is defined by whether the adjective has any overt case or not. Aside from these syntactic dependency annotations,

Background

Here, we describe Japanese basic language resources, PoS-tagged lexicon/corpus, morphological analysers, syntactic dependency annotations, semantic dependency annotations (or case frame annotations), and syntactic phrase structure tree annotations.

Corpora with Annotations

Word Units

Overview of Word Units

Written Japanese sentences are not split into words or morphemes by the use of spaces or any other technique. Thus, we have several word unit standards that can be found in corpus annotation schema or the outputs of morphological analysers. They are described below.

IPADIC

This word unit standard (morphological informatino-annotated lexicon) was derived by the morphological analyser ChaSen. The morphological analyser MeCab, developed in 2001-2004, is independently developped from the lexicon; however, the default lexicon is IPADIC. NAIST-jdic is the successor of IPADIC. NAIST-jdic resolves the license issues in IPADIC. NAIST-jdic inherits the word unit definitions and PoS tagset of IPADIC.

NINJAL UniDic

NINJAL proposed several word unit standards for Japanese corpus linguistics such as minimum word unit, \alpha word unit, \beta word unit, M word unit and so on (小椋ほか 2010a) (小椋ほか 2010b).
Since 2002, they maintain a morporlogical information annotated lexicon UniDic, and propose three sorts of word unit standard:

The UniDic has been maintained diachronically, and NINJAL has published versions of UniDic for several eras.

JUMANdic

This word unit standard was derived by the morphological analyser JUMAN. The unit is longer than SUW in UniDic. See also ‘Morphological Analyser, JUMAN’. The unit includes several compound words as single word units. See the manual.

Morphological Analysers

Bunsetsu Unit (Base Phrase)

Overview of the Bunsetsu Unit

Japanese dependency structures tends to be annotated by bunsetsu unit, to separate compound word construction issues (morphology) from syntactic dependency. However, the bunsetsu-based dependency annotation leaves the NP attachments for compound verbs.

We have two bunsetsu unit standards: Kyoto Corpus Standard and NINJAL Standard.

Kyoto Corpus Standard

The Kyoto Corpus Standard is based on ‘Prefix + Content Word + Suffix or Function word’; that is, it is bunsetsu standard based on JUMAN word units. Some functional multi-word expressions are treated as one bunsetsu such as ‘しようとする’, ‘Vざるをえない’, ‘Vつつある’ and so on. (Manual)

NINJAL Standard

The NINJAL standard is based on the UniDic Long Unit Words definition. The main rule is ‘Content word + Functional Word’ in UniDic LUW. It also defines several functional multi-word expressions as one bunsetsu, such as ‘という’, ‘といった’, ‘かもしれない’, or ‘ことができる’. (小椋ほか 2010a) (小椋ほか 2010b)

PoS Tagging

IPADIC PoS Tagset

IPADIC and NAIST-jdic share the same PoS tagset. Currently, the lexicon based on this PoS tagset is not maintained.

Juman PoS Tagset

The JUMAN PoS Tagset is based on the Masuoka-Takubo PoS tagset (Masuoka and Takubo 1992).

UniDic PoS Tagset

The UniDic defines two layered PoS tagsets, one for Short Unit Words and the other for Long Unit Words. The PoS tagset for Short Unit Words is a ‘lexicon-based label’(語彙主義) tagset in which PoS labels imply all possible usages in a context. In contrast, BCCWJ annotates the ‘usage’ of PoS as other PoS information. The PoS tagset for Long Unit Words uses ‘usage-based labels’(用法主義) disambiguated by contextual information. (小椋ほか 2010a) (小椋ほか 2010b) Note that , the term ‘usage-based’ here does not mean the same as in Langacker’s Usage-Based model.

Issues with Universal Dependency PoS Tagset

The Universal Dependency PoS tagset doesn’t clarify whether it is lexicon-based or usage-based PoS tagset. The Universal Dependency for Japanese with BCCWJ uses the Short Unit Word for the word unit and ‘usage’ of SUW for PoS.

General Description

Japanese syntactic dependency has the following properties.

We have several annotation schema for dependency annotation. They are labelled but contain very limited syntactic information. Some syntactic labels in UD are in case frame or semantic role annotation in and are only available in Japanese (see next section).

Dependency Parsers:

Kyoto Corpus Schema

The Kyoto Corpus Schema is bunsetsu-based. The dependency tree is strictly head-final and projective. The schema defines four labels: D for normal dependency, P for coordination structure, I for dependency in non-constituent conjunct coordination (部分並列), and A for apposition.

CSJ Schema

CSJ is a speech corpus, also bunsetsu-based. The dependency structure is based on the Kyoto Corpus Schema and extends some labels: A2 for generic apposition (総称), R for anastrophe (倒置), B+ for resolution of discrepancy between bunsetsu unit and Accent Phrase Unit, F for filler (フィラー), C for conjunctive, E for interjection or exclamation, Y for greetings or apostrophe (呼びかけ), N for no head in ungrammatical sentence, X for non-projective arc, and D for disfluency (言いよどみ). They define the label K to indicate ancient Japanese (古典) to escape the annotation. They also define the label S to indicate ungrammatical case postposition assignment (格表示誤り).

BCCWJ-DepPara Schema

The BCCWJ-DepPara schema is two-sided: bunsetsu-based dependency using four labels: D for normal dependency, F for filler or no head or face mark, Z for sentence boundary in nested sentences, B for resolution of discrepancy between bunsetsu units; and nested coordination structure and apposition annotation, as in ‘Coordination Annotation for the Penn Treebank’.

Word Dependency in CSJ

Uchimoto (2008) proposed a word-based dependency annotation schema for CSJ. This is an extension for of the schema. They annotated the internal dependency structure of the bunsetsu to resolve the discrepancy between accent phrases (maximal right-branching subtree with in bunsetsu) and bunsetsu units. The annotation is related to the definition of Middle Unit Word.

Bunsetsu-Based Syntactic Dependency Parsers

Semantic Dependency Annotation Schema

The dependency label set in the syntactic dependency annotated corpora is limited. We use case frame annotation or semantic role annotation, in which predicate-argument structures are annotated.

Phrase Structure Treebank

CCG Resources Derived from Multiple Dependency Corpora

Japanese phrase structure resources are limited. One study (Uematsu 2013) compiled CCG resources from several dependency corpora, including bunsetsu-based dependency from Kyoto Text Corpus, predicate argument structures from the NAIST Text Corpus, and the functions of particle ‘と’ from a Japanese particle corpus. They proposed a method to integrate these resources into binary phrase structure trees with argument relations and convert CCG resources. The CCG theory is based on Bekki (2010).

‘NTT Japanese Phrase Structure Treebank’

Tanaka and Nagata (2013) proposed a method to construct a phrase structure by retagging the examples in the work of Uematsu (2013). They also provide an n-ary version of a treebank, introducing phrase and functional tags as follows:

Other treebanks

Contributors

References

Corpora

Lexicon and PoS Tagsets

Dependency Annotation Schemata

Treebanking

Other Annotations