The UCCA Corpus Version 1.1 29/12/2015 =============== This bundle contains 369 passages annotated according to the foundational layer of UCCA. The passages are given as xmls in a format which is described below. The total number of tokens in this corpus is 158771. It also contains the annotation guidelines that were given to the annotators, a metadata file and a toy example XML. The dataset is a part of the UCCA project developed in the NLP lab of the Hebrew University by Omri Abend and Ari Rappoport. The users of this dataset are kindly requested to cite the following publication: "UCCA: A Novel Framework for Semantic Representation" / Omri Abend and Ari Rappoport, ACL 2013 Example passages can be graphically viewed through our web application (URL: vm-05.cs.huji.ac.il). Please refer to our website (URL: homepages.inf.ed.ac.uk/oabend/) or email (oabend@inf.ed.ac.uk) for regular updates on the UCCA project and available resources. Files included -------------- 1. The passages files in an XML format. file names are of the form "ucca_passageXXX.xml" where XXX is the passage ID. Please see the UCCA resource webpage for a software package for reading and using these files. 2. toy.xml: a toy example for explaining the UCCA xml format. 3. metadata: a file that contains some metadata for the passages. Specifically it contains the source of the text used (i.e., the Wikipedia article it was taken from), and the index of the annotator that did the final proof-reading (it can be 2,3 or 6). 4. guidelines.pdf: the annotation guidelines that were given to the annotators are summarized in this file named "UCCA in a nutshell". Concise definitions are available through the UCCA website as well. 5. short_defs.pdf: a brief summary of the categories used by UCCA's foundational layer. XML format: ----------- The xml format allows easy extension with further layers. The top level of each xml is composed of the layers annotated over the passage. Each layer has a unique ID and a set of nodes that it introduces. Each node specifies its outbound edges. The ID of a node is formatted as ".". Layer 0 is a special layer which specifies the tokens of the passage and their linear order. Its nodes are therefore the tokens themselves. Each node may either be of type "Word" or of type "Punctuation". The attribute "paragraph" specifies the number of the paragraph the terminal belongs to, while "paragraph_position" specifies the position of the terminal inside that paragraph. The attribute "text" specifies the written form of the terminal. Layer 1 is the foundational layer of UCCA. Although non-terminal nodes of UCCA are generally untyped (their type is effectively determined by their inbound and outbound edges), the xml format does separate the nodes into three coarse-grained types: (1) FN (regular node) (2) PNCT (a node whose only descendant is a punctuation terminal) (3) LKG (a linkage node). We note that the node type does not provide any additional information, as it can be deterministically derived from the identity of its edges. It is therefore only used for easier readability. Each node specifies its outbound edges through its "edge" elements. The ID of the node to which the edge is directed is specified by the attribute "toID". The type of the edge may be either of the following: (1) any of the 13 categories of the foundational layer (abbreviated as A,P,S,D,C,E,N,R,T,H,L,F,G; see paper). (2) LR (Link Relation) or LA (Link Argument) for edges between a linkage node and its Linker or Parallel Scenes, respectively. (3) Terminal for an edge to a word terminal. (4) U for an edge to a punctuation terminal. A node in layer 1 may also be a leaf that represents an implicit unit. In this case, the node would have an attribute "implicit" with the value "true". Layer 2 is left empty as a place holder where future layers (e.g., coreference, linkage type, information structure) can be represented. UCCA is designed to allow an open-ended set of layers to be annotated on top of a given passage. Toy example: ------------ The file toy.xml contains the annotation of a simple sentence "After Graduation, Mary moved to New York City". The terminals can be seen under the element . Consider now the nodes of the foundational layer (those under the element ). Consider the node whose ID is "1.1". It has 5 children, one is a Linker (and therefore the edge leading to it bears the type L), two are Parallel Scenes (the edge leading to them bear the type H), 2 are punctuation marks (the edges leading to them bear the type U). Note that edges leading to terminals (i.e., to nodes in layer0) bear the type 'Terminal'. Consider node "1.13". This node is of type LKG, which means it represents a linkage relation. It has three children, a Linker (i.e., the linkage relation; the edge has the tag 'LR'), and two linkage arguments (bear the type 'LA'). Licensing: ---------- The texts are taken from the English Wikipedia (http://en.wikipedia.org). The specific articles they were taken from are listed in the metadata file. The Wikipedia texts, as well as the UCCA annotation is distributed under the "Attribution-ShareAlike 3.0 Unported" license (http://creativecommons.org/licenses/by-sa/3.0/). Please follow the link for exact details. ACKNOWLEDGEMENTS: ----------------- We would like to thank Tomer Eshet for his partnering in developing the UCCA web application, and Amit Beka for his help with UCCA's development set and software tools. We would also like to thank our four annotators for hard and thorough work.