brat standoff format

Annotations created in brat are stored on disk in a standoff format: annotations are stored separately from the annotated document text, which is never modified by the tool.

For each text document in the system, there is a corresponding annotation file. The two are associatied by the file naming convention that their base name (file name without suffix) is the same: for example, the file DOC-1000.ann contains annotations for the file DOC-1000.txt.

Within the document, individual annotations are connected to specific spans of text through character offsets. For example, in a document beginning "Japan was today struck by ..." the text "Japan" is identified by the offset range 0..5. (All offsets all indexed from 0 and include the character at the start offset but exclude the character at the end offset.)

The specific standoff flavor used by brat is similar to the BioNLP Shared Task standoff format, and described in detail in the following.

Text files (.txt)

Text files are expected to have the suffix .txt and contain the text of the original documents input into the system.

Sony formed a joint venture with Ericsson, a mobile phone company based in Sweden.

Sony announced today that ...

The document texts are stored in plain text files encoded using UTF-8 (an extension of ASCII — plain ASCII texts work also). Document texts may contain newlines, which will be shown as line breaks by brat. However, it is not necessary for the documents to contain any newlines: brat can perform its own sentence segmentation for display using a reliable algorithm. (Whether or not newlines are included in the original text documents, the text files themselves are not modified.)

Annotation files (.ann)

Annotations are stored in files with the .ann suffix. The various annotation types that may be contained in these files are discussed in the following.

General annotation structure

All annotations follow the same basic structure: Each line contains one annotation, and each annotation is given an ID that appears first on the line, separated from the rest of the annotation by a single TAB character. The rest of the structure varies by annotation type.

Examples of annotation for an entity (T1), an event trigger (T2), an event (E1) and a relation (R1) are shown in the following.

T1	Organization 0 4	Sony
T2	MERGE-ORG 14 27	joint venture
T3	Organization 33 41	Ericsson
E1	MERGE-ORG:T2 Org1:T1 Org2:T3
T4	Country 75 81	Sweden
R1	Origin Arg1:T3 Arg2:T4

Detailed descriptions of these annotations are given below.

Text-bound annotations

Text-bound annotations are an important category of annotation related to both entity and event annotations. Text-bound annotation identifies a specific span of text and assigns it a type.

T1	Organization 0 4	Sony
T2	MERGE-ORG 14 27	joint venture

All text-bound annotations follow the same structure. As in all annotations, the ID occurs first and is delimited from the rest of the line with a TAB character. The primary annotation is given as a SPACE-separated triple (type, start-offset, end-offset). The start-offset is the index of the first character of the annotated span in the text (".txt" file), i.e. the number of characters in the document preceding it. The end-offset is the index of the first character after the annotated span. Thus, the character in the end-offset position is not included in the annotated span. For reference, the text spanned by the annotation is included, separated by a TAB character.

As of v1.3, brat supports also discontinuous text-bound annotations, where the annotation involves more than one continuous span of characters. The standoff representation for these annotations is a straightforward extension of the single-span case. For example, one possible annotation for "North and South America" would be represented as follows:

T1	Location 0 5;16 23	North America
T2	Location 10 23	South America

The (start-offset, end-offset) pairs forming a discontinuous annotation are separated by semicolons, and the texts of by these spans are joined by single space characters to form the reference text of the annotation.

Annotation ID conventions

All annotations IDs consist of a single upper-case character identifying the annotation type and a number. The initial ID characters relate to annotation types as follows:

T: text-bound annotation
R: relation
E: event
A: attribute
M: modification (alias for attribute, for backward compatibility)
N: normalization [new in v1.3]
#: note

Additionally, an asterisk ("*") can be used as a placeholder for an ID in special cases.

Entity annotations

Each entity annotation has a unique ID and is defined by type (e.g. Person or Organization) and the span of characters containing the entity mention (represented as a "start end" offset pair).

T1	Organization 0 4	Sony
T3	Organization 33 41	Ericsson
T3	Country 75 81	Sweden

Each line contains one text-bound annotation identifying the entity mention in text.

Event annotations

Each event annotation has a unique ID and is defined by type (e.g. MERGE-ORG), event trigger (the text stating the event) and arguments.

T2	MERGE-ORG 14 27	joint venture
E1	MERGE-ORG:T2 Org1:T1 Org2:T3

The event triggers, annotations marking the word or words stating each event, are text-bound annotations and their format is identical to that for entities. (The IDs of triggers occupy the same space as the IDs of entities, and these must not overlap.)

As for all annotations, the event ID occurs first, separated by a TAB character. The event trigger is specified as TYPE:ID and identifies the event type and its trigger through the ID. By convention, the event type is specified both in the trigger annotation and the event annotation. The event trigger is separated from the event arguments by SPACE. The event arguments are a SPACE-separated set of ROLE:ID pairs, where ROLE is one of the event- and task-specific argument roles (e.g. Theme, Cause, Site) and the ID identifies the entity or event filling that role. Note that several events can share the same trigger and that while the event trigger should be specified first, the event arguments can appear in any order.

Relation annotations

Binary relations have a unique ID and are defined by their type (e.g. Origin, Part-of) and their arguments.

R1	Origin Arg1:T3 Arg2:T4

The format is similar to that applied for events, with the exception that the annotation does not identify a specific piece of text expressing the relation ("trigger"): the ID is separated by a TAB character, and the relation type and arguments by SPACE.

Relation arguments are commonly identified simply as Arg1 and Arg2, but the system can be configured to use any labels (e.g. Anaphor and Antecedent) in the standoff representation.

Equivalence relations

The system also supports a special syntax for equivance relations. Equivalence relations are symmetric and transitive relations that define sets of annotations to be equivalent in some sense (e.g. referring to the same real-world entity). Such relations can be represented in a compact way as a SPACE-separated list of the IDs of the equivalent annotations.

T1	Organization 0 43	International Business Machines Corporation
T2	Organization 45 48	IBM
T3	Organization 52 60	Big Blue
*	Equiv T1 T2 T3

For backward compatibility with existing standoff formats, brat supports also the special "empty" ID value "*" for equivalence relation annotations.

Attribute and modification annotations

Attribute annotations are binary or multi-valued "flags" that specify further aspects of other annotations. Attributes have a unique ID and are defined by reference to the ID of the annotation that the attribute marks and the attribute value.

A1	Negation E1
A2	Confidence E2 L1

As for other annotations, the ID is separated by TAB and other fields by space.

Binary attributes such as A1 in the above example need only specify the attribute name and the ID of the marked annotation: the value true is implied for the binary attribute. The absence of a binary attribute annotation is interpreted as the attribute having the value false.

Multi-valued attributes specify also the attribute value, separated by SPACE. The values of multi-valued attributes are fully configurable.

For backward compatibility with existing standoff formats, brat also recognizes the ID prefix "M" for attributes.

Normalization annotations

Normalization annotations are supported as of v1.3. Each normalization annotation has a unique ID and is defined by reference to the ID of the annotation that the normalization attaches to and a RID:EID pair identifying the external resource (RID) and the entry within that resource (EID). Additionally, each normalization annotation has the type Reference (no other values for the type are currently defined) and a human-readable string value for the entry referred to.

The following example shows a normalization annotation attached to the text-bound annotation "T1" (not shown) and associates it with the Wikipedia entry with the Wikipedia ID "534366" ("Barack Obama").

Reference T1 Wikipedia:534366

Barack Obama

(Note that the association of the EID values such as "Wikipedia" or "GO" with the relevant external resources is not represented in the standoff but controlled by the tools.conf configuration file.)

As for text-bound annotations, the ID and the text are separated by TAB characters, and other fields (here, "Reference", "T1" and "Wikipedia:534366") by SPACE.

Note annotations

Note annotations provide a way to associate freeform text with either the document or a specific annotation. Notes lines begin with the number (or "hash") sign #.

AnnotatorNotes T1

this annotation is suspect

Notes with an "ID" starting with # followed by a TAB character attach to specific annotations. For these notes, the second TAB-separated field contains a note type and the ID of the annotation that the note is attached to, and the third TAB-separated field contains the text of the note.

The note type can be freely assigned and any number of notes can be attached to a single annotation. (However, currently only a single note of type AnnotatorNotes can be edited from the brat UI.)