brat normalization

Note: normalization features are new in v1.3 (Crunchy Frog). Please make sure you're running v1.3 or newer before attempting to set up normalization.

Introduction

Popup showing additional information from Wikipedia for a normalized annotation.

The identification of the real-world entities that are referred to in text is an important part of analysing the meaning of text. Many specific task formulations that involve the association of text with entries in external resources such as ontologies and entity databases are studied, with tasks termed variously as normalization, grounding, entity linking and wikification.

As of v1.3, brat implements a number of features for supporting the visualization and manual creation of such annotations:

Normalization annotation primitive associating brat annotations with entries in external resources
Visualization functionality showing information from ontological and database resources
Tools for creating, accessing and managing normalization databases
Comprehensive annotation support for normalization, including fast approximate string search for large databases

This page briefly introduces these features and how to set brat up for the visualization and creation of normalization annotations.

Requirements

The brat normalization system uses SimString for approximate string matching. Before trying to set up brat for normalization, please install SimString (and optionally its python bindings) following the instructions on the SimString homepage. If python bindings are not installed, make sure the simstring executable path is set up correctly in the configuration file.

Quick start

If you want to set up normalization in brat using your own data, skip over this section into setup. To quickly test the brat normalization functionality on some example data, just follow these easy steps:

In the brat root directory, run the following command:

    python tools/norm_db_init.py example-data/normalisation/Wiki.txt

This will create a normalization DB using a small sample of data from the English Wikipedia (generated from a Freebase Wikipedia data dump).

Next, open the tools.conf file using your favorite text editor and add the following line to the [normalization] section (as one line):

    Wiki    <URL>:http://en.wikipedia.org, <URLBASE>:http://en.wikipedia.org/?curid=%s

This tells brat that a DB with the name "Wiki" is set up for normalization and provides links to the homepage and online term lookup that will be shown in the brat UI.

Then open the annotation.conf file, and enable the normalisation for each entity and event type where it is applicable, like this:

    Person  <NORM>:Wiki
    Family  Arg1:Person, Arg2:Person, <NORM>:Wiki

That's all! Now, when creating annotations, you will be able to search (a small sample of) Wikipedia from within brat and attach normalization annotations to other annotations, as shown in the next section.

Normalization annotation

After setting up a normalization database and editing tools.conf to configure brat to use it, brat is ready to create normalization annotations. Note that you should reload any open brat windows for the configuration change to take effect.

When creating any annotation marking a text span such as an entity annotation, the "Edit annotation" dialog should now include an additional section for normalization:

The List button opens a list of all normalizations the that have already been added to this span, and allows adding, editing and deleting of normalizations. The parenthesised number on the button is the number of existing normalizations. To allow a new normalization to be added quickly, there is also a Quick Add button, which skips the normalization list dialog.

It is only possible to add normalizations that are configured as appropriate for the span type. If there are any normalizations already present that do not match the span type, they will be marked in the normalization list dialog with a red exclamation mark. Each normalization shows the normalization database, the ID of the normalization in that database, and the canonical name. If the normalization database has a direct URL pattern configured, the magnifier icon will open the corresponding web page. The Edit and Add buttons will open the normalization search dialog. Note that the changes to the normalization list are not saved unless both the normalization list dialog and the span dialog are confirmed with the OK button.

To create a normalization annotation, it's often necessary to search for the unique identifier (ID) of the intended entry, as IDs are typically just meaningless sequences of characters. The normalization search dialog provides access to two alternative searches, after selecting the desired normalization database:

entering a query in the query text input field and clicking the Search button (or equivalently, hitting Enter) will use the normalization DB search functionality within brat (shown below)
clicking on the magnifying glass logo next to the normalization database selector opens a separate page with the resource URL (as configured in tools.conf by the <URL> setting), providing fast access to "native" search functionality that can be used to determine the ID, which can then be copied into the brat dialog.

The search dialog can be used to query the brat normalization DB using approximate string matching, which allows for variation and typos (e.g. "Barak Obama"). The correct entry can be selected and its ID associated with the annotation simply by double-clicking on it.

If you're going to be working on normalization annotation on a brat server administered by someone else, that's all you need to know! If you need to set up a normalization database and configure brat for normalization yourself, read on.

Database setup

Before brat can be used for normalization to a specific resource, it is necessary to create a brat normalization DB for that resource. This can be easily done using tools distributed with brat.

Given a file with the normalization data in the text-based brat normalization data format (examples are found in example-data/normalisation/), a new normalization DB can be set up simply by running the following command in the brat root directory:

    python tools/norm_db_init.py FILE

where FILE is the name of the file containing the normalization data. This will create a brat normalization DB, by default with the same name as FILE (without filename suffix), storing the DB in the brat work/ directory.

To see the options for this tool, run

    python tools/norm_db_init.py -h

For information on testing a normalization DB using command-line tools, see normalization DB tools.

After creating a normalization DB, it's necessary to edit the tools.conf file to configure normalization.

Configuration

Normalization configuration is part of the tools.conf file, where the relevant settings are contained in the [normalization] section. The full syntax of each line in this section is as follows:

    DBNAME     DB:DBPATH, <URL>:HOMEURL, <URLBASE>:ENTRYURL, <UNICODE>:TRUEorFALSE

Here, DB, <URL>, <URLBASE> and <PATH> are literal strings (they should appear as written here), while "DBNAME", "DBPATH", "HOMEURL", "ENTRYURL" and "TRUEorFALSE" should be replaced with specific values appropriate for the database being configured:

DBNAME: sets the database name (e.g. "Wiki", "GO"). The name can be otherwise freely selected, but should not contain characters other than alphanumeric ("a"-"z", "A"-"Z", "0"-"9"), hyphen ("-") and underscore ("_"). This name will be used both in the brat UI and in the annotation file to identify the DB.
DB (optional): provides the file system path to the normalization DB data on the server, relative to the brat server root. If DB parameter isn't set, the system assumes the database can be found in the default location under the given DBNAME.
URL: sets the URL for the home page of the normalization resource (e.g. "http://en.wikipedia.org/wiki/"). Used both to identify the resource more specifically than DBNAME and to provide a link in the annotation UI for accessing the resource.
URLBASE (optional): sets a URL template (e.g. "http://en.wikipedia.org/?curid=%s") that can be filled in to generate a direct link in the annotation UI to an entry in the normalization resource. The value must contain the characters %s as a placeholder that will be replaced with the ID of the entry.
UNICODE (optional): specifies whether the database is created as Unicode or binary. Note that simstring bindings library only handles binary databases. The accepted values are: true, t, 1, yes, y, on, false, f, 0, no, n, off.

Note that individual collections can have their own tools.conf files, so different annotation projects can have different normalization settings.

To use a normalization database in annotation, additionally annotation.conf entity and event specifications need to have a parameter <NORM>:DB1|DB2|..., specifying which databases are relevant to each span type.

After creating a normalization DB and configuring brat to use it, brat is ready for normalization annotation. For information on how to format any dataset for use for brat normalization, see the next section.

Normalization DB file format

To make it easier to set up brat for normalization annotation using new resources, brat defines a simple text-based file format that can be used to import data into the brat normalization system.

Each line in the input file should have the following format:

    ID <TAB> TYPE1:LABEL1:STRING1 <TAB> TYPE2:LABEL2:STRING2 [...]

Where the ID is the unique ID used in normalization, and the TYPE:LABEL:STRING triplets provide various information associated with the ID. (<TAB> is the literal tab character "\t".)

Each TYPE must be one of the following:

"name": STRING is name or alias
"attr": STRING is non-name attribute
"info": STRING is non-searchable additional information

Each LABEL provides a human-readable label used in the normalization UI for the STRING. LABEL values are not used for querying.

For example, for normalization to Wikipedia, the input could contain lines such as the following:

    843          name:Title:A Clockwork Orange      attr:Category:book
    1659954      name:Title:A Clockwork Orange      attr:Category:film

Specifying that "A Clockwork Orance" is the title of a book and a film, and allowing normalization annotation to differentiate between the two.

Each entry must have at least one name, but all other information is optional. There is no need for any of the fields (or their values) to be unique; for example, the following is a valid entry:

    534366       name:Name:Barack Obama      name:Name:Obama      attr:Category:person      attr:Category:US president

Given a file in this format, a normalization DB can be created as described in database setup. A number of example files in this format are found in example-data/normalisation/ directory in the brat installation.

Normalization DB tools

brat provides also some command-line tools for working with normalization DBs. These are found along with many other tools in the tools/ tools directory of the brat installation.

The script norm_db_lookup.py can be used to retrieve the ID and other information stored in a normalization DB for a given entry name. This tool can be helpful for troubleshooting normalization DB setup. An example session (user input shown in blue)

python tools/norm_db_lookup.py Wiki
>>> Google
 1092923	Name:Google		Info:Google Inc. is an American multinational Internet and software corporation [...]
>>> Barak Obama
 (no record found for 'Barak Obama')
>>> Barack Obama
 534366	Name:Barack Obama		Info: Barack Hussein Obama II is the 44th and current President of the United States.
>>> [CRTL-D]

(Note that this tool does not perform approximate string matching.)

Standoff format for normalization

The brat normalization support involves a new category of annotation in the standoff format used to store brat annotations. (If you don't work directly with the brat standoff format, you can skip these technical details.)

First, brat v1.3 is fully backward compatible with previous versions, and any standoff file created in previous versions of brat is also valid for v1.3. The text file format and the standoff format for annotation primitives defined in previous versions of brat (text-bound annotations, relations, etc.) are also unchanged in v1.3. To understand the basic brat standoff format, see the standoff format description.

To support normalization, v1.3 introduces a new category of annotation, normalization. Each normalization annotation has a unique ID and is defined by reference to the ID of the annotation that the normalization attaches to and a RID:EID pair identifying the external resource (RID) and the entry within that resource (EID). Additionally, each normalization annotation has the type Reference (no other values for the type are currently defined) and a human-readable string value for the entry referred to.

The following example shows a normalization annotation attached to the text-bound annotation "T1" (not shown) and associates it with the Wikipedia entry with the Wikipedia ID "534366" ("Barack Obama").

Reference T1 Wikipedia:534366

Barack Obama

As for text-bound annotations, the ID and the text are separated by TAB characters, and other fields (here, "Reference", "T1" and "Wikipedia:534366") by SPACE.

The IDs of normalization annotations follow the general ID conventions in brat and consist of the upper-case character "N" an a number.

Troubleshooting

Normalization configuration fails to take effect

It is necessary to reload the collection for changes to configuration to take effect. This can be done either by navigating to a different collection and back, or simply by reloading the brat page in the browser.

If the issue persists after reloading the collection, it may be that brat is reading a different tools.conf file than the one that was edited to configure normalization. For example, if the collection contains a tools.conf, that configuration will be used in favor of ones in containing collections and the brat root directory.

Search fails to return expected results

If brat is showing a specific error message when a search is performed, this message should indicate what the specific issue is.

If search fails to return expected results without giving any specific error message, first make sure that the data that was used to create the DB is correctly formatted. Note that search requires a match against a name field in the data; it's not enough for the query string to match e.g. just part of an info field.

The norm_db_lookup.py tool provided with brat can be used to check DB contents from the command line.