Note: normalization features are new in v1.3 (Crunchy Frog). Please make sure you're running v1.3 or newer before attempting to set up normalization.
The identification of the real-world entities that are referred to in text is an important part of analysing the meaning of text. Many specific task formulations that involve the association of text with entries in external resources such as ontologies and entity databases are studied, with tasks termed variously as normalization, grounding, entity linking and wikification.
As of v1.3, brat implements a number of features for supporting the visualization and manual creation of such annotations:
- Normalization annotation primitive associating brat annotations with entries in external resources
- Visualization functionality showing information from ontological and database resources
- Tools for creating, accessing and managing normalization databases
- Comprehensive annotation support for normalization, including fast approximate string search for large databases
This page briefly introduces these features and how to set brat up for the visualization and creation of normalization annotations.
The brat normalization system uses SimString for approximate string matching. Before trying to set up brat for normalization, please install SimString and its python bindings following the instructions on the SimString homepage.
If you want to set up normalization in brat using your own data, skip over this section into setup. To quickly test the brat normalization functionality on some example data, just follow these easy steps:
In the brat root directory, run the following command:
python tools/norm_db_init.py example-data/normalisation/Wiki.txt
Next, open the
tools.conf file using your favorite
text editor and add the following line to the
[normalization] section (as one line):
Wiki <URL>:http://en.wikipedia.org, <URLBASE>:http://en.wikipedia.org/?curid=%s
This tells brat that a DB with the name "Wiki" is set up for normalization and provides links to the homepage and online term lookup that will be shown in the brat UI.
That's all! Now, when creating annotations, you will be able to search (a small sample of) Wikipedia from within brat and attach normalization annotations to other annotations, as shown in the next section.
After setting up a normalization database and editing
tools.conf to configure brat to use it, brat is ready
to create normalization annotations. Note that you should reload any
open brat windows for the configuration change to take effect.
When creating any annotation marking a text span such as an entity annotation, the "Edit annotation" dialog should now include an additional section for normalization:
To create a normalization annotation, it's often necessary to search for the unique identifier (ID) of the intended entry, as IDs are typically just meaningless sequences of numbers. The brat UI provides access to two alternative searches:
- clicking on the box with "Click here to search" opens the built-in normalization DB search functionality within brat (shown below)
- clicking on the magnifying glass logo opens a separate page with
the resource URL (as configured
tools.confby the <URL> setting), providing fast access to "native" search functionality that can be used to determine the ID, which can then be copied into the brat dialog.
The search dialog can be used to query the brat normalization DB using approximate string matching, which allows for variation and typos (e.g. "Barak Obama"). The correct entry can be selected and its ID associated with the annotation simply by double-clicking on it.
After filling in the ID, the brat annotation dialog is updated to
show also a direct link to the entry in the resource
(as configured in
by the <URLBASE> setting).
If you're going to be working on normalization annotation on a brat server administered by someone else, that's all you need to know! If you need to set up a normalization database and configure brat for normalization yourself, read on.
Before brat can be used for normalization to a specific resource, it is necessary to create a brat normalization DB for that resource. This can be easily done using tools distributed with brat.
Given a file with the normalization data in the
text-based brat normalization data
format (examples are found in
example-data/normalisation/), a new normalization DB
can be set up simply by running the following command in the brat
python tools/norm_db_init.py FILE
FILE is the name of the file containing the
normalization data. This will create a brat normalization DB, by
default with the same name as FILE (without filename suffix),
storing the DB in the brat
To see the options for this tool, run
python tools/norm_db_init.py -h
For information on testing a normalization DB using command-line tools, see normalization DB tools.
After creating a normalization DB, it's necessary to edit the
tools.conf file to configure normalization.
Normalization configuration is part of the
file, where the relevant settings are contained in the
[normalization] section. The full syntax of each line
in this section is as follows:
DBNAME DB:DBPATH, <URL>:HOMEURL, <URLBASE>:ENTRYURL
literal strings (they should appear as written here), while
"DBNAME", "DBPATH", "HOMEURL" and "ENTRYURL" should be replaced with
specific values appropriate for the database being configured:
DBNAME: sets the database name (e.g. "Wiki", "GO"). The name can be otherwise freely selected, but should not contain characters other than alphanumeric ("a"-"z", "A"-"Z", "0"-"9"), hyphen ("-") and underscore ("_"). This name will be used both in the brat UI and in the annotation file to identify the DB.
DBPATH(optional): provides the file system path to the normalization DB data on the server, relative to the brat server root. If
DBPATHisn't set, the system assumes the DB can be found in the default location under the given
HOMEURL: sets the URL for the home page of the normalization resource (e.g. "http://en.wikipedia.org/wiki/"). Used both to identify the resource more specifically than
DBNAMEand to provide a link in the annotation UI for accessing the resource.
URLBASE(optional): sets a URL template (e.g. "http://en.wikipedia.org/?curid=%s") that can be filled in to generate a direct link in the annotation UI to an entry in the normalization resource. The value should contain the characters "%s" as a placeholder that will be replaced with the ID of the entry.
Note that it's individual collections can have their
tools.conf files, so different annotation projects
can have different normalization settings.
After creating a normalization DB and configuring brat to use it, brat is ready for normalization annotation. For information on how to format any dataset for use for brat normalization, see the next section.
Normalization DB file format
To make it easier to set up brat for normalization annotation using new resources, brat defines a simple text-based file format that can be used to import data into the brat normalization system.
Each line in the input file should have the following format:
ID <TAB> TYPE1:LABEL1:STRING1 <TAB> TYPE2:LABEL2:STRING2 [...]
Where the ID is the unique ID used in normalization, and the
provide various information associated with the ID.
<TAB> is the literal tab character "\t".)
TYPE must be one of the following:
STRINGis name or alias
STRINGis non-name attribute
STRINGis non-searchable additional information
LABEL provides a human-readable label used in the
normalization UI for the
LABEL values are not used for querying.
For example, for normalization to Wikipedia, the input could contain lines such as the following:
843 name:Title:A Clockwork Orange attr:Category:book 1659954 name:Title:A Clockwork Orange attr:Category:film
Specifying that "A Clockwork Orance" is the title of a book and a film, and allowing normalization annotation to differentiate between the two.
Each entry must have at least one name, but all other information is optional. There is no need for any of the fields (or their values) to be unique; for example, the following is a valid entry:
534366 name:Name:Barack Obama name:Name:Obama attr:Category:person attr:Category:US president
Given a file in this format, a normalization DB can be created as
described in database setup. A number of example
files in this format are found in
example-data/normalisation/ directory in the brat
Normalization DB tools
brat provides also some command-line tools for working with
normalization DBs. These are found along with many other tools in
tools/ tools directory of the brat installation.
norm_db_lookup.py can be used to retrieve
the ID and other information stored in a normalization DB for a
given entry name. This tool can be helpful for troubleshooting
normalization DB setup. An example session (user input shown in
python tools/norm_db_lookup.py Wiki >>> Google 1092923 Name:Google Info:Google Inc. is an American multinational Internet and software corporation [...] >>> Barak Obama (no record found for 'Barak Obama') >>> Barack Obama 534366 Name:Barack Obama Info: Barack Hussein Obama II is the 44th and current President of the United States. >>> [CRTL-D]
(Note that this tool does not perform approximate string matching.)
Standoff format for normalization
The brat normalization support involves a new category of annotation in the standoff format used to store brat annotations. (If you don't work directly with the brat standoff format, you can skip these technical details.)
First, brat v1.3 is fully backward compatible with previous versions, and any standoff file created in previous versions of brat is also valid for v1.3. The text file format and the standoff format for annotation primitives defined in previous versions of brat (text-bound annotations, relations, etc.) are also unchanged in v1.3. To understand the basic brat standoff format, see the standoff format description.
To support normalization, v1.3 introduces a new category of
annotation, normalization. Each normalization annotation has
a unique ID and is defined by reference to the ID of the annotation
that the normalization attaches to and a
identifying the external resource (
RID) and the entry
within that resource (
EID). Additionally, each
normalization annotation has the type
other values for the type are currently defined) and a
human-readable string value for the entry referred to.
The following example shows a normalization annotation attached to the text-bound annotation "T1" (not shown) and associates it with the Wikipedia entry with the Wikipedia ID "534366" ("Barack Obama").
As for text-bound annotations, the ID and the text are separated by TAB characters, and other fields (here, "Reference", "T1" and "Wikipedia:534366") by SPACE.
The IDs of normalization annotations follow the general ID conventions in brat and consist of the upper-case character "N" an a number.
Normalization configuration fails to take effect
It is necessary to reload the collection for changes to configuration to take effect. This can be done either by navigating to a different collection and back, or simply by reloading the brat page in the browser.
If the issue persists after reloading the collection, it may be
that brat is reading a different
tools.conf file than
the one that was edited to configure normalization. For example, if
the collection contains a
configuration will be used in favor of ones in containing
collections and the brat root directory.
Search fails to return expected results
If brat is showing a specific error message when a search is performed, this message should indicate what the specific issue is.
If search fails to return expected results without giving any
specific error message, first make sure that the data that was used
to create the DB is correctly
formatted. Note that search requires a match against a
name field in the data; it's not enough for the query
string to match e.g. just part of an
The norm_db_lookup.py tool provided with brat can be used to check DB contents from the command line.