pybrat

pybrat is a simple package for reading brat-formatted text annotations.

https://github.com/jvasilakes/pybrat

Installation

pybrat is written for Python 3 and has no external dependencies. Install it with

python setup.py develop

Using develop should update your installed version when you pull changes from the github.

Uninstallation

python setup.py develop --uninstall

Usage

The pybrat.BratAnnotations class automatically links events to their associated text spans and attributes. Parse a .ann file and iterate through the contents with

>>> from pybrat import BratAnnotations
>>> anns = BratAnnotations.from_file("/path/to/file.ann")
>>> for ann in anns:
>>>    print(ann)
... "E1      PROCESS_OF:T8 PathologicFunction:T5 AgeGroup:T6

By default the __iter__ method will iterate through the highest level annotations. You can iterate through specific types of annotations with the .spans, .attributes, and .events properties. E.g.

>>> for span in anns.spans:
>>>     print(span)
... "T8      PROCESS_OF 86 88        in"

You can output brat formatted annotations simply by calling str(ann) or print(ann). This works for individual annotations, as shown above, or for a BratAnnotations instance.

>>> anns = BratAnnotations.from_file("/path/to/file.ann")
>>> print(anns)
... """
T6   AgeGroup 89 97  children
T5   PathologicFunction 73 79        reflux
T8   PROCESS_OF 86 88        in
E1   PROCESS_OF:T8 PathologicFunction:T5 AgeGroup:T6
...
"""

New! You can specify raw text and/or sentence-segmented text, which allows you to easily cross-reference annotations with their associated text spans.

JSONL formatted sentences

$> cat sentences.jsonl

{"sent_index": 1,
 "start_char": 0,
 "end_char: 23,
 "_text": "The cat sat on the mat."}
>>> from pybrat import BratAnnotations, BratText
>>> anns = BratAnnotations.from_file("path/to/file.ann")
>>> anntxt = BratText.from_files(text="path/to/file.txt", sentences="path/to/file.jsonl")
>>> print(anns.events[0])
... "E1      SIT:T2 Animal:T1 Location:T3
>>> event_sentences = annstxt.sentences(anns.events[0])
>>> print(event_sentences)
... [{"sent_index": 1,
...   "start_char": 0,
...   "end_char": 23,
...   "_text": "The cat sat on the mat."}]

API

class pybrat.Annotation(_id: str, _type: str, _source_file: Optional[str] = None)[source]

Bases: object

The base class for brat annotations. Use Span, Event, or Attribute instead of this class.

copy()[source]

Performs a deep copy of this annotation.

property id
short_repr()[source]
to_brat_str()[source]
property type
update(key, value)[source]
class pybrat.Attribute(_id, value, reference=None, _type='Attribute', _source_file=None)[source]

Bases: Annotation

A brat attribute. Can be attached to Spans or Events.

Parameters:
  • _id (str) – the unique numerical identifier of this attribute with the ‘A’ prefix. E.g., ‘A5’.

  • value (Any) – the value of this attribute.

  • reference (Annotation) – the corresponding Span or Event instance.

  • _type (str) – (Optional) a string giving the type of this attribute. Default is ‘Attribute’.

  • _source_file (str) – (Optional), the name of the .ann file which contains this span.

asdict()[source]
property end_index

The ending character index of this Attribute’s reference.

property indices
property span
property start_index

The starting character index of this Attribute’s reference.

to_brat_str(output_references=False, seen=None)[source]

Format this Attribute instance as a brat string.

Parameters:

output_references (bool) – If True, also includes the brat string of the reference of this Attribute. Default False.

class pybrat.BratAnnotations(spans=None, events=None, attributes=None, _source_file=None)[source]

Bases: object

The main class for working with brat annotations.

You can read annotations from a file.

>>> import pybrat
>>> anns = pybrat.BratAnnotations.from_file("path/to/file.ann")

You can also create a set of annotations from Event instances.

>>> import pybrat
>>> event1 = pybrat.Event("E1", *e1spans)
>>> event2 = pybrat.Event("E2", *e2spans)
>>> anns = pybrat.BratAnnotations.from_events([event1, event2])
add_annotation(annotation: Annotation)[source]
property attributes
property events
classmethod from_events(events_iter)[source]

Create a BratAnnotations instance from a collection of Events. Assumes that the Event instances in events_iter contain all Spans and Attributes.

Parameters:

events_iter (List[Event]) – An iterable over Event instances.

classmethod from_file(fpath)[source]

Read brat annotations from the specified file.

Parameters:

fpath (str) – The path to the ann file.

Returns:

a new BratAnnotations instance.

get_attributes_by_type(attr_type)[source]
get_events_by_type(event_type)[source]
get_highest_level_annotations(type=None)[source]

brat annotations can include only spans, spans + events, or spans + events + attributes. This method allows one to get the highest-level annotation available in this file.

In order from highest to lowest level:

Event Attribute Span

Parameters:

type (str) – (Optional) return annotations with the specified type.

get_spans_by_type(span_type)[source]
save_brat(outdir, filename=None)[source]

Save these brat annotations to a brat-formatted file.

Parameters:
  • outdir (str) – The directory in which to save the file.

  • filename (str) – (Optional) The filename to use. If not specified, attempts to use the Annotation._source_file.

property spans
class pybrat.BratText(text=None, sentences=None, tokenizer=None)[source]

Bases: object

A simple class for organizing the text that corresponds to a file of brat annotations.

Specify plain text, split sentences, or both.

>>> bt = BratText(text=plain_text, sentences=list_of_sents)
>>> bt.text(0, 12)  # Plain text at character indices 0 through 12
>>> bt.tokens(0, 12)  # Tokens spanning character indices 0 through 12
>>> bt.sentences(0, 12)  # Sentences spanning character indices 0 - 12

sentences can also be a json lines file with the following format:

{"sent_index": int  # the number of this sentence in the document
 "start_char": int  # the character offset of the start of the sentence
 "end_char": int    # the character offset of the end of the sentence
 "_text":           # the sentence text
}

You can also access the text using Annotation instances

>>> anns = BratAnnotations.from_file("path/to/file1.ann")
>>> bt = BratText.from_files(text="path/to/file1.txt",
...                          sentences="path/to/file1.jsonl")
>>> # get the text of the first span
>>> bt.text(annotations=[anns.spans[0]])
>>> # tokens from the first three spans
>>> bt.tokens(annotations=anns.spans[0:3])
>>> # Sentences containing all events
>>> bt.sentences(annotations=anns.events[:])
classmethod from_files(text=None, sentences=None, tokenizer=None)[source]
save(outdir, filename=None)[source]

Save this BratText instance to a plain text file.

Parameters:
  • outdir (str) – The directory in which to save the file.

  • filename (str) – (Optional) The filename to use. If not specified, attempts to use the Annotation._source_file.

sentences(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]
text(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]
tokens(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]
class pybrat.CharacterIndex(sorted_spans)[source]

Bases: object

Parameters:

sorted_spans (list(tuple)) – a list of tuples, each tuple containing the (start, end) indices of a text span.

property end_index
property start_index
class pybrat.Event(_id, *spans, attributes=None, _type='Event', _source_file=None)[source]

Bases: Annotation

A brat event, composed of one or more ordered Span instances. pybrat does not enforce any specific Event structure.

Parameters:
  • _id (str) – the unique numerical identifier of this event with the E prefix. E.g., ‘E10’.

  • spans (Span) – one or more Span instances.

  • attributes (dict) – a dictionary of Attribute instances keyed by attribute type.

  • _type (str) – (Optional) A type for this Event. Default is ‘Event’.

  • _source_file (str) – (Optional), the name of the .ann file which contains this span.

asdict()[source]
property end_index

The highest character index of this Event’s spans.

property indices
property start_index

The lowest character index of this Event’s spans.

to_brat_str(output_references=False, seen=None)[source]

Format this Event instance as a brat string.

Parameters:

output_references (bool) – If True, also includes the brat string of the Spans and Attributes of this Event. Default False.

class pybrat.RegexTokenizer(split_pattern='\\s')[source]

Bases: object

A very simple tokenizer that splits on whitespace by default.

>>> import pybrat
>>> tokenizer = pybrat.RegexTokenizer()
>>> text = "The cat in the hat"
>>> tokens, token_char_ranges = tokenizer(text)
class pybrat.Span(_id: str, indices: CharacterIndex, text: str, _type: str = 'Span', _source_file: Optional[str] = None, attributes=None)[source]

Bases: Annotation

A brat span. I.e., a span of text.

Parameters:
  • _id (str) – the unique numerical identifier of this span with the ‘T’ prefix. E.g., ‘T3’.

  • indices (CharacterIndex) – the CharacterIndex for this span.

  • text (str) – the actual span text

  • _type (str) – (Optional) a string giving the type of this span, e.g., for NER. Default is ‘Span’.

  • _source_file (str) – (Optional), the name of the .ann file which contains this span.

asdict()[source]
property end_index
property start_index
to_brat_str(output_references=False, seen=None)[source]

Format this Event instance as a brat string.

Parameters:

output_references (bool) – If True, also includes the brat string of the Spans and Attributes of this Event. Default False.

pybrat.parse_brat_attribute(line)[source]
pybrat.parse_brat_event(line)[source]
pybrat.parse_brat_span(line)[source]