pybrat¶
pybrat is a simple package for reading brat-formatted text annotations.
Installation¶
pybrat is written for Python 3 and has no external dependencies. Install it with
python setup.py develop
Using develop should update your installed version when you pull changes from the github.
Uninstallation¶
python setup.py develop --uninstall
Usage¶
The pybrat.BratAnnotations class automatically links events to their associated text spans and attributes. Parse a .ann file and iterate through the contents with
>>> from pybrat import BratAnnotations
>>> anns = BratAnnotations.from_file("/path/to/file.ann")
>>> for ann in anns:
>>> print(ann)
... "E1 PROCESS_OF:T8 PathologicFunction:T5 AgeGroup:T6
By default the __iter__ method will iterate through the highest level annotations. You can iterate through specific types of annotations with the .spans, .attributes, and .events properties. E.g.
>>> for span in anns.spans:
>>> print(span)
... "T8 PROCESS_OF 86 88 in"
You can output brat formatted annotations simply by calling str(ann) or print(ann). This works for individual annotations, as shown above, or for a BratAnnotations instance.
>>> anns = BratAnnotations.from_file("/path/to/file.ann")
>>> print(anns)
... """
T6 AgeGroup 89 97 children
T5 PathologicFunction 73 79 reflux
T8 PROCESS_OF 86 88 in
E1 PROCESS_OF:T8 PathologicFunction:T5 AgeGroup:T6
...
"""
New! You can specify raw text and/or sentence-segmented text, which allows you to easily cross-reference annotations with their associated text spans.
JSONL formatted sentences
$> cat sentences.jsonl
{"sent_index": 1,
"start_char": 0,
"end_char: 23,
"_text": "The cat sat on the mat."}
>>> from pybrat import BratAnnotations, BratText
>>> anns = BratAnnotations.from_file("path/to/file.ann")
>>> anntxt = BratText.from_files(text="path/to/file.txt", sentences="path/to/file.jsonl")
>>> print(anns.events[0])
... "E1 SIT:T2 Animal:T1 Location:T3
>>> event_sentences = annstxt.sentences(anns.events[0])
>>> print(event_sentences)
... [{"sent_index": 1,
... "start_char": 0,
... "end_char": 23,
... "_text": "The cat sat on the mat."}]
API¶
- class pybrat.Annotation(_id: str, _type: str, _source_file: Optional[str] = None)[source]¶
Bases:
objectThe base class for brat annotations. Use Span, Event, or Attribute instead of this class.
- property id¶
- property type¶
- class pybrat.Attribute(_id, value, reference=None, _type='Attribute', _source_file=None)[source]¶
Bases:
AnnotationA brat attribute. Can be attached to Spans or Events.
- Parameters:
_id (str) – the unique numerical identifier of this attribute with the ‘A’ prefix. E.g., ‘A5’.
value (Any) – the value of this attribute.
reference (Annotation) – the corresponding Span or Event instance.
_type (str) – (Optional) a string giving the type of this attribute. Default is ‘Attribute’.
_source_file (str) – (Optional), the name of the .ann file which contains this span.
- property end_index¶
The ending character index of this Attribute’s reference.
- property indices¶
- property span¶
- property start_index¶
The starting character index of this Attribute’s reference.
- class pybrat.BratAnnotations(spans=None, events=None, attributes=None, _source_file=None)[source]¶
Bases:
objectThe main class for working with brat annotations.
You can read annotations from a file.
>>> import pybrat >>> anns = pybrat.BratAnnotations.from_file("path/to/file.ann")
You can also create a set of annotations from Event instances.
>>> import pybrat >>> event1 = pybrat.Event("E1", *e1spans) >>> event2 = pybrat.Event("E2", *e2spans) >>> anns = pybrat.BratAnnotations.from_events([event1, event2])
- add_annotation(annotation: Annotation)[source]¶
- property attributes¶
- property events¶
- classmethod from_events(events_iter)[source]¶
Create a BratAnnotations instance from a collection of Events. Assumes that the Event instances in events_iter contain all Spans and Attributes.
- Parameters:
events_iter (List[Event]) – An iterable over Event instances.
- classmethod from_file(fpath)[source]¶
Read brat annotations from the specified file.
- Parameters:
fpath (str) – The path to the ann file.
- Returns:
a new BratAnnotations instance.
- get_highest_level_annotations(type=None)[source]¶
brat annotations can include only spans, spans + events, or spans + events + attributes. This method allows one to get the highest-level annotation available in this file.
- In order from highest to lowest level:
Event Attribute Span
- Parameters:
type (str) – (Optional) return annotations with the specified type.
- save_brat(outdir, filename=None)[source]¶
Save these brat annotations to a brat-formatted file.
- Parameters:
outdir (str) – The directory in which to save the file.
filename (str) – (Optional) The filename to use. If not specified, attempts to use the Annotation._source_file.
- property spans¶
- class pybrat.BratText(text=None, sentences=None, tokenizer=None)[source]¶
Bases:
objectA simple class for organizing the text that corresponds to a file of brat annotations.
Specify plain text, split sentences, or both.
>>> bt = BratText(text=plain_text, sentences=list_of_sents) >>> bt.text(0, 12) # Plain text at character indices 0 through 12 >>> bt.tokens(0, 12) # Tokens spanning character indices 0 through 12 >>> bt.sentences(0, 12) # Sentences spanning character indices 0 - 12
sentences can also be a json lines file with the following format:
{"sent_index": int # the number of this sentence in the document "start_char": int # the character offset of the start of the sentence "end_char": int # the character offset of the end of the sentence "_text": # the sentence text }
You can also access the text using Annotation instances
>>> anns = BratAnnotations.from_file("path/to/file1.ann") >>> bt = BratText.from_files(text="path/to/file1.txt", ... sentences="path/to/file1.jsonl") >>> # get the text of the first span >>> bt.text(annotations=[anns.spans[0]]) >>> # tokens from the first three spans >>> bt.tokens(annotations=anns.spans[0:3]) >>> # Sentences containing all events >>> bt.sentences(annotations=anns.events[:])
- save(outdir, filename=None)[source]¶
Save this BratText instance to a plain text file.
- Parameters:
outdir (str) – The directory in which to save the file.
filename (str) – (Optional) The filename to use. If not specified, attempts to use the Annotation._source_file.
- sentences(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]¶
- text(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]¶
- tokens(annotations: List[Annotation] = [], start_char: Optional[int] = None, end_char: Optional[int] = None)[source]¶
- class pybrat.CharacterIndex(sorted_spans)[source]¶
Bases:
object- Parameters:
sorted_spans (list(tuple)) – a list of tuples, each tuple containing the (start, end) indices of a text span.
- property end_index¶
- property start_index¶
- class pybrat.Event(_id, *spans, attributes=None, _type='Event', _source_file=None)[source]¶
Bases:
AnnotationA brat event, composed of one or more ordered Span instances. pybrat does not enforce any specific Event structure.
- Parameters:
_id (str) – the unique numerical identifier of this event with the E prefix. E.g., ‘E10’.
spans (Span) – one or more Span instances.
attributes (dict) – a dictionary of Attribute instances keyed by attribute type.
_type (str) – (Optional) A type for this Event. Default is ‘Event’.
_source_file (str) – (Optional), the name of the .ann file which contains this span.
- property end_index¶
The highest character index of this Event’s spans.
- property indices¶
- property start_index¶
The lowest character index of this Event’s spans.
- class pybrat.RegexTokenizer(split_pattern='\\s')[source]¶
Bases:
objectA very simple tokenizer that splits on whitespace by default.
>>> import pybrat >>> tokenizer = pybrat.RegexTokenizer() >>> text = "The cat in the hat" >>> tokens, token_char_ranges = tokenizer(text)
- class pybrat.Span(_id: str, indices: CharacterIndex, text: str, _type: str = 'Span', _source_file: Optional[str] = None, attributes=None)[source]¶
Bases:
AnnotationA brat span. I.e., a span of text.
- Parameters:
_id (str) – the unique numerical identifier of this span with the ‘T’ prefix. E.g., ‘T3’.
indices (CharacterIndex) – the CharacterIndex for this span.
text (str) – the actual span text
_type (str) – (Optional) a string giving the type of this span, e.g., for NER. Default is ‘Span’.
_source_file (str) – (Optional), the name of the .ann file which contains this span.
- property end_index¶
- property start_index¶