LocText corpus

Name Description Annotations Documents Curators Links / Download
 LocText Protein subcellular localizations for human, yeast, and arabidopsis. Details, entities: proteins, subcellular localizations (loc), organisms
relations: protein ↔ loc, protein ↔ organism
100 abstracts anndoc BioC PubAnnotation

The LocText corpus of protein subcellular localization

Brief summary

The LocText corpus consists of 100 abstracts that have been manually annotated for proteins, subcellular localizations, organisms, and relations among them. The focus of the corpus is on the annotation of proteins and their subcellular localizations. It was developed by Tatyana Goldberg (goldberg@rostlab.org), Shrikant Vinchurkar (shrikantpvinchurkar@gmail.com), Juan Miguel Cejuela (juanmi@tagtog.net), Lars Juhl Jensen (lars.juhl.jensen@cpr.ku.dk), and Burkhard Rost (assistant@rostlab.org).

Document collection

To obtain abstracts related to the subcellular localization of proteins, we extracted PubMed identifiers of articles cited by UniProtKB protein subcellular localization annotations. From this set of abstracts we randomly selected 100 with the following species breakdown:

  • 50 abstracts pertaining to Homo sapiens (human) proteins
  • 25 abstracts pertaining to Saccharomyces cerevisiae (budding yeast) proteins
  • 25 abstracts pertaining to Arabidopsis thaliana proteins

Document annotation

We manually annotated three types of entities in the text, with most focus on the first two:

  • Proteins: generalizes and combines proteins with their corresponding genes and mRNAs; normalized to UniProtKB identifiers
  • Subcellular localizations: normalized to Gene Ontology (GO) cellular component terms
  • Organisms: normalized to NCBI Taxonomic identifiers

We also annotated two types of relationships with the main focus being on the first one:

  • Protein–subcellular localization: link proteins to the parts of the cell where they reside
  • Protein–organism: primarily annotated to support normalization of proteins

Annotation guidelines

Common for all entity classes

  • Do annotate synonyms or variations (e.g. plural forms) of entity names
  • Do annotate both the long and the short forms separately when they appear together, e.g. amyloid precursor protein (APP)


Annotate proteins/genes only when they can be found in UniProtKB corresponding to the organism that is given in the abstract:

  • Do include localization words when they are part of protein names, e.g. peroxisome proliferator-activated receptor alpha
  • Do include words like receptor, protein, or gene if they are part of the actual name, e.g. amyloid precursor protein
  • Do not include such words when they are not part of the name, e.g. the p53 protein
  • Do annotate protein mentions such as ‘ste2-3’ or ‘cation exchangers 1, 2, and 3’ as a single entity. For normalization (i.e. mapping to UniProtKB identifiers), all members of the entity must be considered separately (e.g. ste2 and ste3).
  • Do not annotate names of protein classes, families, or complexes, e.g. growth hormone
  • Do not attempt to distinguish between proteins, genes, and mRNAs

Subcellular localizations

Annotate only subcellular localizations from the UniProtKB controlled vocabulary:

  • Do also annotate the following additional terms:
  • Cell periphery
  • Chromocenters
  • Microtubule
  • Plasma membrane
  • Tonoplast
  • Transmembrane
  • Do not annotate localization words as subcellular localizations when they are part of a protein name


Annotate terms as organisms only if they can be found in NCBI Taxonomy with the rank species, genus, or subfamily:

  • Do annotate human as it is a species
  • Do annotate rat as it is a genus
  • Do annotate murines as it is a subfamily
  • Do not annotate K562 cells as it is a cell line
  • Do not annotate mammals as it has the rank class
  • Do not annotate human-indicative words such as woman or French-Canadians


Annotate relationships if and only if they are clearly implied by the text:

  • Do annotate a relationship for each occurrence of organism or localization to the closest protein (closest is defined by the number of separating words), provided the relationship is meaningful
  • Do annotate a relationship to the closest protein on the right side, in case of proteins’ occurrence to both sides of organism or localization entity with the same distance
  • Do annotate a relationship to all variations or synonyms of an entity name
  • Do not annotate a relationship just because two entities are mentioned together in the same sentence.
  • Do not annotate relationships for marker proteins, e.g. green fluorescent protein
  • Do not annotate relationships when yeast appears as the term yeast two-hybrid

Corpus annotation comparisons

Inter-annotator agreement (IAA)

Annotators TG, SV and JMC developed the annotation guidelines listed above using 46 of 100 PubMed abstracts. The remaining 54 abstracts were annotated independently by TG and SV based on the guidelines and used for estimating the inter-annotator agreement (IAA). We present the IAA by F1 score for entities and relationships separately, given by the following formula:

F1 = 2*XAB*XBA/(XAB+XBA), where Xij is the fraction of annotations by annotator i matching those of annotator j.

IAA for entities

We calculated IAA for protein and subcellular localization entities, as they were the main focus of LocText. We consider two annotations of the same entity type to match if they have the exact same start and end string offsets.

The IAA between the two annotators was F1 score of 96% and 88% for protein and subcellular localization entities, respectively. The combined F1 score for both entity types was 94%.

IAA for relationships

A protein–subcellular localization relationship annotated by two annotators is considered a match if the annotations of both the protein and the subcellular localization match between the annotators, according to the definition above.

The IAA between the two annotators was F1 score of 80% for protein–subcellular localization relationships.

Comparison to UniProtKB annotations

We normalized protein and subcellular localization information extracted from UniProtKB and Medline abstracts (LocText) to UniProtKB and GO identifiers, respectively. The comparison of localization information extracted from the two sources was performed for each UniProtKB protein separately. We classified the LocText subcellular localization annotation of the protein as follows:

  • Novel, if the path of at least one GO term, of the LocText annotation, to the root (GO:0005575) in the GO graph does not intersect GO terms from UniProtKB annotation
  • More detailed, if it is not ‘novel’ and the path of at least one GO term to the root in the GO graph intersects with the UniProtKB GO terms
  • Existing, in all other cases

Example abstract

To illustrate our annotation strategy, we here present our annotation, normalization and comparison to UniProtKB annotations of the Arabidopsis thaliana protein RabF2a (UniProtKB ID: RAF2A_ARATH) in the abstract with PMID: 18088316.


We manually annotated the abstract and identified:

We identified RabF2a to be an Arabidopsis protein that localizes in endosomes. Thus, we annotate a relationship between terms RabF2a and Arabidopsis, as well as between terms RabF2a and endosomes.

View of the PMID: 18088316. Names of proteins, localization terms and organisms are highlighted in green, magenta, and yellow, respectively. Two annotated relationships of protein RabF2a to organism Arabidopsis and localization term endosomes are presented as frames.


We normalize the entities extracted from the abstract as follows:

We also extract subcellular localization from the UniProtKB entry RAF2A_ARATH (current entry version 119): Vacuole membrane; Lipid-anchor. Note=Prevacuolar compartment. We normalize it to: GO:0005774

Comparison to UniProtKB

We compared localization information extracted from a Medline abstract and UniProtKB based on the GO terms. The relationship between two terms (whether they are equal or not) was determined in the GO graph. Mapping GO:0005768 (localization from abstract) and GO:0005774 (localization from UniProtKB) in the GO graph revealed that the localization annotation extracted from the abstract is novel compared to the annotation in UniProtKB.

Visualization of the relationship between GO terms GO:0005768 (endosome) and GO:0005774 (vacuolar membrane) in the GO graph. Because the path of GO:0005768 to the root (cellular component) does not intersect with GO:0005774, the abstract annotation of the endosomal subcellular localization is thus considered novel in respect to the UniProtKB annotation. Figure was made using QuickGO.


The corpus is distributed under the Creative Commons Attribution 4.0 (CC-BY 4.0) license. This implies that you are allowed to share and adapt the corpus for any purpose, as long as you give credit to the original work.

start creating a text corpus right away!

Do you want to publish an existing corpus in tagtog? Contact us