TEITOK

TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation.

5
mentions
1
contributor

What TEITOK can do for you

TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation, initially developed at the Centro de Linguística da Universidade de Lisboa, later at CELGA-ILTEC, and currently maintained at the ÚFAL institute of Charles University, Prague.

The system has a modular design with numerous modules making serving a wide range of different corpus types.

Below are some examples of some of those, and the type of corpora TEITOK can deal with. More modules are added frequently, and it is possible to add custom modules as well.

Historical Corpora
For historical corpora, TEITOK provides the option to have an alignment between the transcription and the facsimile image, it provides the option to work with multiple orthographic realizations to combine several editions of a text into a single XML file, and it provides the option to create a searchable document map to see where in the world several phenomena are more frequent.

TEITOK is freely available for anybody who wishes to create richly annotated textual corpora, and runs on any LINUX based web server.

Features

Manuscript-based corpora

  • Align your manuscript with your transcript
  • Display each manuscript line with its transcription
  • Transcribe directly from the manuscript
  • Search directly for manuscript fragments
  • Keep multiple editions within the same environment

Stand-off Annotations

  • Adds stand-off annotations to any corpus file
  • Edit using an efficient interface
  • Annotate over discontinuous regions
  • Incorporate annotations into the CQP corpus

Audio-based corpora

  • Align your audio with your transcription
  • Transcribe directly from the audio file
  • Scroll transcription vertical with wave function horizontal
  • Search directly for audio segments

Dependency Grammar

  • Keep dependency relations inside any corpus type
  • Visualize dependency trees for any sentence
  • Edit trees easily
  • Search using dependency relations

Geolocation Coordinates

  • Map documents onto the world map
  • Document are clustered into counted groups
  • Access the documents from the map
  • Compare corpus queries on the world map

Edit from CQP Query

  • Search for words often incorrectly annotated
  • Click on any token in a KWIC list to edit it
  • Edit all results in a systematic way
  • Edit each results individually in a list
  • Pre-modify each result by a regular expression

Search

The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data.

The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the original project.

Logo of TEITOK
Keywords
No keywords available
Programming languages
  • PHP 52%
  • C++ 29%
  • JavaScript 12%
  • Perl 5%
  • HTML 1%
License
Not specified
</>Source code

Participating organisations

Uni
Ins

Reference papers

Mentions

Contributors