TEITOK is a web-based platform for viewing, creating, and editing corpora with both rich textual mark-up and linguistic annotation, initially developed at the Centro de Linguística da Universidade de Lisboa, later at CELGA-ILTEC, and currently maintained at the ÚFAL institute of Charles University, Prague.

The system has a modular design with numerous modules making serving a wide range of different corpus types.

Below are some examples of some of those, and the type of corpora TEITOK can deal with. More modules are added frequently, and it is possible to add custom modules as well.

Historical Corpora
For historical corpora, TEITOK provides the option to have an alignment between the transcription and the facsimile image, it provides the option to work with multiple orthographic realizations to combine several editions of a text into a single XML file, and it provides the option to create a searchable document map to see where in the world several phenomena are more frequent.

TEITOK is freely available for anybody who wishes to create richly annotated textual corpora, and runs on any LINUX based web server.

Features

Manuscript-based corpora

Align your manuscript with your transcript
Display each manuscript line with its transcription
Transcribe directly from the manuscript
Search directly for manuscript fragments
Keep multiple editions within the same environment

Stand-off Annotations

Adds stand-off annotations to any corpus file
Edit using an efficient interface
Annotate over discontinuous regions
Incorporate annotations into the CQP corpus

Audio-based corpora

Align your audio with your transcription
Transcribe directly from the audio file
Scroll transcription vertical with wave function horizontal
Search directly for audio segments

Dependency Grammar

Keep dependency relations inside any corpus type
Visualize dependency trees for any sentence
Edit trees easily
Search using dependency relations

Geolocation Coordinates

Map documents onto the world map
Document are clustered into counted groups
Access the documents from the map
Compare corpus queries on the world map

Edit from CQP Query

Search for words often incorrectly annotated
Click on any token in a KWIC list to edit it
Edit all results in a systematic way
Edit each results individually in a list
Pre-modify each result by a regular expression

Search

The rich XML format used in TEITOK is hard to search through. For easier access, all corpora are therefore indexed using the Corpus WorkBench (CWB), allowing texts to be search efficiently, and with the rich query language that CWB provides. Words are indexed in the CWB with various orthographic forms, providing many ways to search through the data.

The type of corpora that TEITOK is meant for are very labour-intensive: for ancient texts, hardly any of the data will be available in digital format, and have to be scanned. In many cases, OCR will not work and even for human readers the texts are often very hard to read. And the data will display a lot of orthographic variation in which a lot of the linguistic annotation, including normalization, will have to be done by hand. As a result, most corpora created with TEITOK will have a limited size, and searching for linguistic properties in them will not yield a lot of results. Therefore, TEITOK offers the option to index the corpus in a central database, which can be searched via this site. Each search result will only display the direct context of the word, and will link directly to the word in the original text on the site of the project it originated from. This way, it is possible to search through multiple corpora at the same time, and get access to the full original data in a way that prominently features the original project.

TEITOK

Description

Features

Manuscript-based corpora

Stand-off Annotations

Audio-based corpora

Dependency Grammar

Geolocation Coordinates

Edit from CQP Query

Search

Participating organisations

Reference papers

Mentions

Contributors