9 Linguistic research

Johannes Benzing was a German Turkic specialist and Diplomat in the era of National Socialism and in the Federal Republic of Germany. Benzing worked as a Linguist in Pers Z S, the signals intelligence agency of the German Foreign Office. He was the youngest senior official (German:Beamter) and headed the section from October 1939 until September 1944.

The Bijankhan corpus is a tagged corpus that is suitable for natural language processing (NLP) research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc.; in about 4300 different subject categories. The corpus contains about 2.6 million manually tagged words with a tag set that contains 550 Persian part-of-speech tags.

CorCenCC or the National Corpus of Contemporary Welsh is a language resource for Welsh speakers, Welsh learners, Welsh language researchers, and anyone who is interested in the Welsh language. CorCenCC is a freely accessible collection of multiple language samples, gathered from real-life communication, and presented in the searchable online CorCenCC text corpus. The corpus is accompanied by an online teaching and learning toolkit – Y Tiwtiadur – which draws directly on the data from the corpus to provide resources for Welsh language learning at all ages and levels.

The Hamshahri Corpus is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group of University of Tehran. Later, a team headed by Ale Ahmad built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks.

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Formosan Straits news, as well as news on finance, sports and entertainment. By 2020, 2.7 billion characters of news media texts have been filtered so far, of which 680 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.3 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and their speech communities in the Pan-Chinese region, and the results show considerable and important variations.

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

Tatoeba is a free collaborative online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (例えば), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on translation of complete sentences. In addition, the structure of the database and interface emphasize one-to-many relationships. Not only can a sentence have multiple translations within a single language, but its translations into all languages are readily visible, as are indirect translations that involve a chain of stepwise links from one language to another.

Wikitongues is an American non-profit organization registered in the state of New York. It aims to sustain and promote all the languages in the world. It was founded by Frederico Andrade, Daniel Bogre Udell and Lindie Botes in 2014.