32 Natural language processing

Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The result is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Abdul Majid Bhurgri Institute of Language Engineering is an autonomous body under the administrative control of the Culture, Tourism and Antiquities Department, Government of Sindh established for bringing Sindhi language at par with national and international languages in all computational process and Natural language processing.

The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named entity extraction, chunking, parsing and coreference resolution. These tasks are usually required to build more advanced text processing services.

Arabic Ontology

Arabic Ontology is a linguistic ontology for the Arabic language, which can be used as an Arabic Wordnet with ontologically-clean content. People use it also as a tree of the concepts/meanings of the Arabic terms. It is a formal representation of the concepts that the Arabic terms convey, and its content is ontologically well-founded, and benchmarked to scientific advances and rigorous knowledge sources rather than to speakers’ naïve beliefs as wordnets typically do . The Ontology tree can be explored online.

Attensity provides social analytics and engagement applications for social customer relationship management. Attensity's text analytics software applications extract facts, relationships and sentiment from unstructured data, which comprise approximately 85% of the information companies store electronically.

Bottlenose.com, also known as Bottlenose, is an enterprise trend intelligence company that analyzes big data and business data to detect trends for brands. It helps Fortune 500 enterprises discover and track emerging trends that affect their brands. The company uses natural language processing, sentiment analysis, statistical algorithms, data mining and machine learning heuristics to determine trends, and has a search engine that gathers information from social networks. KPMG Capital has invested a "substantial amount" in the company.

Conversica is a US-based cloud software technology company. Conversica offers a suite of Intelligent Virtual Assistants for business, with a focus on Customer Experience business functions. Powered by Artificial Intelligence, the Intelligent Virtual Assistants (IVA) interact with leads and customers in a human-like way, like an entry level employee. The IVA software interacts over multiple channels, including email and SMS text messages, and in multiple languages. Conversica is a pioneer in providing AI-driven lead engagement software for marketing and sales organizations. Conversica is headquartered in Silicon Valley,.

DALL-E is an artificial intelligence program that creates images from textual descriptions, revealed by OpenAI on January 5, 2021. It uses a 12-billion parameter version of the GPT-3 Transformer model to interpret natural language inputs and generate corresponding images. It can create images of realistic objects as well as objects that do not exist in reality. Its name is a portmanteau of WALL-E and Dalí.

Eclipse Deeplearning4j is a programming library written in Java for the Java virtual machine (JVM). It is a framework with wide support for deep learning algorithms. Deeplearning4j includes implementations of the restricted Boltzmann machine, deep belief net, deep autoencoder, stacked denoising autoencoder and recursive neural tensor network, word2vec, doc2vec, and GloVe. These algorithms all include distributed parallel versions that integrate with Apache Hadoop and Spark.

Documenting Hate is a project of ProPublica, in collaboration with a number of journalistic, academic, and computing organizations, for systematic tracking of hate crimes and bias incidents. It uses an online form to facilitate reporting of incidents by the general public. Since August 2017, it has also used machine learning and natural language processing techniques to monitor and collect news stories about hate crimes and bias incidents. As of October 2017, over 100 news organizations had joined the project.

In natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris". Entity linking is different from named-entity recognition (NER) in that NER identifies the occurrence of a named entity in text but it does not identify which specific entity it is.

A Gorn address is a method of identifying and addressing any node within a tree data structure. This notation is often used for identifying nodes in a parse tree defined by phrase structure rules.

Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019. GPT-2 translates text, answers questions, summarizes passages, and generates text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages. It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence. GPT-2 was created as a "direct scale-up" of OpenAI's 2018 GPT model, with a ten-fold increase in both its parameter count and the size of its training dataset.

A grammar checker, in computing terms, is a program, or part of a program, that attempts to verify written text for grammatical correctness. Grammar checkers are most often implemented as a feature of a larger program, such as a word processor, but are also available as a stand-alone application that can be activated from within programs that work with editable text.

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

iGlue is an experimental database with detailed search options, containing entities and information editing tool. It organizes interrelated images, videos, individuals, institutions, objects, websites, geographical locations into cohesive data structures.

Jive, also known as the Jive Filter, is a novelty computer program that converts plain English to a comic dialect known as "jive", a parody of African American Vernacular English. Some versions of the filter were adapted to parody other forms of English speech, such as valspeak, cockney, geordie, Pig Latin, and even the Swedish Chef. The last form is sometimes known as the "Encheferator" or "Encheferizer". This family of programs became quite popular in the late 1980s.

Just This Once is a 1993 romance novel written in the style of Jacqueline Susann by a Macintosh IIcx computer named "Hal" in collaboration with its programmer, Scott French. French reportedly spent $40,000 and 8 years developing an artificial intelligence program to analyze Susann's works and attempt to create a novel that Susann might have written. A legal dispute between the estate of Jacqueline Susann and the publisher resulted in a settlement to split the profits, and the book was referenced in several legal journal articles about copyright laws. The book had two small print runs totaling 35,000 copies, receiving mixed reviews.

LIVAC is an uncommon language corpus dynamically maintained since 1995. Different from other existing corpora, LIVAC has adopted a rigorous and regular as well as "Windows" approach in processing and filtering massive media texts from representative Chinese speech communities such as Hong Kong, Macau, Taipei, Singapore, Shanghai, Beijing, as well as Guangzhou, and Shenzhen. The contents are thus deliberately repetitive in most cases, represented by textual samples drawn from editorials, local and international news, cross-Formosan Straits news, as well as news on finance, sports and entertainment. By 2020, 2.7 billion characters of news media texts have been filtered so far, of which 680 million characters have been processed and analyzed and have yielded an expanding Pan-Chinese dictionary of 2.3 million words from the Pan-Chinese printed media. Through rigorous analysis based on computational linguistic methodology, LIVAC has at the same time accumulated a large amount of accurate and meaningful statistical data on the Chinese language and their speech communities in the Pan-Chinese region, and the results show considerable and important variations.

MeaningCloud is a Software as a Service product that enables users to embed text analytics and semantic processing in any application or system. It was previously branded as Textalytics.

METEOR is a metric for the evaluation of machine translation output. The metric is based on the harmonic mean of unigram precision and recall, with recall weighted higher than precision. It also has several features that are not found in other metrics, such as stemming and synonymy matching, along with the standard exact word matching. The metric was designed to fix some of the problems found in the more popular BLEU metric, and also produce good correlation with human judgement at the sentence or segment level. This differs from the BLEU metric in that BLEU seeks correlation at the corpus level.

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.

Niki.ai

Niki is an artificial intelligence company headquartered in Bangalore, Karnataka. It was founded in May 2015 by IIT Kharagpur graduates Sachin Jaiswal, Keshav Prawasi, Shishir Modi and Nitin Babel.

Rhetorical structure theory (RST) is a theory of text organization that describes relations that hold between parts of text. It was originally developed by William Mann and Sandra Thompson of the University of Southern California's Information Sciences Institute (ISI) and defined in a 1988 paper. The theory was developed as part of studies of computer-based text generation. Natural language researchers later began using RST in text summarization and other applications. It explains coherence by postulating a hierarchical, connected structure of texts. In 2000, Daniel Marcu, also of ISI, demonstrated that practical discourse parsing and text summarization also could be achieved using RST.

Sentence embedding is the collective name for a set of techniques in natural language processing (NLP) where sentences are mapped to vectors of real numbers .

In software, a spell checker is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic dictionary, or search engine.

Tatoeba is a free collaborative online database of example sentences geared towards foreign language learners. Its name comes from the Japanese term "tatoeba" (例えば), meaning "for example". Unlike other online dictionaries, which focus on words, Tatoeba focuses on translation of complete sentences. In addition, the structure of the database and interface emphasize one-to-many relationships. Not only can a sentence have multiple translations within a single language, but its translations into all languages are readily visible, as are indirect translations that involve a chain of stepwise links from one language to another.

Text Nailing (TN) is an information extraction method of semi-automatically extracting structured information from unstructured documents. The method allows a human to interactively review small blobs of text out of a large collection of documents, to identify potentially informative expressions. The identified expressions can be used then to enhance computational methods that rely on text as well as advanced natural language processing (NLP) techniques. TN combines two concepts: 1) human-interaction with narrative text to identify highly prevalent non-negated expressions, and 2) conversion of all expressions and notes into non-negated alphabetical-only representations to create homogeneous representations.

The Text REtrieval Conference (TREC) is an ongoing series of workshops focusing on a list of different information retrieval (IR) research areas, or tracks. It is co-sponsored by the National Institute of Standards and Technology (NIST) and the Intelligence Advanced Research Projects Activity, and began in 1992 as part of the TIPSTER Text program. Its purpose is to support and encourage research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies and to increase the speed of lab-to-product transfer of technology.

TipTop Technologies offers a real-time web, social search engine with a unique platform for semantic analysis of natural language. TipTop Search provides results capturing individual and group sentiment, opinions, and experiences from content of various sorts including real-time messages from Twitter or consumer product reviews on Amazon.com. TipTop Technologies and ITC Infotech have worked together to develop a semantic engine and search interface for both enterprise and consumer applications. TipTop's products are part of the "emerging Web 3.0 applications which use semantic technologies to augment the underlying Web system's functionalities."

Voice computing is the discipline that develops hardware or software to process voice inputs.

In natural language processing (NLP), Word embedding is a term used for the representation of words for text analysis, typically in the form of a real-valued vector that encodes the meaning of the word such that the words that are closer in the vector space are expected to be similar in meaning. Word embeddings can be obtained using a set of language modeling and feature learning techniques where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.