Knowledge extractionKnowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, s) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema.
Information extractionInformation extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains.
Entity linkingIn natural language processing, entity linking, also referred to as named-entity linking (NEL), named-entity disambiguation (NED), named-entity recognition and disambiguation (NERD) or named-entity normalization (NEN) is the task of assigning a unique identity to entities (such as famous individuals, locations, or companies) mentioned in text. For example, given the sentence "Paris is the capital of France", the idea is to determine that "Paris" refers to the city of Paris and not to Paris Hilton or any other entity that could be referred to as "Paris".
CoreferenceIn linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in Bill said Alice would arrive soon, and she did, the words Alice and she refer to the same person. Co-reference is often non-trivial to determine. For example, in Bill said he would come, the word he may or may not refer to Bill.
Personal pronounPersonal pronouns are pronouns that are associated primarily with a particular grammatical person – first person (as I), second person (as you), or third person (as he, she, it, they). Personal pronouns may also take different forms depending on number (usually singular or plural), grammatical or natural gender, case, and formality. The term "personal" is used here purely to signify the grammatical sense; personal pronouns are not limited to people and can also refer to animals and objects (as the English personal pronoun it usually does).
He (pronoun)In Modern English, he is a singular, masculine, third-person pronoun. In Standard Modern English, he has four shapes representing five distinct word forms: he: the nominative (subjective) form him: the accusative (objective) form (also called the oblique case) his: the dependent and independent genitive (possessive) forms himself: the reflexive form Old English had a single third-person pronoun — from the Proto-Germanic demonstrative base *khi-, from PIE *ko- "this" — which had a plural and three genders in the singular.
Named-entity recognitionNamed-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc. Most research on NER/NEE systems has been structured as taking an unannotated block of text, such as this one: Jim bought 300 shares of Acme Corp.
Binding (linguistics)In linguistics, binding is the phenomenon in which anaphoric elements such as pronouns are grammatically associated with their antecedents. For instance in the English sentence "Mary saw herself", the anaphor "herself" is bound by its antecedent "Mary". Binding can be licensed or blocked in certain contexts or syntactic configurations, e.g. the pronoun "her" cannot be bound by "Mary" in the English sentence "Mary saw her". While all languages have binding, restrictions on it vary even among closely related languages.
Natural language processingNatural language processing (NLP) is an interdisciplinary subfield of linguistics and computer science. It is primarily concerned with processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic (i.e. statistical and, most recently, neural network-based) machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them.
Reflexive pronounA reflexive pronoun is a pronoun that refers to another noun or pronoun (its antecedent) within the same sentence. In the English language specifically, a reflexive pronoun will end in -self or -selves, and refer to a previously named noun or pronoun (myself, yourself, ourselves, themselves, etc.). English intensive pronouns, used for emphasis, take the same form. In generative grammar, a reflexive pronoun is an anaphor that must be bound by its antecedent (see binding).
Gender neutrality in languages with gendered third-person pronounsA third-person pronoun is a pronoun that refers to an entity other than the speaker or listener. Some languages with gender-specific pronouns have them as part of a grammatical gender system, a system of agreement where most or all nouns have a value for this grammatical category. A few languages with gender-specific pronouns, such as English, Afrikaans, Defaka, Khmu, Malayalam, Tamil, and Yazgulyam, lack grammatical gender; in such languages, gender usually adheres to "natural gender", which is often based on biological sex.
Statistical machine translationStatistical machine translation (SMT) was a machine translation approach, that superseded the previous, rule-based approach because it required explicit description of each and every linguistic rule, which was costly, and which often did not generalize to other languages. Since 2003, the statistical approach itself has been gradually superseded by the deep learning-based neural network approach. The first ideas of statistical machine translation were introduced by Warren Weaver in 1949, including the ideas of applying Claude Shannon's information theory.
Text miningText mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources." Written resources may include websites, books, emails, reviews, and articles. High-quality information is typically obtained by devising patterns and trends by means such as statistical pattern learning. According to Hotho et al.
Neural machine translationNeural machine translation (NMT) is an approach to machine translation that uses an artificial neural network to predict the likelihood of a sequence of words, typically modeling entire sentences in a single integrated model. They require only a fraction of the memory needed by traditional statistical machine translation (SMT) models. Furthermore, unlike conventional translation systems, all parts of the neural translation model are trained jointly (end-to-end) to maximize the translation performance.
TranslationTranslation is the communication of the meaning of a source-language text by means of an equivalent target-language text. The English language draws a terminological distinction (which does not exist in every language) between translating (a written text) and interpreting (oral or signed communication between users of different languages); under this distinction, translation can begin only after the appearance of writing within a language community.
Interlinear glossIn linguistics and pedagogy, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines, such as between a line of original text and its translation into another language. When glossed, each line of the original text acquires one or more corresponding lines of transcription known as an interlinear text or interlinear glossed text (IGT)interlinear for short. Such glosses help the reader follow the relationship between the source text and its translation, and the structure of the original language.
Inalienable possessionIn linguistics, inalienable possession (abbreviated ) is a type of possession in which a noun is obligatorily possessed by its possessor. Nouns or nominal affixes in an inalienable possession relationship cannot exist independently or be "alienated" from their possessor. Inalienable nouns include body parts (such as leg, which is necessarily "someone's leg" even if it is severed from the body), kinship terms (such as mother), and part-whole relations (such as top).
ThouThe word thou (ðaʊ) is a second-person singular pronoun in English. It is now largely archaic, having been replaced in most contexts by the word you, although it remains in use in parts of Northern England and in Scots (/ðu:/). Thou is the nominative form; the oblique/objective form is thee (functioning as both accusative and dative); the possessive is thy (adjective) or thine (as an adjective before a vowel or as a possessive pronoun); and the reflexive is thyself.
Grammatical genderIn linguistics, a grammatical gender system is a specific form of a noun class system, where nouns are assigned to gender categories that are often not related to the real-world qualities of the entities denoted by those nouns. In languages with grammatical gender, most or all nouns inherently carry one value of the called gender; the values present in a given language (of which there are usually two or three) are called the genders of that language.