Publication

Lausanne Historical Censuses Dataset HTR 35k

Related concepts (31)

Knowledge extraction is the creation of knowledge from structured (relational databases, XML) and unstructured (text, documents, s) sources. The resulting knowledge needs to be in a machine-readable and machine-interpretable format and must represent knowledge in a manner that facilitates inferencing. Although it is methodically similar to information extraction (NLP) and ETL (data warehouse), the main criterion is that the extraction result goes beyond the creation of structured information or the transformation into a relational schema.

Primary source

In the study of history as an academic discipline, a primary source (also called an original source) is an artifact, document, diary, manuscript, autobiography, recording, or any other source of information that was created at the time under study. It serves as an original source of information about the topic. Similar definitions can be used in library science and other areas of scholarship, although different fields have somewhat different definitions.

Germanic peoples

The Germanic peoples were historical groups of people that once occupied Northwestern and Central Europe and Scandinavia during antiquity and into the early Middle Ages. Since the 19th century, they have traditionally been defined by the use of ancient and early medieval Germanic languages and are thus equated at least approximately with Germanic-speaking peoples, although different academic disciplines have their own definitions of what makes someone or something "Germanic".

JPEG

JPEG (ˈdʒeɪpɛɡ , short for Joint Photographic Experts Group) is a commonly used method of lossy compression for s, particularly for those images produced by digital photography. The degree of compression can be adjusted, allowing a selectable tradeoff between storage size and . JPEG typically achieves 10:1 compression with little perceptible loss in image quality. Since its introduction in 1992, JPEG has been the most widely used standard in the world, and the most widely used digital , with several billion JPEG images produced every day as of 2015.

Information extraction

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction Due to the difficulty of the problem, current approaches to IE (as of 2010) focus on narrowly restricted domains.

Automatic summarization

Automatic summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content. Artificial intelligence algorithms are commonly developed and employed to achieve this, specialized for different types of data. Text summarization is usually implemented by natural language processing methods, designed to locate the most informative sentences in a given document.

Terminology extraction

Terminology extraction (also known as term extraction, glossary extraction, term recognition, or terminology mining) is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus. In the semantic web era, a growing number of communities and networked enterprises started to access and interoperate through the internet. Modeling these communities and their information needs is important for several web applications, like topic-driven web crawlers, web services, recommender systems, etc.

Germanic languages

The Germanic languages are a branch of the Indo-European language family spoken natively by a population of about 515 million people mainly in Europe, North America, Oceania and Southern Africa. The most widely spoken Germanic language, English, is also the world's most widely spoken language with an estimated 2 billion speakers. All Germanic languages are derived from Proto-Germanic, spoken in Iron Age Scandinavia.

JPEG 2000

JPEG 2000 (JP2) is an standard and coding system. It was developed from 1997 to 2000 by a Joint Photographic Experts Group committee chaired by Touradj Ebrahimi (later the JPEG president), with the intention of superseding their original JPEG standard (created in 1992), which is based on a discrete cosine transform (DCT), with a newly designed, wavelet-based method. The standardized is .jp2 for ISO/IEC 15444-1 conforming files and .jpx for the extended part-2 specifications, published as ISO/IEC 15444-2.

West Germanic languages

The West Germanic languages constitute the largest of the three branches of the Germanic family of languages (the others being the North Germanic and the extinct East Germanic languages). The West Germanic branch is classically subdivided into three branches: Ingvaeonic, which includes English and Frisian, Istvaeonic, which encompasses Dutch and its close relatives, and Irminonic, which includes German and its close relatives and variants. English is by far the most-spoken West Germanic language, with more than 1 billion speakers worldwide.

Historical source

Historical source is an original source that contains important historical information. These sources are something that inform us about history at the most basic level, and are used as clues in order to study history. Historical sources can include coins, artefacts, monuments, literary sources, documents, artifacts, archaeological sites, features, oral transmissions, stone inscriptions, paintings, recorded sounds, images and oral history. Even ancient relics and ruins, broadly speaking, are historical sources.

Historical method

Historical method is the collection of techniques and guidelines that historians use to research and write histories of the past. Secondary sources, primary sources and material evidence such as that derived from archaeology may all be drawn on, and the historian's skill lies in identifying these sources, evaluating their relative authority, and combining their testimony appropriately in order to construct an accurate and reliable picture of past events and environments.

Proto-Germanic language

Proto-Germanic (abbreviated PGmc; also called Common Germanic) is the reconstructed proto-language of the Germanic branch of the Indo-European languages. Proto-Germanic eventually developed from pre-Proto-Germanic into three Germanic branches during the fifth century BC to fifth century AD: West Germanic, East Germanic and North Germanic, which however remained in contact over a considerable time, especially the Ingvaeonic languages (including English), which arose from West Germanic dialects and remained in continued contact with North Germanic.

Lossless JPEG

Lossless JPEG is a 1993 addition to JPEG standard by the Joint Photographic Experts Group to enable lossless compression. However, the term may also be used to refer to all lossless compression schemes developed by the group, including JPEG 2000 and JPEG-LS. Lossless JPEG was developed as a late addition to JPEG in 1993, using a completely different technique from the lossy JPEG standard. It uses a predictive scheme based on the three nearest (causal) neighbors (upper, left, and upper-left), and entropy coding is used on the prediction error.

Handwriting recognition

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition) or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available.

Memory segmentation

Memory segmentation is an operating system memory management technique of dividing a computer's primary memory into segments or sections. In a computer system using segmentation, a reference to a memory location includes a value that identifies a segment and an offset (memory location) within that segment. Segments or sections are also used in s of compiled programs when they are linked together into a and when the image is loaded into memory.

Code segment

In computing, a code segment, also known as a text segment or simply as text, is a portion of an or the corresponding section of the program's virtual address space that contains executable instructions. The term "segment" comes from the memory segment, which is a historical approach to memory management that has been succeeded by paging. When a program is stored in an object file, the code segment is a part of this file; when the loader places a program into memory so that it may be executed, various memory regions are allocated (in particular, as pages), corresponding to both the segments in the object files and to segments only needed at run time.

Census

A census is the procedure of systematically acquiring, recording and calculating population information about the members of a given population. This term is used mostly in connection with national population and housing censuses; other common censuses include censuses of agriculture, traditional culture, business, supplies, and traffic censuses. The United Nations (UN) defines the essential features of population and housing censuses as "individual enumeration, universality within a defined territory, simultaneity and defined periodicity", and recommends that population censuses be taken at least every ten years.

Compression artifact

A compression artifact (or artefact) is a noticeable distortion of media (including , audio, and video) caused by the application of lossy compression. Lossy data compression involves discarding some of the media's data so that it becomes small enough to be stored within the desired or transmitted (streamed) within the available bandwidth (known as the data rate or bit rate). If the compressor cannot store enough data in the compressed version, the result is a loss of quality, or introduction of artifacts.

Image compression

Image compression is a type of data compression applied to s, to reduce their cost for storage or transmission. Algorithms may take advantage of visual perception and the statistical properties of image data to provide superior results compared with generic data compression methods which are used for other digital data. Image compression may be lossy or lossless. Lossless compression is preferred for archival purposes and often for medical imaging, technical drawings, clip art, or comics.