ParsingParsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part (of speech). The term has slightly different meanings in different branches of linguistics and computer science. Traditional sentence parsing is often performed as a method of understanding the exact meaning of a sentence or word, sometimes with the aid of devices such as sentence diagrams.
Reserved wordIn a computer language, a reserved word (also known as a reserved identifier) is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no user-defined meaning. A closely related and often conflated notion is a keyword, which is a word with special meaning in a particular context. This is a semantic definition. By contrast, names in a standard library but not built into a language are not considered reserved words or keywords.
YaccYacc (Yet Another Compiler-Compiler) is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser (the part of a compiler that tries to make syntactic sense of the source code) based on a formal grammar, written in a notation similar to Backus–Naur Form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.
Abstract syntax treeIn computer science, an abstract syntax tree (AST), or just syntax tree, is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language. Each node of the tree denotes a construct occurring in the text. The syntax is "abstract" in the sense that it does not represent every detail appearing in the real syntax, but rather just the structural or content-related details. For instance, grouping parentheses are implicit in the tree structure, so these do not have to be represented as separate nodes.
Regular expressionA regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory. The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene formalized the concept of a regular language.
Syntax (programming languages)In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data. The syntax of a language defines its surface form. Text-based computer languages are based on sequences of characters, while visual programming languages are based on the spatial layout and connections between symbols (which may be textual or graphical).
Compiler-compilerIn computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine. The most common type of compiler-compiler is more precisely called a parser generator. It only handles syntactic analysis. The input of a parser generator is a grammar file, typically written in Backus–Naur form (BNF) or extended Backus–Naur form (EBNF) that defines the syntax of a target programming language.
String literalA string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision (issues with brackets) and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases.
HaskellHaskell (ˈhæskəl) is a general-purpose, statically-typed, purely functional programming language with type inference and lazy evaluation. Designed for teaching, research, and industrial applications, Haskell has pioneered a number of programming language features such as type classes, which enable type-safe operator overloading, and monadic input/output (IO). It is named after logician Haskell Curry. Haskell's main implementation is the Glasgow Haskell Compiler (GHC).
LR parserIn computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parsers, and GLR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.
C Sharp (programming language)C# (pronounced ) is a general-purpose high-level programming language supporting multiple paradigms. C# encompasses static typing, strong typing, lexically scoped, imperative, declarative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines. The C# programming language was designed by Anders Hejlsberg from Microsoft in 2000 and was later approved as an international standard by Ecma (ECMA-334) in 2002 and ISO/IEC (ISO/IEC 23270) in 2003. Microsoft introduced C# along with .
Interpreter (computing)In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution: Parse the source code and perform its behavior directly; Translate source code into some efficient intermediate representation or object code and immediately execute that; Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter Virtual Machine.
Lex (software)Lex is a computer program that generates lexical analyzers ("scanners" or "lexers"). Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt and described in 1975, is the standard lexical analyzer generator on many Unix systems, and an equivalent tool is specified as part of the POSIX standard. Lex reads an input stream specifying the lexical analyzer and writes source code which implements the lexical analyzer in the C programming language.
NewlineA newline (frequently called line ending, end of line (EOL), next line (NEL) or line break) is a control character or sequence of control characters in character encoding specifications such as ASCII, EBCDIC, Unicode, etc. This character, or a sequence of characters, is used to signify the end of a and the start of a new one. In the mid-1800s, long before the advent of teleprinters and teletype machines, Morse code operators or telegraphists invented and used Morse code prosigns to encode white space text formatting in formal written text messages.
Formal grammarIn formal language theory, a grammar (when the context is not given, often called a formal grammar for clarity) describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language. Formal language theory, the discipline that studies formal grammars and languages, is a branch of applied mathematics.
String (computer science)In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). A string is generally considered as a data type and is often implemented as an array data structure of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. String may also denote more general arrays or other sequence (or list) data types and structures.
Whitespace characterIn computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area on a page. For example, the common whitespace symbol (also ASCII 32) represents a blank space punctuation character in text, used as a word divider in Western scripts. With many keyboard layouts, a whitespace character may be entered by pressing .
Identifier (computer languages)In computer programming languages, an identifier is a lexical token (also called a symbol, but not to be confused with the symbol primitive data type) that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules. Which character sequences constitute identifiers depends on the lexical grammar of the language.
Lexical grammarIn computer science, a lexical grammar or lexical structure is a formal grammar defining the syntax of tokens. The program is written using characters that are defined by the lexical structure of the language used. The character set is equivalent to the alphabet used by any written language. The lexical grammar lays down the rules governing how a character sequence is divided up into subsequences of characters, each part of which represents an individual token. This is frequently defined in terms of regular expressions.
Comment (computer programming)In computer programming, a comment is a programmer-readable explanation or annotation in the source code of a computer program. They are added with the purpose of making the source code easier for humans to understand, and are generally ignored by compilers and interpreters. The syntax of comments in various programming languages varies considerably. Comments are sometimes also processed in various ways to generate documentation external to the source code itself by documentation generators, or used for integration with source code management systems and other kinds of external programming tools.