TRE (computing)
Encyclopedia
TRE is an open-source
Open-source software
Open-source software is computer software that is available in source code form: the source code and certain other rights normally reserved for copyright holders are provided under a software license that permits users to study, change, improve and at times also to distribute the software.Open...

 library for texts search, which works like regular expression
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

 engine with ability of fuzzy string searching. It is developed by Ville Laurikari under 2-clause BSD-like license.

Library is written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

 and provides functions which allow using regular expressions for searching over input text lines. Main difference from other regular expression engines is that TRE can match text fragments in approximate way - i.e. supposing that text could have some number of typos
Typographical error
A typographical error is a mistake made in, originally, the manual type-setting of printed material, or more recently, the typing process. The term includes errors due to mechanical failure or slips of the hand or finger, but usually excludes errors of ignorance, such as spelling errors...

.

Approximate matching

TRE uses extended regular expression syntax with addition of directions for matching preceding fragment in approximate way. Each of such directions specifies how much typos are allowed for this fragment.

Approximate matching is performed in a way similar to Levenshtein distance
Levenshtein distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

, which means that there are three types of typos 'recognized':
  • insertion of an extra character (regullar experession);
  • missing of a character from pattern (reglar expession);
  • replacement of some character (regolar exprezsion).

TRE allows specifying of cost for each of three typos type independently.

Command-line utility

Project comes with command-line utility (version of agrep
Agrep
agrep is a proprietary fuzzy string searching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system...

) built automatically with library. It could be used for processing text files or for testing abilities of TRE.

Standard conformance

Though approximate matching requires some syntax extension, when this feature is not used, TRE works like most of other regexp matching engines. This means that
  • regular expressions written for strict matching could be easily migrated for using with library;
  • programmers, familiar with POSIX-style regular expressions need not much study for using TRE.

Predictable time and memory consumption

Author states that time spent for matching grows linearly with increasing of input text length, while memory requirements are almost constant (tens of kilobytes). It is important, especially for possible uses in embedded systems which have comparatively few resources.

There is no information about benchmarking against other regular expression engines.

Other

Other features, common for most regex engines could be checked in regex engines comparison tables
Comparison of regular expression engines
-Libraries:-Languages:-Language features:NOTE: An application using a library for regular expression support does not necessarily offer the full set of features of the library, e.g...

 or in list of TRE features on its web-page.

Usage example

Approximate matching directions are specified in curly brackets and should be distinguishable from repetitive quantifiers (possibly with inserting a space after opening bracket):
  • (regular){~1}\s+(expression){~2} would match variants of phrase "regular expression" in which "regular" have no more than one typo and "expression" no more than two; as in ordinary regular expressions "\s+" means one or more space characters - i.e. rogular ekspression would pass test;
  • (expression){ 5i + 3d + 2s < 11} would match word "expression" if total cost of typos is less than 11, while insertion cost is set to 5, deletion to 3 and substitution of character to 2 - i.e. ekspresson gives cost of 10.

C and C++

Being written in C
C (programming language)
C is a general-purpose computer programming language developed between 1969 and 1973 by Dennis Ritchie at the Bell Telephone Laboratories for use with the Unix operating system....

, TRE could be ported (i.e. rebuilt) for any platform which have GNU C Compiler
GNU Compiler Collection
The GNU Compiler Collection is a compiler system produced by the GNU Project supporting various programming languages. GCC is a key component of the GNU toolchain...

 installed. Due to simplicity of sources (20-30 files placed in one folder) it is probable that porting with more specific compilers also is easy.

Other languages

For projects, written in languages, other than C/C++ TRE could be used in two ways:
  • if compiler of the language provides binary compatibility with C object files (like Gfortran
    GFortran
    gfortran is the name of the GNU Fortran compiler, which is part of the GNU Compiler Collection . gfortran has replaced the g77 compiler, which stopped development before GCC version 4.0. It includes support for the Fortran 95 language and is compatible with most language extensions supported by...

    ), library could be linked to project (possibly with few function wrappings);
  • in other cases it is necessary to create some interface to TRE based library - currently there are bindings for Perl, Python
    Python (programming language)
    Python is a general-purpose, high-level programming language whose design philosophy emphasizes code readability. Python claims to "[combine] remarkable power with very clear syntax", and its standard library is large and comprehensive...

     and Haskell
    Haskell (programming language)
    Haskell is a standardized, general-purpose purely functional programming language, with non-strict semantics and strong static typing. It is named after logician Haskell Curry. In Haskell, "a function is a first-class citizen" of the programming language. As a functional programming language, the...

    . However if the project should be cross-platform
    Cross-platform
    In computing, cross-platform, or multi-platform, is an attribute conferred to computer software or computing methods and concepts that are implemented and inter-operate on multiple computer platforms...

    , there would be necessary separate interface for each of target platforms.

Disadvantages

Since other regular expression engines usually do not provide approximate matching ability, there is almost no concurrent implementation with which TRE could be compared. However there are few things which programmers may wish to be implemented in future releases:
  • replacement mechanism for substitution matched text fragments (like in sed
    Sed
    sed is a Unix utility that parses text and implements a programming language which can apply transformations to such text. It reads input line by line , applying the operation which has been specified via the command line , and then outputs the line. It was developed from 1973 to 1974 as a Unix...

     string processor and many modern implementations of regexps, including built into Perl or Java
    Java (programming language)
    Java is a programming language originally developed by James Gosling at Sun Microsystems and released in 1995 as a core component of Sun Microsystems' Java platform. The language derives much of its syntax from C and C++ but has a simpler object model and fewer low-level facilities...

    );
  • opportunity to use another approximate matching algorithm (than Levenshtein's
    Levenshtein distance
    In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences...

    ) for better typo value assessment (for example Soundex
    Soundex
    Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless...

    ), or at least this algorithm to be improved to allow typos of "swap" type (see Damerau–Levenshtein distance).

External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK