
MAREC
    
    Encyclopedia
    
        The MAtrixware REsearch Collection (MAREC) is a standardised patent data corpus available for research purposes. MAREC could be defined as corpus that seeks to represent patent documents of several languages in order to answer specific research questions. It consists of 19 million patent documents in different languages, normalised to a highly specific XML
schema.
MAREC is intended as raw material for research in areas such as information retrieval
, natural language processing
or machine translation
, which require large amounts of complex documents. The collection contains documents in 19 languages, the majority being English, German and French, and about half of the documents include full text.
In MAREC, the documents from different countries and sources are normalised to a common XML format with a uniform patent numbering scheme and citation format. The standardised fields include dates, countries, languages, references, person names, and companies as well as subject classifications such as IPC
codes.
MAREC is a comparable corpus, where many documents are available in similar versions in other languages. A comparable corpus can be defined as consisting of texts that share similar topics – news text from the same time period in different countries, while a parallel corpus is defined as a collection of documents with aligned translations from the source to the target language. Since the patent document refers to the same “invention” or “concept of idea” the text is a translation of the invention, but it does not have to be a direct translation of the text itself – text parts could have been removed or added for clarification reasons.
The 19,386,697 XML files measure a total of 621 GB and are hosted by the Information Retrieval Facility
. Access and support are free of charge for research purposes.
XML
Extensible Markup Language  is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....
schema.
MAREC is intended as raw material for research in areas such as information retrieval
Information retrieval
Information retrieval  is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...
, natural language processing
Natural language processing
Natural language processing  is a field of computer science and linguistics concerned with the interactions between computers and human  languages; it began as a branch of artificial intelligence....
or machine translation
Machine translation
Machine translation, sometimes referred to by the abbreviation MT  is a sub-field of computational linguistics that investigates the use of computer software to translate text or speech from one natural language to another.On a basic...
, which require large amounts of complex documents. The collection contains documents in 19 languages, the majority being English, German and French, and about half of the documents include full text.
In MAREC, the documents from different countries and sources are normalised to a common XML format with a uniform patent numbering scheme and citation format. The standardised fields include dates, countries, languages, references, person names, and companies as well as subject classifications such as IPC
International Patent Classification
The International Patent Classification  is a hierarchical patent classification system created under the Strasbourg Agreement  and updated on a regular basis by a Committee of Experts, consisting of representatives of the Contracting States of that Agreement with observers from other...
codes.
MAREC is a comparable corpus, where many documents are available in similar versions in other languages. A comparable corpus can be defined as consisting of texts that share similar topics – news text from the same time period in different countries, while a parallel corpus is defined as a collection of documents with aligned translations from the source to the target language. Since the patent document refers to the same “invention” or “concept of idea” the text is a translation of the invention, but it does not have to be a direct translation of the text itself – text parts could have been removed or added for clarification reasons.
The 19,386,697 XML files measure a total of 621 GB and are hosted by the Information Retrieval Facility
Information Retrieval Facility
The Information Retrieval Facility , founded 2006 and located in Vienna, Austria, is a research platform for networking and collaboration for professionals in the field of information retrieval.The IRF has members in the following categories:...
. Access and support are free of charge for research purposes.


