Website Parse Template
Encyclopedia
Website Parse Template (WPT) is an XML
XML
Extensible Markup Language is a set of rules for encoding documents in machine-readable form. It is defined in the XML 1.0 Specification produced by the W3C, and several other related specifications, all gratis open standards....

-based open format which provides HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 structure description of website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

 pages. WPT format allows web crawlers to generate Semantic Web’s
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

 RDFs
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

 for web pages. WPT is compatible with existing Semantic Web
Semantic Web
The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

 concepts defined by W3C (RDF
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

 and OWL
Web Ontology Language
The Web Ontology Language is a family of knowledge representation languages for authoring ontologies.The languages are characterised by formal semantics and RDF/XML-based serializations for the Semantic Web...

) and UNL
Universal Networking Language
Universal Networking Language is a declarative formal language specifically designed to represent semantic data extracted from natural language texts...

 specifications.

WPT Syntax

Website Parse Template consists of following sections:
  • Ontology, where publisher defines concepts and relations which are used in the website
    Website
    A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

    .
  • Templates, where publisher provides templates for groups of web pages which are similar by their content category and structure. Publisher provides the HTML elements’
    HTML element
    An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...

     XPath or TagIDs and links with website Ontology concepts.
  • URLa, where publisher provides URL Patterns
    Regular expression
    In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

     which collect the group of web pages linking them to "Parse Template". In the URLa section publisher can separate form URLs the part as a concept and link to website Ontology
    Ontology
    Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories of being and their relations...

    .


Website Parse Template begins with opening <icdl> tag and ends with closing icdl> tag. Single Website Parse Template is referred to the same host, while single host may have several Website Parse Templates describing its HTML
HTML
HyperText Markup Language is the predominant markup language for web pages. HTML elements are the basic building-blocks of webpages....

 structure. It is required to specify the host for Website Parse Template at the beginning in <icdl> tag:



. . . . . . . . . . . . . . . . . . .


WPT ontology

Ontology section contains enumeration and definition of all concepts used in website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

. Listed concepts must be enclosed within <ontology> ontology> tags
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

. It is required to specify the ontology name (any rational string
String (computer science)
In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set or alphabet....

) and indicate supported language
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...

 ("icdl:ontology", "owl
Web Ontology Language
The Web Ontology Language is a family of knowledge representation languages for authoring ontologies.The languages are characterised by formal semantics and RDF/XML-based serializations for the Semantic Web...

" or "unl:uws
Universal Networking Language
Universal Networking Language is a declarative formal language specifically designed to represent semantic data extracted from natural language texts...

") which is used to specify the concepts.

Example 1. Concepts used in Yahoo! Music
Yahoo! Music
Yahoo! Music, owned by Yahoo!, is the provider of a variety of music services, including Internet radio, music videos, news, artist information, and original programming...

 for "artist" object




















Each concept's definition should start with <concept> tag and ends with concept> tag. <inherit> tag shows inheritance
Inheritance (computer science)
In object-oriented programming , inheritance is a way to reuse code of existing objects, establish a subtype from an existing object, or both, depending upon programming language support...

 relations and <has> tag shows attributable relations between two concepts. Either of defined concepts has default attribute - object identifier
Object identifier
In computing, an object identifier or OID is an identifier used to name an object . Structurally, an OID consists of a node in a hierarchically-assigned namespace, formally defined using the ITU-T's ASN.1 standard. Successive numbers of the nodes, starting at the root of the tree, identify each...

 (id) to be used by web crawlers to co-ordinate the same object's attributes used in different pages
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

 of the same website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

.

Website Parse Template foresees several predefined concepts that are general for all kind of websites:

Menu” - navigation bar
Navigation bar
A navigation bar is a sub region of a web page that contains hypertext links in order to navigate between the pages of a website....

/menu

Logo” - design
Design
Design as a noun informally refers to a plan or convention for the construction of an object or a system while “to design” refers to making this plan...

 element/logo
Logo
A logo is a graphic mark or emblem commonly used by commercial enterprises, organizations and even individuals to aid and promote instant public recognition...



Content” - element that contains main textual content
Web content
Web content is the textual, visual or aural content that is encountered as part of the user experience on websites. It may include, among other things: text, images, sounds, videos and animations....

 of the page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...



Advertisement” – advertisement/banner
Web banner
A web banner or banner ad is a form of advertising on the World Wide Web delivered by an ad server. This form of online advertising entails embedding an advertisement into a web page. It is intended to attract traffic to a website by linking to the website of the advertiser...



External Link” – element that contains external links
Hyperlink
In computing, a hyperlink is a reference to data that the reader can directly follow, or that is followed automatically. A hyperlink points to a whole document or to a specific element within a document. Hypertext is text with hyperlinks...





WPT templates

Templates section contains number of templates for groups of similarly structured web pages. Either of those templates refers to a single group of similarly structured web pages. HTML elements’
HTML element
An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...

 XPath
XPath
XPath is a language for selecting nodes from an XML document. In addition, XPath may be used to compute values from the content of an XML document...

 references or TagIDs are used for linking structured content
Web content
Web content is the textual, visual or aural content that is encountered as part of the user experience on websites. It may include, among other things: text, images, sounds, videos and animations....

 with defined concepts. The template description starts with opening <template> tags
Tag (metadata)
In online computer systems terminology, a tag is a non-hierarchical keyword or term assigned to a piece of information . This kind of metadata helps describe an item and allows it to be found again by browsing or searching...

 and ends with closing template> tag. In <template> tag it is required to specify template name and language used for templates description. As a template name can be chosen any string, but for the language it is necessary to indicate supported language
Formal language
A formal language is a set of words—that is, finite strings of letters, symbols, or tokens that are defined in the language. The set from which these letters are taken is the alphabet over which the language is defined. A formal language is often defined by means of a formal grammar...

 type, e.g. "icdl:template", "rdf
Resource Description Framework
The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

" or "unl:expression
Universal Networking Language
Universal Networking Language is a declarative formal language specifically designed to represent semantic data extracted from natural language texts...

".

Example 2. Simple template for single artist page on Yahoo! Music
Yahoo! Music
Yahoo! Music, owned by Yahoo!, is the provider of a variety of music services, including Internet radio, music videos, news, artist information, and original programming...






The web page
Web page
A web page or webpage is a document or information resource that is suitable for the World Wide Web and can be accessed through a web browser and displayed on a monitor or mobile device. This information is usually in HTML or XHTML format, and may provide navigation to other web pages via hypertext...

 may contain structured repeatable content () included in one main HTML element
HTML element
An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...

 () that are specified as follows:

Example 3. Repeatable content representation




In case of specified complex HTML element
HTML element
An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent...

 is already described by another template the tag can be used to point to that template block. It makes possible to create hierarchic relations between WPT templates so that web crawlers can use specified reference(s)
Reference
Reference is derived from Middle English referren, from Middle French rèférer, from Latin referre, "to carry back", formed from the prefix re- and ferre, "to bear"...

 to identify the same object in different pages of a given website
Website
A website, also written as Web site, web site, or simply site, is a collection of related web pages containing images, videos or other digital assets. A website is hosted on at least one web server, accessible via a network such as the Internet or a private local area network through an Internet...

.

Example 4. Hierarchic relations
Hierarchical model
A hierarchical database model is a data model in which the data is organized into a tree-like structure. The structure allows representing information using parent/child relationships: each parent can have many children, but each child has only one parent...

 between WPT Templates





URLs section

This section defines the URLs/URL patterns that are corresponding to groups of similarly structured web pages described in Templates section. In accordance with Templates section URLs section also may consist of several blocks and either of those blocks should start with <urls> tag and ends with urls> tag.

Example 5. URLs/URL patterns







As a URLs block name can be chosen any string, but for the template it is necessary to indicate certain template name described in previous section. The URL pattern provided in Example 5 also include the represented real URL. RegExp
Regular expression
In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

 specifications are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "id" and "fullname") are to be defined previously in Ontology section.

See also

  • ICDL Crawling
    ICDL crawling
    ICDL crawling is an open distributed web crawling technology based on Website Parse Template .- What is Website Parse Template? :Website Parse Template is an XML based open format which provides HTML structure description of Web pages. The WPT format allows web crawlers to generate Semantic Web’s...

  • Open Market For Internet Content Accessibility
    Open Market For Internet Content Accessibility
    Open Market For Internet Content Accessibility - OMFICA is a non-profit organization with the mission to develop competitive market for web search. OMFICA has created data repository of World Wide Web content and set-up its governance based on democratic principles...

  • Semantic Web
    Semantic Web
    The Semantic Web is a collaborative movement led by the World Wide Web Consortium that promotes common formats for data on the World Wide Web. By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web of unstructured documents into a "web of...

  • World Wide Web Consortium
    World Wide Web Consortium
    The World Wide Web Consortium is the main international standards organization for the World Wide Web .Founded and headed by Tim Berners-Lee, the consortium is made up of member organizations which maintain full-time staff for the purpose of working together in the development of standards for the...

     (W3C)
  • RDF
    Resource Description Framework
    The Resource Description Framework is a family of World Wide Web Consortium specifications originally designed as a metadata data model...

  • OWL
    Web Ontology Language
    The Web Ontology Language is a family of knowledge representation languages for authoring ontologies.The languages are characterised by formal semantics and RDF/XML-based serializations for the Semantic Web...

  • Regular expression
    Regular expression
    In computing, a regular expression provides a concise and flexible means for "matching" strings of text, such as particular characters, words, or patterns of characters. Abbreviations for "regular expression" include "regex" and "regexp"...

  • Universal Networking Language
    Universal Networking Language
    Universal Networking Language is a declarative formal language specifically designed to represent semantic data extracted from natural language texts...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK