Hamshahri Corpus
Encyclopedia
The Hamshahri Corpus is a sizable Persian
Persian language
Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. It is primarily spoken in Iran, Afghanistan, Tajikistan and countries which historically came under Persian influence...

 corpus based on the Iranian newspaper Hamshahri
Hamshahri
Hamshahri is a major national Iranian Persian-language newspaper published by the Municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first coloured daily newspaper in Iran and has over 60 pages of classified advertisement, and is priced at 1000 Iranian rials. Currently, the...

, one of the first online Persian
Persian language
Persian is an Iranian language within the Indo-Iranian branch of the Indo-European languages. It is primarily spoken in Iran, Afghanistan, Tajikistan and countries which historically came under Persian influence...

 newspapers in Iran. It was in initially collected and compiled by Ehsan Darrudi at DBRG Group http://ece.ut.ac.ir/dbrg/ of the University of Tehran
University of Tehran
The University of Tehran , also known as Tehran University and UT, is Iran's oldest university. Located in Tehran, the university is among the most prestigious in the country, and is consistently selected as the first choice of many applicants in the annual nationwide entrance exam for top Iranian...

.

This corpus was created by crawling the online news articles from the Hamshahri
Hamshahri
Hamshahri is a major national Iranian Persian-language newspaper published by the Municipality of Tehran, and founded by Gholamhossein Karbaschi. It is the first coloured daily newspaper in Iran and has over 60 pages of classified advertisement, and is priced at 1000 Iranian rials. Currently, the...

's website and processing the HTML pages to create a standard text corpus for modern Information Retrieval
Information retrieval
Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...

 experiments.

Version 1.0

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, Society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average of 1.8 KB.

The corpus is available in several formats for download http://ece.ut.ac.ir/dbrg/Hamshahri/:
  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

Version 2.0

The second release of Hamshahri Corpus released on October 20, 2008. It offers several new features and improvements:
  • More News: 323,616 Text Stories in 3206 XML files (a file for each day)
  • Increased Time Span: 1996-06-22 to 2007-05-13 (Hejri Shamsi: 1375/04/02 to 1386/02/23)
  • Bigger in Size: 1.42 GB uncompressed
  • Standard Container: Unicode XML
  • Included Images: images have been extracted from the news and preserved (available in an additional package) makes it suitable for Images Retrieval tasks.
  • Categorized News: the news stories have been categorized semi-automatically (appropriate for Text Categorization and Classification tasks).


The corpus is available for download in XML format http://ece.ut.ac.ir/DBRG/Hamshahri/ham2/.

See also

  • Bijankhan Corpus
    Bijankhan Corpus
    The Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian language. This collection is gathered from daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural, etc; in...

  • Persian Today Corpus
  • Text corpus
    Text corpus
    In linguistics, a corpus or text corpus is a large and structured set of texts...

  • Information Retrieval
    Information retrieval
    Information retrieval is the area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World Wide Web...


External links

The source of this article is wikipedia, the free encyclopedia.  The text of this article is licensed under the GFDL.
 
x
OK