Notes on Information extraction

Saturday, October 11, 2014

Parsers and tokenizers (Python)

Here I am collecting simple (no third-party libraries) functions for parsing and tokenization of frequent formats of texts.
From Wikipedia:

Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining. Tokenization is useful both in linguistics (where it is a form of text segmentation), and in computer science, where it forms part of lexical analysis.

Our article on Sentiment analysis

Recently me and my colleague made a prototype tool for Sentiment analysis of web reviews in Russian language.

It was quite successful, since it took the second place on the ROMIP Sentiment Analysis Track in 2012.

Here is a link to the article and presentation we wrote for the conference Dialogue-2013 about the prototype.

Common approach to the Named-entity recognition task

In the further explanation I shall provide examples from the task of the extraction of geography mentions from texts (e.g. mentions of cities, countries, districts, streets, etc.).

We may distinguish three steps in the NER:

select objects (hypotheses) in the text; object is a word or group of words that probably is a mention of the entity
represent each object as a set of features that could be used for classification
classification itself

Information Extraction from HTML

When it comes to Information extraction, most of the time we make a deal with the Internet and pages stored in HTML. There are 2 cases:

working with a certain website (e.g., Ebay). In this case, it is called Web scraping and a straightforward task.
crawling the Web and search for a certain type of data (mentions of people, places, organisations, telephone numbers, emails, etc.).

In the second case the page structure is unknown, so one should make a deal with pure text mining. However, in the case of HTML things are getting harder. In this post, I am going to discuss the problem and its solution.