Notes on Information extraction: February 2013

When it comes to Information extraction, most of the time we make a deal with the Internet and pages stored in HTML. There are 2 cases:

working with a certain website (e.g., Ebay). In this case, it is called Web scraping and a straightforward task.
crawling the Web and search for a certain type of data (mentions of people, places, organisations, telephone numbers, emails, etc.).

In the second case the page structure is unknown, so one should make a deal with pure text mining. However, in the case of HTML things are getting harder. In this post, I am going to discuss the problem and its solution.

Notes on Information extraction

Saturday, February 9, 2013

Information Extraction from HTML