When it comes to Information extraction, most of the time we make a deal with the Internet and pages stored in HTML. There are 2 cases:
- working with a certain website (e.g., Ebay). In this case, it is called Web scraping and a straightforward task.
- crawling the Web and search for a certain type of data (mentions of people, places, organisations, telephone numbers, emails, etc.).
In the second case the page structure is unknown, so one should make a deal with pure text mining. However, in the case of HTML things are getting harder. In this post, I am going to discuss the problem and its solution.