Notes on Information extraction: Information Extraction from HTML

When it comes to Information extraction, most of the time we make a deal with the Internet and pages stored in HTML. There are 2 cases:

working with a certain website (e.g., Ebay). In this case, it is called Web scraping and a straightforward task.
crawling the Web and search for a certain type of data (mentions of people, places, organisations, telephone numbers, emails, etc.).

In the second case the page structure is unknown, so one should make a deal with pure text mining. However, in the case of HTML things are getting harder. In this post, I am going to discuss the problem and its solution.

Problem description

Without taking into account structure of the page, there are only text features that may be used to extract entities (discussed below). But HTML has two problems that reduce the effectiveness of text features:

Multilingualism, abbreviations (New York - NYC, San Francisco - SF), slang and misspelling. Dictionaries are always short.
A lot of short and unrelated pieces of text.

HTML is powerful in its ability to present information in a structured form. Usually structures are lists, tables and cards with various combinations of fields. HTML structures may be very complex, and their forms are limited only with a designer’s imagination.

In fact, text in these structures is usually short. So being removed from HTML it becomes a heap of tattered snippets. Due to the widespread presence of HTML structures, in average 70% of text nodes extracted from an HTML page will be of length from 1 to 3 words. If exclude menu, 50% of text chunks will be short. (This values I’ve got on the training set of 10K pages on different languages and from different hosts)

For illustration, let’s take a look on text nodes on the Ebay pages (The web service description is provided below). Short text means weak context. Weak context restricts the effectiveness of the IE techniques.

Almost all pages have parts with bold solid text. But concentration only on them means losing a big amount of data, because according to statistics in HTML it tends to be presented in structured forms.

For these reasons, using HTML markup as features becomes a necessary step to increase the effectiveness of information extraction on the Web.

Using HTML in the IE tasks

From my experience, there are three ways of using HTML to lever the IE effectiveness:

1. Look on tags

Today HTML aims to describe the structure of a document, rather than its design. HTM5 standard has added pure semantic tags like “time”, “meter”, “article”, etc. Some resources use microformats, which utilize HTML tag “span” with very descriptive values of the attribute “class”. Sometimes page design visually highlights important elements of a page such as telephone numbers, so span elements with unique class values may also occur. In addition, the tag "anchor" (“a”) often wraps entities, so values of its “href” and “class” attributes may be honking. Inline tags (e.g. "a", "span", "em", etc.) are also useful in the determination of entities borders.

All in all, it is useful idea to consider the values of the “class” attribute, and look on their correlation with the occurrence of your target entities in the nodes with that values.

2. Use HTML structure to remove cluttering from a page

Usually HTML pages have a lot of information that doesn't refer to their main message. Actually page content might be splitted on the article and the template parts. The elements of the template are menu, design elements, advertising, footer, etc.

There are a lot of papers and tools dedicated to the extraction of meaningful content from HTML pages. I recommend reading the review made by Tomaz Kovacic about algorithms to extract article text from the page. Their common feature is that they all trying to find one main block of text. Well, it might be useful for increasing page readability, but it often cuts too much.

I prefer a different approach. The idea is to take another page from the site and then subtract its tree from the tree of the target page. Subtraction means we overlay one tree to another and remove all nodes which subtrees are matched with correspondent nodes of another tree. To be more flexible, we remove nodes, which subtrees are matched on at least 90% (to be lenient to slight template modification of the website pages).

You can evaluate the effectiveness of the proposed method by using the web service. As an illustration let’s take the page http://technorati.com/ and subtract the page http://technorati.com/blogs/top100/ from it. The result. Let’s compare it with the Readability service. You can see that it has lost a lot of meaningful content. However, the results are not surprising since it uses less information and extracts one main article from the page.

3. Look on regular HTML structures on a page and group corresponding elements of that structures

I have described the idea of objects enumeration in the part related to text features. The idea of regular structures is a generalization of that method. It is very useful in the case of HTML pages since it works where standard methods completely surrender.

To be clear let’s start with an example. Consider the page http://www.cheapflights.co.uk/travel-tips/telephone-cheat-sheet/. It has few tables with companies and their telephone numbers.

Rows of that tables form a repeated and consecutive HTML template:

<tr> <td>...</td> <td>...</td> <td>...</td> </tr>
<tr> <td>...</td> <td>...</td> <td>...</td> </tr>
<tr> <td>...</td> <td>...</td> <td>...</td> </tr>
...

As I said earlier, HTML tends to represent information in a structured form. If you see a regular and consecutive HTML template on a page, it is very likely that it is an enumeration of some structured data. In our example, this data is a record “company - telephone number - directions”.

Now very important outputs:

Inasmuch as it is an enumeration, the correspondent fields of the records should share same or very common sense. Let’s join them into a groups of objects with similar sense.
If some of the values of the group are objects of a particular type (e.g., company or human name), then the other values of that group should be of the same type.

In the example, the groups of corresponding fields are:
Group 1: “Company / Business”, “Aer Lingus”, “bmi”, “bmi baby”, “British Airways”, etc.
Group 2: “Phone Number / Website”, “0871 718 5000”, “0871 500 0737”, “0207 365 0777”, etc.
Group 3: “Steps to take to talk to a human”, “Press 2 then 1”, “Put straight through”, “Put straight through”, etc.

If I have a basic vocabulary of companies (e.g., from DBpedia.com), I can find some of the values from the first group in this vocabulary. So, I can figure out the type of the objects in the first group.

Again, If I write a simple regular expression for telephone numbers, I’ll match many values from the second group, thereby distinguish its type also.

In this approach next two conditions have to be satisfied:

Group only templates that go straight one after another in the HTML tree. On the example page we have 2 tables; it will be wrong to join records from different tables.
Skip simple groups. Complex structure guarantees a specific sense of its fields; in contrast, monadic templates (with only one field) form groups with very broad sense. For example, a sequence of paragraphs wrapped with the tag “p” doesn’t form an enumeration of similar sense objects.

This approach could be tested with the web service. The result for our example.

The result for the page from Ebay.com, where template for products’ description is very complex.

I shall not discuss the algorithm here since it is a part of my research paper. Later I shall upload a link to my article and the source code.

Few more words about HTML structures

In common, there is a great variety of different structures. However, there is also a significant bias in their frequencies. Some of them are very popular, e.g. simple tables, block of image and its label, forum messages, etc. There is also a great amount of web sites built with popular CMS like Drupal, Joomla and Wordpress. So their site components form bold clusters of frequent structures in the web.

Such bias allows to build a library of common HTML structures with description of their possible meaning. Of course, trying to make a comprehensive library is Utopia. But, the ability to distinguish for example forum messages and comments from other text is very helpful since they usually very messy and need to be treated differently.

Additional section

The web service for this article

The web service is available on http://htmlparsertest.appspot.com/.

It takes an URI and loads the HTML page from this URI. Next it has 3 options:

print tree

It prints simplified tree, where all inline HTML tags (e.g., span, a, em, b, etc.) are removed, and all subtrees without text elements are removed also. This operation is necessary for further text processing since it joins all HTML tree elements that supposed to be one text block.

Example:

 In HTML tree:
   <p>
         Hello,
         <a>
               Sandra
         </a>
         !
   </p>
 
 In simplified tree:
 <p>
      Hello, Sandra!
 </p>

print text nodes

Same as in the previous option, but prints only text nodes without HTML.

print similar sense groups

It prints text nodes of regular HTML structures joined into groups of similar sense.

2 modifications are available:

subtract an HTML tree from another URI.

It is used to remove cluttering and static HTML template from the page. The approach was discussed above.

extract entities like proposed by user.

This option filters groups of similar sense texts and remove all groups that don’t have elements like proposed.

Web-scraping approach to the Information extraction tasks

E.g., the task is to extract product’s description and prices from the E-bay.

Solution:

for programmers:

- write a regular expression to extract certain type of HTML nodes. This solution may work fast; however, it is slow to write and hard to maintain.

- use any HTML parsing library, get the page DOM, then extract desirable nodes with XPath. Such libraries are libxml (for C), html5lib, lxml, BeautifulSoup (for Python) and many others. Great library, which simplifies almost all parts of this process is Scrapy (for Python).

for non-programmers:

- for one-off tasks browser plug-ins and web-services are good. TheWebMiner has a user-friendly interface, no scripting, just select a text on the page and press the button. Outwit hub allows to gather images, tables, links, etc. I also want to mention IMacros. It is a Firefox addon, which allows to automate a repetitious work. So one can build a macro to make copy and paste repeatedly, then follow to the next page.

- for data extraction as an automated daily task Yahoo! Pipes may serve well. A video about how to feed a site on Wordpress with this tool.

Text features for Information extraction:

vocabulary and word features.

Words "New York", "London", "Moscow" usually denote cities; "Jeffrey", "Tomas" and "Timothy" unambiguously refer to man's names. Having a vocabulary of common keys if the first thing to do when it comes to named-entity recognition tasks. However, there is also a great amount of ambiguous names like "Nancy" (city, and person's name). So context should support the word meaning.

Word features may help also. For example, title-cased word with suffix -ova often refers to Russian surnames. Also, regular expressions may be effectively used to describe such well-formatted types of data like telephone numbers and emails.

context features (surrounding of the word).

The context allows to disambiguate the meaning of the word, and make decision in the case of short vocabulary. Two subsequent title-case words, where one word is from the vocabulary of humans’ names, is most of the time a mention of a person. A sequence of title-cased words ended with "Ltd." with high probability refers to a company name; also, a title-cased word with the keyword "City" following it usually is a city.

The very powerful context feature is an enumeration. If there is a sequence of words connected with comma, conjunction or signs like vertical bar "|", then we can assume they are objects of the same type. If one of them is found in the objects vocabulary, then the others may be attributed to the same type.

Textual features can be effectively described by regular expressions and context-free grammars. There is a plethora of linguistic tool sets, like NLTK, Gate, Nooj, NLP tools, etc., which can be used for this task. Graham Wilcock wrote a good book about this stuff.

It is also a well-developed field in Machine learning. Today there is a lot of software for named-entity recognition. Usually they use Conditional random fields or Hidden Markov models. I have to mention the DBpedia Spotlight as a very promising product, especially because of the powerful DBPedia ontology it uses.

Notes on Information extraction

Saturday, February 9, 2013

Information Extraction from HTML