Sunday, March 10, 2013

Common approach to the Named-entity recognition task

In the further explanation I shall provide examples from the task of the extraction of geography mentions from texts (e.g. mentions of cities, countries, districts, streets, etc.).

We may distinguish three steps in the NER:

  1. select objects (hypotheses) in the text; object is a word or group of words that probably is a mention of the entity
  2. represent each object as a set of features that could be used for classification
  3. classification itself

Each step in details

Step 1. Objects selection

Object (in this article) - is a short section in the text; it probably denotes an entity of the type that we extract (geography in our case). In the other words, we make a hypothesis about some part of the text that it is an entity, and we gonna check it in the classification step.

Those objects that are really entities we call positive objects; those objects that are not entities we call negative objects. So, we have two classes of objects, and we don’t know the object class at the first step. It is a common classification task.

It is assumed that there is a function called ExtractObject. The function takes as an input the text and returns the positions of objects. Its primary task not to lose any positive objects; its secondary task is to reduce the amount of negative objects that are passed to classification. It could be described as a filter or a weak classifier that extracts entities with arbitrarily small precision, but its recall is close to 1.0.

The ExtractObject function is selected based on the appropriateness to the exact task. In the very common case, it returns positions of all possible n-grams in the text. This function will be appropriate since all entities in the text will be among extracted objects. But it will also be not efficient since too many negative objects will be in the output. A lot of negative objects increase the load to subsequent steps that is undesirable.

The efficient function chooses objects according to the necessary features of an entity. For example, in the task of geography extraction these features are the presence of the object n-gram in the thesaurus of geographical names and the fact that the n-gram in the text is in the title or upper case. The last feature is acceptable only for well-formatted text.

Output of the efficient ExtractObject for geo-extraction:
During the 19th century, <object>Massachusetts State</object> was famous for the intellectual activity of its writers and educators.
View the profiles of professionals named <object>Georgia</object> Davidson on LinkedIn.

Non-efficent ExtractObject:
<object>The</object> election was for a new, extended term of six years.
The <object>election</object> was for a new, extended term of six years.
The election <object>was</object> for a new, extended term of six years.
The election was <object>for</object> a new, extended term of six years.
...
<object>The election</object> was for a new, extended term of six years.
The <object>election was</object> for a new, extended term of six years.
...

The efficient function allows to extract objects with the ratio positive to negative not less than 1:10.

Step 2. Features extraction

We are going to represent the object as a set of features. The function for extraction of features should meet two requirements:

  1. fully fix the meaning of the object in the text. If the features doesn’t allow to define the exact meaning of the object, then it might be impossible to resolve ambiguity. Example:
    The State University of <object>New York</object> is a system of public institutions.
  2. Here the features should describe the context of the object, since “New York” is a part of another entity here and is not a geography mention.

  3. it must be useful for the classification. Thus, from one hand, it should be possible to build a dividing surface between classes (provide required precision); from another hand, it should be possible to generalize common features of the different objects of the same class (achieve required recall and get compact model).

The features are labels. A label is a binary feature represented as a string and might be possessed by an object. If the object has the label then the feature value is true, otherwise - false. Example of features of the word “tables”:
“lemma:table”, “text_case:lower”, “pos:noun”, “number:plural”, etc.

It is important that features are binary. So, we have different features for different values of the same word parameter. For example, we don’t have two values for the word’s number. In contrast, we have two features: “number:plural”, “number:singular”, and both of them might be true. Two reasons are at the basis of this convention:

  1. features often have many values or not accurately known. For example, prior to part of speech tagging we don’t know the exact word’s part of speech, and for the word “table” it could be noun, or verb, or adjective. The same thing is for grammatical case. Another example is the position in the sentence. If the sentence is short then the word may be described as being in the sentence beginning and end.
  2. most of the time it is impossible to define the distance between different values of a parameter. As an illustration, In Russian language there are six cases. There is no way to define which of them are closer to each other than to others. So, it is better to make different features and not stray machine learning tools; since, most of them tries to distinguish the distance between objects according to the Gaussian or similar kernels. The same is for lemmas. One can define the distance by semantic similarity, but it is a very huge task, and there is no evidence that the Triangle inequality will work there.

To define completely a word in the text, we should describe features of the following types:

  1. lemma: “lemma:table”;
  2. morphological features to describe word’s form in the text: “pos:noun”, “number:plural”.
  3. These features are essential in inflectional languages, like Russian, where the word’s form reflects its syntactical connections in the sentence.

  4. capitalisation of the word: “text_case:lower”. In well-formated texts entities are often capitalised.
  5. position in the sentence and in special zones in the text. Special zones might be links, headers, paragraphs, footers, menu items for html pages, etc.

To take into account the word’s position we concat the features with the word’s coordinate relatively to the object. For example, let’s assume the following sentence:
Daily coverage of <object>New York's</object> restaurants, nightlife, shopping, fashion, politics, and culture.

Then the words’ lemmas could be described in the following way:
"-3:lemma:daily", "-2:lemma:coverage", "-1:lemma:of", "0:lemma:new_york", "1:lemma:restaurant", "2:lemma:,", "3:lemma:nightlife", etc.

The same for other features.

Also other information useful for classification might be described in the features. This information is word relative frequency in texts, the belonging of the word to a thesaurus of a special type or a certain cluster (e.g. obtained by context distribution clustering), the text features, etc.

The number of features is unlimited since the amount of lemmas in unknown.

Since the text size might be big, it is not effective to describe all words around the object. Thus, we should limit the size of the context to be considered. To meet also the first requirement to the features extraction function, we should make an assumption about the sufficient context size. Sufficient context size is an amount of words before and after the object in the text that should be described in features in order to fix the meaning of an object to the extent that the object class is unambiguously fixed.

The assumption:

The meaning of the object is completely fixed if the object words and the object context up to 3 words before and after the object are defined. So, the sufficient context size is 3.

I am not trying to proof this assumption here. I’ve got it empirically, and it is quite obvious. In truth it is a very rare case, when the object class will be still undefinable if the 7-gram is fixed. The example of that case:
I love you <object>Georgia<object>! You are the best of all, I’ve ever seen.

If we leave +/- 3 words context (“I love you <object>Georgia<object> ! You are”) we still cannot distinguish whether the object is a woman’s name, or a state, or a country. However, again these cases are rare.

So, according to the assumption, it is enough to describe words with their positions and forms inside the object and in the context up to 3 words.

An example for the full set of object’s features

The object: View the profiles of professionals named <object>Georgia</object> Davidson on LinkedIn.

Sufficient context: of professionals named <object>Georgia</object> Davidson on LinkedIn

Features:

of professionals named Georgia Davidson on LinkedIn
-3 -2 -1 0 1 2 3
-3:lem:of -2:lem:professional -1:lem:name 0:georgia 1:lem:davidson 2:on 3:linkedin
-3:pos:preposition -2:pos:noun -1:pos:verb 0:pos:noun 1:pos:noun 2:pos:preposition 3:pos:noun
-3:case:lower -2:case:lower -1:case:lower 0:case:title 1:case:title 2:case:lower 3:case:title
-2:number:plural -1:voice:passive 0:number:singular 1:number:singular 3:number:singular
0:voc:human_name 0:voc:human_surname 0:voc:organisation
0:voc:province
0:voc:country

So, the full set of features is:
-3:lem:of, -2:lem:professional, -1:lem:name,
0:georgia, 1:lem:davidson, 2:on, 3:linkedin,
-3:pos:preposition, -2:pos:noun, -1:pos:verb,
0:pos:noun, 1:pos:noun, 2:pos:preposition,
3:pos:noun, -3:case:lower, -2:case:lower,
-1:case:lower, 0:case:title, 1:case:title,
2:case:lower, 3:case:title, -2:number:plural,
-1:voice:passive, 0:number:singular, 1:number:singular,
3:number:singular, 0:voc:human_name, 0:voc:human_surname,
0:voc:organisation, 0:voc:province, 0:voc:country

Notes:

  1. To imitate the bag of words approach we can also duplicate all lemma-features without their positions.
  2. Objects could also be described with only syntactic features. In this case we leverage syntactic tree and describe in the object features its connections to the other sentence members, e.g. nouns, verbs, adjectives, adverbs etc. This approach is very effective since the meaning of the word can be defined by the verbs and nouns with which it can have syntactic relations. Also, we don’t rely on words closeness in the sentence. From the other hand, syntax costs a lot, may have mistakes, and not grasps such features as words order inversion in a sentence.
  3. many other types of features could be added. The final set of features depends on the exact type of entities what we extract.

Step 3. Classification

The next step after the features extraction is to define the class of the object (negative or positive). It is a common task of two-class classification. In the case when we extract objects of several types simultaneously, (e.g. person names, organisations, geography) we train n+1-class classificator (n is the number of entity types) or n two-class classifiers (each is one-against-all).

If the features will be used directly as classification features, then the features filtration should be applied; since, the number of features will be comparable to the number of objects in the training set.

The learning approach is chosen by the engineer. It might be any approach; however, as a matter of fact that we use binary features, decision tree-like and logic classifiers are more preferable than those like kNN, or SVN, or others that try to calculate the distance between objects. So, random forest, or matrixnet, or boosting on logic predicates will be the best choice.

I shall explain our approach to the classification task in the following post.

No comments:

Post a Comment