Wikilinks agent

Wikipedia

Final project by

Eran Rosenthal     Eliezer Kochva


Introduction

Wikipedia project is a free internet encyclopedia that anyone can edit. One of its major advantages over classic encyclopedias are the inter-related links (wikilinks) between the different articles, allowing readers to easily access data relevant to the subject they are learning about.

The links are added manually by the editors of Wikipedia. This requires some technical skills, and sometimes editors may miss possible informative links. Our aim is to implement an artifical intelligent agent which will learn the patterns and best practices of linking based on featured articles, and will be able to automaticlly annotate important terms in raw text as links. This agent may be used for adding links in newly created or bad linked articles.

Approach and Method

We implemented few different approches for linkyfing as different agents. All the agents share a common interface and some common infrastracture for parsing, API queries to Wikipedia and other commons tasks. A brief description of the agents:

Naïve Approach - FoolAgent

As a baseline agent, the most basic agent for annotating raw text with links just annotates any term in the article that has an existing article with the same name. This approach is simple and finds almost all possible links. * However, this may lead to overlinking - excessive number of links, making it difficult to identify links likely to aid the reader's understanding significantly and may cause cognitive load.

Heuristic approach - HeuristicAgent

A heuristic agent is an agent that selects the terms to linked based on heuristic features derived from the raw text. These features provide some basic information on the context of the term within the raw text but don't provide any meaning to the term. The agent incoprates a learning model - decision tree, random forest or SVM - and requires training to evaluate the importance or the weights of the features.
During the training phase, the agent gets a batch of featured articles, extracts features for all the terms in the articles and uses the existing links as indication to the true label of the terms - link or non link. Once we trained the learning model we can move on to the linking phase - the agent is given a new raw text, extracts features for each term in the raw text and uses the trained model to classifiy whether the term should be linked.

Heuristic approach+Online access - OnlineAgent

The online agent extends the heuristic agent with features that in addition to the context of the term, provide some indication for its meaning. The online agent is allowed to load the article text of candidate terms to be linked. This ability costs both network and time resources, but gives the agent additional information which sometimes may help in the link decision. We used mutual information criteria to select the terms we would like to further expend with online data.

Features

According to Wikipedia style guideline: "Everyday words understood by most readers in context, names of major geographic features and dates are usally avoided, while events, people and topics that may help the readers are suggested"[^].
As there is no formal definition for these rules (which are interpreted subjectively by the article writer), and as it requires to understand the "meaning" of terms in order to be implemented, we decided to use some simple features that we believed can help a machine to deduce the "importance" of the term:

  1. Features derived from the word only:
    1. Title length - The (character) length of the term. Some of the most common words in English are short.
    2. Title word length - Number of words in term. Single word terms compared to 2 words terms may be indicative to names.
    3. Word type - Whether the term is a number, or if not whether the first letter is capital or non capital.
  2. Features derived from the article context:
    1. Preword - Whether indicative words for links appear before the terms. We use the finding that some words appear in high frequency before links compared to their regular frequency.
    2. Position - Average position of the word in sentence. Usally the begining of a sentence is richer in links than its end. (Graph 1)
    3. Number of occurences - Number of occurences of the term in the raw text.
    4. Entropy - defined as -p*log(p) when p=frequency of the word.
    5. Mutual information - For terms with multiple words defined using the frequency of the combined words compared to the individual components.
    6. Included - whether a candidate link is a substring of another candidate link
  3. Online features
    1. Category - Whether the candidate belongs to the same category.
    2. linkback - Whether the candidate term links to the new article.
    3. Mutual words - Number of words that appear in both candidate and raw text.
    4. Priority words - Similar to the above, but only for words that are important in the raw text, where "importance" is defined by mutual information.


  4. (Graph 1)

Assumptions

We have made some assumption and reductions to the problem:
  1. Only "direct" links - We assume the terms that may be linked are already presented in the raw text. This assumption may not always hold, for example the link target and the link text in Wikipedia aren't equal, e.g. a link to Barak Obama with text "the current president" may look like [[Barak Obama|the current president]] (known as "piped link"). Infering such links is a hard problem as the agent should "understand" the meaning of terms and requires the use of a large knowledge base. As maintaining and efficently indexing the KB are required, this is out of the scope of this project. However, this assumption isn't so strict, as most of the times the term itself is presented in the raw text.
  2. Featured articles use only correct links - We assume featured articles use links correctly, e.g there is no overlinking or underlinking and has no redundent links.
  3. Candidate titles assumptions - As article titles in Wikipedia are usally one or two words, or names (where all words in title start with capital letter), we checked only such terms as link candidates. This reduction is due to the fact that validating a if term exist as article in Wikipedia is time consuming. Extending it to terms with 3 words or more requires to check whether any sequence of 3 or more words in the raw text exist as article in Wikipedia - which greatly increase the number of API requests. (55% of all article titles are 1-2 words long, and another 17% are names longer than 2 words.)

Results

Agents results

First we would like to introduce a small sample of the agents results running on the raw text of Wikipedia article in English Wikipedia.

FoolAgent • HeuristicAgent: id3/SVM/RF • OnlineAgent: id3/SVM/RF
FoolAgent

Wikipedia ( or ) is a free-access, free content [[Internet encyclopedia]], supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most [[popular websites]] and constitutes the Internet's largest and most popular general [[reference work]].

[[Jimmy Wales]] and [[Larry Sanger]] launched Wikipedia on [[January 15]], 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. [[As of]] February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

HeuristicAgent-id3

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most popular websites and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

HeuristicAgent-RandomForest

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most [[popular websites]] and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. [[The English]] Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

HeuristicAgent-SVM

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most popular websites and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

OnlineAgent-id3

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supMported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most popular websites and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

OnlineAgent-RandomForest

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most popular websites and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

OnlineAgent-SVM

Wikipedia ( or ) is a free-access, free content Internet encyclopedia, supported and hosted by the non-profit Wikimedia Foundation. Those who can access the site and follow its rules can edit most of its articles. Wikipedia is ranked among the ten most popular websites and constitutes the Internet's largest and most popular general reference work.

Jimmy Wales and Larry Sanger launched Wikipedia on January 15, 2001. Sanger coined its name, a portmanteau of wiki and encyclopedia. Initially only in English, Wikipedia quickly became multilingual as it developed similar versions in other languages, which differ in content and in editing practices. The English Wikipedia is now one of more than 200 Wikipedias and is the largest with articles. As of February 2014, it had 18 billion page views and nearly 500 million unique visitors each month. Globally, Wikipedia had more than 19 million accounts, out of which there were about 69,000 active editors as of November 2014.

As you can see in the above examples:

Conclusions

In this project we implemented different agents which can assist Wikipedia editors (and readers) with suggestions for wikilinks in Wikipedia articles that would be helpful for readers. The problem of selecting "important" links is subjective (dependent on the reader) and hard for formalization. We tried to solve with simple means: Using simple features, avoiding using knowledge base and without real understaning of the meaning of terms. Our agents give improvment compared to the naïve approach of linking everything. Some of the agents provide surprisingly good results (HeuristicAgent-svm), even with such simple approach.

Usally a learning method tries to get right on most of the predictions, and as such a "almost correct agent" will reject any link (as most of the text is not linked). We had to give some bias to linkss (bigger weight), so the agent will prefer more links. We would like to note that the selection of link weight in the learning method is strongly linked to the way we would like to run such tool: If the agent works fully automaticlly and adds links to Wikipedia articles, we would prefer less links and give less bias to links, while if it is a semi-automatic tool we would prefer more bias.

The use of different levels of information (context for HeuristicAgent, and online data for OnlineAgent) have given us the ability to scale the agent performence and runtime.

Additional Information

References

We used various open source libraries:

sklearn matplotlib Pywikibot