Wikipedia project is a free internet encyclopedia that anyone can edit. One of its major advantages over classic encyclopedias are the inter-related links (wikilinks) between the different articles, allowing readers to easily access data relevant to the subject they are learning about.
The links are added manually by the editors of Wikipedia. This requires some technical skills, and sometimes editors may miss possible informative links. Our aim is to implement an artifical intelligent agent which will learn the patterns and best practices of linking based on featured articles, and will be able to automaticlly annotate important terms in raw text as links. This agent may be used for adding links in newly created or bad linked articles.
Approach and Method
We implemented few different approches for linkyfing as different agents. All the agents share a common interface and some common infrastracture for parsing, API queries to Wikipedia and other commons tasks. A brief description of the agents:
Naïve Approach - FoolAgent
As a baseline agent, the most basic agent for annotating raw text with links just annotates any term in the article that has an existing article with the same name. This approach is simple and finds almost all possible links. * However, this may lead to overlinking - excessive number of links, making it difficult to identify links likely to aid the reader's understanding significantly and may cause cognitive load.
Heuristic approach - HeuristicAgent
A heuristic agent is an agent that selects the terms to linked based on heuristic features derived from the raw text. These features provide some basic information on the context of the term within the raw text but don't provide any meaning to the term. The agent incoprates a learning model - decision tree, random forest or SVM - and requires training to evaluate the importance or the weights of the features.
During the training phase, the agent gets a batch of featured articles, extracts features for all the terms in the articles and uses the existing links as indication to the true label of the terms - link or non link. Once we trained the learning model we can move on to the linking phase - the agent is given a new raw text, extracts features for each term in the raw text and uses the trained model to classifiy whether the term should be linked.
Heuristic approach+Online access - OnlineAgent
The online agent extends the heuristic agent with features that in addition to the context of the term, provide some indication for its meaning. The online agent is allowed to load the article text of candidate terms to be linked. This ability costs both network and time resources, but gives the agent additional information which sometimes may help in the link decision. We used mutual information criteria to select the terms we would like to further expend with online data.
According to Wikipedia style guideline: "Everyday words understood by most readers in context, names of major geographic features and dates are usally avoided, while events, people and topics that may help the readers are suggested"[^].
As there is no formal definition for these rules (which are interpreted subjectively by the article writer), and as it requires to understand the "meaning" of terms in order to be implemented, we decided to use some simple features that we believed can help a machine to deduce the "importance" of the term:
- Features derived from the word only:
- Title length - The (character) length of the term. Some of the most common words in English are short.
- Title word length - Number of words in term. Single word terms compared to 2 words terms may be indicative to names.
- Word type - Whether the term is a number, or if not whether the first letter is capital or non capital.
- Features derived from the article context:
- Preword - Whether indicative words for links appear before the terms. We use the finding that some words appear in high frequency before links compared to their regular frequency.
- Position - Average position of the word in sentence. Usally the begining of a sentence is richer in links than its end. (Graph 1)
- Number of occurences - Number of occurences of the term in the raw text.
- Entropy - defined as -p*log(p) when p=frequency of the word.
- Mutual information - For terms with multiple words defined using the frequency of the combined words compared to the individual components.
- Included - whether a candidate link is a substring of another candidate link
- Online features
- Category - Whether the candidate belongs to the same category.
- linkback - Whether the candidate term links to the new article.
- Mutual words - Number of words that appear in both candidate and raw text.
- Priority words - Similar to the above, but only for words that are important in the raw text, where "importance" is defined by mutual information.
AssumptionsWe have made some assumption and reductions to the problem:
- Only "direct" links - We assume the terms that may be linked are already presented in the raw text. This assumption may not always hold, for example the link target and the link text in Wikipedia aren't equal, e.g. a link to Barak Obama with text "the current president" may look like [[Barak Obama|the current president]] (known as "piped link"). Infering such links is a hard problem as the agent should "understand" the meaning of terms and requires the use of a large knowledge base. As maintaining and efficently indexing the KB are required, this is out of the scope of this project. However, this assumption isn't so strict, as most of the times the term itself is presented in the raw text.
- Featured articles use only correct links - We assume featured articles use links correctly, e.g there is no overlinking or underlinking and has no redundent links.
- Candidate titles assumptions - As article titles in Wikipedia are usally one or two words, or names (where all words in title start with capital letter), we checked only such terms as link candidates. This reduction is due to the fact that validating a if term exist as article in Wikipedia is time consuming. Extending it to terms with 3 words or more requires to check whether any sequence of 3 or more words in the raw text exist as article in Wikipedia - which greatly increase the number of API requests. (55% of all article titles are 1-2 words long, and another 17% are names longer than 2 words.)
First we would like to introduce a small sample of the agents results running on the raw text of Wikipedia article in English Wikipedia.
- FoolAgent - as expected is overlinking
- HeuristicAgent - behaves differently with different learning methods. As most of the raw text is not linked, we had to select whether we let the agent to simply work based the given annotation or to give bias to more links. For SVM and RandomForest we added a bias to more links, and for ID3 we didn't play with the link weight. As a result we can see that RandomForst and SVM have many more links. It seems that SVM works better, though it give some irrlevant/less relevant links (e.g November or The English).
- OnlineAgent - Here we have bias to terms with higher mutual information, which sometimes give slightly better results, for example ID3 now links to English Wikipedia. RandomForest beahves quiet differently - as the number of features get higher the selection of features in trees become important (as the online features appear only to the top candidates for links). For the SVM we get very similar results (but not the same).
In this project we implemented different agents which can assist Wikipedia editors (and readers) with suggestions for wikilinks in Wikipedia articles that would be helpful for readers. The problem of selecting "important" links is subjective (dependent on the reader) and hard for formalization. We tried to solve with simple means: Using simple features, avoiding using knowledge base and without real understaning of the meaning of terms. Our agents give improvment compared to the naïve approach of linking everything. Some of the agents provide surprisingly good results (HeuristicAgent-svm), even with such simple approach.
Usally a learning method tries to get right on most of the predictions, and as such a "almost correct agent" will reject any link (as most of the text is not linked). We had to give some bias to linkss (bigger weight), so the agent will prefer more links. We would like to note that the selection of link weight in the learning method is strongly linked to the way we would like to run such tool: If the agent works fully automaticlly and adds links to Wikipedia articles, we would prefer less links and give less bias to links, while if it is a semi-automatic tool we would prefer more bias.
The use of different levels of information (context for HeuristicAgent, and online data for OnlineAgent) have given us the ability to scale the agent performence and runtime.
We used various open source libraries: