Release of the Wikipedia collection reader v.0.4
By Fabien Poulard on 03/04/2010, 13:39 - Sciences & Recherche - Permalink
Wikipedia is an incredible source of information, data and more generally of language acts (uses of language). It is a unique resource for researchers in natural language processing (NLP).
The MediaWiki UIMA Loader is a UIMA component, a collection reader to be more specific, that is able of making use of Wikipedia to build corpora. The 0.4 version is the first release publicly announced for this component.
For those who do not want to wait anymore :
- The jar package of the component (and its dependencies : mwdumper and wikimodel.wem)
- The sources of the component
Presentation of the component
The MediaWiki UIMA Loader is a collection reader aiming at loading data from a MediaWiki, especially Wikipedia and its derivative projects ...
The component is distributed under an Apache 2 license. You can then use it for academic or commercial purposes. In both cases, if the component is usefull to you, do not hesitate to tell me what you think, if you need new features or if you find bugs.
Contrary to several other projects, the component does not directly connect to Wikipedia webstie. Neither does it need a local mirror of the website in a local MediaWiki database. It works directly on the XML dumps, which has the following advantages :
- No repetitive access to MediaWiki projects servers, which preserves their bandwidth and computation capacities ;
- No need to deploy a database server and to import the dumps data into it (if you still want to do so, and speak french, here is a tutorial I wrote about this process. I can translate it if you ask) ;
- It limits the disk space necessary to store the informations by directly using the compressed dumps ;
- It limits the latency due to networks and SQL servers requests.
The features of this 0.4 version are the following :
- Loading from an XML dump, compressed or not ;
- Several filtering options (see below) concerning the pages and revisions to be loaded into the UIMA chain ;
- Interpretation of the wiki syntax and annotation of the Titles, Sections, Paragraphs and Links (cf. this bill for more informations, you still need to speak French).
Setup
Before setting up and using the component, it is necessary to have a UIMA environment set up. If this is not the case, visit this tutorial.
The simplest way is to download the component jar from the uima-fr download space, as well as the dependencies : mwdumper and wikimodel.wem.
If you would like to build the jar by yourself, you need to download the component sources, still in the uima-fr download space,and compile them with Maven :
$ tar -xzvf mediawiki-uima-loader-0.4.1.tar.gz ... $ cd mediawiki-uima-loader-0.4.1 $ mvn package ...
The jar package should be created into the target/ directory, the dependencies will have been downloaded into your local maven repository.
Use
You can use the component in any UIMA toolchain, the same way you use any other collection reader component. The following procedure is about the cpeGui, but it should be similar for the other tools of the same kind.
The cpeGui is not able to load an xml descriptor from a Jar. Therefore, it is necessary to extract the descriptor from the jar in order for the cpeGui to be able to load it. If you have compiled the component by yourself, the descriptor is in the directory desc. Otherwise, you just have to extract it from the Jar :
$ jar -x wikipedia-cr.xml -f mediawiki-uima-loader-0.4.1.jar
It is necessary to add the jar component and its dependencies into the UIMA_CLASSPATH, before launching the cpeGui from the command line. For example, considering that the jar of the component is in the current directory and the dependencies in the maven local repository :
$ export UIMA_CLASSPATH=$UIMA_CLASSPATH:~/.m2/repository/org/wikimedia/mwdumper/1.16/mwdumper-1.16.jar:~/.m2/repository/org/wikimodel/org.wikimodel.wem/2.0.7-SNAPSHOT/org.wikimodel.wem-2.0.7-SNAPSHOT.jar:mediawiki-uima-loader-0.4.1.jar $ cpeGui
In the panel dedicated to the Collection Reader, click on Browse and select the component descriptor we have extracted from the jar (wikipedia-cr.xml). The panel is modified to offer the configuration fields for the component.
The only mandatory parameter is the fields Input Xml Dump. You must specify in this later the path to the XML dump of Wikipedia (or any other MediaWiki dump) that you would like to load. For example : ~/frwiki-20100111-pages-meta-history.xml.bz2. The component is able to read a dump, compressed or not.
The other configuration fields deal with the filtering to operate while loading the data :
- Latest Revision Only, if you check this checkbox, only the latest revision of each page will be loaded, otherwise they will all be loaded (all the revisions of one page per CAS) ;
- Ignore Talks, if you check this checkbox, all the talk pages will be ignored, otherwise they will be loaded ;
- Config Namespaces Filter, this field let you specify the namespaces to consider or to ignore when the data is loaded. If it is left empty, all the namespaces are loaded. For Wikipedia the available namespaces are :
- -2 : media resources ;
- -1 : special pages ;
- 0 : main namespace where the articles are ;
- 1 : talks about articles ;
- 2 : user pages ;
- 3 : talks about user pages ;
- 4 : Wikipedia project ;
- 5 : talks about the Wikipedia project ;
- 6 : files ;
- 7 : talks about the files ;
- 8 : MediaWiki ;
- 9 : talks about MediaWiki ;
- 10 : models ;
- 11 : talks about models ;
- 12 : help ;
- 13 : talks about help ;
- 14 : categories ;
- 15 : talks about categories ;
- 100 : portal ;
- 101 : talks about the portal ;
- 102 : projects ;
- 103 : talks about the projects ;
- 104 : references ;
- 105 : talks about the references ;
For example to take into consideration only the talks pages: 1,3,5,7,9,11,13,15,101,103,105, or to consider all namespaces except categories : !14 ;
- Config Title Match, this field let you filter the pages using a regular expression. Only the pages which title validates the regular expression will be loaded. For example: A.* to load all pages starting with a A ;
- Config List Filter and Config Exact List Filter, in these fields you specify a path to a file containing one page name per line. Only these pages will be loaded. If you choose the Exact parameter field, the title must correspond to the page title, otherwise it can also correspond to the talk page ;
- Config Revision List Filter, this field let you specify the path to a file containing the revision number to be loaded (one revision per line) ;
- Config Before Timestamp Filter and Config After Timestamp Filter, these fields let you specify a time gap between which the data will have been created. You can then specify the begining and ending date into the format yyyy-MM-dd'T'HH:mm:ss'Z'.
Once the component configured, you can continue with other components and set up the toolchain as you use to before launching the execution.
Beware, if you export the content from a compressed dump, with the XmiWriter component for example, consider that the data amount may be from 20 to 100 times more important than the size of the original compressed dump. Hence, count about 200 Go for the english version of the Wikipedia considering only the latest revision of the pages.

