This document describes how you can deploy WikiXMLDB on your computer. The WikiXMLDB demo deployed by Sedna team on Amazon EC2 runs on the virtual computer with restricted resources. To achieve better performance and perform unlimited customization you are able to run the application on your computer.
The deployment of WikiXMLDB consists of several simple steps. Here they are:
1. Get Wikipedia snapshot
Open the link http://download.wikimedia.org/enwiki/ in your browser and navigate to the last directory named by date. What you need here is the file pages-articles.xml.bz2 which contains the wikipedia articles in the Mediawiki format. Download this file to your file system and unpack it to obtain regular XML file.
2. Parse Wikipedia to XML representation
The next step is to parse the Mediawiki representation of Wikipedia to more structured XML representation. Open the download page of Texterra project and download the latest version of the parser. The parser is Python-based so you need Python to run the parser. Additional documentation how to run the parser is available in README file which comes with the parser.
3. Load Wikipedia into Sedna
To load Wikipedia into Sedna XML database you should have Sedna installed on your computer. If you still don't have Sedna on your computer go to the Sedna home page and get Sedna. Follow Quick Start guide to install and run Sedna.
Now you are ready to load the XML representation of Wikipedia into Sedna. First, ensure that you have 250 GB of free space on the disk which contains Sedna database files (by default database files reside in the sedna/data directory of your Sedna installation path). To load Wikipedia (we assume that the file is named 'wiki.xml') into Sedna do the following steps:
The bulkload process takes some time and ends with the 'Bulk load succeeded' message of the se_term.
After the succesfull bulkload you should create a number of indexes to increase the performance of quering Wikipedia. Create an empty file named 'indexes.xquery' and put the following lines in it:
(: retrieve article by title :) declare default element namespace "http://www.mediawiki.org/xml/export-0.5/"; CREATE INDEX 'article-by-title' ON doc("wiki")/mediawiki/page BY title AS xs:string&
(: retrieve articles which refer to the given article :) declare default element namespace "http://www.mediawiki.org/xml/export-0.5/"; CREATE INDEX 'article-by-link' ON doc("wiki")/mediawiki/page BY .//link/@label AS xs:string&
(: retrieve articles which belong to the given category :) declare default element namespace "http://www.mediawiki.org/xml/export-0.5/"; CREATE INDEX 'article-by-cat' ON doc("wiki")/mediawiki/page BY ./revision/catlinks/catlink/@href AS xs:string
Then (assuming the file is named indexes.xquery) run the following command:
se_term -file indexes.xquery wiki
This command causes Sedna to create these three indexes.
Testing you installation
After the deployment steps are complited you can run the following command to check that all runs correctly:
se_term -query "index-scan('article-by-title','Data mining','EQ')" wiki
With this command you should obtain an XML representation of 'Data mining' article