Deployment

This document describes how you can deploy WikiXMLDB on your computer. The WikiXMLDB demo deployed by Sedna team on Amazon EC2 runs on the virtual computer with restricted resources. To achieve better performance and perform unlimited customization you are able to run the application on your computer.

Overview

The deployment of WikiXMLDB consists of several simple steps. Here they are:

  • Get Wikipedia snapshot in Mediawiki format
  • Parse Mediawiki format to the XML representation
  • Load XML representation of Wikipedia into Sedna XML Database
  • Create indexes for efficient quering

1. Get Wikipedia snapshot

Open the link http://download.wikimedia.org/enwiki/ in your browser and navigate to the last directory named by date. What you need here is the file pages-articles.xml.bz2 which contains the wikipedia articles in the Mediawiki format. Download this file to your file system and unpack it to obtain regular XML file.

2. Parse Wikipedia to XML representation

The next step is to parse the Mediawiki representation of Wikipedia to more structured XML representation. Open the download page of Texterra project and download the latest version of the parser. The parser is Python-based so you need Python to run the parser. Additional documentation how to run the parser is available in README file which comes with the parser.

3. Load Wikipedia into Sedna

To load Wikipedia into Sedna XML database you should have Sedna installed on your computer. If you still don't have Sedna on your computer go to the Sedna home page and get Sedna. Follow Quick Start guide to install and run Sedna.

Now you are ready to load the XML representation of Wikipedia into Sedna. First, ensure that you have 250 GB of free space on the disk which contains Sedna database files (by default database files reside in the sedna/data directory of your Sedna installation path). To load Wikipedia (we assume that the file is named 'wiki.xml') into Sedna do the following steps:

  • Start Sedna with command: se_gov
  • Create database named 'wiki' with command: se_cdb wiki
  • Run the database: se_sm wiki
  • Start Sedna Terminal with command: se_term wiki
  • Turn on logless mode for faster loading: \set LOG_LESS_MODE
  • Run bulk load: LOAD 'wiki.xml' 'wiki'&
  • Restart database after bulk load is finished to force checkpoint. Use se_stop command to stop Sedna.

The bulkload process takes some time and ends with the 'Bulk load succeeded' message of the se_term.

Create indexes

After the succesfull bulkload you should create a number of indexes to increase the performance of quering Wikipedia. Create an empty file named 'indexes.xquery' and put the following lines in it:

(: retrieve article by title :)
declare default element namespace "http://www.mediawiki.org/xml/export-0.5/";
CREATE INDEX 'article-by-title' ON doc("wiki")/mediawiki/page BY title AS xs:string&

(: retrieve articles which refer to the given article :)
declare default element namespace "http://www.mediawiki.org/xml/export-0.5/";
CREATE INDEX 'article-by-link' ON doc("wiki")/mediawiki/page BY .//link/@label AS xs:string&

(: retrieve articles which belong to the given category :)
declare default element namespace "http://www.mediawiki.org/xml/export-0.5/";
CREATE INDEX 'article-by-cat' ON doc("wiki")/mediawiki/page BY ./revision/catlinks/catlink/@href AS xs:string

Then (assuming the file is named indexes.xquery) run the following command:

se_term -file indexes.xquery wiki

This command causes Sedna to create these three indexes.

Testing you installation

After the deployment steps are complited you can run the following command to check that all runs correctly:

se_term -query "index-scan('article-by-title','Data mining','EQ')" wiki

With this command you should obtain an XML representation of 'Data mining' article