XML representaion of Wikipedia articles

The key structural elements are:

  • doc("wiki") - document which contains Wikipedia content
  • doc("wiki")/mediawiki/page - correspons to an article in Wikipedia
  • page/title - the title of the article
  • page/revision/text - a container for content of the article
  • page/revision/text/section - a section in the content
  • page/revision/text/section/@title - a title of the section
  • page/revision/text//a - a link from the article to another article specified as 'href' attribute
  • page/revision/text//p - a paragraph inside the content
  • page/revision/text//img - a reference to the image spesified via 'src' attribute
  • page/revision/text/catlinks - categories which the article belongs to
  • page/revision/text/catlinks/catlink/@href - a category which the article belongs to
  • page/revision/text//template - templates which contain various metadata. The type of metadata is defined in the 'head' attribute. In particular, infoboxes are represented as 'template' elements.

Article example in XML

<page xmlns="">
    <title>Igor Kurchatov</title>
       <comment>/* References */</comment>
       <text xml:space="preserve">
          <section depth="1" title="Life and career">
            <a href="image:kurchatov.jpg">thumb|right|Igor Kurchatov</a> <b>Igor Vasilyevich Kurchatov</b> 
            (<a href="January 12">January 12</a>, <a href="1903">1903</a> &ndash; <a href="February 7">February 7</a>,
            <a href="1960">1960</a>) was a <a href="Soviet Union">Soviet</a>/<a href="Russia">Russian</a> physicist.
            He was the leader of the <a href="Soviet atomic bomb project">Soviet atomic bomb project</a>. 
            Kurchatov was born in <i>Simsky zavod</i>, <a href="Ufa">Ufa</a> <a href="Guberniya">Guberniya</a>  
            (now city of <i>Sim</i>, <a href="Chelyabinsk Oblast">Chelyabinsk Oblast</a>). After completing 
            <a href="Simferopol gymnasium 1"> Simferopol gymnasium 1</a> he studied <a href="physics">physics</a> 
            at Crimea State University and ship building at the <a href="Saint Petersburg Polytechnical University">
            Polytechnical Institute</a> in <a href="Petrograd">Petrograd</a>. In <a href="1925">1925</a> he moved to the 
            <a href="Ioffe Physico-Technical Institute">Physico-Technical Institute</a>, where he worked (under <a href="Abram
            Fedorovich Ioffe">Abram Fedorovich Ioffe</a>) on various problems connected with <a href="radioactivity">
            radioactivity</a>.In <a href="1932">1932</a> he received funding for his own nuclear science research team, 
            which built the Soviet Union's first <a href="cyclotron">cyclotron</a> (<a href="September 21">September 21</a>, 
            <a href="1939">1939</a>).
            Igor Kurchatov and his apprentice <a href="Georgy Flyorov">Georgy Flyorov</a> discovered the basic ideas of the 
            uranium chain reaction and the nuclear reactor concept in the 1930's. In 1942 Kurchatov declared: "At breaking up
            of kernels in a kilogram of uranium, the energy released must be equal to the explosion of 20,000 tons of trotyl." 
            This announcement was practically verified during the atomic bombing of Hiroshima.</p>
          <section depth="1" title="External links">
              <link href="">Kurchatov institute</link>
              <link href="">
                 Biography of Igor Kurchatov (in Russian)
         <catlink href="Category:Russian physicists"/>
         <catlink href="Category:Soviet physicists"/>
         <catlink href="Category:Nuclear physicists"/>

Sedna Indexes

For efficient query execution there is a number of predefined indexes created in Sedna. Each index stands for some common use case in content processing.

Retrieve article by its name


declare default element namespace "";
CREATE INDEX 'article-by-title' ON doc("wiki")/mediawiki/page BY title AS xs:string

Query example

(: Return article which title is 'Internet' :)

Retrieve articles which refers to the article with the specified title (what links here)


declare default element namespace "";
CREATE INDEX 'article-by-link' ON doc("wiki")/mediawiki/page BY .//link/@label AS xs:string

Query example

(: Return the titles of articles which has references to the 'Anarchism' article :) declare default element namespace ""; index-scan('article-by-link','Anarchism','EQ')/title

Retrive articles which belong to the given category


declare default element namespace "";
CREATE INDEX 'article-by-cat' ON doc("wiki")/mediawiki/page BY ./revision/catlinks/catlink/@href AS xs:string

Query example

(: Return titles of the articles in 'Russian mathematicians' category :) declare default element namespace "";
declare ordering unordered;
index-scan('article-by-cat','Category:Russian mathematicians','EQ')/title

Full-text search index


declare default element namespace "";
CREATE FULL-TEXT INDEX 'fti' ON doc("wiki")/mediawiki/page TYPE "xml"

Query example

(: Retrive all links from the articles which has 'sedna' in its titles :) declare default element namespace "";
declare ordering unordered;
ftindex-scan('fti','(title contains sedna)')/title//a/@href

Predefined queries

For your convenience we have also prepared a number of predefined queries. You can customize them in any way (click on [edit] to the right of the predefined queries) or write your own query from scratch.