Tags

employment
llc
xrp
redhat
ripple
interfaces
ncurses
ruby
refs
filesystems
retro gaming
raspberry pi
sinatra
3d printing
nethack
gcc
compiler
fedora
virtfs
project
gaming
vim
grep
sed
aikido
philosophy
splix
android
lvm
storage
bitcoin
projects
sig315
miq
db
polisher
meditation
hopex
conferences
omega
simulator
bundler_ext
rubygems
book review
google code in
isitfedoraruby
svn
gsoc
design patters
jsonrpc
rjr
aeolus
ohiolinuxfest
rome
europe
travel
brno
gtk
python
puppet
conference
fudcon
html5
snap
tips
ssh
linux
hardware
libvirt
virtualization
engineering expo
cloud
redmine
plugins
rpm
yum
rake
screencasting
jruby
fosscon
pidgin
gnome-shell
distros
notacon
presentation
rails
deltacloud
apache
qmf
passenger
syrlug
hackerspace
music
massive attack
backups
crypto
vnc
xsd
rxsd
x3d
mercurial
webdev
ovirt
qpid
haikus
poetry
legaleese
jquery
selenium
testing
xpath
git
sshfs
svg
ldap
autotools
pygtk
xmlrpc
slackware

Sep 14 2008 xpath

A Brief Tutorial on XPath

Since its been over a month since my last post, I feel one is long overdue. Today’s entry is about XPath, a language that allows for the traversal / retrieval of nodes in an XML document (note that at the time of writing this XPath 2.0; which adds additional type logic, operators, among other things; is now an official standard, but I will not dive into the additional features in this entry).

For this document, lets assume you are working on, and want to traverse the following HTML document (yes I realize HTML is not a strict subset of XML but for the purpose of the article, and in actual practice, XPath can be used to traverse html documents (and is often employed to do so, see my write up on integrating the Selenium web interface test suite into my current professional project oVirt))

<html>
  <head>
   <title>Hello World</title>
  </head>
  <body>
   <div id="intro" class="content">Would you like to know something about me?</div>
   <div class="content">
     <p>My favorite foods</p>
     <ul>
       <li>Pizza</li>
       <li>Burritos</li>
       <li>Pasta</li>
     </ul>
   </div>
   <br/>
 </body>
</html>

Similar to Unix-like operating systems “/” indicates root, but unlike standard filesystems “/” does not correspond to any particular node. To start with something easy, to get the title of the web page, one would simply pass the following XPath into the search method of the library they are using

  /html/head/title

which would return the “Hello World” string. We could use a similar, full path lookup to retrieve the contents of the introduction div, but rather a simpler method would be to employ the “self or descendent” directive “//” in conjunction with the attribute directive “@” to find the div with the “intro” id attribute like so

  //div[@id='intro']

This essentially tells XPath to search for a div with the “intro” id starting at the root element. Note there are many ways to accomplish the same thing, for example we would have started at the body element and ignore any divs elsewhere in the document (once again, that wouldn’t appear in standard html but bear with me for this example) like so

  /html/body//div[@id="intro"]

If multiple elements match, for example when looking up list elements that appear in any unordered list like so,

  //ul/li

A list of elements will be returned, or alternatively you can specify an index to retrieve a specified element (very important note indices are 1 based, eg index ‘0’ is invalid) like so

  //ul/li[2]

If you do not have a definitive element type in mind, you can match ‘all elements’ using the ‘*’ operator. For example to return both the paragraph and unordered list under the second div in the body, you would use the following statement

  /html/body/div[2]/*

XPath also provides many functions useful in retrieving and transforming information stored in the document (technically all the aformentioned locators / specifiers are simply shorthand abbreviations for various XPath functions), for example to retrieve the number of list elements in the document, one could use the count function

  count(//ul/li)

Or to return div containing only text data (eg not other nodes) one would

  //div/text()

which would resolve to the first / ‘intro’ div in the example.
I have barely touched upon the power of XPath in this guide, and have not even discussed additions which XPath 2.0 adds to the language, but hopefully this will give you a starting point which to expand and dive into the universe of XML traversal / lookup.