Processing geo information in Wikipedia articles
This tutorial is on how to lookup and geocode places in a Wikipedia article and visualize those with the help of Processing. It gives a basic overview on the related topics, and provides mutiple Processing examples, below.
Yahoo Placemaker is a "geoparsing web service" that enrichs content with geographic metadata by extracting places from unstructured texts. It finds, identifies and disambiguates place names from textual content and returns the place with its geo-location and additional information, such as its type (e.g. country, or town). Different place names can be recognized (e.g. "New York" and "NYC"), as well as multi-lingual references (e.g. "München", "Munich").
A gazetteer is a geographical dictionary or directory, a reference for information about places and place names.
For using Placemaker you either need an App ID to query the Placemaker API directly, or you can use YQL for it. YQL is a query language to retrieve and manipulate data from various web services, and thus is convenient for mashups. It simplifies the access of diverse APIs by unifying and connecting their interfaces. YQL is influenced by SQL but diverges from it as it provides specialized methods to query, filter, and join data across web services.
Very helpful to live test YQL queries is the YQL console. Here you can experiment with all the web services, execute queries, and directly retrieve the results.
Into the text field (1) you can enter YQL, and after executing that query you will see the original results as XML or JSON (2). On the right side (3) you find all enabled web services to integrate. For many purposes the data category may be of interest; there you'll find methods to access data from the web, such as HTML pages or RSS feeds.
Using Placemaker via YQL
Let's take a look at a simple place extraction query: The YQL below utilizes the Placemaker API to analyze
the given text in
This query returns an XML result with all recognized places. In this example, "Berlin" is the only found place from that text snippet.
The returned place consists of the geo position, as well as further data elements. The
woeId is a unique identifier for that place,
type indicates which kind of physical place this data object is about. The
the canonical english name.
Extracting places from web pages
With YQL you not only can access various web services, but also query data from different sources. The following returns the complete HTML body of a given page.
As we want to extract places from Wikipedia articles we need to combine the content of the Wikipedia web page with Placemaker:
which returns all found places with their geo-positions. A small excerpt of the result looks like this:
So, now that we have got access to the places mentioned in a Wikipedia article, let's use them in a Processing sketch.
Reading XML in Processing
To be able to read and use the geo-positions,
we need to request the YQL and parse the resulting XML.
In Processing this is very easy to do:
Simply create a new XMLElement with the URL of the web service to use. After that,
parse the returned XML to read the result values of the called API method.
See the paragraph on XML reading in the RSS tutorial for further information.
Call YQL and count place matches
Let's create a new XMLElement and provide a URL in the constructor, and count and print the number of found places.
restUrl used in the example is the REST service with a YQL query, where the parameters are URL encoded, e.g.
Simply copy it from the YQL console (from the "REST query" text field).
Now that we have the
xmlResponse we walk through the XML structure to the elements to use.
In this example we get the place of every match with
The path parameter specifies which elements to return as array. The hierarchical XPath expression
results/matches/match/place selects all place elements.
Read name and geo position
We can iterate over all the places, and access their titles and geo-positions. See the section
on the Place XML structure for a more detailed field description.
The elements are accessed via the
getChild(index) method in the following example.
Note, that the latitude and longitude values need to be converted to float prior to using.
This example prints all names and positions of the places in the XML.
To draw the geo-positions onto the Processing canvas they have to be projected onto a planar surface. There are various map projections with different usages, and specific advantages and disadvantages. You can find a neat Projection reference with an overview on map projections and their applications over at Radical Cartography.
Different map projections: Mercator, Goode Homolosine, Gall-Peters (left to right)
Draw positions in Processing
The Equirectangular projection is a very simple map projection, used mostly for thematic maps and geovisualizations. The geo-coordinates can be mapped directly onto the cartesian coordinate system.
Now you can visually represent every place by using whatever Processing drawing methods you like.
Background world map
Displaying the positions solely on a blank canvas may not be sufficient to understand the geospatial relations. One of the simpler approaches is to display a map in the background. In Processing just load a map image, and draw it before drawing the place markers.
Showing all places mentioned in the Wikipedia article on Walter Gropius.
(You can find the image used above at Wikipedia Commons: Equirectangular-projection.jpg)
Keep in mind, though, that the visual markers in the above example do not incorporate the place type, i.e. the marker for Australia and the one for Boston are the same. More sophisticated visualizations should take this into account.
Find below a small class for the place markers. This could be used as basis for own geo visualizations.