Scrapers and Irregularities

The Semantic Web, on its long way to go, is greedy for structured data from web pages. But many of them expose minor irregularities when they were manually or semiautomatically crafted. And when they are perfectly regular, they probably originate directly from databases and could thus be tapped more easily without the detour of human-readable html. So I am always keen to learn about Scrapers who allow excerpting structured XML data from “almost” structured HTML.

(My secondary motive is that many such almost-structured HTML pages contain lists and hierarchical collections of terms that lend themselves to be copied into visualizer applications, where they can be rearranged and amended with cross-referencing links. I think that with a little bit of heuristics, many such lists could me made available for copying & pasting not just text but structures.)

A new scraper service is feed43 (via OLDaily). This service has a great user interface and is easy to understand: Basically, it extracts variables denoted by {%} from the surrounding “bearing text” constants and from skipped text denoted by the {*} macro.

However, it turns out to be not very robust when the said small irregularities occur. So I wrote the following feedback (which is still unreplied).

I think it is not optimal that bearing strings such as “<li>” are allowed to match in the skip macro. This way, slight irregularities cause the item structure to be spoilt. Otherwise, just the offending items would be skipped.

Consider the following input

<html>
<ul> 
<li>Firstname = John, Initials = F., Lastname = Kennedy</li> 
<li>Firstname = Lyndon, Initials = B., Lastname = Johnson</li> 
<li>Firstname = Richard, Initials = M., Lastname = Nixon</li> 
<li>Firstname = Gerald, Initials = R., Lastname = Ford</li> 
<li>Firstname = Jimmy, Lastname = Carter</li> 
<li>Firstname = George, Initials = H. W., Lastname = Bush</li> 
</ul>

and this repeatable pattern:

<li>Firstname = {%},{*}, Lastname = {%}</li>

and you will get Jimmy Bush in addition to the skipped Carter as a result.

This entry was posted in Web. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s