Scraping 101: Extracting Anchor Text with Regexp

There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.

For example:

(<a [^>]+>)(.*)</a>

Ok let’s try that in English:

  1. (<a [^>]+>) matches <a href=”….”.>.
  2. (.*) *should* match anchor text (I’ll elaborate on that).
  3. </a> matches the closing A tag.

<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a>

will correctly extract the anchor text “search engine land.” BUT because (.*) is greedy,

<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a> is cool because vanessa fox posts there.</a>

will incorrectly extract:

search engine land</a> is cool because vanessa fox posts there.

as anchor text. Hmm..

So how do you fix this? Instead of using a .*, use .*? or other non-greedy modifiers like +?, ??, or {m,n}? (I haven’t tested the last three, I assume they work).

(<a [^>]+>)(.*?)</a> will extract anchor text from web pages.

Related Posts

3 Responses to “Scraping 101: Extracting Anchor Text with Regexp”

  1. Interesting post but how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?

  2. “how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?”

    1. Using Yahoo API, pull backlinks to the home page only.
    2. Pull sitewide backlinks.
    3. Run a site: command, and pull the first 1,000 URLs.
    4. For each URL, pull backlicks to that URL.
    5. Hamlet Batista also suggested using keywords to pull even more backlinks.
    7. Rerun, filtering out multiple urls from the same domain. (not filtering that is useful for finding sites that link sitewide; filtering it is useful for discovering a greater number of domains).

    Obviously, even with all those API runs, this method will only dig up a subset of a site’s backlinks, and you’d have to dig through the result set to weed out noise (links from myspace search pages, nofollowed blog comment links, links from scraper sites, etc).

    If a single page links multiple times to a domain with different anchor text, that also brings up a problem most tools out there don’t deal with.

  3. […] Scraping 101: Extracting Anchor Text with Regexp, Half’s SEO Notebook […]

What's Your Take?