Scraping 101: Extracting Anchor Text with Regexp
There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.
For example:
(<a [^>]+>)(.*)</a>
Ok let’s try that in English:
- (<a [^>]+>) matches <a href=”….”.>.
- (.*) *should* match anchor text (I’ll elaborate on that).
- </a> matches the closing A tag.
<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a>
will correctly extract the anchor text “search engine land.” BUT because (.*) is greedy,
<a href=”http://www.searchengineland.com” rel=”notpaid”>search engine land</a> is cool because vanessa fox posts there.</a>
will incorrectly extract:
search engine land</a> is cool because vanessa fox posts there.
as anchor text. Hmm..
So how do you fix this? Instead of using a .*, use .*? or other non-greedy modifiers like +?, ??, or {m,n}? (I haven’t tested the last three, I assume they work).
(<a [^>]+>)(.*?)</a> will extract anchor text from web pages.

Interesting post but how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?
Michael Martinez said this on February 11th, 2008 at 4:18 pm
“how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?”
1. Using Yahoo API, pull backlinks to the home page only.
2. Pull sitewide backlinks.
3. Run a site: command, and pull the first 1,000 URLs.
4. For each URL, pull backlicks to that URL.
5. Hamlet Batista also suggested using keywords to pull even more backlinks.
7. Rerun, filtering out multiple urls from the same domain. (not filtering that is useful for finding sites that link sitewide; filtering it is useful for discovering a greater number of domains).
Obviously, even with all those API runs, this method will only dig up a subset of a site’s backlinks, and you’d have to dig through the result set to weed out noise (links from myspace search pages, nofollowed blog comment links, links from scraper sites, etc).
If a single page links multiple times to a domain with different anchor text, that also brings up a problem most tools out there don’t deal with.
Halfdeck said this on February 11th, 2008 at 6:25 pm
[…] Scraping 101: Extracting Anchor Text with Regexp, Half’s SEO Notebook […]
SEO Company Reading - Hobo SEO UK said this on February 12th, 2008 at 10:08 pm