Scraping 101: Extracting Anchor Text with Regexp

There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.

For example:

(] >)(.*)

Ok let’s try that in English:

  1. (] >) matches .
  2. (.*) *should* match anchor text (I’ll elaborate on that).
  3. matches the closing A tag.

search engine land

will correctly extract the anchor text “search engine land.” BUT because (.*) is greedy,

search engine land is cool because vanessa fox posts there.

will incorrectly extract:

search engine land is cool because vanessa fox posts there.

as anchor text. Hmm..

So how do you fix this? Instead of using a .*, use .*? or other non-greedy modifiers like ?, ??, or {m,n}? (I haven’t tested the last three, I assume they work).

(] >)(.*?) will extract anchor text from web pages.

Was this article helpful?

0 0