Scraping 101: Extracting Anchor Text with Regexp
There are many ways to skin a cat, but when it comes to scraping websites, I like parsing content with regexp. One of the biggest problems I bumped into when parsing HTML is matching opening and closing tags.
For example:
(] >)(.*)
Ok let’s try that in English:
- (] >) matches .
- (.*) *should* match anchor text (I’ll elaborate on that).
- matches the closing A tag.
search engine land
will correctly extract the anchor text “search engine land.” BUT because (.*) is greedy,
search engine land is cool because vanessa fox posts there.
will incorrectly extract:
search engine land is cool because vanessa fox posts there.
as anchor text. Hmm..
So how do you fix this? Instead of using a .*, use .*? or other non-greedy modifiers like ?, ??, or {m,n}? (I haven’t tested the last three, I assume they work).
(] >)(.*?) will extract anchor text from web pages.
Was this article helpful?