Comments on: Scraping 101: Extracting Anchor Text with Regexp

By: Don

Don — Mon, 15 Sep 2008 14:10:04 +0000

The TreeBuilder library is actually pretty mature code and was made assuming that the HTML code is bad, so it works quite nicely, would handle stuff like nested anchors fine and doesn’t break over such things as missing closing tags. Perl coders are usually really good at parsing since that’s the languages strongest feature. I doubt it is perfect, but it seems to handle most HTML docs ok. I just got a Y! API key and looped through the 1000 results and pulled anchor/image alt text tags along with their attributes off all the sites.

By: Halfdeck

Halfdeck — Mon, 15 Sep 2008 10:45:43 +0000

Interesting Don. One issue I see with your code though is it probably relies on valid HTML code to work. Also if you have stuff nested inside a href, anchor may not get parsed correctly. But thanks for posting the code. Though I implement scrapers in Java if I have some free time I’ll definitely check it out.

By: Don

Don — Fri, 12 Sep 2008 19:30:32 +0000

Perl has some nice modules available for connecting to and parsing HTML documents. If you like regexp and haven’t learned perl yet you should definitely check it out. It has built-in regexp support to make parsing stuff like HTML docs quick. Here’s some basic code that extracts nofollow links and their corresponding anchor text or alt text if its an image. note the code is messy :)

use HTML::TreeBuilder;
use LWP::UserAgent;
use HTTP::Headers;
use URI::Escape;
use HTML::Parser;

$ua = LWP::UserAgent->new;
$ua->agent(”Mozilla/5.0″);
$ua->timeout(3000000);

$req = HTTP::Request->new(GET => “http://url.com”);
$res = $ua->request($req);

if ($res->is_success) {
$tree = HTML::TreeBuilder->new_from_content($res->content);
if(defined $tree->look_down( ‘_tag’ => ‘a’)){

@getlinks=$tree->look_down( ‘_tag’ => ‘a’);

for($b=0;$battr(’href’)){
if($getlinks[$b]->attr(’rel’) && $getlinks[$b]->attr(’rel’)=~/nofollow/gi ){
print “Nofollow-> ” . $getlinks[$b]->attr(’href’) . “\n”;

if ($getlinks[$b]->as_text) {
print “This is a text link\n”;
print “Anchor text-> ” . $getlinks[$b]->as_text . “\n”;
}

$image = $getlinks[$b]->look_down( ‘_tag’ => ‘img’);
if ($image && $image->attr(’alt’)) {
print “This is a image link\n”;
print “Alt text-> ” . $image->attr(’alt’) . “\n”;

}
}
}
}
}
}

By: SEO Company Reading - Hobo SEO UK

SEO Company Reading - Hobo SEO UK — Wed, 13 Feb 2008 03:08:30 +0000

[…] Scraping 101: Extracting Anchor Text with Regexp, Half’s SEO Notebook […]

By: Halfdeck

Halfdeck — Mon, 11 Feb 2008 23:25:38 +0000

“how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?”

1. Using Yahoo API, pull backlinks to the home page only.
2. Pull sitewide backlinks.
3. Run a site: command, and pull the first 1,000 URLs.
4. For each URL, pull backlicks to that URL.
5. Hamlet Batista also suggested using keywords to pull even more backlinks.
7. Rerun, filtering out multiple urls from the same domain. (not filtering that is useful for finding sites that link sitewide; filtering it is useful for discovering a greater number of domains).

Obviously, even with all those API runs, this method will only dig up a subset of a site’s backlinks, and you’d have to dig through the result set to weed out noise (links from myspace search pages, nofollowed blog comment links, links from scraper sites, etc).

If a single page links multiple times to a domain with different anchor text, that also brings up a problem most tools out there don’t deal with.

By: Michael Martinez

Michael Martinez — Mon, 11 Feb 2008 21:18:15 +0000

Interesting post but how do you propose we capture all those HTML pages to see who is linking to whom with what anchor text?