Web Scraping : A basic know-how.

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, web spiders, web robots, etc. The process is termed “web crawling”, and most site engines use it as a means to provide up-to-date data, in order to create a copy of all pages that have been visited. These are later processed, and the search engine will index the downloaded pages.
This helps in :

faster search
automating maintenance task on a web site
gathering specific types of information from websites

The bot starts with seeds, which are a list of URLs to visit. Once the “crawler” is on one of the listed URLs, the hyperlinks in that page are identified and added to the “crawl frontier” which is the set of URLs that are to be visited. These are later visited according to a pre-defined set of policies.

Web Crawlers can be developed using any language : perl, python, java, asp,php etc. Among these, we chose perl to develop a web crawler. Lets see what happened next.

Why Perl?

Perl is well suited for web scraping because of its highly powerful RegEx and availability of CPAN modules .

In this session, we will deal with :

Mechanize(Perl Module),
Process spawning
Anonymous scraping

Mechanize module : Mechanize is one of the main modules used, for stateful programmatic web browsing, used for automating interaction with websites. Mechanize supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you’ve visited, which can be queried and revisited. Usefull functions decribed in bottem

For more info:http://search.cpan.org/~petdance/WWW-Mechanize-1.62/

Sample Script for Web Scraping

[perl]
#!/usr/bin/perl -w
use WWW::Mechanize;
$url = ‘http://chato.cl/research/crawling_thesis ‘;
$m = WWW::Mechanize->new();
$m->get($url);
$c = $m->content; # Will display souce code of the above link
exit;
[/perl]

Usefull Function of mechanize module
my $mech = WWW::Mechanize->new(); #Creating new object of Mechanize.
$mech->agent_alias(‘Linux Mozilla’); #Creating a new agent like firfox
$mech->get(‘www.google.com’); #Download content in the link (www.google.com)
$mech->content; # This has the content of www.google.com link
$mech->submit_form # for form submition
$mech->find_link(text =>’Next’) #Follow the link with text ‘Next’ there are so many options for this like regular expression ,class,etc

Process spawning :
Most of the bots have a main process and a number of child processes. Main processes deal with creating child processes based on our requirement, while the child processes scrape our target locations simultanously.

Why Process spawning?
Process spawning is used simply for simultaneous scraping at different levels of a web site (i.e. at different page/sections etc.
It has a number of advantages like nitro boosting of scraping speed and easier management of server load.
In case the target is an e-commerce portal with a million section (like review page) with some pages or sections (or any other target) missing. Here, the child process will simply die, without effecting the total crawling process, while the main continues with a new child and new section.
Anonymous scraping with TOR

Tor is a free software and an open network that helps in defending your site against a form of network surveillance known as traffic analysis. This surviellance threatens personal freedom, privacy, confidential business activities and relationships.
Tor is a network of virtual tunnels that allows people and groups to improve their privacy and security on the Internet. It also enables software developers to create new communication tools with built-in privacy features. Tor provides the foundation for a range of applications that allow organizations and individuals to share information over public networks without compromising their privacy.

For more info please go through
http://www.torproject.org/docs/tor-doc-unix.html.en#polipo

Leave a Reply Cancel reply