This article is a part of the on-going Web Scraping Series. If you are not familiar with Web Scraping please check with the first article . This session mainly deals with Dynamic Content Scraping. Nowadays most of the web portals are dynamic by making Ajax calls instead of old static web pages. Scraping on dynamic environment is both interesting and challenging one.
The first part of the discussion concentrated mainly on static page scraping with Perl mechanize module. Even though mechanize provides extension for dynamic scraping, it is not very good.
So this session deals with making use of selenium testing tool for Web Scraping.
Selenium IDE is a Firefox add-on that records clicks, typing, and other actions to make a test, which you can play back in the browser.
Selenium Remote Control (RC) is a Java based Command line server for handling request from client.
Pros and Cons
Selenium based Web Scraping on small throughout is easy task.
It consumes lots of memory resource, for each request it will launch a new browser instance.
Working of selenium
Selenium RC comes in two parts.
- A server which automatically launches and kills browsers, and acts as a HTTP proxy for web requests from them.
- Client libraries for your favourite computer language.
The RC server also bundles Selenium Core, and automatically loads it into the browser.
Here is a simplified architectural representation:
For Detailed diagram http://seleniumhq.org/about/how.html
How to Setup a Selenium Server
Download Selenium RC server to directory to /usr/local/selenium
#java -jar selenium-server.jar #starting selenium server .By default it is listen to 4444
An example Client Program
As said in the above section, it is possible to create selenium client by recording user activities or else the programmers can create it using their own language. Python, Perl and Ruby, Java has supporting modules for it.
#Sample Perl Code
use Time::HiRes qw(sleep);
use Test::More “no_plan”;
my $sel = Test::WWW::Selenium->new( host => “192.168.1.20”,
port => 4444,
browser => “*firefox”,
browser_url => “http://www.godaddy.com/” );
my $data=$sel->get_html_source(); # here you get source of the current page
For more info please have a look at cpan http://search.cpan.org/search?query=selenium&mode=all
As scraper you can extract required data from this source:
For scraping data from multiple pages
Open selenium IDE and record the events that you are interested and analyse the code generated and try to implement your own way,
As a last word, let me add that selenium is not completely a scraping tool, it is instead, a testing tool.
For more about selenium have look at http://seleniumhq.org/