Archive

Author Archive

Dynamic Web Scraping Using Selenium

November 24th, 2010

This article is a part of the on-going Web Scraping Series. If you are not familiar with Web Scraping please check with the first article . This session mainly deals with Dynamic Content Scraping. Nowadays most of the web portals are dynamic by making Ajax calls instead of old static web pages. Scraping on dynamic environment is both interesting and challenging one.

The first part of the discussion concentrated mainly on static page scraping with Perl mechanize module. Even though mechanize provides extension for dynamic scraping, it is not very good.

So this session deals with making use of selenium testing tool for Web Scraping.

Prerequsites

Selenium IDE is a Firefox add-on that records clicks, typing, and other actions to make a test, which you can play back in the browser.

Selenium Remote Control (RC)  is a Java based Command line server for handling request from client.

Pros and Cons

It supports all Dynamic Content like Ajax, JavaScript, is easy to implement and it is possible to write selenium clients in any language we prefer, for example, here I have used Perl. You can also use Python, Java, etc.

Selenium based Web Scraping on small throughout is easy task.

It consumes lots of memory resource, for each request it will launch a new browser instance.

Working of selenium

Selenium Remote Control (RC) is a test tool that allows you to write automated web application UI tests in any programming language against any HTTP website using any mainstream JavaScript-enabled browser.

Selenium RC comes in two parts.

  1. A server which automatically launches and kills browsers, and acts as a HTTP proxy for web requests from them.
  2. Client libraries for your favourite computer language.

The RC server also bundles Selenium Core, and automatically loads it into the browser.

Here is a simplified architectural representation:

For Detailed diagram http://seleniumhq.org/about/how.html

How to Setup a Selenium Server

Download Selenium RC server to directory to /usr/local/selenium

#cd /usr/local/selenium

#unzip selenium-remote-control-1.0-beta-2-dist.zip

#cd selenium-remote-control-1.0-beta-2

#cd selenium-server-1.0-beta-2

#java -jar selenium-server.jar #starting selenium server .By default it is listen to 4444

An example Client Program

As said in the above section, it is possible to create selenium client by recording user activities or else the programmers can create it using their own language. Python, Perl and Ruby, Java has supporting modules for it.

#Sample Perl Code
#!/usr/bin/perl
use strict;
use warnings;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan";
use Test::Exception;

my $sel = Test::WWW::Selenium->new( host => "192.168.1.20",
port => 4444,
browser => "*firefox",
browser_url => "http://www.godaddy.com/" );
$sel->open_ok("/domains/search.aspx?ci=8969");
$sel->click_ok("domain_search_button");
$sel->wait_for_page_to_load_ok("30000");
my $data=$sel->get_html_source(); # here you get source of the current page

For more info please have a look at cpan http://search.cpan.org/search?query=selenium&mode=all

As scraper you can extract required data from this source:

For scraping data from multiple pages

Open selenium IDE and record the events that you are interested and analyse the code generated and try to implement your own way,

As a last word, let me add that selenium is not completely a scraping tool, it is instead, a testing tool.

For more about selenium have look at http://seleniumhq.org/

VN:F [1.9.6_1107]
Rating: 5.1/10 (8 votes cast)
VN:F [1.9.6_1107]
Rating: +3 (from 3 votes)

Shameem Khalid linux, perl , , , ,

Web Scraping : A basic know-how.

August 2nd, 2010

A Web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, web spiders, web robots, etc. The process is termed “web crawling”, and most site engines use it as a means to provide up-to-date data, in order to create a copy of all pages that have been visited. These are later processed, and the search engine will index the downloaded pages.
This helps in :

  • faster search
  • automating maintenance task on a web site
  • gathering specific types of information from websites

The bot starts with seeds, which are a list of URLs to visit. Once the “crawler” is on one of the listed URLs, the hyperlinks in that page are identified and added to the “crawl frontier” which is the set of URLs that are to be visited. These are later visited according to a pre-defined set of policies.

Web Crawlers can be developed using any language : perl, python, java, asp,php etc. Among these, we chose perl to develop a web crawler. Lets see what happened next.

Why Perl?

Perl is well suited for  web scraping  because of its highly powerful RegEx and availability of CPAN modules .

In this session, we will deal with :

  • Mechanize(Perl Module),
  • Process spawning
  • Anonymous  scraping

Mechanize module : Mechanize is one of the main modules used, for stateful programmatic web browsing, used for automating interaction with websites. Mechanize supports performing a sequence of page fetches including following links and submitting forms. Each fetched page is parsed and its links and forms are extracted. A link or a form can be selected, form fields can be filled and the next page can be fetched. Mech also stores a history of the URLs you’ve visited, which can be queried and revisited. Usefull functions decribed in bottem

For more info:http://search.cpan.org/~petdance/WWW-Mechanize-1.62/

Sample Script

#!/usr/bin/perl -w
use WWW::Mechanize;
$url = 'http://chato.cl/research/crawling_thesis ';
$m = WWW::Mechanize->new();
$m->get($url);
$c = $m->content; #  Will display souce code of the above link
exit;

Usefull Function of mechanize module
my $mech = WWW::Mechanize->new();         #Creating new object of  Mechanize.
$mech->agent_alias(‘Linux Mozilla’);             #Creating a new agent like firfox
$mech->get(‘www.google.com’);                       #Download content in the link (www.google.com)
$mech->content;                                                     # This has the content of www.google.com link
$mech->submit_form                                            # for form submition
$mech->find_link(text =>’Next’)                      #Follow the link with text ‘Next’ there are so many options for this like regular expression ,class,etc

Process spawning  :
Most of the bots have a main process and a number of child processes. Main processes deal with creating child processes based on our requirement, while the child processes scrape our target locations simultanously.

Why Process spawning?
Process spawning is used simply for simultaneous scraping at different levels of a web site (i.e. at different page/sections etc.
It has a number of advantages like nitro boosting of scraping speed and easier management of server load.
In case the target is an e-commerce portal with a million section (like review page) with some pages or sections (or any other target)  missing. Here, the child process will simply die, without effecting the total crawling process, while the main continues with a new child and new section.
Anonymous scraping with TOR


Tor is a free software and an open network that helps in defending your site against a form of network surveillance known as traffic analysis. This surviellance threatens personal freedom, privacy, confidential business activities and relationships.
Tor is a network of virtual tunnels that allows people and groups to improve their privacy and security on the Internet. It also enables software developers to create new communication tools with built-in privacy features. Tor provides the foundation for a range of applications that allow organizations and individuals to share information over public networks without compromising their privacy.

For more info  please go through
http://www.torproject.org/docs/tor-doc-unix.html.en#polipo

VN:F [1.9.6_1107]
Rating: 9.3/10 (3 votes cast)
VN:F [1.9.6_1107]
Rating: +1 (from 1 vote)

Shameem Khalid Articles, linux, perl , , , ,

Issues with ImageMagick/Mod_perl/Html::mason

February 11th, 2010

Today, i was working with a web-based mod_perl application running on an old rhel 2.1 (panama) server, with apache v1.3.33 and mod_perl v1.29.

The applctn had an option to upload images for the products displayed on the site. But the app lacked image resizing functionality. Some of the users had uploaded huge images and that went beyond the site borders. I was about to integrate the resizing function using perl’s Image::Magick module. Existing module was out-dated, so i went for an upgrade. Obviously, it was not a quick and easy upgrade.

For the code snip

$image = Image::Magick->new;
$x = $image->Read($imgname);
if($x eq '')
{
$image->Resize( geometry => '800x800' );
$image->Write($imgname);
}

The first error i met was

Can’t load ‘/usr/lib/perl5/site_perl/5.6.1/i386-linux/auto/Image/Magick/Magick.so’

i ran ldd command on Magick.so, and libMagick.so link was found broken

ldd /usr/lib/perl5/site_perl/5.6.1/i386-linux/auto/Image/Magick/Magick.so
libMagick.so.6 => not found

So downloaded ImageMagick-6.2.3-6.tar.gz and reinstalled ImageMagick.

For a successful compile, i had to comment the following lines in /usr/src/ImageMagick-6.2.3/magick/annotate.c

//      if (LocaleCompare(encoding,"Latin-1") == 0)
//        encoding_type=ft_encoding_latin_1;

because ( i guess ) the machine lacked that font.
After that, the app threw another error
Wrong JPEG library version: library is 62, caller expects 70

This was fixed by relinking libjpeg.so.62 to libjpeg.so.7.0.0. I know its dirty, but that did the trick.

Hope, mod_perl/mason players will find this useful.

VN:F [1.9.6_1107]
Rating: 6.6/10 (5 votes cast)
VN:F [1.9.6_1107]
Rating: +2 (from 2 votes)

Shameem Khalid linux, perl , ,