Friday, February 19, 2010

Heritrix and HTMLUnit

Hi folks, In this post I will explain how to build Heritrix from its source code and how to Integrate HTMLUnit into Heritrix.

First question comes to mind is, What is Heritrix? and why do we need HTMLUnit to Integrate into Heritrix?

Well, Heritrix is a open-source, Web crawler. Heritrix does not include web page level DOM model and JavaScript Interpreter. Therefore, if you want to crawl the web to look for malicious scripts or obfuscated JS, then you need a JS interpreter. Hence. HTMLUnit comes into play. HTMLUnit is a headless browser, which has got JS interpreter.

Steps to Build Heritrix:
  1. Download latest version of JDK rpm from sun website and install it. 
  2. Set JAVA_HOME and PATH environment to .bashrc file (~/.bashrc)
                 export JAVA_HOME=/usr/java/jdk1.6.x.x
                 export PATH=$JAVA_HOME/bin:$PATH
     Now JDK is ready to be used by Heritirx and Maven
     3.  We need maven 1.0.2 to build heritix.  Note: We need src of heritix so that we can modify it in future. Therefore do not use heritix binaries available on Ineternet. Build heritirx from src. Also Note the version of maven. it is very very important. Do not try with latest version of maven. It may not work. 
     4. Download binary of maven 1.0.2 and extract it somewhere on disk. Now set MAVEN_HOME
         environment for it as mention below.
            Edit (/etc/profile) file to insert following lines before unset i and unset pathmunge commands at the end of file.
            export MAVEN_HOME=/path_of_Maven_directory   
            pathmunge $MAVEN_HOME/bin before 

     Now logout and login again to reflect environment variable changes done above to be get reflected.

    5. Run maven -v command to test maven is running properly.
    6. Run maven jar command. this will create /root/.maven/repository directory.
    7. Now go into heritirx directory and run command maven dist
    8. This will create subdirectory target, and many other subdirectories inside target directory.
         target/distribution directory holds heritirx build version.
         It there is failure due to any dependency jar file then download that file from Internet and store it in either /root/.maven/cache or /root/.maven/repository/.../jar/ directory.
   9. Heritirx is build Successfully. Extract build version and test heritirx.
  10.  Launch heritrix by using command:
           $ HERITRIX_HOME/bin/heritrix --admin=LOGIN:PASSWORD
                 where $HERITRIX_HOME is the location of your untarred heritrix.?.?.?.tar.gz.

Integrating HTMLUNIT into Heritrix:

This is little bit tricky. You are at this point means you already have heritrix, sun JDK and maven.

Follow the steps given below:
Step 1: Download HTMLUnit (I used HTMLUnit 2.5). We don't need source code of HTMLUnit therefore download binary of HTMLUnit. We only need its JAR files.

Step 2: Copy all JAR files in HTMLUnit into lib sub-directory of heritrix folder. Do not replace files, which are already there, if you replace them, then you need to modify file. Only add those files which are not there.

Step 3: Edit project.xml file in heritrix directory. Bcoz we want to tell heritrix where HTMLUnit classes can be found. Add tag for each JAR file of HTMLUnit.

Sample of dependency tag is given below:
         <url> </url>
             <war.bundle>true </war.bundle  >
                Use to handle JS obfuscation. It is a headless browser.
                <license>Apache 2.0
Add this dependency tag for all JAR files of HTMLUnit.

Step 4: Edit file in heritirx directory to instruct maven that, do not try to download those dependency files from Internet, rather look into local directory. Syntax to do this can be easily found in project. properties file, simply make use of it.

For example:
            maven.jar.htmlunit = ${basedir}/lib/htmlunit-2.5.jar

Add a entry for each JAR file (that is, each dependency entry done in Step 3 ) of HTMLUnit.

Step 5. Done. Now build Heritrix again.

1 comment:

  1. Ok. So htmlunit-2.5.jar will be included in the build. But what instructs Heritrix to use htmlunit for fetching instead of HttpMethod? Without this, it would be completely useless to build Heritrix with htmlunit dependency.