Hi folks, In this post I will explain how to build Heritrix from its source code and how to Integrate HTMLUnit into Heritrix.
First question comes to mind is, What is Heritrix? and why do we need HTMLUnit to Integrate into Heritrix?
Well, Heritrix is a open-source, Web crawler. Heritrix does not include web page level DOM model and JavaScript Interpreter. Therefore, if you want to crawl the web to look for malicious scripts or obfuscated JS, then you need a JS interpreter. Hence. HTMLUnit comes into play. HTMLUnit is a headless browser, which has got JS interpreter.
Steps to Build Heritrix:
- Download latest version of JDK rpm from sun website and install it.
- Set JAVA_HOME and PATH environment to .bashrc file (~/.bashrc)
export JAVA_HOME=/usr/java/jdk1.6.x.x
export PATH=$JAVA_HOME/bin:$PATH
Now JDK is ready to be used by Heritirx and Maven
3. We need
maven 1.0.2 to build heritix.
Note: We need src of heritix so that we can modify it in future. Therefore do not use heritix binaries available on Ineternet. Build heritirx from src. Also Note the version of maven. it is very very important. Do not try with latest version of maven. It may not work.
4. Download
binary of
maven 1.0.2 and extract it somewhere on disk. Now set
MAVEN_HOME
environment for it as mention below.
Edit (
/etc/profile) file to insert following lines before
unset i and
unset pathmunge commands at the end of file.
export MAVEN_HOME=/path_of_Maven_directory
pathmunge $MAVEN_HOME/bin before
Now logout and login again to reflect environment variable changes done above to be get reflected.
5. Run
maven -v command to test maven is running properly.
6. Run
maven jar command. this will create /root/.maven/repository directory.
7. Now go into heritirx directory and run command
maven dist
8. This will create subdirectory target, and many other subdirectories inside target directory.
target/distribution directory holds heritirx build version.
It there is failure due to any dependency jar file then download that file from Internet and store it in either
/root/.maven/cache or
/root/.maven/repository/.../jar/ directory.
9. Heritirx is build Successfully. Extract build version and test heritirx.
10. Launch heritrix by using command:
$ HERITRIX_HOME/bin/heritrix --admin=LOGIN:PASSWORD
where $HERITRIX_HOME is the location of your
untarred heritrix.?.?.?.tar.gz.
Integrating HTMLUNIT into Heritrix:
This is little bit tricky. You are at this point means you already have heritrix, sun JDK and maven.
Follow the steps given below:
Step 1: Download HTMLUnit (I used HTMLUnit 2.5). We don't need source code of HTMLUnit therefore download binary of HTMLUnit. We only need its JAR files.
Step 2: Copy all JAR files in HTMLUnit into
lib sub-directory of heritrix folder. Do not replace files, which are already there, if you replace them, then you need to modify
project.properties file. Only add those files which are not there.
Step 3: Edit
project.xml file in heritrix directory. Bcoz we want to tell heritrix where HTMLUnit classes can be found. Add
tag for each JAR file of HTMLUnit.
Sample of dependency tag is given below:
<dependency>
<id>htmlunit</id>
<version>2.5</version>
<url>http://htmlunit.sourceforge.net/ </url>
<properties>
<war.bundle>true </war.bundle >
<ear.bundle>true</ear.bundle>
<ear.bundle.dir>APP-INF/lib</ear.bundle.dir>
<description>
Use to handle JS obfuscation. It is a headless browser.
</description>
<license>Apache 2.0
http://www.apache.org/licenses/LICENSE-2.0 </license>
</properties>
</dependency>
Add this dependency tag for all JAR files of HTMLUnit.
Step 4: Edit
project.properties file in heritirx directory to instruct maven that, do not try to download those dependency files from Internet, rather look into
local directory. Syntax to do this can be easily found in
project. properties file, simply make use of it.
For example:
maven.jar.htmlunit = ${basedir}/lib/htmlunit-2.5.jar
Add a entry for each JAR file (that is, each dependency entry done in
Step 3 ) of HTMLUnit.
Step 5. Done. Now build Heritrix again.