Nutch-Hbase

unpack hbase and edit conf/hbase-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>hbase.rootdir</name>
    <value>file:///DIRECTORY/hbase</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/DIRECTORY/zookeeper</value>
  </property>
</configuration>

start hbase
verify hbase is working. (use the hbase shell)

hbase(main):003:0> create 'test', 'cf'
0 row(s) in 1.2200 seconds
hbase(main):003:0> list 'test'
..
1 row(s) in 0.0550 seconds
hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.0560 seconds
hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0370 seconds
hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0450 seconds

Install Nutch and edit nutch-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
<property>
    <name>http.agent.name</name>
    <value>My Nutch Spider</value>
</property>

<property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    <description>Default class for storing data</description>
</property>

</configuration>

enable Gora-Hbase dependency in ivy/ivy.xml

    <!-- Uncomment this to use HBase as Gora backend. -->
    <dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*->default" />

Set HBase as the default data-store in gora.properties

 gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

Build Nutch

ant runtime

Unpack SOLR and copy the conf files from nutch to solr (tip from

cp <NUTCH_DIR>/CONF/* <SOLR_DIR>/example/solr/collection1/conf/
Santy-Raghavans-MacBook-Pro:local santy$ bin/nutch solrindex http://localhost:8983/solr
Usage: SolrIndexerJob <solr url> (<batchId> | -all | -reindex) [-crawlId <id>]
Santy-Raghavans-MacBook-Pro:local santy$ bin/nutch solrindex http://localhost:8983/solr -all
SolrIndexerJob: starting
[Deprecated] Xalan: org.apache.xml.serializer.XMLEntities
Adding 1 documents
SolrIndexerJob: java.lang.RuntimeException: job failed: name=solr-index, jobid=job_local_0001
    at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:75)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:84)

As an alternate nutch conf directory contains a solr4-schema.xml file. I copied this over to solr config. This complained about a missing _version_ field in the schema file. I added it and solr stopped complaining, but nutch still reports the same error as above ( SolrIndexerJob)

Turns out that the first time we run nutch's solrindex, we need to run it with -reindex, running it with -all gives the java.lang.RuntimeException listed above.

Santy-Raghavans-MacBook-Pro:local santy$ bin/nutch solrindex http://localhost:8983/solr -all
SolrIndexerJob: starting
[Deprecated] Xalan: org.apache.xml.serializer.XMLEntities
Adding 1 documents
SolrIndexerJob: done.
Santy-Raghavans-MacBook-Pro:local santy$
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License