Introduction

This is the main entry point for Gora documentation. Here are some pointers for further info:

You can find an abstract overview of how to configure Gora here.

Gora Modules

Gora source code is organized in a modular architecture. The gora-core module is the main module which contains the core of the code. All other modules depend on the gora-core module. Each datastore backend in Gora resides in it's own module. The documentation for the specific module can be found at the module's documentation directory.

It is wise so start with going over the documentation for the gora-core module and then the specific data store module(s) you want to use. The following modules are currently implemented in Gora.

We currently have modules under development for Oracle NoSQL and Apache Lucene.

Gora Testing

Gora currently has two testing mechanisms * JUnit Tests: These are included for every module which provides a DataStore within Gora. * Integration Tests: A custom testing suite called GoraCI (Continuous Ingestion) which stress tests Gora functionality at scale.

JUnit Tests

Unit tests in Gora are implemented using the popular JUnit framework. Each module which implements the DataStore interface similarly implements a DataStoreTestBase API which test utilities for DataStores. The DataStoreTestBase class delegates actual test execution to DataStoreTestUtil.

The tests begin in a fairly trivial fashion testing functionality like datastore schema creation schema deletion, etc and continue in this manner getting progressively more complex as we begin testing some more advanced features within the Gora API. In addition to the unit tests contained within this class, the best place to look for API functionality is at the examples directories under various Gora modules. Most modules contain a /src/examples/ directory under which some example classes can be found. Specifically, there are some classes that are used for tests under gora-core/src/examples/.

GoraCI Integration Testsing Suite

Background

Since Gora 0.5, the GoraCI suite has been part of the mainstream Gora codebase.

Credit for GoraCI can be handed to Keith Turner (Gora PMC member) for his foresight in developing GoraCI which we have now extended from gora-accumulo to the entire suite of Gora modules.

Apache Accumulo has a test suite that verifies that data is not lost at scale. This test suite is called continuous ingest.
Essentially the test runs many ingest clients that continually create linked lists containing 25 million nodes. At some point the clients are stopped and a map reduce job is run to ensure no linked list has a hole. A hole indicates data was lost.

The nodes in the linked list are random. This causes each linked list to spread across the table. Therefore if one part of a table loses data, then it will be detected by references in another part of the table.

This project is a version of the test suite written using Apache Gora [1]. Goraci has been tested against Accumulo and HBase.

The Anatomy of GoraCI tests

Below is rough sketch of how data is written. For specific details look at the Generator code

  1. Write out 1 million nodes
  2. Flush the client
  3. Write out 1 million that reference previous million
  4. If this is the 25th set of 1 million nodes, then update 1st set of million to point to last
  5. goto 1

The key is that nodes only reference flushed nodes. Therefore a node should never reference a missing node, even if the ingest client is killed at any point in time.

When running this test suite w/ Accumulo there is a script running in parallel called the Aggitator that randomly and continuously kills server processes.
The outcome was that many data loss bugs were found in Accumulo by doing this. This test suite can also help find bugs that impact uptime and stability when run for days or weeks.

This test suite consists the following

When generating data, its best to have each map task generate a multiple of 25 million. The reason for this is that circular linked list are generated every 25M. Not generating a multiple in 25M will result in some nodes in the linked list not having references. The loss of an unreferenced node can not be detected.

Building GoraCI

As GoraCI is packaged with the Gora master branch source it is automatically built every time you execute

mvn install

The maven pom file has some profiles that attempt to make it easier to run GoraCI against different Gora backends by copying the jars you need into lib. Before packaging its important to edit gora.properties and set it correctly for your datastore. To run against Accumulo do the following.

vim src/main/resources/gora.properties //set Accumulo properties
mvn package -Paccumulo-1.4

To run against HBase, do the following.

vim src/main/resources/gora.properties //set HBase properties
mvn package -Phbase-0.92

To run against Cassandra, do the following.

vim src/main/resources/gora.properties //set Cassandra properties
mvn package -Pcassandra-1.1.2

For other datastores mentioned in gora.properties, you will need to copy the appropriate deps into lib. Feel free to update the pom with other profiles, open a ticket or just send us a pull request.

Java Class Description

Below is a description of the Java programs

goraci.sh is a helper script that you can use to run the above programs. It assumes all needed jars are in the lib dir. It does not need the package name. You can just run goraci.sh Generator, below is an example.

$ ./goraci.sh Generator
Usage : Generator <num mappers> <num nodes>

For Gora to work, it needs a gora.properties file on the classpath and a gora-$datastore-mapping.xml mapping file on the classpath, the contents of both are datastore specific, more details can be found here [2]. You can edit the ones in src/main/resources and build the goraci-${version}-SNAPSHOT.jar with those. Alternatively remove those and put them on the classpath through some other means.

Gora and Hadoop

Gora uses Apache Avro which uses a Json library that Hadoop has an old version of. The two libraries jackson-core and jackson-mapper need to be updated in $HADOOP_HOME/lib and $HADOOP_HOME/share/hadoop/lib/. Currently these are updated to jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar. For details see HADOOP-6945.

GoraCI and HBase

To improve performance running read jobs such as the Verify step, enable scanner caching on the command line. For example:

$ ./gorachi.sh Verify -Dhbase.client.scanner.caching=1000 \
   -Dmapred.map.tasks.speculative.execution=false verify_dir 1000

Dependent on how you have your Hadoop and HBase setup deployed, you may need to change the gorachi.sh script around some. Here is one suggestion that may help in the case where your Hadoop and HBase configuration are other than under the Hadoop and HBase home directories.

diff --git a/org.apache.gora.goraci.sh b/org.apache.gora.goraci.sh
index db1562a..31c3c94 100755
--- a/org.apache.gora.goraci.sh
+++ b/org.apache.gora.goraci.sh
@@ -95,6 +95,4 @@ done
 #run it
 export HADOOP_CLASSPATH="$CLASSPATH"
 LIBJARS=`echo $HADOOP_CLASSPATH | tr : ,`
 -hadoop jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar" $CLASS -libjars "$LIBJARS" "$@"
 -
 -
 +CLASSPATH="${HBASE_CONF_DIR}" hadoop --config "${HADOOP_CONF_DIR} jar "$GORACI_HOME/lib/org.apache.gora.goraci-0.0.1-SNAPSHOT.jar" $CLASS -files "${HBASE_CONF_DIR}/hbase-site.xml" -libjars "$LIBJARS" "$@"

You will need to define HBASE_CONF_DIR and HADOOP_CONF_DIR before you run your goraci jobs. For example:

$ export HADOOP_CONF_DIR=/home/you/hadoop-conf
$ export HBASE_CONF_DIR=/home/you/hbase-conf
$ PATH=/home/you/hadoop-1.0.2/bin:$PATH ./goraci.sh Generator 1000 1000000

Concurrency

Its possible to run verification at the same time as generation. To do this supply the -c option to Generator and Verify. This will cause Genertor to create a secondary table which holds information about what verification can safely verify. Running Verify with the -c option will make it run slower because more information must be brought back to the client side for filtering purposes. The Loop program also supports the -c option, which will cause it to run verification concurrently with generation.

If verification is run at the same time as generation without the -c option, then it will inevitably fail. This is because verification mappers read different parts of the table at different times and giving an inconsistent view of the table. So one mapper may read a part of a table before a node is written, when the node is later referenced it will appear to be missing. The -c option basically filters out newer information using data written to the secondary table.

Conclusions

This test suite does not do everything that the Accumulo test suite does, mainly it does not collect statistics and generate reports. The reports are useful for assesing performance.

Below shows running a test of the test. Ingest one linked list, deleted a node in it, ensure the verifaction map reduce job notices that the node is missing. Not all output is shown, just the important parts.

$ ./goraci.sh Generator  1 25000000
$ ./goraci.sh Print -s 2000000000000000 -l 1
  2000001f65dbd238:30350f9ae6f6e8f7:000004265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
$ ./goraci.sh Print -s 30350f9ae6f6e8f7 -l 1
  30350f9ae6f6e8f7:4867fe03de6ea6c8:000003265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6
$ ./goraci.sh Delete 30350f9ae6f6e8f7
  Delete returned true
$ ./goraci.sh Verify gci_verify_1 2 
  11/12/20 17:12:31 INFO mapred.JobClient:   org.apache.gora.goraci.Verify$Counts
  11/12/20 17:12:31 INFO mapred.JobClient:     UNDEFINED=1
  11/12/20 17:12:31 INFO mapred.JobClient:     REFERENCED=24999998
  11/12/20 17:12:31 INFO mapred.JobClient:     UNREFERENCED=1
$ hadoop fs -cat gci_verify_1/part\* 30350f9ae6f6e8f7   2000001f65dbd238

The map reduce job found the one undefined node and gave the node that referenced it.

Below are some timing statistics for running Goraci on a 10 node cluster.

Store           | Task                   | Time    | Undef  | Unref | Ref        
----------------+------------------------+---------+--------+-------+------------
accumulo-1.4.0  | Generator 10 100000000 | 40m 16s |    N/A |   N/A |        N/A     
accumulo-1.4.0  | Verify /tmp/goraci1 40 |  6m  7s |      0 |     0 | 1000000000  
hbase-0.92.1    | Generator 10 100000000 |  2h 44m |    N/A |   N/A |        N/A     
hbase-0.92.1    | Verify /tmp/goraci2 40 |  6m 34s |      0 |     0 | 1000000000

HBase and Accumulo are configured differently out-of-the-box. We used the Accumulo 3G, native configuration examples in the conf/examples directory.

To provide a comparable memory footprint, we increased the HBase jvm to "-Xmx4000m", and turned on compression for the ci table:

create 'ci', {NAME=>'meta', COMPRESSION=>'GZ'}

We also turned down the replication of write-ahead logs to be comparable to Accumulo:

<property>
  <name>hbase.regionserver.hlog.replication</name>
  <value>2</value>
</property>

For the accumulo run, we set the split threshold to 512M:

shell> config -t ci -s table.split.threshold=512M

This was done so that Accumulo would end up with 64 tablets, which is the number of regions HBase had. The number of tablets/regions determines how much parallelism there is in the map phase of the verify step.

Sometimes when this test suite is run against HBase data is lost. This issue is being tracked under HBASE-5754