This is the main documentation for DataStore's contained within the
gora-core
module which (as it's name implies)
holds most of the core functionality for the gora project.
Every module
in gora depends on gora-core therefore most of the generic documentation
about the project is gathered here as well as the documentation for AvroStore
,
DataFileAvroStore
and MemStore
. In addition to this, gora-core holds all of the
core MapReduce, GoraSparkEngine, Persistency, Query, DataStoreBase and Utility functionality.
AvroStore can be used for binary-compatible Avro serializations. It supports Binary and JSON serializations.
Property Key | Property Value | Required | Description |
---|---|---|---|
gora.datastore.default= | org.apache.gora.avro.store.AvroStore | Yes | Implementation of the persistent Java storage class |
gora.avrostore.input.path= | *hdfs://uri/path/to/hdfs/input/path* || *file:///uri/path/to/local/input/path* | Yes | This value should point to the input directory on hdfs (if running Gora in a distributed Hadoop environment) or to some location input directory on the local file system (if running Gora locally). |
gora.avrostore.output.path= | *hdfs://uri/path/to/hdfs/output/path* || *file:///uri/path/to/local/output/path* | Yes | This value should point to the output directory on hdfs (if running Gora in a distributed Hadoop environment) or to some location output location on the local file system (if running Gora locally). |
gora.avrostore.codec.type= | BINARY || JSON | No | The property key specifying avro encoder/decoder type to use. Can take values BINARY or JSON but resolves to BINARY is one is not supplied. |
In the stores covered within the gora-core module, no physical mappings are required.
DataFileAvroStore is file based store which extends <codeAvroStore to use Avro's DataFile{Writer,Reader}
's as a backend.
This datastore supports MapReduce.
DataFileAvroStore would be configured exactly the same as in AvroStore above with the following exception
Property Key | Property Value | Required | Description |
---|---|---|---|
gora.datastore.default= | org.apache.gora.avro.store.DataFileAvroStore | Yes | Implementation of the persistent Java storage class |
In the stores covered within the gora-core module, no physical mappings are required.
Essentially this store is a ConcurrentSkipListMap in which operations run as follows
MemStore would be configured exactly the same as in AvroStore above with the following exception
Property Key | Property Value | Required | Description |
---|---|---|---|
gora.datastore.default= | org.apache.gora.memory.store.MemStore | Yes | Implementation of the Java class used to hold data in memory |
In the stores covered within the gora-core module, no physical mappings are required.
GoraSparkEngine is Spark backend of Gora. Assume that input and output data stores are:
DataStore<K1, V1> inStore;
DataStore<K2, V2> outStore;
First step of using GoraSparkEngine is to initialize it:
GoraSparkEngine<K1, V1> goraSparkEngine = new GoraSparkEngine<>(K1.class, V1.class);
Construct a JavaSparkContext
. Register input data store’s value class as Kryo class:
SparkConf sparkConf = new SparkConf().setAppName("Gora Spark Integration Application").setMaster("local");
Class[] c = new Class[1];
c[0] = inStore.getPersistentClass();
sparkConf.registerKryoClasses(c);
JavaSparkContext sc = new JavaSparkContext(sparkConf);
JavaPairRDD can be retrieved from input data store:
JavaPairRDD<Long, Pageview> goraRDD = goraSparkEngine.initialize(sc, inStore);
After that, all Spark functionality can be applied. For example running count can be done as follows:
long count = goraRDD.count();
Map and Reduce functions can be run on a JavaPairRDD
as well. Assume that this is the variable after map/reduce is applied:
JavaPairRDD<String, MetricDatum> mapReducedGoraRdd;
Result can be written as follows:
Configuration sparkHadoopConf = goraSparkEngine.generateOutputConf(outStore);
mapReducedGoraRdd.saveAsNewAPIHadoopDataset(sparkHadoopConf);