Overview

This is the main documentation for DataStore's contained within the gora-core module which (as it's name implies) holds most of the core functionality for the gora project.

Every module in gora depends on gora-core therefore most of the generic documentation about the project is gathered here as well as the documentation for AvroStore, DataFileAvroStore and MemStore. In addition to this, gora-core holds all of the core MapReduce, GoraSparkEngine, Persistency, Query, DataStoreBase and Utility functionality.

AvroStore

Description

AvroStore can be used for binary-compatible Avro serializations. It supports Binary and JSON serializations.

gora.properties

Property Key Property Value Required Description
gora.datastore.default= org.apache.gora.avro.store.AvroStore Yes Implementation of the persistent Java storage class
gora.avrostore.input.path= *hdfs://uri/path/to/hdfs/input/path* || *file:///uri/path/to/local/input/path* Yes This value should point to the input directory on hdfs (if running Gora in a distributed Hadoop environment) or to some location input directory on the local file system (if running Gora locally).
gora.avrostore.output.path= *hdfs://uri/path/to/hdfs/output/path* || *file:///uri/path/to/local/output/path* Yes This value should point to the output directory on hdfs (if running Gora in a distributed Hadoop environment) or to some location output location on the local file system (if running Gora locally).
gora.avrostore.codec.type= BINARY || JSON No The property key specifying avro encoder/decoder type to use. Can take values BINARY or JSON but resolves to BINARY is one is not supplied.

AvroStore XML mappings

In the stores covered within the gora-core module, no physical mappings are required.

DataFileAvroStore

Description

DataFileAvroStore is file based store which extends <codeAvroStore to use Avro's DataFile{Writer,Reader}'s as a backend. This datastore supports MapReduce.

gora.properties

DataFileAvroStore would be configured exactly the same as in AvroStore above with the following exception

Property Key Property Value Required Description
gora.datastore.default= org.apache.gora.avro.store.DataFileAvroStore Yes Implementation of the persistent Java storage class

Gora Core mappings

In the stores covered within the gora-core module, no physical mappings are required.

MemStore

Description

Essentially this store is a ConcurrentSkipListMap in which operations run as follows

gora.properties

MemStore would be configured exactly the same as in AvroStore above with the following exception

Property Key Property Value Required Description
gora.datastore.default= org.apache.gora.memory.store.MemStore Yes Implementation of the Java class used to hold data in memory

MemStore XML mappings

In the stores covered within the gora-core module, no physical mappings are required.

GoraSparkEngine

Description

GoraSparkEngine is Spark backend of Gora. Assume that input and output data stores are:

DataStore<K1, V1> inStore;
DataStore<K2, V2> outStore;

First step of using GoraSparkEngine is to initialize it:

GoraSparkEngine<K1, V1> goraSparkEngine = new GoraSparkEngine<>(K1.class, V1.class);

Construct a JavaSparkContext. Register input data store’s value class as Kryo class:

SparkConf sparkConf = new SparkConf().setAppName("Gora Spark Integration Application").setMaster("local");
Class[] c = new Class[1];
c[0] = inStore.getPersistentClass();
sparkConf.registerKryoClasses(c);
JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaPairRDD can be retrieved from input data store:

JavaPairRDD<Long, Pageview> goraRDD = goraSparkEngine.initialize(sc, inStore);

After that, all Spark functionality can be applied. For example running count can be done as follows:

long count = goraRDD.count();

Map and Reduce functions can be run on a JavaPairRDD as well. Assume that this is the variable after map/reduce is applied:

JavaPairRDD<String, MetricDatum> mapReducedGoraRdd;

Result can be written as follows:

Configuration sparkHadoopConf = goraSparkEngine.generateOutputConf(outStore);
mapReducedGoraRdd.saveAsNewAPIHadoopDataset(sparkHadoopConf);