Working with Avro
All data stored in Kiji cells are serialized and deserialized using Apache Avro. Each logical unit of data in Avro has a type. The type of a datum is called a schema
. A schema may be a simple primitive such as an integer or string, or it may be a composition of other schemas such as an array or record.
Avro data can be serialized and deserialized by several programming languages into types appropriate for that language. In Java, for example, data with an Avro INT schema is manifested as a java.lang.Integer object. A MAP schema is manifested as a java.util.Map
. The full mapping from Avro schemas to Java types can be found in the Avro documentation.
Using Avro with KijiRowData
When implementing a gatherer's gather() method or a producer or bulk importer's produce() method, use the KijiRowData
object to read data from the current Kiji table row. Avro serialization is taken care of for you; the call to getValue()
or getMostRecentValue()
will automatically return the type specified in the table layout. For example, to read an Avro string value from the most recent value of the info:name
column, call KijiRowData.getMostRecentValue("info", "name")
. It will be returned to you as a java.lang.CharSequence
. If you are reading a cell with a complex compound schema, KijiSchema will return the corresponding Avro generated Java object type.
To write typed data into a Kiji cell from your producer or bulk importer's produce()
method, use the context passed into KijiProducer
's produce()
method. The put()
method is overloaded to accept a variety of Java types, including primitives and Avro types. Serialization is handled for you, so you can pass a complex Avro object directly to put()
. For example, to write custom Address complex Avro type:
final EntityId user = table.getEntityId("Abraham Lincoln");
final Address addr = new Address();
addr.setAddr1("1600 Pennsylvania Avenue");
addr.setCity("Washington");
addr.setState("DC");
addr.setZip("20500");
context.put(user, "info", "address", addr);
Note that the type of the value passed to put()
must be compatible with the schema registered for the column in the Kiji table layout.
Using Avro in MapReduce
You may find it useful to read and write Avro data between your mappers and reducers. Jobs run by Kiji can use Avro data for MapReduce keys and values. To use Avro data as your gatherer, mapper, or reducer's output key, use the org.apache.mapred.AvroKey
class. You must also specify the writer schema for your key by implementing the org.kiji.mapreduce.AvroKeyWriter
interface. For example, to output an Integer key from a gatherer:
public class MyAvroGatherer
extends KijiGatherer<AvroKey<Integer>, Text>
implements AvroKeyWriter {
// ...
@Override
protected void gather(KijiRowData input, GathererContext context)
throws IOException, InterruptedException {
// ...
context.write(new AvroKey<Integer>(5), new Text("myvalue"));
}
@Override
public Schema getAvroKeyWriterSchema(Configuration conf) throws IOException {
return Schema.create(Schema.Type.INTEGER);
}
}
Likewise, an org.apache.mapred.AvroValue
may be used for Avro data as the output value. Implement the AvroValueWriter
interface to specify the writer schema. To use Avro data as your bulk importer, mapper, or reducer's input key or value, wrap it in an AvroKey
(or AvroValue
for values) and implement AvroKeyReader
(or AvroValueReader
) to specify the reader schema.
KijiMR User Guide
- What is KijiMR?
- Bulk Importers
- Producers
- Gatherers
- Reducers
- Pivoters
- HFiles
- Command Line Tools
- Key-Value Stores
- Job History
- Working with Avro
- Testing