Motivation

A KijiPivoter scans the rows from an input Kiji table and writes cells into an output Kiji table. A pivoter is subject to no restriction regarding the table, row or column being written to. As the name suggests, a Kiji pivoter's best use-case is to perform a pivot, eg. generate a reverse look-up table, but are not limited to that. A Kiji pivoter is a map-only job : if a reduce operation is needed, use a gatherer combined with a reducer.

Classes Overview

The main classes around Kiji pivoters are:

  • org.kiji.mapreduce.pivot.KijiPivoter is the abstract base class Kiji pivoters must extend.
  • org.kiji.mapreduce.KijiTableContext allows Kiji pivoters to emit cells into the configured output table while accessing KeyValue stores.
  • org.kiji.mapreduce.pivot.KijiPivotJobBuilder is a programmatic builder and launcher for Kiji pivoter jobs. Pivoter jobs can also be launched using the command-line tool kiji pivot.

Using the API

A KijiPivoter must extend the abstract base class KijiPivoter and define the following methods:

  • KijiDataRequest getDataRequest() : let the pivoter specify which data to request from the configured input Kiji table.
  • void produce(KijiRowData row, KijiTableContext context) : this method is invoked by the KijiMR framework to process one single input row from the configured input Kiji table. The row content is available through the row parameter; the pivoter may use the content from this row to emit cells to write to the configured output Kiji table using the context parameter.

As mentioned above, the KijiTableContext let the pivoter task write to the configured output Kiji table as follows:

  • EntityId getEntityId(Object... components) lets the pivoter create entity IDs to identify the rows of the configured output table to write to.
  • void put(EntityId entityId, String family, String qualifier, [long timestamp], T value) writes a single cell to the row with the specified entity ID, into the column specified by family:qualifier. The timestamp parameter is optional; when omitted, the current time (HConstants.LATEST_TIMESTAMP) is used. The cell value must be compatible with the layout of the column it is written to.

Optionally, a pivoter may implement setup() and cleanup() to initialize and finalize resources that can be reused across the many invocations of produce():

  • setup() is invoked exactly once per task before processing any input row;
  • cleanup() is invoked exactly once per task after it processed all its rows.

A pivoter may also request external KeyValueStore by implementing the getRequiredStore() method. KeyValue store readers are usually opened from the setup() method using context.getStore(storeName). For more details, you may check the KeyValue Stores section in this guide.

Example:

/**
 * Example of a trivial pivoter.
 *
 * Reads an input table keyed on user login names and containing a column "info:email",
 * and writes to an output table keyed on email addresses with a column "info:login".
 */
public class PivoterExample extends KijiPivoter {
  /** {@inheritDoc} */
  @Override
  public KijiDataRequest getDataRequest() {
    // Request all columns in family "info" from the input Kiji table:
    return KijiDataRequest.create("info");
  }

  /** {@inheritDoc} */
  @Override
  public void produce(KijiRowData row, KijiTableContext context)
      throws IOException {
    final String login = row.getEntityId().getComponentByIndex(0);
    final String email = row.getMostRecentValue("info", "email");

    final EntityId eid = context.getEntityId(email);
    context.put(eid, "info", "login", login);
  }
}

The example pivoter can be run from the command-line as follows:

kiji pivot \
    --pivoter='package.PivoterExample' \
    --input="format=kiji table=kiji://.env/default/input_table" \
    --output="format=kiji table=kiji://.env/default/output_table nsplits=5" \
    --lib=/path/to/libdir/

The pivoter command-line argument specifies the fully-qualified name of the KijiPivoter class to run. The input argument specifies the input Kiji table to read from; the output argument specifies the output Kiji table to write to. In this example, both the input and the output table belong to the same Kiji instance kiji://.env/default. Optionally, the lib argument specifies the path of a directory that contains jar files necessary for the KijiPivoter class.