KijiExpress is a set of tools designed to make defining data processing MapReduce jobs quick and expressive, particularly for data stored in Kiji tables.

KijiExpress jobs are written in the Scala programming language, which gives you access to Java libraries and tools but is more concise and easier to write. In addition, KijiExpress gives you access to functionality for building complex MapReduce data pipelines by including the Scalding library, a Twitter sponsored open-source library for authoring flows of analytics-focused MapReduce jobs. KijiExpress is integrated with Avro to give you access to complex records and data types in your data transformation pipelines.

The core functionality of KijiExpress currently provides developers with tools to manipulate data in pipelines that provide considerable flexibility over trying to write MapReduce jobs directly in Java.

Using this Document

The first section of this document describes how to set up your environment.

Then there are sections on data concepts in Kiji, a Scala and Scalding introduction, and an example KijiExpress job that demonstrates and explains the Scalding concepts in action.

If you are already familiar with Scala and Scalding, you can skip directly to the last sections, which outline functionality specific to KijiExpress. These are the KijiExpress sources and Data flow operations in KijiExpress. The last section goes over how to run KijiExpress jobs.

Useful External References

Other Kiji References


One benefit of KijiExpress is that a model implemented in KijiExpress will be deployable in a production enterprise software environment without needing to be redeveloped in Java.

A separate project, the KijiModeling project, is the framework for writing and deploying models using KijiExpress. The project is currently under development here.

Combined with KijiScoring, models developed in KijiModeling can return results in the time frames demanded of user interactions in the web. The model lifecycle allows you to manage all of the phases of generating and maintaining models and data to support an application using machine learning to produce better results.

KijiModeling Roadmap Overview

More documentation on KijiModeling will be available as it stabilizes.

Note that not all of KijiModeling functionality is implemented at this time. This Roadmap outlines the planned total functionality.

KijiModeling is an evolving project to provide a development environment for generating, training, and verifying predictive models and to support data operations that allow real-time application of the model to be as efficient as possible. The planned functionality of KijiModeling will roll out over time. These plans include support for a model lifecycle:

  • Preparation. KijiExpress model lifecycle includes the opportunity to execute offline computations so results can be used when extracting features in real time. In production, the prepare phase might be executed in batch on a periodic basis, to ensure its results are up-to-date when extracting features.

  • Data Extraction. The extract phase transforms entity-centric raw data from the data store about an entity into the features for that entity used by other model lifecycle phases.

  • Scoring. The score phase defines how an entity's data is combined with a trained model to produce a score.

  • Training. The train phase executes offline procedures that generate data required to score using the model. Training accepts features extracted for a subset of entities (a training set) and uses them with the model and entity data to produce entity-specific results.

  • Evaluation. After defining how a model is trained and used to score, it is necessary to evaluate these procedures for accuracy. Often, this involves scoring against a subset of entities different from the training set and evaluating the scores produced against some known ground-truth.

These high-level modeling stages rely on data transformations and flow, from importing data into Kiji tables from HDFS or other sources, to reading data from Kiji tables, to performing operations on the data, to writing data out to Kiji tables.