Install Kiji BentoBox

If you don't have a working environment yet, install the standalone Kiji BentoBox in three quick steps!

Start a Kiji Cluster

  • If you plan to use a BentoBox, run the following command to set BentoBox-related environment variables and start the Bento cluster:
cd <path/to/bento>
source bin/kiji-env.sh
bento start

After BentoBox starts, it displays a list of useful ports for cluster webapps and services. The MapReduce JobTracker webapp (http://localhost:50030 in particular will be useful for this tutorial.

  • If you are running Kiji without a BentoBox, there are a few things you'll need to do to make sure your environment behaves the same way as a BentoBox:

Starting Kiji in Non-BentoBox Systems

  1. Make sure HDFS is installed and started.
  2. Make sure MapReduce is installed, that HADOOP_HOME is set to your MR distribution, and that MapReduce is started.
  3. Make sure HBase is installed, that HBASE_HOME is set to your hbase distribution, and that HBase is started.
  4. Export KIJI_HOME to the root of your kiji distribution.
  5. Export PATH=${PATH}:${KIJI_HOME}/bin.
  6. Export EXPRESS_HOME to the root of your kiji-express distribution.
  7. Export PATH=${PATH}:${EXPRESS_HOME}/bin

When the tutorial refers to the BentoBox, you'll know that you'll have to manage your Kiji cluster appropriately.

Set Tutorial-Specific Environment Variables

  • Define an environment variable named KIJI that holds a Kiji URI to the Kiji instance we'll use during this tutorial:
export KIJI=kiji://.env/kiji_express_music

The code for this tutorial is located in the ${KIJI_HOME}/examples/express-music/ directory. Commands in this tutorial will depend on this location.

  • Set a variable for the tutorial location:
export MUSIC_EXPRESS_HOME=${KIJI_HOME}/examples/express-music

Install Kiji

  • Install your Kiji instance:
kiji install --kiji=${KIJI}

Create Tables

The file music-schema.ddl defines table layouts that are used in this tutorial:

music-schema.ddl

  • Create the Kiji music tables that have layouts described in music-schema.ddl.
${KIJI_HOME}/schema-shell/bin/kiji-schema-shell --kiji=${KIJI} --file=${MUSIC_EXPRESS_HOME}/music-schema.ddl

This command uses kiji-schema-shell to create the tables using the KijiSchema DDL, which makes specifying table layouts easy. See the KijiSchema DDL Shell reference for more information on the KijiSchema DDL.

  • Verify the Kiji music tables were correctly created:
kiji ls ${KIJI}

You should see the newly-created songs and users tables:

kiji://localhost:2181/express_music/songs
kiji://localhost:2181/express_music/users

Upload Data to HDFS

HDFS stands for Hadoop Distributed File System. If you are running the BentoBox, it is running as a filesystem on your machine atop your native filesystem. This tutorial demonstrates loading data from HDFS into Kiji tables, which is a typical first step when creating KijiExpress applications.

  • Upload the data set to HDFS:
hadoop fs -mkdir express-tutorial
hadoop fs -copyFromLocal ${MUSIC_EXPRESS_HOME}/example_data/*.json express-tutorial/

You're now ready for the next step, Importing Data.

Kiji Administration Quick Reference

Here are some of the Kiji commands introduced on this page and a few more useful ones:

  • Start a BentoBox Cluster:
cd <path/to/bento>
source bin/kiji-env.sh
bento start
  • Stop your BentoBox Cluster:
bento stop
kiji install --kiji=<URI/of/instance>

The URI takes the form:

kiji://.env/<instance name>
  • Running compiled KijiExpress jobs

To run a KijiExpress job, you invoke a command of the following form:

express.py job \
    --user-jar=path/to/jar/containing/job \
    --job-name=org.MyKijiApp.MyJob \
    [--libjars=<list of JAR files, separated by colon>] \
    [--mode=local|hdfs] \
    [job-specific options]

The mode=hdfs flag indicates that KijiExpress should run the job against the Hadoop cluster versus in Cascading's local environment. The -libjars flag indicates additional JAR files needed to run the command.

  • Launching the KijiExpress shell

KijiExpress includes an interactive shell that can be used to execute KijiExpress flows. To launch the shell, you invoke a command of the following form:

express.py shell \
    [--libjars=<list of JAR files, separated by colon>] \
    [--mode=local|hdfs]

If the mode flag is set to 'hdfs', mode=hdfs, the shell will run jobs using Scalding's Hadoop mode. For normal usage against a hadoop cluster, this option should be used.

If the -libjars flag is nonempty, the jar files specified will be placed on the classpath. This is helpful if you are using external libraries or have compiled avro classes.

To execute multi-line statements in the shell, use paste mode. This can be used to execute existing KijiExpress code. To start paste mode, enter the following command into a running KijiExpress shell:

:set paste

Type Ctrl+D to end paste mode and execute the entered code.