Running kiji help will list all the available tools. The usage format for the tools is:

$ kiji <tool> [FLAG]...

Instance administration

Table administration

Data inspection/modification

Miscellaneous

Targeting a Kiji instance or a Kiji table with Kiji URIs

Most KijiSchema command-line tools accept a parameter specifying an HBase cluster, a Kiji instance or a Kiji table. These elements can be specified in a unified way through Kiji URIs.

Kiji URIs are formatted hierarchically:

    kiji://<HBase cluster>/<Kiji instance>/<Kiji table>/<columns>
  • HBase cluster: the address of a ZooKeeper quorum used by the HBase instance KijiSchema has been installed on. The default is .env, which tells KijiSchema to use the HBase instance identified by the HBase configuration files available on the classpath.
  • Kiji instance: the name of the Kiji instance within the HBase cluster.
  • Kiji table: the name of the Kiji table within the Kiji instance.
  • columns: a set of columns or column families within the Kiji table, separated by comas.

The HBase cluster address is a required component of all Kiji URIs. Any further component is optional. For instance:

  • kiji://localhost:2181 references the HBase cluster whose ZooKeeper quorum is composed of the server listening on localhost:2181.
  • kiji://localhost:2181/kiji_instance_1 designates the Kiji instance named kiji_instance_1 and living the the HBase cluster kiji://localhost:2181.
  • kiji://localhost:2181/kiji_instance_1/the_table designates the Kiji table named the_table and living the the Kiji instance kiji://localhost:2181/kiji_instance_1.
  • kiji://localhost:2181/kiji_instance_1/the_table/family1,family:column2 references the column family family1 and the column family:column2 within the Kiji table kiji://localhost:2181/kiji_instance_1/the_table.

The default value for Kiji URIs is kiji://.env/default, which references the Kiji instance named default and installed on the HBase instance identified by the HBase configuration available on the classpath.

Scripting using the command-line interface

All Kiji command-line tools accept a --interactive flag that controls whether user interactions are allowed. By default, this flag is set to true, which enables user interactions such as confirmations for dangerous operations.

When scripting Kiji commands, you may disable user interactions with --interactive=false.

Installation: install

The kiji install command creates the initial metadata tables kiji.<instance-name>.meta, kiji.<instance-name>.status, kiji.<instance-name>.schema_id and kiji.<instance-name>.schema_hash required by the KijiSchema system. This should be run once during initial setup of a KijiSchema instance.

The HBase cluster and the Kiji instance name may be specified with a Kiji URI:

kiji install --kiji=kiji://hbase_cluster/kiji_instance

Removal: uninstall

The kiji uninstall command removes an installed KijiSchema instance, and deletes all the user tables it contains. The HBase cluster and the name of the Kiji instance to remove is specified with a Kiji URI:

kiji uninstall --kiji=kiji://hbase_cluster/kiji_instance

Metadata backups: metadata

The kiji metadata command allows you to backup and restore metadata information in KijiSchema. This metadata contains table layout information as well as the schema definitions.

Creating a backup

You can backup the metadata for a specific Kiji instance with:

kiji metadata --kiji=kiji://hbase_cluster/kiji_instance --backup=mybackup

Restoring from a backup

Similarily, you can restore the metadata for a specific Kiji instance with:

kiji metadata --kiji=kiji://hbase_cluster/kiji_instance --restore=mybackup

After asking for confirmation:

Are you sure you want to restore metadata from backup?
This will delete your current metatable.
Please answer yes or no.

Restoration begins:

Restoring Metadata from backup.
Restore complete.

If restoration of only a subset of the table and schema information is desired, the following flags should be used:

  • --tables - restores all tables from the metadata backup into the specified Kiji instance.
  • --schemas - restores all schema table entries from the metadata backup into the specified Kiji instance.

Creating Tables: create-table

The kiji create-table command creates a new Kiji table. This is stored in an underlying HBase table with the name kiji.<instance-name>.table.<table-name>.

This command has two mandatory arguments:

  • --table=<table-uri> - Kiji URI of the table to create. It is an error for this table to already exist.
  • --layout=<path/to/layout.json> - Path to a file a JSON file containing the table layout specification, as described in Managing Data.

The following arguments are optional:

  • --num-regions=<int> - The number of initial regions to create in the table. This may only be specified if the table uses row key hashing. It may not be used in conjunction with --split-key-file.

  • --split-key-file=<filename> - Path to a file containing the row keys to use as initial boundaries between regions. This may only be specified if the table uses row key hashing. It may not be used in conjunction with --num-regions.

Deleting tables, rows, and cells: delete

The kiji delete command will delete a KijiSchema table, row, or cell, and drop all values which were in them. This command has one mandatory argument:

  • --target=<kiji-uri> - URI of the target to delete or to delete from. The target may be an entire Kiji instance, a Kiji table or a set of columns within a Kiji table.

And several optional arguments:

  • --entity-id=<entity-id> - Specifies the entity ID of a single row to delete or to delete from. Requires the target Kiji URI to designate a Kiji table.

    The default is to not target a specific row, ie. to delete the entire Kiji table specified with --target=....

  • --timestamp=<timestamp-spec> - Timestamp specification:

    • '<timestamp>' to delete cells with exactly this timestamp, expressed in milliseconds since the Epoch;
    • 'latest' to delete the most recent cell only;
    • 'upto:<timestamp>' to delete all cells with a timestamp older than the specified timestamp expressed in milliseconds since the Epoch;
    • 'all' to delete all cells, irrespective of their timestamp.

    The default is --timestamp=all.

Managing layouts: layout

The kiji layout command displays or modifies the layout associated with a table.

This command requires two parameters:

  • --table=<table-uri> - URI of the Kiji table to examine the layout of.
  • --do=<action> - Action to perform on the layout: dump (the default), set or history.

You may dump the current layout of a table with:

$ kiji layout --table=kiji://.env/default/users
{
  name: "users",
  description: "The user table",
  keys_format : {encoding : "RAW"},
  locality_groups : [],
  layout_id : "3",
}

You may update the layout of a table with:

$ kiji layout --table=kiji://.env/default/users --do=set --layout=/path/to/layout.json

The file /path/to/layout.json is a JSON descriptor of the updated table layout.

Optionally, you may use the --dry-run argument to prints out messages stating whether or not the update would succeed (i.e., whether or not the layout is valid) and what locality groups would be updated by the new layout.

Finally, you may dump the layout history of a table with:

$ kiji layout \
    --table=kiji://.env/default/users \
    --do=history \
    --max-version=5 \
    --write-to=/path/to/table-layout-history/layout

This dumps the 5 latest revisions of the table layout in 5 JSON files /path/to/table-layout-history/layout-<timestamp>.json.

Flushing tables: flush-table

The kiji flush-table command will instruct HBase to flush the contents of a table to HDFS. When HBase receives new data, it is recorded in a write-ahead log (WAL). But this WAL is not merged with existing table files until the table is flushed or compacted. This happens more frequently if more data is written to a table. But you can force data to be written to table files with this command. If a table is not frequently updated, flushing the data with this command may improve recovery time in the event that HBase experiences a failure.

You must use one or both of the following arguments to specify what to flush:

  • --table=<table-uri> - URI of the Kiji table to flush.
  • --meta - If set, flushes KijiSchema metadata tables.

You should only flush tables during a period of relative inactivity. Flushing while a large number of operations are ongoing may adversely affect performance. The flush operation is also asynchronous; the command may return before the actual flush operation is complete.

Listing Information: ls

The kiji ls command is a basic tool used to explore a KijiSchema repository. It can list instances, tables, or even columns in the specified Kiji URI argument. Note that this tool takes Kiji URIs as unflagged arguments. If no URI argument is specified, then the tool assumes the default URI: kiji://.env/default.

You may list the Kiji instances existing in an HBase cluster by specifying the URI of an HBase cluster:

$ kiji ls kiji://localhost:2181
kiji://localhost:2181/kiji_instance1/
kiji://localhost:2181/kiji_instance2/

You may list the Kiji tables within a Kiji instance by specifying the URI of a Kiji instance:

$ kiji ls kiji://localhost:2181/kiji_instance1
kiji://localhost:2181/kiji_instance1/table1
kiji://localhost:2181/kiji_instance1/table2
kiji://localhost:2181/kiji_instance1/table3

You may list the columns of a table by specifying the URI of a Kiji table:

$ kiji ls kiji://localhost:2181/kiji_instance1/table1
kiji://localhost:2181/kiji_instance1/table1/info:name
kiji://localhost:2181/kiji_instance1/table1/info:email
…

Finally, you may even iteratively list multiple URIs by providing them as multiple arguments:

$ kiji ls kiji://localhost:2181 kiji://localhost:2181/kiji_instance1 kiji://localhost:2181/kiji_instance1/table1
kiji://localhost:2181/kiji_instance1/
kiji://localhost:2181/kiji_instance2/
kiji://localhost:2181/kiji_instance1/table1
kiji://localhost:2181/kiji_instance1/table2
kiji://localhost:2181/kiji_instance1/table3
kiji://localhost:2181/kiji_instance1/table1/info:name
kiji://localhost:2181/kiji_instance1/table1/info:email
…

The URI arguments can be specified as relative paths and all such paths are relative to kiji://.env/.

Getting a row: get

The kiji get command prints the record specified by the flag --entity-id in the Kiji URI argument (with or without columns specified). Each cell from the record appears on two lines: the first line contains the record’s entity ID, the cell timestamp expressed in milliseconds since the UNIX epoch, and the cell column name (family:qualifier) specified by the Kiji URI argument; The second line contains the string representation of the cell data itself.

$ kiji get kiji://localhost:2181/kiji_instance1/table1 --entity-id="'Olga Jefferson'"
Looking up entity: 'Olga Jefferson' from kiji table: : kiji://localhost:2181/kiji_instance1/table1/info:name,info:email
entity-id='Olga Jefferson' [1305851507300] info:name
                                 Olga Jefferson
entity-id='Olga Jefferson' [1305851507301] info:email
                                 Olga.Jefferson@hotmail.com

The URI is specified similar to kiji ls, but the flag --entity-id is as follows.

  • --entity-id=<string> - Specifies the entity ID of a single row to look up:

    • Either as a Kiji row key, with --entity-id=kiji=...:

      Old deprecated Kiji row keys are specified as naked UTF-8 strings;

      New Kiji row keys are specified in JSON, as in: --entity-id=kiji="['component1', 2, 'component3']".

    • or as HBase encoded row keys specified as bytes:

      • by default as UTF-8 strings, or prefixed as in 'utf8:encoded\x0astring';
      • in hexadecimal as in 'hbase:hex:deadfeed';
      • as a URL with 'url:this%20URL%00'.

You will typically want to further constrain the data printed to the terminal with the following options.

  • --max-versions=<int> - Restrict the number of versions of each cell to display.

    The default is 1, ie. displays the latest version of each cell.

  • --timestamp=<long>..<long> - Excludes cell versions whose timestamp is outside the specified time range min..max. Timestamps are expressed in milliseconds since the Epoch. If the lower bound is unspecified, it defaults to 0 and if the upper bound is unspecified, it defaults to Long.MAX_VALUE.

    The default is 0.., i.e. from Epoch to Long.MAX_VALUE.

Scanning a table: scan

The kiji scan command, unlike kiji get, scans multiple records in the table specified by Kiji URI argument (with or without columns specified). Each record appears as a set of cells separated from other records by blank lines. The cells appear similar to how they do with kiji get.

$ kiji scan kiji://localhost:2181/kiji_instance1/table1/info:name,info:email
Scanning kiji table: kiji://localhost:2181/kiji_instance1/table1/
entity-id='Olga Jefferson' [1305851507300] info:name
                                 Olga Jefferson
entity-id='Olga Jefferson' [1305851507301] info:email
                                 Olga.Jefferson@hotmail.com

entity-id='Sidney Tijuana' [1305851507425] info:name
                                 Sidney Tijuana
entity-id='Sidney Tijuana' [1305851507427] info:email
                                 Sidney.Tijuana@hotmail.com
…

The scanned records may be further constrained by using the following options:

  • --start-row=row-key and --limit-row=row-key - Restrict the range of rows to scan through. The start row is included in the scan while the limit row is excluded. Start and limit rows are expressed in the same way as --entity-id for kiji get. For example as HBase encoded rows: --start-row='hex:0088deadbeef' or --limit-row='utf8:the row key in UTF8'.

    The default is to scan through all the rows in the table.

  • --max-rows=<int> - Limits the total number of rows to display the content of.

    The default is 0 and sets no limit.

The following additional options also apply to kiji scan.

  • --max-versions=<int> - Restrict the number of versions of each cell to display.

    The default is 1, ie. displays the latest version of each cell.

  • --timestamp=<long>..<long> - Excludes cell versions whose timestamp is outside the specified time range min..max. Timestamps are expressed in milliseconds since the Epoch. If the lower bound is unspecified, it defaults to 0 and if the upper bound is unspecified, it defaults to Long.MAX_VALUE.

    The default is 0.., i.e. from Epoch to Long.MAX_VALUE.

Incrementing counters: increment

The kiji increment command may be used to increment (or decrement) a KijiSchema counter.

The following arguments are required:

  • --cell=<column-uri> - Kiji URI specifying a single column to increment.
  • --entity-id=<entity> - Entity ID of the target row.
  • --value=amount - The value to increment by.

See kiji ls for how to specify entity IDs.

Setting Individual Cells: put

To aid in the insertion of small data sets, debugging, and testing, the kiji put command may be used to insert individual values in a Kiji table.

The following arguments are required:

  • --target=<column-uri> - Kiji URI specifying a single column to write.
  • --entity-id=<entity> - Target row id (an unhashed, human-readable string)
  • --value=<JSON value> - The value to insert. The value is specified as a JSON string according to the Avro JSON encoding specification

See kiji ls for how to specify entity IDs.

The following arguments are optional:

  • --schema=Avro schema - By default, KijiSchema will use the reader schema attached to a column in its layout to decode the JSON and encode the binary data for insertion in the table. This argument allows you to use an alternate writer schema.
  • --timestamp=long - Specifies a timestamp (in milliseconds since the Epoch) other than “now”.

Running an Application Jar with KijiSchema: jar

If your application requires KijiSchema and its dependencies, you can use the kiji jar command to launch your program’s main method with KijiSchema present on the classpath.

This command requires two unlabeled arguments: the jar filename, and the main class to run:

$ kiji jar myapp.jar com.pkg.MyApp [args...]

Generating Sample Data: synthesize-user-data

In the interest of enabling quick experimentation with KijiSchema, the kiji synthesize-user-data tool will generate a number of semi-random rows for you.

The tool creates a set of rows which contain columns info:id, info:name, and info:email; these are pseudo-randomly generated first and last names, with plausible email addresses with gmail, hotmail, etc. accounts based on the generated names. These columns can be used with mappers and reducers.

To use this tool, first create a table with the layout in ${KIJI_HOME}/examples/synthdata-layout.xml. Then invoke bin/kiji synthesize-user-data --table=<table-uri>. This will generate 100 rows of data. You can create a different number of records by specifying --num-users=<int>.

You can specify a different list of names with the --name-dict=filename argument.