Managing Data
Every table in Kiji has an associated layout. The layout of a table contains a baseline schema definition which can be used to access the majority of the information in the table without further special knowledge. Each table’s layout describes the set of columns which can exist in any given row. For each column, a minimal “reader” schema is specified; this provides multiple team members working with a large data set with a common data dictionary; it also enables validation of cell data against a reference schema. The layout also describes some additional properties of the table; these are discussed in this section as well.
Every data element in Kiji is stored in a Kiji cell, which is uniquely identified by an entity ID, column name, and timestamp. The column name is itself two components: a family name and a qualifier. As in HBase, these are written family:qualifier
. Data written to a Kiji cell is serialized to a byte array according to an Avro schema. The writer schema used for the particular write operation is stored alongside the cell data, so the input data can be deserialized exactly by subsequent read requests. But this schema must be compatible with the expected reader schema specified in the layout for the cell.
There are two ways to edit the layout of a Kiji table:
- Using the KijiSchema DDL shell
- Editing the JSON layout description directly.
Most users should use the KijiSchema DDL shell. It has a user-friendly language for creating, modifying, and describing tables. To learn more about this mechanism, skip ahead to the DDL Shell reference.
Viewing and editing the JSON layout description is a low-level task, typically performed by system administrators, or for debugging purposes only. Its format is described below.
A JSON layout descriptor is a specification for the locality groups, columns, and data types that comprise a table, written as a JSON document whose elements are described in the following subsections. We will refer to the following example layout file throughout this section:
{
name: "users",
description: "A bunch of made-up users",
version: "layout-1.1",
keys_format: {encoding: "HASH"},
locality_groups: [ {
name: "default",
description: "The default locality group",
in_memory: false,
max_versions: 1,
ttl_seconds: 7776000,
compression_type: "NONE",
families: [ {
name: "info",
description: "A bunch of fields",
columns: [ {
name: "id",
description: "user id hash",
column_schema: {type: "INLINE", value: '"string"'}
}, {
name: "name",
description: "The person's name",
column_schema: {type: "INLINE", value: '"string"'}
}, {
name: "email",
description: "The person's email",
column_schema: {type: "INLINE", value: '"string"'}
} ]
}, {
name: "searches",
description: "The recent search queries the user has made",
map_schema: {type: "CLASS", value: "com.search.avro.Search"}
} ]
} ]
}
The schema of the table layout descriptor is available in the KijiSchema source tree at Layout.avdl.
Overall structure of a table layout
At the top-level, a table contains:
- the table name and description;
- a description of the row keys encoding;
- the table locality groups.
Each locality group has:
- a primary name, unique within the table, a description and optionally some name aliases;
- whether the data is to be stored in memory or on disk;
- data retention lifetime;
- maximum number of versions to keep;
- type of compression;
- column families stored in this locality group.
Each column family has:
- a primary name, unique within the table, a description and optionally some name aliases;
- for map-type families, the Avro schema of the cell values;
- for group-type families, the collection of columns in the group.
Each column in a group-type family has:
- a primary name, unique within the family, a description and optionally some name aliases;
- an Avro schema.
All names must start with a letter and may only use letters, numbers, and underscores.
Group-type and map-type families
In KijiSchema, there are two kinds of column families:
-
Group-type families define a fixed set of named columns. In the example layout above, there is a single group-type family named
info
, containing the columnsinfo:id
,info:name
, andinfo:email
, each with their own Avro schema. Each row may contain any subset of these three columns, but may not contain any additional columns. -
Map-type families define a family where the cell column qualifiers are not explicitly defined. In the example layout above, the map-type family named
searches
may be used to store every search performed by a user, each with their own column qualifier; cell columns would have names of the formsearches:<search-terms>
; the contents of each cell might be a compound Avro record containing the list of products returned by the query specified in the cell’s qualifier, as well as a boolean for each indicating whether the user actually clicked that search result. A map-type column specifies a single Avro schema for all its column. Note that the column qualifiers used in a map-type family must be valid UTF-8 strings (rather than arbitrary byte arrays).
Kiji cell schema
Within a table, Kiji cells are encoded according to their declared schema. There are three types of Kiji cell schemas, specified using the type
field:
INLINE
- fieldvalue
contains an the JSON representation of an Avro schema. In the earlier example, all three columns contain a single Avro"string"
field, but a column could also contain an array, record or other complex Avro data type.CLASS
- fieldvalue
contains the fully-qualified name of a Java class mapped by an Avro data type, like an implementation ofSpecificRecord
. The user must ensure that the class is available on the classpath of any Kiji tools accessing the cell.COUNTER
- cells are encoded as long integers and support atomic increment or decrement operations.
The schema of cells in a group-type family is specified by the column_schema
field (see columns info:id
, info:name
, info:email
in the example layout). The schema of cells in a map-type family is specified by the map_schema
field (see map-type family searches
in the example layout).
Locality groups
All families within a locality group are stored together in HBase. It is usually a good idea to put families that are often read and written together into the same locality group.
Locality groups control the physical properties of the underlying storage:
in_memory
- when this boolean flag is set, Kiji configures HBase to keep as much of this locality group in memory as possible.ttl_seconds
- cells’ time to live : cells older that this number of seconds may be automatically discarded.max_versions
- maximum number of timestamped versions of cells to retain : as new versions of a cell are written, older versions are deleted to not exceed this limit.compression
- one ofNONE
,GZ
,LZO
orSNAPPY
.
Names
Locality groups, families, and columns are identified by their primary names.
Updating layouts
Table layouts may be updated by specifying a table layout update descriptor. A table layout update descriptor entirely specifies the new layout, and sets the reference_layout
field to specify the original layout to update. Locality groups, families, and columns can be updated by redefining the locality group, family, or column with the same name. To rename locality groups, families, or columns, set the renamed_from
field in the new definition to the original name. Table names may not be changed.
For example, to rename the default
locality group into new_name
, one may update the table layout with a locality group descriptor as follows:
locality_groups: [ {
name: "new_name",
renamed_from: "default",
...
} ]
Layout record descriptor
For reference, the Avro descriptor for table layout records is defined in src/main/avro/Layout.avdl
within the kiji-schema git project as follows:
Layout.avdl
KijiSchema User Guide
- What is KijiSchema?
- Data Model
- Managing Data
- Accessing Data
- For Administrators
- DDL Shell Reference
- Command-Line Tool Reference
- FAQ