Cassandra Data Modeling in Data Xtractor: The Other Features
We cover here some missing features and details not properly addressed in the previous two articles, on migrating from a relational database to Apache Cassandra using Data Xtractor: static fields, secondary indexes, NULL values in the partition or cluster key fields etc.
Table of Contents
This extends the coverage of relational database migration to Apache Cassandra with Data Xtractor, already introduced through two practical examples in:
- How to Migrate a Relational Database to Cassandra (with Data Xtractor) – HotelReservations sample from O’Reilly’s book Cassandra: The Definitive Guide, re-published by Jeff Carpenter and Eben Hewitt in 2016.
- RDB to Cassandra, with Chebotko Diagrams – DigitalLibrary sample from the breakthrough whitepaper published in 2015 by Artem Chebotko, Andrey Kashlev and Shiyong Lu.
STATIC columns are columns whose values change only with the value of the partition key. They may be relevant only when the result returns one-to-many relationship values, in which case static columns are those non-key columns on the same side with the partition key.
The query below returns a Room-to-Amenities one-to-many relationship. When hotel_id and room_id are used as partition keys, the rate could be used as STATIC column, because it also belongs to the Room table, on the left side of the relationship. For each group of such column values (hotel_id, room_id and rate), you may get one or more amenities, with different values. But rate’s value will not change within the same partition.
A Chebotko diagram has a special notation for columns defined as STATIC. We could eventually detect such columns automatically in Data Xtractor. The feature is not currently implemented, but it could be in the near future.
A query with the same fields may use a different column with a partition key role, when filtering by an exact match. This could be done by creating a secondary index in that column.
Consider the following two queries, that expose the exact same fields, except the second query uses authors as partition key, instead of venue_name:
Instead of a CREATE TABLE Artifacts_by_author, we could have issued a CREATE INDEX statement:
CREATE INDEX "Artifacts_by_author" ON "Artifacts_by_venue" (authors);
A Chebotko diagram has a special notation for columns on which you create a secondary index. The main advantage is you avoid data duplication, by generating another Cassandra table with the exact same data, but stored differently. Problem is data is stored clustered by its partition key, and searching by the secondary index may hit multiple partitions, when getting everything related to that secondary index value. This is one important reason why secondary indexes are to be avoided, and keep denormalization as the preferred alternative.
NULL in Key Values
Similar to the previous two queries, suppose you want to add another filter, by “keywords” this time, instead of “authors” or “venue_name” (click to enlarge):
Problem is the keywords field accepts NULL values, and this query will fail if you do not filter out these NULLs when you try to import data in a Cassandra table. In Data Xtractor, generation of INSERT INTO or CSV files is done by temporarily disabling the WHERE clause. So you cannot simply add a “WHERE keywords IS NOT MULL” to the query, because it will be ignored on export.
The solution is to replace the tables in our query with a subquery, in which you may perform whatever you want, even the calculation of aggregates. You can use the top query only for the WHERE and ORDER BY clauses defining the partitioning and clustering keys on the subquery.
Here is our new subquery, which will always filter out all NULL value keywords, among other things:
When a query is based on another query from the same diagram, exported as a Cassandra table, the top query will be automatically declared as a materialized view on export when:
- The query is solely based on one and only one identifiable Cassandra table, and nothing else.
- Top Where and Sort fields include all columns identified as PK (partition keys or cluster keys) in the base query.
- Top Where and Sort fields optionally include one and only one other non-static column of the base query.
Consider reservations_by_confirmation2 query here below. The query is internally based only on reservations_by_hotel_date, and include all its PK columns (hotel_id, start_date, room_id) as sort fields. Plus one other non-PK column (confirm_number), which has been now identified as a partition key (used in a WHERE clause):
reservations_by_confirmation2 will be declared as a materialized view, when the model is exported as a DLL script (remark the base query must be in this diagram as well):
DROP TABLE IF EXISTS reservations_by_hotel_date; DROP MATERIALIZED VIEW IF EXISTS reservations_by_confirmation2; -- Q7. Lookup a reservation by hotel, date, and guest name. CREATE TABLE IF NOT EXISTS reservations_by_hotel_date ( hotel_id text, start_date date, room_id text, end_date date, confirm_number int, guest_id text, PRIMARY KEY ((hotel_id, start_date), room_id) ) WITH CLUSTERING ORDER BY (room_id ASC); -- Q71. Lookup a reservation by confirmation number. Implemented as a materialized view. CREATE MATERIALIZED VIEW IF NOT EXISTS reservations_by_confirmation2 AS SELECT confirm_number, hotel_id, start_date, room_id, end_date, guest_id FROM reservations_by_hotel_date WHERE confirm_number IS NOT NULL AND hotel_id IS NOT NULL AND start_date IS NOT NULL AND room_id IS NOT NULL PRIMARY KEY (confirm_number, hotel_id, start_date, room_id);
You do not need to specify the “IS NOT NULL” expressions in your query, they will be automatically added. Materialized views do not import data (i.e. do not generate INSERT INTO statements). Query a materialized view the way you query a regular view in SQL: it will always return data through its underlying table.
Materialized views can also include no additional non-PK column. They can simply expose the PK fields of the underlying query in a different order, or interchange partition keys with cluster keys.
In principle, you are allowed to add in OQL a Cassandra table description using “WITH COMMENT”. However, some cloud providers – including DataStax Astra, that we used in our demos – disable such feature, and your CREATE TABLE statements fail:
It would be more trouble for you to correct your statements now, and it’s not worth it. This is why we export and display a table description only like a SQL-like inline comment, before the CREATE TABLE, and not embedded within the statement itself.
Data Xtractor automatically generates generic data types on export. Specific Cassandra data types – including the collection types (sets, maps, lists) – can be presently customized only in the exported DDL script file.
UDTs (User Defined Types) can be declared and customized only in the generated DDL script. Also, you must eventually adapt generated INSERT INTO statements to support nested JSON data. The feature was too complex to consider for this iteration in Data Xtractor.
Cassandra Query Language by Examples: Puzzles with Answers is a new Kindle e-book I recently published.
Learn data modeling with Chebotko diagrams and proper key choices.
Learn how to properly use the partition, clustering or primary keys, in WHERE and ORDER BY clauses.
Learn when and what keys you may skip in a filter expression or in an explicit query sort order.
Learn when to create a new denormalized table, secondary index or materialized view.
In generated OQL, we’ll always declare all primary keys in the last line of a CREATE TABLE body, with PRIMARY KEY. There will be no PRIMARY KEY suffix after a single-key PK. This makes it easier to always locate all primary keys in the same place.
The partition key will always be surrounded by its own parenthesis, whether it’s single or composite. This makes it easier to always locate the partition key(s) inside a ((…)) block.
What follows after will be an optional list of cluster keys. WITH CLUSTERING clause appears only when we have at least one cluster key in DESC order.
CREATE TABLE IF NOT EXISTS "Ratings_by_artifact" ( artifact_id int, num_ratings int, sum_ratings int, PRIMARY KEY ((artifact_id)) ); CREATE TABLE IF NOT EXISTS "Experts_by_artifact" ( artifact_id int, expertise text, user_id int, user_name text, email text, PRIMARY KEY ((artifact_id, expertise), user_id) ); CREATE TABLE IF NOT EXISTS "Artifacts_by_venue" ( venue_name text, venue_year int, artifact_id int, avg_rating float, PRIMARY KEY ((venue_name), venue_year, artifact_id) ) WITH CLUSTERING ORDER BY (venue_year DESC, artifact_id ASC);
A quick review on the allowed or mandatory searches on a Cassandra table. Quoted Mapping Rules come from the “Big Data Modeling …” whitepaper of Artem Chebotko.
(a) Tables with one single or compound partition key, and no cluster keys, must use an exact match on all partition keys. This will hit only one partition and one cluster node, and one single row (or none) is returned.
For instance, Cassandra tables like hotels or guests from HotelReservations have one single primary key, on the single partition keys hotel_id or guest_id, and they must pass a unique identifier value in the WHERE clause in a SELECT query. A table with primary key from a compound partition key (on multiple columns) and no cluster keys must provide an exact match in each partition key:
SELECT * FROM hotels WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f'; SELECT * FROM guests WHERE guest_id = 'c262832f-6f09-4a4a-8e92-447304fcebd8'; SELECT * FROM rooms WHERE hotel_id = 'b93a7112-f4dd-41cc-b529-d5d79114078e' AND room_id = 'kdefgt';
Mapping Rule 2: Equality Search Attributes
Equality search attributes, which are used in a query predicate, map to the prefix columns of a table primary key. Such columns must include all partition key columns and, optionally, one or more clustering key columns. Violation of this rule may result in inability to support query requirements.
(b) Tables with both partitioning and clustering keys must use exact matches on each partition key (like before) and optional exact or range matches on some of the cluster keys, starting from the left. There could be no table with cluster keys, and no partition keys. The exact matches on the partition keys make sure we hit only one partition and one cluster node (like before). But presence of the clustering keys is usually a sign more than one row could be returned.
Following searches with exact matches are all correct. When used on all keys, they will return one single row (or none). Remark the second and third queries use also exact matches on cluster keys. However, you may not use a query with matches on hotel_id and room_number, skipping start_date. Last query uses a tuple notation, that groups clustering columns using the same match operator:
SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f'; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date = '2019-02-11'; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date = '2019-02-11' AND room_number = 15; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND (start_date, room_number) = ('2019-02-11', 15);
(c) Allowed range match operators on a cluster key are: <, <=, >, >= and IN. Once you use a range match on a single cluster key, you must skip the following cluster keys, in the order they are defined (i.e. room_number can no longer be used in search). This doesn’t apply to exact matches: last queries uses an exact match for start_date, that allows using room_number as well (with an exact or range match). You may not use a cluster key with a range match operator, and skip previous cluster keys, as this would return sets which are not contiguous, and this is not allowed. Remark also the last query, using tuples for the clustering columns: this is not equivalent with its similar query without tuples! Tuples can also be used on the cluster keys for the IN operator.
SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date > '2010-01-01'; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date > '2010-01-01' AND start_date < '2020-01-01'; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date IN ('2019-02-11', '2019-07-22'); SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND (start_date, room_number) IN (('2019-02-11', 12), ('2019-07-22', 11)); SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND start_date = '2019-02-11' AND room_number > 9; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' AND (start_date, room_number) > ('2019-02-11', 9);
Mapping Rule 3: Inequality Search Attributes
An inequality search attribute, which is used in a query predicate, maps to a table clustering key column. In the primary key definition, a column that participates in inequality search must follow columns that participate in equality search. Violation of this rule may result in inability to support query requirements.
(d) Sorting (by ORDER BY) can be done only on the cluster keys, in either the exact order defined in the CREATE TABLE (which is by default, when ORDER BY is omitted), or its exact opposite order (something like “ORDER BY start_date ASC, room_number DESC” is not allowed). You cannot skip cluster keys, from the left side: this means you cannot skip start_date and use room_number alone in ORDER BY (but you may sort by start_date and not room_number):
SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' ORDER BY start_date, room_number; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' ORDER BY start_date DESC, room_number DESC; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' ORDER BY start_date DESC;
Mapping Rule 4: Ordering Attributes
Ordering attributes, which are specified in a query, map to clustering key columns with ascending or descending clustering order as prescribed by the query. Violation of this rule may result in inability to support query requirements.
(e) Any primary key value must uniquely identify a row. When this is not the case, add one or more cluster keys.
In our available_rooms_by_hotel_date table, hotel_id as partition key returns entries for all rooms in the hotel. Adding a start_date as cluster key does not ensure uniqueness, but adding room_number as well does (assuming you may have just one entry for the same hotel room, with the same date):
CREATE TABLE available_rooms_by_hotel_date ( hotel_id text, start_date date, room_number int, room_id text, is_available int, PRIMARY KEY ((hotel_id), start_date, room_number));
Mapping Rule 5: Key Attributes
Key attribute types map to primary key columns. A table that stores entities or relationships as rows must include key attributes that uniquely identify these entities or relationships as part of the table primary key to uniquely identify table rows. Violation of this rule may lead to data loss.
(e) You cannot normally use a WHERE clause with a non-key column (i.e. a search by room_id ‘CH-445’ returns an error). This is because such search may hit multiple partitions and incur a huge performance cost. However, you can still do it with either ALLOW FILTERING option, or defining a secondary index on that column.
Last query here would fail without the secondary index, or the ALLOW FILTERING option (as in the first query):
SELECT * FROM available_rooms_by_hotel_date WHERE room_id = 'CH-445' ALLOW FILTERING; CREATE INDEX index_room_id ON available_rooms_by_hotel_date (room_id); SELECT * FROM available_rooms_by_hotel_date WHERE room_id = 'CH-445';
(f) You may use the TOKEN function on a partition key column, with relational non-equal match operators, to eventually compare string values by their lexicographic order. The queries below are equivalent, but only the first query allows such operators on a partition key. Remark you may also use IN on the last column of a partition key:
SELECT * FROM available_rooms_by_hotel_date WHERE token(hotel_id) >= token('2d76c2a1-f312-4934-83ae-a59c0574805f') AND token(hotel_id) <= token('2d76c2a1-f312-4934-83ae-a59c0574805f'); SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f'; SELECT * FROM available_rooms_by_hotel_date WHERE hotel_id IN ('2d76c2a1-f312-4934-83ae-a59c0574805f');