RDB to Cassandra, with Chebotko Diagrams

Published by Cristian Scutaru on August 31, 2020August 31, 2020

Step-by-step complete functional and practical implementation of the example used by Artem Chebotko in his breakthrough whitepaper about data modeling for Apache Cassandra. The focus here is on the immediate easy migration from a relational database (and relational model) to denormalized queries and automatically generated OQL scripts.

Table of Contents

1. Introduction

In 2015, Artem Chebotko (a Solutions Architect at DataStax), together with Andrey Kashlev (creator of the Kashlev Data Modeler) and Shiyong Lu published the whitepaper A Big Data Modeling Methodology for Cassandra, a breakthrough for data modeling with Apache Cassandra. The document quickly walks through the migration of an ER model (in Chan notation) to some Cassandra tables, exposed as queries in a Chebotko diagram.

This article goes even further, with the end-to-end practical implementation of their data model:

We start with a DDL script that creates and populates an actual small PostgreSQL database for the same model.
We connect to this database from Data Xtractor (the you can download and try for free here) and reverse-engineer its relational database model.
We denormalize the model with designed visual queries, using the same use cases from the whitepaper. We create an Application Workflow diagram, and a Chebotko Diagram.
We automatically generate scripts and data from our queries and models, acting as an ETL (Extract, Load and Transform) module
We import and test exported scripts into a free DataStax Astra cloud Cassandra keyspace.

Check also How to Migrate a Relational Database to Cassandra (with Data Xtractor).

2. Explore Relational Model

CREATE TABLE "Users" (
  user_id int NOT NULL,
  name VARCHAR(50) NOT NULL,
  email VARCHAR(50) DEFAULT NULL,
  expertise VARCHAR(255) NOT NULL,
  CONSTRAINT "PK_Users" PRIMARY KEY (user_id)
);
CREATE TABLE "Venues" (
  venue_name VARCHAR(5) NOT NULL,
  venue_year int NOT NULL,
  country VARCHAR(50) DEFAULT NULL,
  homepage VARCHAR(100) DEFAULT NULL,
  topics VARCHAR(255) DEFAULT NULL,
  CONSTRAINT "PK_Venues" PRIMARY KEY (venue_name, venue_year)
);
...
INSERT INTO "Users" (user_id, name, email, expertise) VALUES (1, 'Mark Jones', '[email protected]', 'Music, Internet, IoT');
INSERT INTO "Users" (user_id, name, email, expertise) VALUES (2, 'Mary Gray', '[email protected]', 'C#, .NET, Python');
INSERT INTO "Users" (user_id, name, email, expertise) VALUES (3, 'Cristi S.', '[email protected]', 'Architecture');
INSERT INTO "Users" (user_id, name, email, expertise) VALUES (4, 'Toby Williams', '[email protected]', 'Painting');
INSERT INTO "Users" (user_id, name, email, expertise) VALUES (5, 'Ashavin Rashid', '[email protected]', 'knows nothing');

INSERT INTO "Venues" (venue_name, venue_year, country, homepage, topics) VALUES ('ICWS', 2008, 'Germany', 'myhomepage.com', 'Anything about Cassandra');
INSERT INTO "Venues" (venue_name, venue_year, country, homepage, topics) VALUES ('CWD', 2001, 'US', 'cassandra.org', 'Development, architecture');
INSERT INTO "Venues" (venue_name, venue_year, country, homepage, topics) VALUES ('SCC', 1997, 'Canada', 'somethinghere.ca', 'Python, Java, C++');
INSERT INTO "Venues" (venue_name, venue_year, country, homepage, topics) VALUES ('SCC', 2012, 'Canada', 'somethingelse.ca', NULL);
INSERT INTO "Venues" (venue_name, venue_year, country, homepage, topics) VALUES ('AXX', 2015, 'US', NULL, NULL);
...

Previous DDL script creates and populates a small PostgreSQL database, trough pgAdmin or another free database tool. Create a new DigitalLibrary database, then copy, paste and run the script in the Query Tool.

Download and install Data Xtractor on Windows (a free 7-days trial is automatically started). Connect to the previous PostgreSQL database. Add a Structural Model from the standard demo models, and customize it a bit. The following relational database model follows closely the data model described through ER diagrams in the whitepaper:

3. Create Application Workflow

An Application Workflow diagram describes query-based use cases required by the application. You may create this in Data Xtractor with Notes, in an empty diagram. For the links, select them two by two, and use the Connect Shape contextual menu command:

4. Design Denormalized Queries

Here is the detailed query-first design of the future Cassandra tables, in Data Xtractor (click to enlarge on any query diagram).

As general rules:

Right-click on a table you have to start with, and Add to a New Built Query (or drag-and-drop on an existing query). Drag-and-drop outside relationship items, if you need to walk through joins to related elements.
Uncheck the shape items you don’t want. Hide the checkboxes when done (to avoid clutter).
Rename the projection items, by double-click in column headers. Always turn pagination off, for any of these queries, and no Skip or Keep.
Add one or more WHERE clauses for fields you will use as partition keys. Add to ORDER BY fields you will use as clustering keys.
Add other required filtering elements, such as GROUP BY and aggregate functions, or Formula expressions.
Periodically click on the Save All toolbar button, to avoid loosing your work.

4.1. Artifacts_by_venue

Q1. Find artifacts published in a venue with a given name after a given year. Order results by year (DESC).

When your exact match is “venue_name”, this will be your partition key, and we will add a WHERE equals clause in Data Xtractor (WHERE venue_name = ‘SCC’). An additional filter and ordering will be required on the “venue_year”, so this will be your clustering key, and we will sort it in descending order in Data Xtractor. For the filter, we added WHERE venue_year > 2000. (Remember these queries can be used for test in Cassandra, after import. They also automatically determine the partitioning and clustering key icons in the Chebotko diagram.)

The same venue can have multiple artifacts, so we need to add “artifact_id” as cluster key as well, to get a unique primary key.

We also need the average rating, so add a GROUP BY clause and aggregate the multiple values of Reviews.rating for each venue artifact. In Formula, we called two contextual functions: we first truncated the value to the immediate integer, and made sure to return zero if there are no ratings for some artifacts. This was mandatory, because we used LEFT JOINS here.

Can switch the Hide FROM Clause and take a look at the query, the way it will be exported and used later on in Cassandra:

SELECT venue_name, venue_year, artifact_id, avg_rating, artifact_title, authors, keywords
FROM "Artifacts_by_venue"
WHERE venue_name = 'SCC' AND venue_year > 2000
ORDER BY venue_year DESC, artifact_id

4.2. Artifacts_by_author

Q2. Find artifacts published by a given author. Order results by year (DESC).

This is somehow similar to the previous query, but the partition key (“author”) is already in the Artifacts table. Same fields are used as clustering keys.

Authors, keywords, and expertise are Cassandra collections. But we have no special representation for them in Data Xtractor yet, they must be later eventual customized in the DDL script, before running in Cassandra. What we simply do here is we select the whole value of the “author” field, not just one list item.

4.3. Users_by_artifact

Q3. Find users who liked a given artifact.

No need for aggregations here, and no need for outer joins, as we start our selection from LikesArtifacts, which PKs can never hold NULL values. Remarks the required “artifact_id” value from all the queries from the second row of the Chebotko diagram has been already returned by the queries from the top row.

The same artifact can be liked by more than one user, so we need to add “user_id” as cluster key, to get unique primary key values.

4.4. Experts_by_artifact

Q4. Find users who liked a given artifact and who have expertise in a certain area.

The difference from the previous query is the addition of another partition key (expertise), to get a composite partition key. Every field that requires an exact match in your use case definition will be considered as partition key. When sort may be added as well for the same field, it will be rather used as cluster key.

4.5. Ratings_by_artifact

Q5. Find an average rating of a given artifact.

Again, this is an aggregate query, returning the SQL COUNT and SUM of the same rating field. In OQL, they may be later replaced with Cassandra “counter” fields, but our ETL will export the calculated aggregate values.

4.6. Venues_by_user

Q6. Find venues that a given user liked.

First denormalized query from the bottom row of the diagram. Remark all queries filter “by_user”, because we had the user_id from one of the queries from the previous row.

Because of the multiple cardinality of the rows returned, we must add additional fields as cluster keys, to get unique primary key values.

4.7. Artifacts_by_user

Q7. Find artifacts published after a certain year that a given user liked. Order results by year (DESC).

venue_year is always used for the descending order, but it may be also used as a range filter, so cluster key.

4.8. Reviews_by_user

Q8. Find reviews posted by a given user with a rating >=x. Order results by rating (DESC).

Just like before, another field (rating) may be used here as both for sorting, and as range filter.

4.9. Artifacts

Q9. Find information about an artifact with a given id.

Last query may look like a simple lookup table, but it also collect the aggregate rating from Reviews, for each user. There is no need however for cluster keys, because for each user_id value (the partitioning key) there is only one row returned.

5. Create Chebotko Diagram

Save all previous queries, then select all their nodes in the hierarchy from the left, and Add to New Model – Structural Model, to create a new diagram with all the shapes expanded. Show Shape Descriptions, and generic types (for the physical model). Remark the types are generic and cannot be customized on the diagram. You can however later change the generated DDL file.

Select the shapes two by two, right-click and use Connect Shapes, to trace a link between. The Chebotko Diagram follows closely the path from the Activity Workflow diagram.

Partition keys appear with an equal sign icon, cluster keys with up or down arrows. We chose not to display K or C, because this works for any Data Xtractor designed query with fields used for WHERE or ORDER BY. The beauty is, when a Cassandra OQL file will be generated from this model, the actual mapping to Cassandra specifics will be performed.

6. Export ETL Scripts

ETL means Export, Transform and Load. The Export phase is presented here below, with three main types of auto-generated scripts. Transform may require some additional steps with the exported data. And Load will be the final step in this tutorial, with an import to a Cassandra keyspace.

6.1. CREATE TABLE Statements

With the Chebotko Diagram open, call . The only current choice is for Apache Cassandra. This creates an OQL file with DROP and CREATE TABLE statements. Just comment out the first part, if you don’t want to reinstall all.

DROP TABLE IF EXISTS "Artifacts_by_venue";
DROP TABLE IF EXISTS "Artifacts_by_author";
DROP TABLE IF EXISTS "Users_by_artifact";
DROP TABLE IF EXISTS "Experts_by_artifact";
DROP TABLE IF EXISTS "Ratings_by_artifact";
DROP TABLE IF EXISTS "Venues_by_user";
DROP TABLE IF EXISTS "Artifacts_by_user";
DROP TABLE IF EXISTS "Reviews_by_user";
DROP TABLE IF EXISTS "Artifacts";

-- Q1. Find artifacts published in a venue with a given name after a given year. Order results by year (DESC).
CREATE TABLE IF NOT EXISTS "Artifacts_by_venue" (
  venue_name text,
  venue_year int,
  artifact_id int,
  avg_rating float,
  artifact_title text,
  authors text,
  keywords text,
  PRIMARY KEY (venue_name, venue_year, artifact_id)
) WITH CLUSTERING ORDER BY (venue_year DESC, artifact_id ASC);

-- Q2. Find artifacts published by a given author. Order results by year (DESC).
CREATE TABLE IF NOT EXISTS "Artifacts_by_author" (
  authors text,
  venue_year int,
  artifact_id int,
  avg_rating float,
  artifact_title text,
  keywords text,
  venue_name text,
  PRIMARY KEY (authors, venue_year, artifact_id)
) WITH CLUSTERING ORDER BY (venue_year DESC, artifact_id ASC);
...

6.2. INSERT INTO Statements

An alternative would be to generate CSV files, but let’s use one single OQL file for this small test case.

Close all your screen document, select all your Cassandra query nodes, right-click and Export Results as INSERT INTO. Skip all WHERE clauses, as you want to export all data: this means all your designed queries will be temporarily executed with no WHERE clause. The export is slow and interactive, as data is copied from each result grid on screen.

INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (101, 10, 'Migrating Databases', 'Jeffrey Jones', 'databases, RDBMS', 'ICWS', 2008);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (102, 9.5, 'Cassandra Data Modeling', 'Jeff Carter', 'modeling', 'CWD', 2001);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (105, 6, 'Chebotko Diagrams with Data Xtractor', 'Cristian S.', NULL, 'SCC', 2012);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (108, 0, 'Another artifact', NULL, NULL, 'ICWS', 2008);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (107, 0, 'Anout anything...', NULL, NULL, 'ICWS', 2008);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (104, 0, 'Cassandra with Data Xtractor', 'Cristian S.', 'Cassandra, Data Xtractor', 'SCC', 1997);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (103, 0, 'Introduction to Cassandra', 'Artem C., John W.', 'Cassandra, DataStax', 'SCC', 1997);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (106, 0, 'More Chebotko Diagrams', 'Cristian S.', NULL, 'AXX', 2015);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (110, 0, 'NoSQL', NULL, NULL, 'CWD', 2001);
INSERT INTO "Artifacts" (artifact_id, avg_rating, artifact_title, authors, keywords, venue_name, venue_year) VALUES (109, 0, 'OQL Development', 'Jerry K, Richard W.', NULL, 'CWD', 2001);

INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Cristian S.', 2015, 106, 0, 'More Chebotko Diagrams', NULL, 'AXX');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Cristian S.', 2012, 105, 6, 'Chebotko Diagrams with Data Xtractor', NULL, 'SCC');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Jeffrey Jones', 2008, 101, 10, 'Migrating Databases', 'databases, RDBMS', 'ICWS');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('John Doe', 2008, 107, 0, 'About anything...', NULL, 'ICWS');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Beth Williams', 2008, 108, 0, 'Another artifact', NULL, 'ICWS');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Jeff Carter', 2001, 102, 9.5, 'Cassandra Data Modeling', 'modeling', 'CWD');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Jerry K, Richard W.', 2001, 109, 0, 'OQL Development', NULL, 'CWD');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('BB King', 2001, 110, 0, 'NoSQL', NULL, 'CWD');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Artem C., John W.', 1997, 103, 0, 'Introduction to Cassandra', 'Cassandra, DataStax', 'SCC');
INSERT INTO "Artifacts_by_author" (authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name) VALUES ('Cristian S.', 1997, 104, 0, 'Cassandra with Data Xtractor', 'Cassandra, Data Xtractor', 'SCC');
...

6.3. SELECT Statements

You may also export the SQL for your queries into a file, but this will rather dump the SQL generated for the PostgreSQL tables. There is another way to generate SQL for the tables to be created from these queries. Reopen the Chebotko Diagram, and call the top menu command Objects-Export-Model with SELECT Queries.

Same syntax can be generated, per query, in the query designer, when you select Hide FROM Clause.

-- Q1. Find artifacts published in a venue with a given name after a given year. Order results by year (DESC).

SELECT venue_name, venue_year, artifact_id, avg_rating, artifact_title, authors, keywords
FROM "Artifacts_by_venue"
WHERE venue_name = 'SCC'
   	AND venue_year > 2000
ORDER BY venue_year DESC, artifact_id;

-- ===========================================================
-- Q2. Find artifacts published by a given author. Order results by year (DESC).

SELECT authors, venue_year, artifact_id, avg_rating, artifact_title, keywords, venue_name
FROM "Artifacts_by_author"
WHERE authors = 'Jeffrey Jones'
ORDER BY venue_year DESC, artifact_id;

-- ===========================================================
-- Q3. Find users who liked a given artifact.

SELECT artifact_id, user_id, user_name, email, expertise
FROM "Users_by_artifact"
WHERE artifact_id = 104
ORDER BY user_id;
...

Cassandra Query Language by Examples: Puzzles with Answers is a new Kindle e-book I recently published.

Learn how to properly design tables in CQL, and how to query Cassandra tables without paying a price in performance.

When to create a new denormalized table, secondary index or materialized view.

How to avoid traps, gotcha situations, hidden issues you may not know about.

Or try the same puzzles LIVE, on Udemy!

7. Import into Cassandra, with DataStax Studio

Get a free 10GB cloud Cassandra keyspace on DataStax Astra (no credit card required). Once a database with the keyspace is created, click on your database name, then launch Developer Studio with a new DigitalLibrary notebook.

Copy, paste and execute every script exported in the previous step, in one separate notebook cell each:

CREATE TABLE Statements – this will create Cassandra tables from the previous queries, with partitioning keys, clustering keys, and regular columns. Check them all in the Schema tab. You may obviously customize the file before running, but this may break the proper execution of the other next two steps.
INSERT INTO Statements – this will quickly populate all your tables with some denormalized test data.
SELECT Statements – you may now select the text of each query and run them all one by one, checking the result.