How to Migrate a Relational Database to Cassandra (with Data Xtractor)
This is an end-to-end practical guide about denormalizing and migrating a relational database to Apache Cassandra, using Data Xtractor. It involves connecting to a PostgreSQL database from Data Xtractor, denormalizing tables through visual queries, exposing these queries in Chebotko diagrams, generating CQL scripts and data for Cassandra, and actually importing and querying data in a free cloud Cassandra instance.
Table of Contents
Introduction: Cassandra Data Modeling Tools
After so many years, the Data Modeling section from the excellent Cassandra: The Definitive Guide, by Jeff Carpenter and Eben Hewitt, is still one of the few detailed examples on how to model a database in Apache Cassandra, on denormalized tables from a typical relational database. Jeff led a team of developer at DataStax.
Last chapter from this section recommends a few data modeling tools for Cassandra, that I tried when a client was looking for an easier migration of his large relational database to Amazon Keyspaces.
Kashlev Data Modeler (KDM) is an interesting online designer, requiring an initial ERD in Chan notation. In 2015, Andrey Kashlev co-authored with Artem Chebotko the breakthrough whitepaper A Big Data Modeling Methodology for Cassandra, describing this query-first approach The tool can auto-generate multiple possible patterns based on the cardinality of typical RDBMS relationships, but does not actually display connected Cebotko diagrams.
Second tool I tried is the polyglot Hackolade, with support for dozens of NoSQL databases. The tool can model already imported Cassandra tables, and is not focused on migration.
DataStax Dev Center has been replaced by the online Dev Studio for DataStax Astra, with limited graphical features for Cassandra data modeling. It’s been also suggested that tools like Jaspersoft, Talend or Pentaho could be better suited for ETL, to serve such migrations. Still, none of them supports Cebotko diagrams or the data flow presented by Jeff. And the BI tools are rather huge, more generic and way too complex for the task.
I’ve been honestly surprised when I tried just out of curiosity my good old Data Xtractor, to see how far I can advance with them for the task in sight. And I’ve seen I was closer to get a functional model than most other tools presented before. For the past few days I just added the missing support and came up with a new release that almost fully supports Jeff’s data migration scenario.
While still in early stages and experimental, with some missing features and not yet fully tested, it is by far the only tool I know that can create Chebotko diagrams and easily export denormalized queries and INSERT INTO statements into CQL. I already used it for two large client migrations, from SQL Server and PostgreSQL (one for production and one for further benchmarking). This is a smaller but fully functional end-to-end migration example.
All intermediate files have been included in full in this ZIP archive.
Create a PostgreSQL Relational Database
Run pgAdmin and create a new HotelReservations database. This could be on a local or remote PostgreSQL, your choice. Then load and execute the DDL script HotelReservations-PostgreSQL.sql included in the ZIP archive, partially exposed here below:
CREATE TABLE "Hotel" ( "HotelID" CHAR(36) NOT NULL, "Name" VARCHAR(45) NOT NULL, "Phone" VARCHAR(15) NOT NULL, "Address" VARCHAR(255) NOT NULL, CONSTRAINT "PK_Hotel" PRIMARY KEY ("HotelID") ); CREATE TABLE "PointOfInterest" ( "POIID" CHAR(36) NOT NULL, "Name" VARCHAR(45) NOT NULL, "Description" VARCHAR(255) DEFAULT NULL, CONSTRAINT "PK_POI" PRIMARY KEY ("POIID") ); ... INSERT INTO "Hotel" ("HotelID", "Name", "Phone", "Address") VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', 'Motel 6', '(604) 555-2365', '606-3727 Ullamcorper. Street, Roseville NH 11523'); INSERT INTO "Hotel" ("HotelID", "Name", "Phone", "Address") VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', 'Super 8', '(502) 555-2193', '5037 Diam Rd, Daly City OH 90255'); INSERT INTO "Hotel" ("HotelID", "Name", "Phone", "Address") VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', 'Holiday Inn', '(401) 555-3398', '969-1762 Tincidunt Rd, Boise CT 35282'); INSERT INTO "PointOfInterest" ("POIID", "Name", "Description") VALUES ('a388b6bf-70d2-4037-bb95-62fe5dc33851', 'Lake Houron', 'Peaceful oasis with ducks and other birds'); INSERT INTO "PointOfInterest" ("POIID", "Name", "Description") VALUES ('47e8cce9-a973-4fbd-b2ff-0a85e4f613d9', 'Navy Museum', 'Best for a rainy day...'); INSERT INTO "PointOfInterest" ("POIID", "Name", "Description") VALUES ('f776cdd8-6451-4dab-8713-7b3b2039945e', 'Statue of Liberty', 'That''s a very nice statue'); ...
This creates a relational database with a data model close to the one presented by Jeff. I just fixed some minor inconsistencies and avoided some problematic data types. The script also creates some test data, through INSERT INTO statements.
Connect to the Relational Database
Download and install a free trial version of Data Xtractor, then connect to our PostgreSQL relational database and import its metadata. In the New Database Connection popup, with PostgreSQL database type selected, you must fill in the Host Name/IP Address and Port, the User Name and Password, and the Database (HotelReservations must be found within the list, if all other parameters are correct). Test your connection eventually (you must get 9 existing tables as response). Click OK and wait for the importing of metadata to finish.
Click on Models-Demo Models-add demo models, and Add all of them. Open the Tables/View Structural Model. Data and Model Xtractor may automatically hide (to eliminate clutter) physical intersection tables, and represent many-to-many relationships without such implementation details. Find Show All Intersection Tables and Hide All Relationship Items switches, to show a typical RDBMS diagram, similar to Jeff’s. Expand all shapes, and switch to a standard Crow’s Foot notation if you wish:
Application Workflow Diagrams
Now let the fun begin and start denormalizing the database, through view-like queries. We’ll follow closely Jeff’s queries and recommendations.
We’ll start with a conceptual draft of our final result, outlining the operations we need in an Application Workflow diagram. In Data Xtractor, create a new empty model, then add separate Notes to it. Each note will contain the description of an actual Cassandra query. Beside Q1..Q9, we’ll also have a “Book a room” different operation, which we can display like a transparent note. Last, select pair of notes, with the source first, then choose “Connect Shapes” from the contextual menu: a symbolic connector will appear between the two blocks:
Expand now the Queries folder and click the “add query” hyperlink 9 times in a row. Don’t worry about the empty queries created on screen, in fact close them all now. Click on the query nodes one by one and rename them with the same names used by Jeff. Also, in the Objects tab, fill in the Description field with “Q1. …”, “Q2. …” etc use cases provided by Jeff. (Remark that our descriptions are part of the shape, not link labels!)
You can now move all these new empty queries under a Cassandra Queries folder, then select them all and Add to New Model – Relationship Model. Rearranged a bit, with no grid and with Show Shape Descriptions, with two added Notes as container keyspaces, this is what you could get on your screen (click to enlarge):
Denormalize the Relational Database
Next step is to fill-in each query with the fields we need selected. Double-click on hotels_by_poi to open the query in design mode. Set No Pagination.
From under the Tables-public node, drag and drop the PointOfInterest table node into the empty design area. Then drag and drop HotelToPOI relationship item in the empty area. From the HotelToPOI shape, drag and drop the Hotel relationship outside. You must have now on screen three shapes interconnected, like in the image below.
Uncheck the fields we don’t need: keep only Name from PointOfInterest, and all fields from Hotel. Rename them according to Jeff, by double-clicks in grid’s column headers. Remove the first constant “value” column (added by default to avoid an empty query). (From time to time, do not forget to click on Save All big toolbar button, to keep all your changes!)
Let’s run this query:
We are not done, because we cannot run the query like this in Cassandra, we must define a partition key and optional cluster keys. There are two main key-related rules when designing queries for Apache Cassandra:
- Any search should select one single top element, stored in the same partition or node, with an exact match on the partition key.
- When returning multiple related items, all these items must be stored together within the same location, and eventually sorted by the defined cluster keys.
For our query, we will “suggest” existence of the partition key with a WHERE clause, and of the cluster key with an ORDER BY, as it follows. Switch back to SQL mode, and display Where and Sort query designers, with the More button. When executed, the query will now return only two entries, and this is how it will be used in Cassandra. Hide the checkboxes from the design area, and look at the automatically generated SQL:
Follow the exact same pattern when you design the rest of the Q2..Q9 queries:
- hotels (Q2) – this is a lookup table, selecting one single Hotel table entry by hotel_id. No Sort, and you may use ‘2d76c2a1-f312-4934-83ae-a59c0574805f’ for Motel 6.
- pois_by_hotel (Q3) – from HotelToPOI expand the PointOfInterset relationship. Use the hotel_id of Motel 6 as WHERE clause, and ORDER BY poi_name. This will return Lake Houron and Navy Museum.
- available_rooms_by_hotel_date (Q4) – get RoomAvailability table and expand its Room relationship. Use the same hotel_id as before, and sort result by start_date and room_number.
- amenities_by_room (Q5) – start with a Room, and expand its RoomToAmenity relationship, then the Amenity relationship. Keep (denormalized!) room’s rate, with the amenity_name and its description. Filter by both hotel_id and room_id (the room_id alone is not an unique identifier in our model!). Use for a change the id of Holiday Inn, and ‘kdefgt’ as room id. This should return two related amenities (Wi-Fi and pool). Sort by amenity_name.
- reservations_by_confirmation (Q6) – you need only the Reservation table, with all fields. As one confirmation_number can be the same for multiple hotels, sort by hotel_id, and filter with ‘16380824’ as number.
- reservations_by_hotel_date (Q7) – this is a variation of Q6, using the exact same table, fields and data. However, the denormalization is necessary because we search and sort by different keys. You may filter the reservations for Motel 6 on 2019-03-10, sorted by room_id. Remark both filters require fixed values, to hit the same partition. If you intend to use a range for the start_day, it must be rather used as cluster key (not partition key), with ORDER BY (not WHERE).
- reservations_by_guest (Q8) – use all fields from Reservation, like before, but you also need LastName from the related Guest table. You’ll search now by the guest name (like Bernard, who has four reservations), and sort by hotel and guest ids.
- guests (Q9) – another lookup table (like hotels), selecting one single Guest entry, by guest_id. No Sort, and you may use ‘c262832f-6f09-4a4a-8e92-447304fcebd8’ for Louis Bernard.
Create a Chebotko Diagram
Artem Chebotko is still a Solution Architect at DataStax at this moment, and his interesting notation allows us to better visualize and understand denormalized queries prepared for Cassandra.
Select all our queries nodes, then click on Add to New Model – Structural Model. All our queries appear as expanded shapes, with no links between, in a new model diagram. Equal sign icons (=) identify query fields with a WHERE filter, and these translate directly into Cassandra partition keys. Up/down arrow icons identify fields with ORDER BY in our queries, and they translate directly into Cassandra cluster keys. From now on, this is a generic notation for all our queries in Data Xtractor, regardless of database type, and this is why we avoided the very specific Cassandra notation with K and C. But the fields are easily identified, and this is in fact a basic Chebotko diagram.
To connect the queries, select them two at a time, starting with the original shape, and click on Connect Shapes, from the contextual menu. Any query shapes or notes can be connected this way. The link is rather symbolic and always allowed. However, the partition key values of a target shape should be usually provided by the source shape. When no incoming connectors, the partition key values are assumed to be known and entered by the user. For instance, poi_name of Q1 is assumed to be known and entered by the user. Bot the hotel_id of Q2 is collected from the previous linked query.
With Show Shape Descriptions, Show Data Types, Show Generic Types, and a bit of repositioning, we get to this version of a Chebotko diagram, very close to Jeff’s representation. Remark the “Book a Room” transparent node, also connected to query shapes, which is a different data entry operation:
Unlike our model, original Chebotko diagrams show Qx as connector labels. Qx are rather displayed here as query descriptions, and there are no entry links with no source shape. In Chebotko’s notation, Qx were rather use cases, not queries. And it was possible to have more than one use case implemented with the same query. To do this here, simply add multiple Qx descriptions to the same shape.
Generate ETL Queries and Data
All this looks nice – you may say – but how do we get now our queries and data migrated to Cassandra? Are these diagrams just visually appealing and that’s it? No, we added other new ETL (Extract, Transform and Load) features to get it close to an automated process, and make it worth investing in our tool.
(a) From our last Chebotko Diagram, you may call the Objects-Export-Model as DDL Script menu command. Apache Cassandra is the only available option at this time. This will generate a DDL (Data Definition Language) script (HotelReservations-CREATE.oql in our archive), in OQL (Cassandra Query Language), with CREATE TABLE statements auto-generated from our queries. First section, with DROP TABLE statements, can be removed or commented out all together, if you don’t want to override an existing keyspace.
While our Chebotko diagrams do not allow (yet) indexes, materialized views, collections and other specific physical types, you can always customize this OQL file before running it in Cassandra:
DROP TABLE IF EXISTS hotels_by_poi; DROP TABLE IF EXISTS pois_by_hotel; DROP TABLE IF EXISTS reservations_by_confirmation; DROP TABLE IF EXISTS reservations_by_hotel_date; DROP TABLE IF EXISTS reservations_by_guest; DROP TABLE IF EXISTS guests; DROP TABLE IF EXISTS available_rooms_by_hotel_date; DROP TABLE IF EXISTS amenities_by_room; DROP TABLE IF EXISTS hotels; CREATE TABLE IF NOT EXISTS hotels_by_poi ( poi_name text, hotel_id text, name text, phone text, address text, PRIMARY KEY (poi_name, hotel_id) ) WITH CLUSTERING ORDER BY (hotel_id ASC) WITH COMMENT = 'Q1. Find hotels near a given point of interest.'; CREATE TABLE IF NOT EXISTS pois_by_hotel ( hotel_id text, poi_name text, description text, PRIMARY KEY (hotel_id, poi_name) ) WITH CLUSTERING ORDER BY (poi_name ASC) WITH COMMENT = 'Q3. Find points of interest near a given hotel.'; ...
(b) You may also select all query nodes, and use the Export Results as INSERT INTO contextual menu command. This will add generic SQL statements (supported by OQL, but not necessarily Cassandra specific) to populate our database. You may want to skip the WHERE clauses. This is a slow command, because the queries will be opened, executed and closed one by one on screen, and data is actually copied from the result grid as it is. If you used pagination, only the first grid page will be returned. A partial view of our HotelReservations-INSERT.sql file:
INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', 'CH-234', 150.00, 'Dining Room', 'Continental breakfast in the morning'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', 'bghsdw', 150.00, 'Dining Room', 'Continental breakfast in the morning'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '12-WWQ', 150.00, 'Dining Room', 'Continental breakfast in the morning'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '12-CFG', 130.00, 'Dining Room', 'Continental breakfast in the morning'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '12-WWQ', 150.00, 'Internet Access', 'Wi-Fi Internet access for your computer'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', 'CH-012', 150.00, 'Internet Access', 'Wi-Fi Internet access for your computer'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', 'kdefgt', 190.00, 'Internet Access', 'Wi-Fi Internet access for your computer'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', 'CH-234', 150.00, 'Parking', 'Free of CHARge self-parking'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', 'kdefgt', 190.00, 'Swimming Pool', 'Outdoor seasonal pool'); INSERT INTO amenities_by_room (hotel_id, room_id, rate, amenity_name, description) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '12-WWQ', 150.00, 'Swimming Pool', 'Outdoor seasonal pool'); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '2019-02-03', 14, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', '2019-02-11', 15, 0); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', '2019-03-10', 11, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', '2019-03-10', 22, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '2019-03-21', 10, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', '2019-05-02', 22, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', '2019-07-22', 11, 0); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('2d76c2a1-f312-4934-83ae-a59c0574805f', '2019-08-10', 11, 1); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('5061bde8-f6f4-4767-9b92-cff82d9ab52b', '2019-11-10', 10, 0); INSERT INTO available_rooms_by_hotel_date (hotel_id, start_date, room_number, is_available) VALUES ('b93a7112-f4dd-41cc-b529-d5d79114078e', '2019-12-09', 31, 0); ...
(c) The Save SQL contextual menu command, on all selected query nodes, can save the SQL of the previously executed queries. As an alternative, here is a different version, with their WHERE clauses included (as partial view of our HotelReservations-SQL.sql file):
-- amenities_by_room -- Q5. Find the rate and amenities for a room. SELECT r1."HotelID" AS hotel_id, r1."RoomID" AS room_id, r1."Rate" AS rate, a."Name" AS amenity_name, a."Description" AS description FROM public."Amenity" AS a INNER JOIN public."RoomToAmenity" AS r ON a."AmenityID" = r."AmenityID" INNER JOIN public."Room" AS r1 ON r."RoomID" = r1."RoomID" AND r."HotelID" = r1."HotelID" WHERE (r1."HotelID" = 'b93a7112-f4dd-41cc-b529-d5d79114078e' AND r1."RoomID" = 'kdefgt') ORDER BY a."Name"; -- =========================================================== -- available_rooms_by_hotel_date -- Q4. Find an available room in a given date range. SELECT r."HotelID" AS hotel_id, r."StartDate" AS start_date, r1."RoomNumber" AS room_number, r1."RoomID" AS room_id, r."Available" AS is_available FROM public."RoomAvailability" AS r INNER JOIN public."Room" AS r1 ON r."RoomID" = r1."RoomID" AND r."HotelID" = r1."HotelID" WHERE r."HotelID" = '2d76c2a1-f312-4934-83ae-a59c0574805f' ORDER BY r."StartDate", r1."RoomNumber"; ...
(d) Of course, you always have as alternative the generation of CSV files, one per query, in a folder you may specify. CSV files may load faster in Cassandra, with the COPY command, for large amounts of data.
You may select all query nodes, and use instead the Export Results as CSV Files contextual menu command. You must skip the WHERE clause as well, and pagination is always temporarily disabled by default. Processing is similar and rather slow: each query is opened, executed and closed on screen, as data is actually copied from the result grid.
(e) Finally, it would be good to have a different version for our use case queries, in OQL and using directly the tables generated for Cassandra in the DDL. You can do this as well in Data Xtractor, using the Objects-Export-Model with SELECT Queries top menu command, on the Chebotko diagram. A partial view of our HotelReservations-SELECT.cql generated file:
-- Q1. Find hotels near a given point of interest. SELECT poi_name, hotel_id, name, phone, address FROM hotels_by_poi WHERE poi_name = 'Lake Houron' ORDER BY hotel_id; -- =========================================================== -- Q3. Find points of interest near a given hotel. SELECT hotel_id, poi_name, description FROM pois_by_hotel WHERE hotel_id = '2d76c2a1-f312-4934-83ae-a59c0574805f' ORDER BY poi_name; ...
Import in Cassandra
Our final step, in this practical migration from a relational database to Apache Cassandra, is to load some files generated in the ETL step into a Cassandra database.
I strongly recommend to get a free forever 10GB cloud Cassandra keyspace on DataStax Astra (no credit card required). It’s just for one database and one keyspace, one Cassandra node with no replication, but it is great to test small jobs like ours. Once created, click on your database name, then launch Developer Studio with a new HotelReservations notebook.
(a) Copy, paste and run the previously exported CREATE TABLE statements in the first notebook cell. Check the Schema tab: you must see all created tables, with their columns, including partitioning and clustering keys.
(b) Add another notebook cell with the INSERT INTO statements, and run them. (An alternative would be to load data from the CSV files, but small test data is loaded faster from our single generated file.)
(c) Add a final cell with our exported SELECT statements (2-3 queries may require small and obvious syntax corrections in the WHERE clause). It wouldn’t make sense to execute them all together, but you can select the text of one single query and run it at a time. This will test your installation and make sure the queries return what’s expected.
Here is the one simple view of my DataStax cloud, with the Schema tab, and an executed query with its actual results:
Cassandra Query Language by Examples
Cassandra Query Language by Examples: Puzzles with Answers is a new Kindle e-book I recently published.
What you will learn or better understand:
- How to properly design tables in CQL. Data modeling with Chebotko diagrams and proper key choices.
- How to properly query Cassandra tables without paying a price in performance.
- How to properly use the partition, clustering or primary keys, in WHERE and ORDER BY clauses.
- When and what keys you may skip in a filter expression or in an explicit query sort order.
- When to create a new denormalized table, secondary index or materialized view.
- Major differences between SQL and CQL, with plenty of examples.
- How to avoid traps, gotcha situations, hidden issues you may not know about.
- What the mysterious TTL actually does. When whole rows may suddenly disappear or not.