Getting started with Amazon Redshift … and Data Xtractor

Published by Cristian Scutaru on

End-to-end demo with the “Getting started” TICKIT sample used by AWS. Create and connect to an Amazon Redshift cluster. Create a demo schema and populate tables with data, with Redshift’s Query Editor. Connect from Data Xtractor, reverse engineer the data model and design rich queries, without even typing SQL.

Table of Contents

Create an Amazon Redshift Cluster

We’ll follow the general guidelines of the well-documented Getting started with Amazon Redshift introduction, but we will skip some complex role-related operations for the moment. We’ll also focus solely on free resources you can create with a free trial account…

We’ll just go faster on the creation of a (free) AWS account, and a (free) Amazon Redshift cluster. For a free trial account, just use with AWS an email address you didn’t use before, and provide a credit card information. It’s totally safe: you’ll never pay anything, provided you limit yourself to using the free trial services offered for 12 months.

With Amazon Redshift, you can create free clusters for 3 months. Just make sure you select the “Free trial” option, and the required limited configuration (one dc2.large node is more than enough for what we need):

As an alternative, you can also create all these under a pay-as-you-go account. Assuming you’ll clean up all your new AWS resources within two hours, this will cost you less than $50, with a minimal cluster configuration and one node.

I just created here a redshift-cluster-1 cluster (the default suggested name), with a tickit new database at the default port 5439. With master user name awsuser (the default suggested name) and some specific password (remember it!), to connect to the tickit database. Just use defaults for the rest and click on “Create cluster”. In just a few minutes, the cluster is created and you can go to the Query Editor screen.

Create TICKIT Tables in Amazon Redshift

Tickit is a small data warehouse database, with fact tables and dimension tables. Take a quick look at its table descriptions.

Copy from this page all “create table” statements, and paste them into the query definition entry box, of the Redshift’s Query Editor. To run them, you’ll have to select each statement one by one, and click the Run button (this is because we cannot yet run scripts from within Redshift).

Now you’ll have to do the same with the following statements, which declare primary keys and foreign keys:

alter table users add primary key (userid);
alter table venue add primary key (venueid);
alter table category add primary key (catid);
alter table date add primary key (dateid);
alter table event add primary key (eventid);
alter table listing add primary key (listid);
alter table sales add primary key (salesid);

alter table event add foreign key (venueid) references venue (venueid);
alter table event add foreign key (catid) references category (catid);
alter table event add foreign key (dateid) references date (dateid);

alter table listing add foreign key (sellerid) references users (userid);
alter table listing add foreign key (eventid) references event (eventid);
alter table listing add foreign key (dateid) references date (dateid);

alter table sales add foreign key (listid) references listing (listid);
alter table sales add foreign key (sellerid) references users (userid);
alter table sales add foreign key (buyerid) references users (userid);
alter table sales add foreign key (eventid) references event (eventid);
alter table sales add foreign key (dateid) references date (dateid);

This is because Amazon Redshift doesn’t need PKs and FKs. It will not complain if you create them, but Redshift does not enforce any referential integrity. Redshift rather requires distribution keys and sort keys, which you will find already added to the previous “create table” statements.

It’s always good to keep or add primary and foreign keys, because reverse engineered data models will have this additional information. And queries will also know where to use joins. You’ll see later on why this is almost required for Data Xtractor, and how much it helps…

Load TICKIT Data into Amazon Redshift

To load data into our tables, one of the best ways is to go through S3. So just create a new tickit123 bucket in your Amazon S3 service, it’s free (you may have to replace “123” with some other random combination, because bucket names must be globally unique, and I used tickit123 myself). Then download this ZIP file onto your computer, extract the files, and upload them all into your new S3 bucket.

To COPY data from S3 to Amazon Redshift, we need a custom role. Go to your IAM management console and create a new S3ToRedshift role, with AmazonS3ReadOnlyAccess and AmazonRedshiftFullAccess assigned permissions. Second permission may be too large, but we’ll remove anything from this demo at the end, so we’re fine for now.

You’ll have to edit as well the Trust Relationship, and replace it with:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

This is because we’ll run the COPY commands with the awsuser Redshift database user account, and it must know he can assume the IAM role.

When done, go back to your Query Editor, and paste the following COPY commands (don’t run them yet, they will fail!):

copy users from 's3://tickit123/allusers_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy venue from 's3://tickit123/venue_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy category from 's3://tickit123/category_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy date from 's3://tickit123/date2008_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy event from 's3://tickit123/allevents_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy listing from 's3://tickit123/listings_pipe.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '|' region 'us-west-2';

copy sales from 's3://tickit123/sales_tab.txt'
iam_role 'arn:aws:iam::478247124651:role/S3ToRedshift'
delimiter '\t' timeformat 'MM/DD/YYYY HH:MI:SS' region 'us-west-2';

Before running them one by one, as usual, you’ll have to replace “tickit123” with your own bucket name, the ARN of the iam_role with your own, and the region code with your own (I created my cluster in Oregon).

Connect to Amazon Redshift from Data Xtractor

Your cluster is by default private, so we’ll just make it public to anyone for now. Navigate to your VPC console, click on Security Groups, and locate the “default” one you used for your Redshift cluster. Add the following “Redshift” inbound rule(s), with public access from anywhere (a better alternative is to provide just your IP address, but this is just a quick demo):

It’s time now to connect to our Redshift cluster from Data Xtractor. First download and install Data Xtractor on your Windows computer (ignore the warnings, the product is 100% safe and has been installed hundreds of thousands of times). You’ll automatically get a 7-days free trial for any database type, including Amazon Redshift (other databases are and remain free).

Launch Data Xtractor and connect to a new Amazon Redshift database. The host name must have your endpoint (copied from your Redshift cluster properties), without the terminal database name and port number (which must appear below, by default 5439). For instance, my endpoint here is redshift-cluster-1.cdvk3xfvcbjb.us-west-2.redshift.amazonaws.com:5439/tickit, and I passed just redshift-cluster-1.cdvk3xfvcbjb.us-west-2.redshift.amazonaws.com.

Enter your user name and password, and now you should be able to expand the Database combo box, and select the tickit database. Don’t wonder about any other values, as they are dynamically generated:

A click on Test should come back with a green message, and 7 tables found in the database. Click on OK and wait for the reverse engineering process to complete. Expand the Models\Demo Models folder, and add all existing demo models: Data Xtractor will automatically create some ER diagrams with your tables, using different options and layouts.

When done, double-click on the Structural Model (and re-arrange the shapes a bit, if necessary). Here is a view that I like:

The Relationship Model shows all shapes collapsed, and you may eventually switch to the standard Crow’s Foot notation:

Remark that without our added PK and FK constraint definitions, none of these relationship would be auto-discovered!

Query Amazon Redshift Data from Data Xtractor

Your tables are now created, with Redshift distribution and sort keys, and also with our added primary and foreign keys. Tables have been populated with data, through Amazon S3. It’s time to query them.

The end of this page has a bunch of queries, that you should copy and run, one by one, in your online Query Editor, like before. I’ll focus here on running the exact same queries in Data Xtractor.

(1) First query definition can be simply copied into a New Text Query:

-- Get definition for the sales table.
SELECT *    
FROM pg_table_def    
WHERE tablename = 'sales';

When you run it, you get this:

(2) Second query will return one single field, with one aggregate value (210):

-- Find total sales on a given calendar date.
SELECT sum(qtysold) 
FROM   sales, date 
WHERE  sales.dateid = date.dateid 
AND    caldate = '2008-01-05';

This time we’ll visually “design” the query, without typing any SQL. Right-click on the sales table, and select “Add to New Built Query”. Drag and drop outside the date relationship item from the sales shape: this will automatically recognize and create an INNER JOIN, instead of the blunt WHERE clause:

(3) Third query has an inner subquery (which in Data Xtractor we’ll have to create first, separately):

-- Find top 10 buyers by quantity.
SELECT firstname, lastname, total_quantity 
FROM   (SELECT buyerid, sum(qtysold) total_quantity
        FROM  sales
        GROUP BY buyerid
        ORDER BY total_quantity desc limit 10) Q, users
WHERE Q.buyerid = userid
ORDER BY Q.total_quantity desc;

Design and save the subquery as a “TotalPerBuyer” built query:

Then reuse this last query as a subquery, in a new built query, displayed here with a full screen:

(4) Finally, the last query has two nested subqueries, and is using window functions:

-- Find events in the 99.9 percentile in terms of all time gross sales.
SELECT eventname, total_price 
FROM  (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) as percentile 
       FROM (SELECT eventid, sum(pricepaid) total_price
             FROM   sales
             GROUP BY eventid)) Q, event E
       WHERE Q.eventid = E.eventid
       AND percentile = 1
ORDER BY total_price desc;

You can totally design it, like before, in Data Xtractor, as we fully support Redshift window functions. But I’ll leave this as exercise. For now, just copy and paste its definition into a New Text Query, and add some Inline Chart, for a better visualization:

Clean Up Your AWS Resources!

It’s always VERY important to remove your cloud resources after such exercise, to avoid incurring costs:

  1. Start by deleting your Redshift cluster. You may be asked to take a final snapshot, but rather don’t (as it is saved in S3 and Amazon announced they will incur some cost in the future, if they don’t already do). This action will delete the cluster, with the database and all tables within.
  2. Go to S3 now and delete the bucket you created for the demo. This will also remove all files loaded in the bucket.
  3. Go to your S3ToRedshift role and remove it as well.

Categories: Case Study

Cristian Scutaru

I designed and implemented the Data Xtractor suite, with Model Xtractor, Query Xtractor, and Visual Xtractor as separate modules. I am a software architect and developer with over 30 years professional experience. I’ve been working with relational databases for almost three decades and I was constantly unhappy with the relative limitation of those tools used to connect directly to a platform, and instantly extract and display data in flexible ways.