How to process tables in your model, and ideas for speeding processing up
Part three of a three-part series of blogs

Processing tables involves loading data into your model, and can take a long time. This blog explains how processing works in SSAS Tabular and the different options available to you, shows how to process tables or models, and gives some ideas on how to make processing run more quickly.

  1. The theory of processing tables in SSAS (Tabular)
  2. The practice of processing in SSAS Tabular
  3. Some ideas for speeding up processing (this blog)

This blog is part of our online SSAS Tabular tutorial; we also offer lots of other Analysis Services training resources.

Posted by Andy Brown on 19 February 2016

You need a minimum screen resolution of about 700 pixels width to see our blogs. This is because they contain diagrams and tables which would not be viewable easily on a mobile phone or small laptop. Please use a larger tablet, notebook or desktop computer, or change your screen resolution settings.

Some ideas for speeding up processing

Processing is usually the bottleneck in any company, since it involves loading masses of data from a company's transactional database into its data warehouse, for use the next day as a basis for management reports.

One way to minimise the disruption that this process (forgive the pun) causes is to use SQL Server Agent to schedule it to run at 3 o'clock in the morning, although this is tough on anyone working nights!

This page looks at some ways in which you can make processing take less time.

Step 1 - Partition your tables

I've looked at this in a separate blog, but I've listed it here for completeness.  Partitioning allows you to break a table up into different parts, and choose which to process:

Partitioned table

For the tblPos table, for example, you could choose just to import the 2016 data, on the grounds that other years have been closed off and so can't have changed.

 

Step 2 - Omit high-cardinality columns where possible

The temptation when importing tables is to include all of the columns, then just hide the ones you want from client view:

Hidden Transaction id column

The TransactionId column is hidden from client view, so how much harm can it be doing?

 

For large datasets, this is a mistake!  The TransactionId column shown above is the most expensive one in the entire MAM database, since it contains 13,904 different values:

The TransactionId column

Because it's the primary key for the transactions table, each transaction id value is necessarily unique.

 

Remember that SSAS uses column and not row storage, so the column dictionary for the transaction id number will be (at least in MAM terms) huge.  All this for a column which no user of your model will ever want to see or use!

The solution is to avoid importing expensive columns with a high level of granularity:

Untick the transaction id

Unticking the transaction id will mean that this expensive column won't be imported.

 

Step 3 - Pre-sort your data

Take the animal table as an example.  Sorting by the animal id or name won't gain much:

Animals sorted

The first two columns have unique values for every row, so it doesn't matter what order these appear in.

 

Sorting by the Legs column, by contrast, will have an effect (since this has fewest unique values):

Sorted animals by legs

The Legs column has the fewest distinct values, so it makes sense to sort by this column first.

 

I have to be honest, I don't know exactly how much difference this will make, but the logic seems good!

Step 4 - avoid calculated columns

Consider the following simple calculation, which works out the value for transactions as the price of the item multiplied by the quantity of items bought:

Price and value of items

The value of any item is the price times the quantity.

 

I've added 3 measures at the bottom of the table:

Measure Formula Results
Cardinality price =DISTINCTCOUNT([Price]) 384
Cardinality quantity =DISTINCTCOUNT([Quantity]) 4
Cardinality amount =DISTINCTCOUNT([Amount]) 1045

The maximum possible combination of amounts would be 1,536 (that is, 384 prices multiplied by 4 quantities), but because of repeats there are only 1,045 unique amounts.  What this means is that the calculated column takes up nearly three times as much storage as the other two columns combined.

Be aware, however, that by removing the calculated column any query which aggregates transaction amounts will run that much more slowly, since it will have to recalculate them. There is often a trade-off between processing time and query run time!

Step 5 - avoid expensive queries

Because of the way that the xVelocity engine stores data (by columns, not rows) some queries will be more costly to run than others. Here's an example of an expensive query:

An expensive query

This query will be relatively expensive to run, since it has to bring in data from four different columns.

 

Here's a less expensive one:

A less expensive query

This query will be relatively cheap to run, since aggregation over a column runs quickly, and there are only two columns to bring in.

 

In practise the volume of data is so small for the MAM database that relative timings will be meaningless! 

Step 6 - don't store date/time columns

For the make-a-mammal tblPos (point-of-sale) table, we capture the point-of-sale date/time. Typical entries might look like this:

Date/time entries

Five typical date/time entries from the Pos table.

 

If you have DateTime columns like this, you should split them into two parts (date and time). For the above dates, this would give: 

Original value Date Time
23/11/2015  13:35:23 23/11/2015 13:35:23
23/11/2015  15:22:08 23/11/2015 15:22:08
24/11/2015  13:35:23 23/11/2015 13:35:23
24/11/2015  15:22:08 23/11/2015 15:22:08
25/11/2015  13:35:23 23/11/2015 13:35:23

This would yield the following column dictionaries:

Column Dictionary Values
Date 23/11/2015, 24/11/2015, 25/11/2015 0,0,1,1,2
Time 13:35:23, 15:22:08  0,1,0,1,0

The maximum possible number of dates which can now be stored is 365 or 366 for each year in the model, and the maximum number of times is 24 x 60 x 60 = 86,400 seconds. 

Step 7 - consider splitting id columns

Typically when processing a model the most expensive columns are unique id ones. You can reduce the amount of data stored by splitting these - here's a suggestion:

Split id column

I've split the transaction id into the thousands component and the units component).

 

This would give the following column dictionaries:

Column Formula Distinct values
Thousands =int([TransactionId]/1000) 113
Units =[TransactionId]-1000 * int([TransactionId]/1000) 1000

Of course, you'll then have to reconstruct these when creating DAX expressions: it's the age-old trade-off between processing time and query run-time ...

If you want to measure how long it takes to run queries, DAX Studio includes an excellent Server Timings facility.

This blog has 0 threads Add post