562 attributed reviews in the last 3 years
Refreshingly small course sizes
Outstandingly good courseware
Whizzy online classrooms
Wise Owls only (no freelancers)
Almost no cancellations
We have genuine integrity
We invoice after training
Review 30+ years of Wise Owl
View our top 100 clients
How to process tables in your model, and ideas for speeding processing up
Part three of a three-part series of blogs
Processing tables involves loading data into your model, and can take a long time. This blog explains how processing works in SSAS Tabular and the different options available to you, shows how to process tables or models, and gives some ideas on how to make processing run more quickly.
Processing is usually the bottleneck in any company, since it involves loading masses of data from a company's transactional database into its data warehouse, for use the next day as a basis for management reports.
One way to minimise the disruption that this process (forgive the pun) causes is to use SQL Server Agent to schedule it to run at 3 o'clock in the morning, although this is tough on anyone working nights!
This page looks at some ways in which you can make processing take less time.
I've looked at this in a separate blog, but I've listed it here for completeness. Partitioning allows you to break a table up into different parts, and choose which to process:
For the tblPos table, for example, you could choose just to import the 2016 data, on the grounds that other years have been closed off and so can't have changed.
The temptation when importing tables is to include all of the columns, then just hide the ones you want from client view:
The TransactionId column is hidden from client view, so how much harm can it be doing?
For large datasets, this is a mistake! The TransactionId column shown above is the most expensive one in the entire MAM database, since it contains 13,904 different values:
Because it's the primary key for the transactions table, each transaction id value is necessarily unique.
Remember that SSAS uses column and not row storage, so the column dictionary for the transaction id number will be (at least in MAM terms) huge. All this for a column which no user of your model will ever want to see or use!
The solution is to avoid importing expensive columns with a high level of granularity:
Unticking the transaction id will mean that this expensive column won't be imported.
Take the animal table as an example. Sorting by the animal id or name won't gain much:
The first two columns have unique values for every row, so it doesn't matter what order these appear in.
Sorting by the Legs column, by contrast, will have an effect (since this has fewest unique values):
The Legs column has the fewest distinct values, so it makes sense to sort by this column first.
I have to be honest, I don't know exactly how much difference this will make, but the logic seems good!
Consider the following simple calculation, which works out the value for transactions as the price of the item multiplied by the quantity of items bought:
The value of any item is the price times the quantity.
I've added 3 measures at the bottom of the table:
The maximum possible combination of amounts would be 1,536 (that is, 384 prices multiplied by 4 quantities), but because of repeats there are only 1,045 unique amounts. What this means is that the calculated column takes up nearly three times as much storage as the other two columns combined.
Be aware, however, that by removing the calculated column any query which aggregates transaction amounts will run that much more slowly, since it will have to recalculate them. There is often a trade-off between processing time and query run time!
Because of the way that the xVelocity engine stores data (by columns, not rows) some queries will be more costly to run than others. Here's an example of an expensive query:
This query will be relatively expensive to run, since it has to bring in data from four different columns.
Here's a less expensive one:
This query will be relatively cheap to run, since aggregation over a column runs quickly, and there are only two columns to bring in.
In practise the volume of data is so small for the MAM database that relative timings will be meaningless!
For the make-a-mammal tblPos (point-of-sale) table, we capture the point-of-sale date/time. Typical entries might look like this:
Five typical date/time entries from the Pos table.
If you have DateTime columns like this, you should split them into two parts (date and time). For the above dates, this would give:
This would yield the following column dictionaries:
|Date||23/11/2015, 24/11/2015, 25/11/2015||0,0,1,1,2|
The maximum possible number of dates which can now be stored is 365 or 366 for each year in the model, and the maximum number of times is 24 x 60 x 60 = 86,400 seconds.
Typically when processing a model the most expensive columns are unique id ones. You can reduce the amount of data stored by splitting these - here's a suggestion:
I've split the transaction id into the thousands component and the units component).
This would give the following column dictionaries:
|Units||=[TransactionId]-1000 * int([TransactionId]/1000)||1000|
Of course, you'll then have to reconstruct these when creating DAX expressions: it's the age-old trade-off between processing time and query run-time ...
If you want to measure how long it takes to run queries, DAX Studio includes an excellent Server Timings facility.
|Parts of this blog|
25 Aytoun Street