Read our blogs, tips and tutorials
Try our exercises or test your skills
Watch our tutorial videos or shorts
Take a self-paced course
Read our recent newsletters
License our courseware
Book expert consultancy
Buy our publications
Get help in using our site
581 attributed reviews in the last 3 years
Refreshingly small course sizes
Outstandingly good courseware
Whizzy online classrooms
Wise Owl trainers only (no freelancers)
Almost no cancellations
We have genuine integrity
We invoice after training
Review 30+ years of Wise Owl
View our top 100 clients
Search our website
We also send out useful tips in a monthly email newsletter ...
Written by Andy Brown
In this tutorial
Processing itself in SSAS Tabular is easy; it's understanding what's going on which is hard! This page explains the underlying concepts of processing, and the choices you'll need to make.
To understand how processing works in SSAS tabular models, and how to tweak it, it's vital to understand how the underlying engine stores data.
Most databases (such as SQL Server) use a row-based storage algorithm:
Typically each row in a table is stored as a record, and is accessed by its primary key (unique identifier).
By contrast, the SSAS database engine stores data by columns:
In tabular models, columns are stored separately (because whole tables are processed in one go, this is an efficient way to store and access data).
Duplicate column values are only ever stored once. Thus the SpeciesId and Legs columns above would be stored as:
Column | Dictionary | Values |
---|---|---|
SpeciesId | 1,2,3,4 | 0,0,0,1,1,2,2,3 |
Legs | 0,2,4 | 2,2,2,1,1,0,2,2 |
For example there are only 3 discrete values for legs: 0, 2 and 4. Here's how SSAS would store the legs for each of the 8 animals:
Animal | Legs | Position in list | Rank used by SSAS |
---|---|---|---|
Sheep | 4 | 3 | 2 |
Cat | 4 | 3 | 2 |
Tiger | 4 | 3 | 2 |
Penguin | 2 | 2 | 1 |
Owl | 2 | 2 | 1 |
Snake | 0 | 1 | 0 |
Crocodile | 4 | 3 | 2 |
Frog | 4 | 3 | 2 |
Putting the last column together gives the values string 2,2,2,1,1,0,2,2 shown in the first table.
This shows that the lower the cardinality of a column (ie the smaller the number of distinct values there are, and hence the more duplication there is), the more efficiently the data will be stored - a point I'll revisit later in this blog.
When you process any model, two things happen:
Stage | What's involved |
---|---|
Data (loading data for each table) | The xVelocity engine takes each column in each table and builds a dictionary from it, as explained above. |
Recalc (tidying up loose ends) | The engine then populates any calculated columns, hierarchies, relationships and indexes for the database. |
It's important to bear these two stages in mind when you're considering what your processing strategy should be.
When you process a table or database, in order to achieve continuity SSAS by default runs a SQL transaction to load the new set of data while leaving the old set unaffected.
This means that if the processing fails at any point use of the existing model will not be affected, but it also means that the xVelocity engine requires a little over twice as much memory to process a table or database as is taken up by the table or database being processed.
Bearing this in mind, when you choose to process you'll be presented with a choice of options as follows (partitions are explained in this separate blog, and are included in the table below for completeness):
Mode | Level | What it does |
---|---|---|
Process Data | Table or partition | Reads in data and builds dictionaries, but doesn't process any calculated columns, indexes, relationships or hierarchies. Typically a Process Recalc is required after all tables' dictionaries have been updated using Process Data. |
Process Recalc | Database | Updates all of the calculated columns, indexes, relationships and hierarchies for a model. |
Process Full | Table, partition or database | Runs a Process Data on the table or partition, followed by a Process Recalc. If you run Process Full for an entire database, SSAS will rebuild all of the tables in the database (running a Process Data on each one), then run a single Process Recalc. |
Process Defrag | Table or database | Tidies up an individual table, or all tables (this is particularly useful when you've processed individual partitions for a table, and dictionaries might no longer be efficiently stored). |
Given the above, the most efficient approach is usually to:
However, this approach does mean that calculated columns, relationships, hierarchies and indexes won't be available between the start of step 1 and the end of step 2.
One way to get round the above is to run steps 1 and 2 as part of a single SQL transaction (equivalent to doing a Process Full on the entire database), but this will use much more memory.
The previous page shows how processing works; this page shows how to do it! You can process an individual table, all tables, or else all of the tables using a particular connection (the options are described under separate headings below).
To process a single table, select the table in design or data view and then choose this menu option:
Choose this option to process a table.
A typical requirement is to process all of the tables from a particular SQL Server database. To do this, first choose a connection:
Choose to look at your existing connections.
Now choose a connection, and choose to process it:
Choose to process all the tables using this connection.
If you want to load the data for an entire model:
Choose this menu option to process all of the tables in a model.
When you deploy a model SSAS will process its tables, but by default it will do so only for tables and partitions whose data has changed. To change this, first show the properties of your project:
Right-click on your project and choose to show its properties.
You can now select how SSAS will copy data when you deploy your model:
In the project's properties window, change the default processing option (if you choose Do Not Process, SSAS will copy metadata when you deploy a project but won't load any data).
"Metadata" are the definitions of your tables and relationships, specifying which tables, columns and rows are included in your model, but not including the data itself.
You can also choose to load data from SSMS:
Right-click on the model you've deployed to Analysis Services, and choose to process it in its entirety.
You can then choose how you're going to process it, and choose OK:
Choose how you want to process the data. Process Clear wipes out all existing data, without reloading it (essentially deleting the contents of the model).
You can also process individual tables from SSMS:
Right-click on the table you want to process, and choose the option shown.
You can now tick which tables you want to process, and choose how:
You've now got complete control of what you process, and how you do it!
That's pretty much the whole story on processing (with the possible exception of partitions, covered in a separate blog). The final thing I want to look at is some ideas for speeding up processing.
Processing is usually the bottleneck in any company, since it involves loading masses of data from a company's transactional database into its data warehouse, for use the next day as a basis for management reports.
One way to minimise the disruption that this process (forgive the pun) causes is to use SQL Server Agent to schedule it to run at 3 o'clock in the morning, although this is tough on anyone working nights!
This page looks at some ways in which you can make processing take less time.
I've looked at this in a separate blog, but I've listed it here for completeness. Partitioning allows you to break a table up into different parts, and choose which to process:
For the tblPos table, for example, you could choose just to import the 2016 data, on the grounds that other years have been closed off and so can't have changed.
The temptation when importing tables is to include all of the columns, then just hide the ones you want from client view:
The TransactionId column is hidden from client view, so how much harm can it be doing?
For large datasets, this is a mistake! The TransactionId column shown above is the most expensive one in the entire MAM database, since it contains 13,904 different values:
Because it's the primary key for the transactions table, each transaction id value is necessarily unique.
Remember that SSAS uses column and not row storage, so the column dictionary for the transaction id number will be (at least in MAM terms) huge. All this for a column which no user of your model will ever want to see or use!
The solution is to avoid importing expensive columns with a high level of granularity:
Unticking the transaction id will mean that this expensive column won't be imported.
Take the animal table as an example. Sorting by the animal id or name won't gain much:
The first two columns have unique values for every row, so it doesn't matter what order these appear in.
Sorting by the Legs column, by contrast, will have an effect (since this has fewest unique values):
The Legs column has the fewest distinct values, so it makes sense to sort by this column first.
I have to be honest, I don't know exactly how much difference this will make, but the logic seems good!
Consider the following simple calculation, which works out the value for transactions as the price of the item multiplied by the quantity of items bought:
The value of any item is the price times the quantity.
I've added 3 measures at the bottom of the table:
Measure | Formula | Results |
---|---|---|
Cardinality price | =DISTINCTCOUNT([Price]) | 384 |
Cardinality quantity | =DISTINCTCOUNT([Quantity]) | 4 |
Cardinality amount | =DISTINCTCOUNT([Amount]) | 1045 |
The maximum possible combination of amounts would be 1,536 (that is, 384 prices multiplied by 4 quantities), but because of repeats there are only 1,045 unique amounts. What this means is that the calculated column takes up nearly three times as much storage as the other two columns combined.
Be aware, however, that by removing the calculated column any query which aggregates transaction amounts will run that much more slowly, since it will have to recalculate them. There is often a trade-off between processing time and query run time!
Because of the way that the xVelocity engine stores data (by columns, not rows) some queries will be more costly to run than others. Here's an example of an expensive query:
This query will be relatively expensive to run, since it has to bring in data from four different columns.
Here's a less expensive one:
This query will be relatively cheap to run, since aggregation over a column runs quickly, and there are only two columns to bring in.
In practise the volume of data is so small for the MAM database that relative timings will be meaningless!
For the make-a-mammal tblPos (point-of-sale) table, we capture the point-of-sale date/time. Typical entries might look like this:
Five typical date/time entries from the Pos table.
If you have DateTime columns like this, you should split them into two parts (date and time). For the above dates, this would give:
Original value | Date | Time |
---|---|---|
23/11/2015 13:35:23 | 23/11/2015 | 13:35:23 |
23/11/2015 15:22:08 | 23/11/2015 | 15:22:08 |
24/11/2015 13:35:23 | 23/11/2015 | 13:35:23 |
24/11/2015 15:22:08 | 23/11/2015 | 15:22:08 |
25/11/2015 13:35:23 | 23/11/2015 | 13:35:23 |
This would yield the following column dictionaries:
Column | Dictionary | Values |
---|---|---|
Date | 23/11/2015, 24/11/2015, 25/11/2015 | 0,0,1,1,2 |
Time | 13:35:23, 15:22:08 | 0,1,0,1,0 |
The maximum possible number of dates which can now be stored is 365 or 366 for each year in the model, and the maximum number of times is 24 x 60 x 60 = 86,400 seconds.
Typically when processing a model the most expensive columns are unique id ones. You can reduce the amount of data stored by splitting these - here's a suggestion:
I've split the transaction id into the thousands component and the units component).
This would give the following column dictionaries:
Column | Formula | Distinct values |
---|---|---|
Thousands | =int([TransactionId]/1000) | 113 |
Units | =[TransactionId]-1000 * int([TransactionId]/1000) | 1000 |
Of course, you'll then have to reconstruct these when creating DAX expressions: it's the age-old trade-off between processing time and query run-time ...
If you want to measure how long it takes to run queries, DAX Studio includes an excellent Server Timings facility.
Kingsmoor House
Railway Street
GLOSSOP
SK13 2AA
Landmark Offices
99 Bishopsgate
LONDON
EC2M 3XD
Holiday Inn
25 Aytoun Street
MANCHESTER
M1 3AE
© Wise Owl Business Solutions Ltd 2024. All Rights Reserved.