This page has 0 threads | Add post

The theory of processing tables in SSAS (Tabular)

Written by Andy Brown

In this tutorial

Column Storage
Data compression for column storage
The two stages of processing
The types of processing
The most efficient strategy
Processing an individual table
Processing all of the tables for a given connection
Processing an entire model
Choosing processing options for deployment
Processing in Management Studio
Step 1 - Partition your tables
Step 2 - Omit high-cardinality columns where possible
Step 3 - Pre-sort your data
Step 4 - avoid calculated columns
Step 5 - avoid expensive queries
Step 6 - don't store date/time columns
Step 7 - consider splitting id columns

Processing itself in SSAS Tabular is easy; it's understanding what's going on which is hard! This page explains the underlying concepts of processing, and the choices you'll need to make.

Column Storage

To understand how processing works in SSAS tabular models, and how to tweak it, it's vital to understand how the underlying engine stores data.

Most databases (such as SQL Server) use a row-based storage algorithm:

Typically each row in a table is stored as a record, and is accessed by its primary key (unique identifier).

By contrast, the SSAS database engine stores data by columns:

In tabular models, columns are stored separately (because whole tables are processed in one go, this is an efficient way to store and access data).

Data compression for column storage

Duplicate column values are only ever stored once. Thus the SpeciesId and Legs columns above would be stored as:

Column	Dictionary	Values
SpeciesId	1,2,3,4	0,0,0,1,1,2,2,3
Legs	0,2,4	2,2,2,1,1,0,2,2

For example there are only 3 discrete values for legs: 0, 2 and 4. Here's how SSAS would store the legs for each of the 8 animals:

Animal	Legs	Position in list	Rank used by SSAS
Sheep	4	3	2
Cat	4	3	2
Tiger	4	3	2
Penguin	2	2	1
Owl	2	2	1
Snake	0	1	0
Crocodile	4	3	2
Frog	4	3	2

Putting the last column together gives the values string 2,2,2,1,1,0,2,2 shown in the first table.

This shows that the lower the cardinality of a column (ie the smaller the number of distinct values there are, and hence the more duplication there is), the more efficiently the data will be stored - a point I'll revisit later in this blog.

The two stages of processing

When you process any model, two things happen:

Stage	What's involved
Data (loading data for each table)	The xVelocity engine takes each column in each table and builds a dictionary from it, as explained above.
Recalc (tidying up loose ends)	The engine then populates any calculated columns, hierarchies, relationships and indexes for the database.

It's important to bear these two stages in mind when you're considering what your processing strategy should be.

The types of processing

When you process a table or database, in order to achieve continuity SSAS by default runs a SQL transaction to load the new set of data while leaving the old set unaffected.

This means that if the processing fails at any point use of the existing model will not be affected, but it also means that the xVelocity engine requires a little over twice as much memory to process a table or database as is taken up by the table or database being processed.

Bearing this in mind, when you choose to process you'll be presented with a choice of options as follows (partitions are explained in this separate blog, and are included in the table below for completeness):

Mode	Level	What it does
Process Data	Table or partition	Reads in data and builds dictionaries, but doesn't process any calculated columns, indexes, relationships or hierarchies. Typically a Process Recalc is required after all tables' dictionaries have been updated using Process Data.
Process Recalc	Database	Updates all of the calculated columns, indexes, relationships and hierarchies for a model.
Process Full	Table, partition or database	Runs a Process Data on the table or partition, followed by a Process Recalc. If you run Process Full for an entire database, SSAS will rebuild all of the tables in the database (running a Process Data on each one), then run a single Process Recalc.
Process Defrag	Table or database	Tidies up an individual table, or all tables (this is particularly useful when you've processed individual partitions for a table, and dictionaries might no longer be efficiently stored).

The most efficient strategy

Given the above, the most efficient approach is usually to:

Run a Process Data on each table (or on each partition of each table) to create up-to-date dictionaries for all of its columns; then
When this is finished, run a single Process Recalc on the entire database.

However, this approach does mean that calculated columns, relationships, hierarchies and indexes won't be available between the start of step 1 and the end of step 2.

One way to get round the above is to run steps 1 and 2 as part of a single SQL transaction (equivalent to doing a Process Full on the entire database), but this will use much more memory.

The previous page shows how processing works; this page shows how to do it! You can process an individual table, all tables, or else all of the tables using a particular connection (the options are described under separate headings below).

Processing an individual table

To process a single table, select the table in design or data view and then choose this menu option:

Choose this option to process a table.

Processing all of the tables for a given connection

A typical requirement is to process all of the tables from a particular SQL Server database. To do this, first choose a connection:

Choose to look at your existing connections.

Now choose a connection, and choose to process it:

Choose to process all the tables using this connection.

Processing an entire model

If you want to load the data for an entire model:

Choose this menu option to process all of the tables in a model.

Choosing processing options for deployment

When you deploy a model SSAS will process its tables, but by default it will do so only for tables and partitions whose data has changed. To change this, first show the properties of your project:

Right-click on your project and choose to show its properties.

You can now select how SSAS will copy data when you deploy your model:

In the project's properties window, change the default processing option (if you choose Do Not Process, SSAS will copy metadata when you deploy a project but won't load any data).

"Metadata" are the definitions of your tables and relationships, specifying which tables, columns and rows are included in your model, but not including the data itself.

Processing in Management Studio

You can also choose to load data from SSMS:

Right-click on the model you've deployed to Analysis Services, and choose to process it in its entirety.

You can then choose how you're going to process it, and choose OK:

Choose how you want to process the data. Process Clear wipes out all existing data, without reloading it (essentially deleting the contents of the model).

You can also process individual tables from SSMS:

Right-click on the table you want to process, and choose the option shown.

You can now tick which tables you want to process, and choose how:

You've now got complete control of what you process, and how you do it!

That's pretty much the whole story on processing (with the possible exception of partitions, covered in a separate blog). The final thing I want to look at is some ideas for speeding up processing.

Processing is usually the bottleneck in any company, since it involves loading masses of data from a company's transactional database into its data warehouse, for use the next day as a basis for management reports.

One way to minimise the disruption that this process (forgive the pun) causes is to use SQL Server Agent to schedule it to run at 3 o'clock in the morning, although this is tough on anyone working nights!

This page looks at some ways in which you can make processing take less time.

Step 1 - Partition your tables

I've looked at this in a separate blog, but I've listed it here for completeness. Partitioning allows you to break a table up into different parts, and choose which to process:

For the tblPos table, for example, you could choose just to import the 2016 data, on the grounds that other years have been closed off and so can't have changed.

Step 2 - Omit high-cardinality columns where possible

The temptation when importing tables is to include all of the columns, then just hide the ones you want from client view:

The TransactionId column is hidden from client view, so how much harm can it be doing?

For large datasets, this is a mistake! The TransactionId column shown above is the most expensive one in the entire MAM database, since it contains 13,904 different values:

Because it's the primary key for the transactions table, each transaction id value is necessarily unique.

Remember that SSAS uses column and not row storage, so the column dictionary for the transaction id number will be (at least in MAM terms) huge. All this for a column which no user of your model will ever want to see or use!

The solution is to avoid importing expensive columns with a high level of granularity:

Unticking the transaction id will mean that this expensive column won't be imported.

Step 3 - Pre-sort your data

Take the animal table as an example. Sorting by the animal id or name won't gain much:

The first two columns have unique values for every row, so it doesn't matter what order these appear in.

Sorting by the Legs column, by contrast, will have an effect (since this has fewest unique values):

The Legs column has the fewest distinct values, so it makes sense to sort by this column first.

I have to be honest, I don't know exactly how much difference this will make, but the logic seems good!

Step 4 - avoid calculated columns

Consider the following simple calculation, which works out the value for transactions as the price of the item multiplied by the quantity of items bought:

The value of any item is the price times the quantity.

I've added 3 measures at the bottom of the table:

Measure	Formula	Results
Cardinality price	=DISTINCTCOUNT([Price])	384
Cardinality quantity	=DISTINCTCOUNT([Quantity])	4
Cardinality amount	=DISTINCTCOUNT([Amount])	1045

The maximum possible combination of amounts would be 1,536 (that is, 384 prices multiplied by 4 quantities), but because of repeats there are only 1,045 unique amounts. What this means is that the calculated column takes up nearly three times as much storage as the other two columns combined.

Be aware, however, that by removing the calculated column any query which aggregates transaction amounts will run that much more slowly, since it will have to recalculate them. There is often a trade-off between processing time and query run time!

Step 5 - avoid expensive queries

Because of the way that the xVelocity engine stores data (by columns, not rows) some queries will be more costly to run than others. Here's an example of an expensive query:

This query will be relatively expensive to run, since it has to bring in data from four different columns.

Here's a less expensive one:

This query will be relatively cheap to run, since aggregation over a column runs quickly, and there are only two columns to bring in.

In practise the volume of data is so small for the MAM database that relative timings will be meaningless!

Step 6 - don't store date/time columns

For the make-a-mammal tblPos (point-of-sale) table, we capture the point-of-sale date/time. Typical entries might look like this:

Five typical date/time entries from the Pos table.

If you have DateTime columns like this, you should split them into two parts (date and time). For the above dates, this would give:

Original value	Date	Time
23/11/2015 13:35:23	23/11/2015	13:35:23
23/11/2015 15:22:08	23/11/2015	15:22:08
24/11/2015 13:35:23	23/11/2015	13:35:23
24/11/2015 15:22:08	23/11/2015	15:22:08
25/11/2015 13:35:23	23/11/2015	13:35:23

This would yield the following column dictionaries:

Column	Dictionary	Values
Date	23/11/2015, 24/11/2015, 25/11/2015	0,0,1,1,2
Time	13:35:23, 15:22:08	0,1,0,1,0

The maximum possible number of dates which can now be stored is 365 or 366 for each year in the model, and the maximum number of times is 24 x 60 x 60 = 86,400 seconds.

Step 7 - consider splitting id columns

Typically when processing a model the most expensive columns are unique id ones. You can reduce the amount of data stored by splitting these - here's a suggestion:

I've split the transaction id into the thousands component and the units component).

This would give the following column dictionaries:

Column	Formula	Distinct values
Thousands	=int([TransactionId]/1000)	113
Units	*=[TransactionId]-1000 int([TransactionId]/1000)**	1000

Of course, you'll then have to reconstruct these when creating DAX expressions: it's the age-old trade-off between processing time and query run-time ...

If you want to measure how long it takes to run queries, DAX Studio includes an excellent Server Timings facility.

This page has 0 threads

Add a new post

Online tutorials

Tutorials on SSAS

Processing and partitions