Recently there was a forum post concerning how PowerPivot parallelizes data refresh . . and I thought this might be an interesting topic for a “Peek Inside” blog posting. The first question is: Why are we doing this? What is the purpose of parallelizing data refresh? Why is this so important? Well . . there are two reasons. The first reason is that we want to get the maximum throughput from all of the compute resources that we have in the farm. We paid a lot for the servers and we want to keep them busy. However, secondly, particularly as users start deploying more and more workbooks, the number of jobs will increase as well. We expect that the automatic refresh capabilities of the PowerPivot system will be a popular feature. Information workers like to keep their workbooks up-to-date – and data refresh is a powerful new feature of PowerPivot. For a large farm with tens of thousands of Excel workbooks, there might be thousands of PowerPivot embedded data workbooks (10:1). And of these there might be hundreds of workbooks that need nightly data refresh (again, using a 10:1 ratio). If we did the data refresh one-by-one and each one took 10 minutes, this means it would take almost 10 to 20 hours to refresh them all. Obviously we need to perform many of them at the same time to fit within a reasonable nightly window.
As the steps that we do to accomplish this ‘parallelization’ isn’t talked too much in the regular BOL (there is a bit, but mostly around the setting dialog boxes), I thought that it would make a good blog posting.
First, let’s start off with the basic PowerPivot overall architecture: (I will let other postings cover the details; here is a basic overview)
- PowerPivot for SharePoint has two shared services: one service called the PowerPivot System service is responsible for all back-end PowerPivot work (except the Engine itself); and we have a second shared service that is (you guessed it), the SSAS Engine itself. The Engine service is the regular SSAS Windows service wrapped as a SharePoint shared service. It is set to Vertipaq mode and is ready for loading, querying and processing of embedded PowerPivot datasets that are part of your Excel workbooks. When you install PowerPivot you get both services installed on the SharePoint app server. Across the SharePoint farm, you have “x’” pairs of services (i.e. where you have installed PowerPivot for SharePoint). This might be all app servers in the farm; one app server in the farm; or anywhere in-between. As an administrator you get to decide which app servers will be providing PowerPivot services.
- Associated with the PowerPivot System service is a service application database. There is one service application database per PowerPivot System service application. Here is how you find out which RDBMS is created:
The service application database holds:
- The ‘instance map’ which tells the system which app server has a given workbook’s data is loaded, or cached
- The data refresh schedules that have been created by end-users. It also contains the ‘run queue’ for those data refresh jobs that are now runable, or being ran.
- Lastly, the service application database contains data refresh history, i.e. when jobs ran and their status (success, failure, or informational messages).
To configure data refresh jobs, there are two Engine service instance properties. To see these settings, run Central Administration and click on the SSAS Engine service instance. In this case, here is a server in my office at Microsoft. I went to Central Admin “Services on Server” for the machine DWICKERT-RTM and clicked on the “SQL Server Analysis Services” service at the bottom:
And here are the data refresh settings for the SSAS Engine service instance on DWICKERT-RTM:
The first checkbox tells the system that DWICKERT-RTM can be used for querying, i.e. loading read-only databases on the machine. If the checkbox was not enabled (unckecked), then this machine would have been dedicated to data refresh. The second checkbox is a similar setting for data refresh. If neither is enabled, then the instance will not be used. It is installed and configured, but it won’t be used. The default setting is that both options are enabled, i.e. that a server can be used for both querying and data refresh – but if you want to change it and dedicate the server for one or the other activity, here is where you set the role.
If you have enabled data refresh on this machine then you get to decide how many workbooks can be concurrently refreshing at the same time. We call these data refresh units of work a slot. In the case above we have enabled 4 slots on DWICKERT-RTM. The default setting is the amount of memory divided by 4GB (thus a 16GB machine should result in default setting of 4 concurrent jobs) – although there is a current bug in the system where the setting is always set to 1 regardless of how large the server memory is. The maximum value is the number of CPUs. To get the most use from your machine resources, I strongly recommend that you set the maximum concurrency if the machine is dedicated to data refresh. If you are running on a quad core machine, then the maximum number is 4. The way the system is designed is a bit complication. The UI allows you to enter any value you wish for this dialog box (e.g. you could enter 100 instead of 4 if you wished), but when the system goes to actually run the data refresh jobs, it will generate errors if the maximum concurrent jobs is larger than the number of cores on the machine. While this is not the best behavior, we would have like to stop you from entering a wrong number in the dialog box right up front, but given the fact that it is difficult to monitor remote machines, this was the most effective approach.
So . . . now we have things running, we are all done, Right?? Well as it turns out, No. The one remaining thing to talk about is how a data refresh job gets ‘kicked off’ to begin with. The “kicker” is the PowerPivot Data Refresh Timer job: (again, running Central Admin, PowerPivot Management Dashboard, here it is:)
As you can see, the timer job runs once a minute. This means that the timer job calls into the PowerPivot System service (on each machine where data refresh is enabled) giving it a ‘kick’ to take a look at its scheduled jobs every minute. The PowerPivot System service looks to see if any of the scheduled jobs is now ‘runable’ and if so, the job is placed in the run queue. At the end of each timer job (i.e. each minute), the PowerPivot System service looks to see if there are any runable jobs waiting, and if there are any ‘slots’ available for this machine. If so, the PowerPivot System service starts the refresh process.
All SharePoint interactions, i.e. reading the workbook from the SharePoint content database, saving the file back to the content database, are done in the same calling thread from the RPC call from the timer job into the PowerPivot System service, but when the job is ready to actually do the SSAS processing (where the Engine goes out and refreshes the cube data from the original data sources), then that work is requested in parallel on separate threads within the PowerPivot System service (one thread per ‘slot’). Remember that the actual processing is done by the local Engine service. Remember that in PowerPivot for SharePoint, the PowerPivot System service and the local SSAS Engine instance are always installed and operate as pairs. The PowerPivot System service acts as a ‘gatekeeper’ for the local Engine service. We never have the situation where the data refresh is done by a PowerPivot System service, but executed on a remote Engine service. The local Engine is all that PowerPivot System service knows (or cares) about.
So, like many things in PowerPivot, while data refresh seems simple and straightforward, there is actually a fair amount of technology underneath.
I hoped you enjoyed the geeky tour.