PowerPivotGeek?

Who is this mystery man?
Click on the icon to find out. Who is powerpivotgeek?

A Peek Inside: Allocation. What is it? Why is it important?

Recently I have been getting some questions about an earlier post that I did around how the PowerPivot inactivity unloading is done. see http://powerpivotgeek.com/2010/03/04/a-peek-inside-unloading-powerpivot-data/ for more details. Surely there must be a faster mechanism for unloading data – and there is – but to fully discuss that aspect of the system we need to take a step back and talk about how databases get loaded in the first place. I don’t mean the actual loading process (that is another interesting post that I’ll do later on), what I mean is how PowerPivot decides to load something to begin with . . . and where it decides to load it.

PowerPivot provides on-demand loading of embedded databases. What this means is that when a client application makes a connection to a workbook (by placing “server=http://sharepoint_server/site/subsite/doclib/workbook.xlsx;” or “data source=http://…/workbook.xlsx;” on the connection string), then the system looks to see if the workbook’s embedded data has already been loaded. If so, the connection request is routed to that PowerPivot app server. If the embedded data has not yet be located then the system has to go through what we call Allocation. Allocation is the process of deciding which app server to use to load the data and to load it into that machine’s AS Engine instance (we only allow one AS instance per machine – called <server>\GeminiBI in CTP3 and <server>\PowerPivot in RTM).

PowerPivot ships with two algorithms:

  1. Round-robin – where the system just picks the next server in the PowerPivot System Service instance list (using the SP object model – remember PowerPivot is a shared service in SharePoint).
  2. Health-based – where the system polls all of the PowerPivot System Service instances to determine their health and uses the healthiest server as the location for the ‘to-be-loaded database’.

image

Notice that round-robin is the default. Round robin is particular good if need to minimize the number of messages flowing through your system as it does not poll any of the servers – it just picks the next one and uses it. Since it does not check to see if that system is the ‘best’ from a memory or CPU perspective, round robin is best used when there is no memory pressure on the servers. It is NOT a good algorithm under heavy load. The heavy load use case was the design point for health-based allocation.

Health-based uses the current CPU and memory usage of the AS instance on each machine to determine the healthiest. But for our purposes the other thing that it does, if it cannot find a server that has memory free (CPU is only used as a tie-breaker), then health-based looks to see if there are any inactive databases that can be unloaded to make memory available for the target database (the embedded data in the workbook). An ‘inactive’ database is those that have no currently opened connections to it. Health-based scans the list of inactive databases in oldest-to-latest last accessed order; until the target amount of memory (1.5X the size of the workbook) is found. Thus multiple databases may be unloaded. The PSS sends XMLA DETACH commands to each database, it loads the target workbook, and then the connection request is forwarded to the right server.

So . . that is what happens. But who is actually doing this work? That is the “geek’ie” part of this post. Rather than using the PSS and having to create a “master” PSS (which would give us a single failure point), we elected to do this work in a distributed manner. Here are the components:

image

 

The allocation algorithms run in the WCF proxy code. When a request comes in (a remote procedure call to a method in the PSS), the proxy looks to see if it knows where the target database is located. If the proxy has seen the database before, then it forwards the request to the PSS for that respective endpoint. If the proxy has not yet seen the target database it picks one of the PSS web services at random and invokes a method on it to see if the database is located in the PowerPivot service application database (the so-called instance map, which is a record of what databases are loaded and cached (detached) on each AS engine. There is a 10-minute background thread in the PSS that keeps the local AS engine in-synch with the instance map. The randomly chosen PSS returns the instance map information. If the target database is in the instance map then the proxy forwards the request to that PSS endpoint. And everything is good.

If the target database is not in the instance map information, then the proxy looks to see what allocation algorithm to use. If it is round robin, then the proxy just picks the next server in the PSS instance list. Since there could be many PSS proxies in the farm (e.g. many WFEs, many web applications that might have different IIS application pools, ECS transports, and other backend services that access PowerPivot via msolap, ADOMD.NET, AMO which in turn use the Channel Transport), the actual allocation of servers is likely to appear random or heavily favoring one server. We purposely did not synchronize the picking of the ‘next’ server because that would have placed a hot spot on that particular resource. Round-robin is supposed to be a fast, minimal algorithm – so having it become random over time is considered OK.

If the allocation algorithm is health-based, then the proxy polls each PSS gathering information about CPU, memory and the status and size of memory taken by each database on the servers. The proxy then decides which is the ‘best’ server and what databases to unload (detach) to be able to load the database on that server. The proxy then calls the server to perform the load and forwards the request to that particular machine’s PSS endpoint. 

At any point, in the entire algorithm the proxy is prepared to handle that the ultimate target endpoint that it is given, for example, from its own cache, or from the instance map information, might be wrong and the RPC call may return that the database is not located where the proxy believes it to be. If this happens, the proxy reties. It keeps retrying the number of PSS instances are on the farm. If the proxy is still chasing its tail through all of the pointers and caches in the system, then it gives up and returns an “unable to load database” error to the caller.

So, in summary, i wanted to go over this not to explain allocation itself, but more to make the point that rather than say one simple thing — “When in doubt, turn on health-based allocation – and let it proactively unload databases.” While not directly tied to inactivity, it is the single biggest ‘tuning’ option that you have to clearing resources in a high memory use configuration.

Enjoy.

  • Share/Bookmark

1 comment to A Peek Inside: Allocation. What is it? Why is it important?

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>