Pages

Saturday, February 19, 2011

Cube Mangement for Complex Business/Data Requirements

SQL Server Analysis Services provides OLAP functionality. To use most of its features and capabilities to build a powerful BI solution, we need to understand basic cube designing & processing concepts.

It looks quite easy to design OLAP cubes using Business Intelligence Development Studio, because it comes with cube designing wizard, which help to create cube objects. After cube design is done successfully with wizard, next step is to deploy it to on Analysis Server. And if everything goes fine, then cube can be browsed using different client applications.

In real project scenarios, it is not as easy as said above. In order to to design a BI solution with Analysis Services for a complex data requirement requires more detail understanding of business problem and how Analysis Services can be used to provide a better BI solution. There are many online documentations/blogs/books available to design and manage cubes. Some of them are mentioned below:
  1. Books Online: SQL Server Analysis Services (SSAS):  http://msdn.microsoft.com/en-us/library/ms175609(v=SQL.90).aspx
  2. Basic cube concept videos: http://www.learnmicrosoftbi.com/
  3. Cube Concepts: http://training.inet.com/OLAP/Cubes.htm
Here, I would like to explain, basic practices to design complex cubes. Lets try to generalize problem statement by considering some below stated points:
  • Underlying data warehouse/mart has around 20 dimensions and 6 facts.
  • Out of 20, 6 dimensions are incremental in nature, which grows by 50,000-200,000 records per day.
  • Fact tables size grows by 1,000,000  to 3,000,000 records per day depending on traffic in system.
Business Requirement: Some requirements include:
  • Business users want to do analysis on 4 main dimensions (For Example: Time, Product, Entity, Location) on frequent basis with very high query performance (i.e. queries should return result within a second).
  • Maximum acceptable latency between logging system (i.e. OLTP) and cube is 3-4 hours.
  • Data availability should be 24*7 regardless of cube processing failure.
Challenges: To handle such requirements, major challenges comes with cube management.
  • To manage such huge dimensions & facts.
  • To provide query performance within 1 second for mentioned business requirements.
  • To make cube available 24*7 irrespective of cube failure.
  • Cube has to be processed every hour to make latency within 3-4 hours.
Solution: Cube design requires exact problem statement. So here I'll discuss how to manage such complex cubes & not the designing part.

Cube Dimension Management: Processing large cube dimensions is an expensive process. So to manage such huge dimensions require proper attribute & hierarchy design.
  1. Use Dimension Views instead of Tables for source: Create views on dimension tables and use them while designing data source view (DSV) for large dimensions. It provides flexibility to dimension processing model to change view definition at processing time, so that it only reads new records. Dimension views also decouple link between dimension tables and DSV, which gives added advantage in case if there is any change in table's meta data information, then there will no direct impact on DSV.
  2. Use ProcessAdd processing type: Use ProcessAdd to process large dimensions, because it doesn't look for any data updates/deletes. Therefore, it makes processing faster in comparison to any other processing options. Use of Dimension views gives us an option to alter its definition and make sure only new records are read while processing.
  3. Alter View definition while processing: Every time before dimension precessing, their views should be altered. There are different ways to do it automatically. One method is to maintain a master table at date & hour level (If cube is being processed hourly) with process_status flag. Depending on process_status value, alter view definition and if dimension processed successfully, then update process_status flag for that date & hour as 'processed'.                                                   
Fact/Partition Management: There can be many ways to manage large facts, but all of them require a very good design. One of them is discussed below:
  1. Create daily partitions: As we mentioned above that each fact table size grows by 1m - 3m on daily basis, and maximum latency acceptable is 3-4 hours. Therefore, to get this optimization level, Measure Group partitions should be created on daily basis. So that when we process partitions on hourly basis, only current day partition get processed, which minimizes overall processing time by great extent and also minimizes MDX query performance.
  2. Merge Older Partitions: If we create partitions on daily basis, then number of partition in a Measure Group will increase and will require strategy to manage these partitions. Analysis Services provides a utility to merge existing partition in a single a partition. One strategy could be to keep last 2 weeks data on daily partitions, last 3 months data on monthly partitions, and records older than 3 months on yearly partitions. This strategy can be maintained through AMO script and can be written in SSIS script task which can further be trigger by SQL Server Agent on daily basis.
  3. Add Slice Property: Analysis Server 2005 & 2008 internally assigns attribute ranges to each partitions so that MDX queries only hit the required partition and avoid unnecessary partition scans, which is referred as "Auto Slice". Refer this link for more details: Slice property. Also can look at this discussion Why MDX Queries use multiple partitions to return the result?  As per SSAS Performance Guide, auto slice works for MOLAP partitions, and Data slices are only applicable to ROLAP partitions. But there are so many cases where people observed that auto slice doesn't work for MOLAP partitions as well. I personally have faced this issue 2-3 times. For more detail on partition Slice, Please read this article: Get most out of partition slices
  4. Use ProcessAdd & ProcessFull to process Partitions: Use ProcessFull,  when new partitions get created, and use ProcessAdd afterthat to add records on hourly basis to an existing partition (ProcessAdd can only be applicable on existing partitions). ProcessAdd only processes new records, therefore makes processing much faster.
Design Aggregations: Aggregations should be designed as per requirement, so that most used queries use aggregations and not partitions, which makes better query response. Example: For above mentioned requirement, we know that analysis will be done primarily on 4 dimensions, therefore we can design aggregations on particular hierarchy levels for these dimensions, so that queries will use aggregations instead of partitions.

Server Architecture to make data availability 24*7: Data availability is a crucial part of cube management. MSDN best practice article on load balancing & cube availability is very briefly explained in Scale-Out Querying with Analysis Services.


Readers: Kindly share your comments & suggestions. 

3 comments:

  1. Nice article. How big is your cube ? How do you manage old data in the cube. Do you remove the old partitions?

    Thanks -Raj

    ReplyDelete
  2. Thanks Raj for your comnent,

    1. How big is your cube ?
    Cube size is 25 GB and its growing by 300-500 mb on daily basis.

    2.How do you manage old data in the cube?
    We maintain monthly partition for last 3 months data, daily parition for last 2 weeks data. And data older than 3 months goes to archive partition.

    3. Do you remove the old partitions?
    No, I don't remove partitions. I simply merge partitoins to source partition. For Ex: Daily partitions get merge to Current month's partition, and so on...

    Thanks
    Surya

    ReplyDelete
  3. If you want to partition your cube or tabular project, check out the SSAS Partition Manager project on Codeplex which will dynamically add partitions with minimal configuration on your part. See https://ssaspartitionmanager.codeplex.com/

    ReplyDelete