Designing Data Models
Dimensional Approach
Principles
In quick succession, we presented multiple dimensional models derived from the same source schema. The dimensional models we created are all different, yet none of them are wrong and none of them are the sole right model because they are able to answer different business questions. Now it’s time to synthesize the examples and formalize the approach to dimensional data modeling. Our approach will keep two principles in mind.
Principles
The first principle is simplification. All of the models we built are simpler than our source model. For instance, we have seen several models consisting of just one dataset. Production schemas will be much bigger, yet you can still expect that your source data model with hundreds of tables will translate to a dimensional model with tens of datasets.
The beauty of single-dataset models is not only simplicity but also flexibility and usability for business users. Within such a model, everything is allowed; any metric you build can be sliced and filtered by anything in the dataset. In addition, its simplicity doesn’t limit its analytical power. Even with our aggregated example, we saw a huge list of business questions that the model could answer.
Being closer to the business and business users is the second principle, and it’s the reason why we ended up with different data models. In dimensional data modeling, we always start with the business questions in mind. Loading data as-is is never a good idea. Such an approach would fail the business users and their ability to work with the model because we wouldn’t be accounting for the questions they want to answer with the data.
Think About the Business
Analyzing data is hard work, and that work has to be done somewhere by someone. As the creator of the data model, you decide where to put the work—either closer to or farther from business users. You can decide what part you automate to promote governance and what part you keep totally flexible.
Note how we stopped talking about User (a database table or a column) and started talking about Opportunity Owner or Account Owner (the name of the relationship between two tables). That is another move toward the business. The name of the relationship is now more important to us than the name of the column.
In the same way, as data types emerge, what is important is the business meaning of the data types. That’s why, instead of columns, we talk about attributes and facts. (UserId, even if its data type is an integer, would still be considered an attribute because its business meaning is not numeric.)
Tip: Want to know more about terminology? We recommend our course Understanding Logical Data Model.
Business Semantics
When building dimensional data models, we always start with the business semantics. What do we want to measure? This question identifies our facts: Amount, Number of Opportunities, etc.
If our first question is “what,” then the next one is “how.” How do we want to measure it? This question identifies our attributes: anything we want to group the facts by (e.g., Amount by Owner Name) and any way we want to filter them.
Attributes and Facts
Working with attributes is typically straightforward. You either include them or exclude them from your model. Working with facts is subtler because what you want to measure is typically a calculation that can include arithmetic operations (Profit = Revenue - Expenses), aggregations (count of opportunities), or filtering (Total Amount of opportunities with Status = Open).
You need to decide whether to have a fact called Profit (and calculate it behind the scenes) or to have facts Revenue and Expenses and calculate the profit when needed. To decide which approach is the right one, you need to first answer the “how” question and identify the attributes.
Business Matrix
The tool to capture the “what” and “how” answers is called a business matrix, which cross-references facts in rows with attributes in columns. A checked box in the matrix means that we want to be able to group (aggregate) the fact by the attribute—and also that it is possible to do so with respect to the source data. It is always important to make sure the business requirements are compatible with the data.
When building more complex models, the business matrix can help you identify the right datasets as the groupings of facts that are sliceable by the same attributes and the groupings of attributes slicing the same facts. While the decision about how to group attributes and facts to datasets is not a trivial one and has both analytical and performance consequences, the business-matrix approach can give you a good first iteration of your data model.
Number of Closed Opportunities
Here is the business matrix for our first example. All cells are checked, which tells us that the whole model can fit into one dataset. We may not necessarily always want to fit it into one dataset but now we know it is possible.
Opportunity Amount | Count of Opportunities | |
---|---|---|
Opportunity Id | ||
Opportunity Name | ||
Opportunity Stage | ||
Opportunity Created Date | ||
Opportunity Close Date | ||
Opportunity Owner Name | ||
Opportunity Owner Title | ||
Opportunity Owner Department |
Each checked box defines a business question that can be answered: Opportunity Amount by Stage, by Department, or Count of Opportunities by Owner.
More checked boxes in the same column allow for more complex aggregations, such as Count of Opportunities by Owner and Stage. With that, we can find how many opportunities we have for each combination of Owners and Stages.
More checked boxes in the same row allow for more complex arithmetic operations. Opportunity Amount and Count of Opportunities by Owner can calculate Average Opportunity Amount by Owner.
You can also combine rows and columns. Four checked boxes for Opportunity Stage-Amount, Stage-Count, Owner-Amount, and Owner-Count mean we can calculate Average Opportunity Amount by Owner and Stage.
Accounts with Open Opportunities
The moment we replace Opportunity Owner with Account Owner and introduce a new fact—Count of Account—then the matrix gets more complicated:
Opportunity Amount | Count of Opportunities | Count of Accounts | |
---|---|---|---|
Opportunity Id | |||
Opportunity Name | |||
Opportunity Stage | |||
Opportunity Created Date | |||
Opportunity Close Date | |||
Opportunity Owner Name | |||
Opportunity Owner Title | |||
Opportunity Owner Department |
Both opportunity-related facts can still be sliced by anything, but the Count of Accounts is not sliceable by Opportunity Stage, Close Date, nor any other Opportunity attribute. The business matrix gives us a quick overview of what business questions are possible to answer.
Note: If you decide to define the Count of Accounts as a precalculated fact, you would not be able to use opportunities to filter your result. However, if you keep Account Id in your model, GoodData will allow for calculations like the count of accounts that have an open opportunity (Opportunity Stage) in the current quarter (Opportunity Close Date).