Covariance and Correlation and R-Squared

This article explains how to use MAQL to calculate the relationship between two variables in terms of covariance, Pearson correlation, and the R-squared coefficient of determination.

To learn about statistical functions in MAQL, see our Documentation.

Background

Covariance

In probability theory and statistics, covariance is a measure of the degree to which two random variables (X, Y) change together. If the variables tend to show similar behavior, the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the smaller values of the other, the covariance is negative.

Covariance is calculated as:

Covariance Formula

Pearson Correlation (r)

In statistics, correlation is the degree of association between two random variables (X, Y). It is expressed by a correlation coefficient that varies between -1 and 1.

Correlation is calculated as:

Pearson Correlation Formula

where sx is the standard deviation of X.

R-Squared (Coefficient of Determination)

In statistics, R-Squared (the square of the Pearson Correlation Coefficient) indicates how well data points fit a line or curve. It is mainly used in models that predict future outcomes or test hypotheses on the basis of other related information.

R-Squared is calculated as:

R-Squared Formula

Scenario

Suppose that an insurance company wants to find the relationship between the number of claims and the value of those claims, organized by Geo Codes.

Sample Data

Geo Codes Number of Claims (X) Value of Claims in $K (Y)
11011 108 392.5
11012 19 46.2
11013 13 15.7
11014 124 422.2

Our data set contains Number and Value facts by Geo Code as depicted by the following model:

Data Model

Metric 1 - Avg Claim Number (Mean X)

To find the average of variable X, or Avg Claim Number, you can use MAQL to define the following metric:

SELECT AVG(Number) BY ALL OTHER

The BY ALL OTHER clause is used to avoid any slicing of the amount by anything that may be present in the report.

Metric 2 - Avg Claim Value (Mean Y)

You can define a similar metric to find the average of variable Y, or in this case Avg Claim Value:

SELECT AVG(Value) BY ALL OTHER

The BY ALL OTHER clause is used to avoid any slicing of the amount by anything that may be present in the report.

Metric 3 - Claim Number Diff

You can now use the metric created in Metric 1 to calculate the difference between any given number of claims and the overall average claim number. Note that this metric is sliced by Geo Code:

SELECT SUM(Number - Avg Claim Number) BY GeoCode

Metric 4 - Claim Value Diff

Likewise, you can use Metric 2 to calculate the difference between any given claim value and the overall average claim value:

SELECT SUM(Value - Avg Claim Value) BY GeoCode

Metric 5 - Covariance

Next, you can calculate the covariance by multiplying the two difference metrics that were defined in Metrics 3 and 4, aggregating the result, and dividing by the number of records (n):

SELECT (SELECT SUM((SELECT Claim Number Diff*Claim Value Diff BY GeoCode)))/ (SELECT COUNT((GeoCode)))

To determine the number of records, you have counted the number of unique Geo Codes.

Metric 6 - Pearson Correlation (r)

Now that you have calculated the covariance, you can calculate the correlation by dividing the value by the standard deviation of X and Y, or in this case, Number and Value facts:

SELECT Covariance(Number, Value) /((SELECT STDEV(Number))*(SELECT STDEV(Value)))

Metric 7 - R-Squared

R-Squared is the square of the Pearson correlation, which you already calculated in Metric 6:

SELECT POWER(Pearson Correlation (r), 2)

Note that in this case, using the POWER function with exponent 2 has the same effect as squaring the Pearson Correlation (r) metric, or multiplying it by itself.

By nesting MAQL metrics, you can create complex statistical expressions with little of the complexity you’d expect from a data query language.