GSoC’21: Blog 2

Rishabh Sanjay
4 min readJul 7, 2021

Hey everyone!!

So it’s been a while since I last wrote. Today I would be updating you all about Week 3 and 4 of my GSOC journey with ArviZ@NumFOCUS.

Week 3:

So I had implemented dot plots in Matplotlib during my week 1 and 2 for ArviZ. But as Arviz is a backend agnostic library, it allows Bokeh as another backend for plots. So my major task in week 3 was to implement dot plots in Bokeh. The high-level function is common for both the backends, so I finished this in 1 week. This PR can be referred for more details.

Here are some examples of how it can be used:

import arviz as az
import numpy as npN = 500
values = np.random.normal(loc = 3, scale = 4, size = N)
az.plot_dot(values = values, backend='bokeh')
az.plot_dot(values = values, point_interval = True, backend='bokeh)

Week 4:

So for this week, I shifted to implement ECDF Plots. This plot was completely new to me as I have not looked into this during my Community Bonding period so, after a lot of brainstorming with my mentors during week 3, I started to implement this.

The purpose of the ECDF plot is to compare the given sample and distribution or compare 2 samples. Apart from ECDF plots recently ECDF difference plots have also been popular so will implement them as a subplot. Also for comparison, we can also find the PIT of the sample and compare it with the Uniform(0,1) distribution. So if the sample followed the given distribution then the PIT would follow the Uniform(0,1) distribution. So we would implement ECDF and ECDF-difference for the PIT values too.

What are PIT values?

We can transform the sample values to a Uniform distribution via the Probability Integral Transform(PIT). So suppose we are given a sample y1, y2,..yn ~ g(y) where g is unknown. So we want to know whether g=p where p is a known pdf with a tractable cdf. So the PIT of yi’s w.r.t p is given as:

So if g=p then the transformed ui’s will follow the continuous Uniform(0,1) distribution. The proof of this is well explained here. So after this, our job is to use ECDF or ECDF-difference plots with simultaneous confidence bands to check whether ui’s follow a uniform distribution or not. If we are given 2 samples to compare yi’s and xi’s i.e x1,x2,.., xs ~ p. So for this case we have to follow the same procedure with this change as we do not know p here:

Simultaneous Confidence Bands

For understanding this I would request you all to refer to this wonderful paper. This has explained this in great detail and I would be implementing the Simulation-based algorithm for finding the confidence bands but I found that this method takes a lot of time so will implement the optimization algorithm in the future.

Results

Given a sample y ~ Normal(0,1) and a distribution p is a Normal(0,1)

Figure 1 is the ECDF Plot for the sample y. Figure 2 is the ECDF-difference plot for the sample y. Figure 3 is the ECDF plot for the PIT for sample y and Figure 4 is the ECDF-difference plot for the PIT of the sample y.

For all 4 figures, we can see that the ECDF plots are inside the (1-alpha) level simultaneous confidence bands. Used alpha as 0.95. Also, it is evident from figure 3 that the ECDF plot for the PIT indeed follows the standard uniform distribution. Therefore the sample follows the given distribution p. The ECDF plots can be used to check whether the modelled distribution on the sample does a good job or not.

Now let's observe the plots when the sample and distribution p are different.

Given a sample y ~ Normal(0,1) and a distribution p is a Normal(0,2)

Figure 1 is the ECDF Plot for the sample y. Figure 2 is the ECDF-difference plot for the sample y. Figure 3 is the ECDF plot for the PIT for sample y and Figure 4 is the ECDF-difference plot for the PIT of the sample y.

For all 4 figures, we can see that none of the ECDF plots is inside the (1-alpha) level simultaneous confidence bands. Hence the plots can be used to compare the sample and distribution.

For the sample-sample case too the plots are similar to the sample-distribution case. For a better understanding of the plots and code, you can refer to this gist which contains the code for the plots. You all can play with different distributions to see how the plots are working.

Let’s meet again in 2 weeks 😄

--

--

Rishabh Sanjay

Maths and Computing at IIT Kanpur, Deep Learning | NLP | Bayesian Statistics enthusiast