GSoC’21 Blog-1
ArviZ is a python package for exploratory analysis of Bayesian Models.
The Community Bonding period involved brainstorming over ideas and revisions over the proposal after going through it multiple times. The Coding phase started on June 7, 2021. This post summarizes the first 2 weeks of the Coding Phase of my GSoC’21 experience with ArviZ@NumFOCUS.
In the first week, I worked mainly on implementing Wilkinson’s Algorithm for dot plots and creating a general lower-level function that Matplotlib and Bokeh backends would use.
For the second week, I worked mainly on designing the function API and bringing everything together and adding new features.
Dot Plots
Dot Plots represent individual observations in a batch of data using symbols, especially circular dots. They can be considered as scatter plots with horizontal stacks. To learn more about them you can refer to this paper.
Wilkinson’s Algorithm
The stack of dots can be considered as bins in histograms. So to find the height and position of these bins, we will follow this simple algorithm.
Here X denotes the finite set of data points, namely X_1, X_2, …, X_n.
h: bin’s width
v: offset term (v is the average of the first and last dot if different o/w 0)
Algorithm:
- Start with the smallest value, X_j = X_1. The first stack begins here.
- Count the number of dots within X_j and X_(j+h). Let us call it n_j.
- Place n_j dots above X_j and offset to the right by v if the n_j data values differ.
- Move right to the next data point not included in the current stack.
- Repeat steps 2–4 until no data points are left.
Quantile Dot Plots
A quantile dot plot represents a probability distribution by taking a uniform sample of quantile values and plotting them in a dot plot.
This was introduced basically to represent continuous probability distribution as discrete outcomes, i.e. if we generate a random sample of finite size(say 20) and plot it using a dot plot, then it doesn’t need to look exactly like the probability distribution every time unless the size of the sample is huge.
Instead of generating random samples, we generate evenly spaced quantiles in probability space(i.e. between 0 and 1) depending on the number of samples. For these quantiles, we find the corresponding value of x s.t P(X < x) is equal to the particular probability for the underlying distribution.
Implementation
The below is the implementation of Wilkinson’s Algorithm in Python.
This is just the Algorithm for the entire function API and the lower and higher-level functions please refer to this PR which is currently under review. I will also add tests and documentation for this before it can be merged.
This function could be similarly used as all the other existing ArviZ plot functions.
import arviz as az
import numpy as npN = 500
values = np.random.normal(loc = 3, scale = 4, size = N)
az.plot_dot(values = values)
The above script would result in the following plot:
We could also add a point interval to our plot, by default an HDI and a 25–75 quartile is plotted.
az.plot_dot(values = values, point_interval = True)
By specifying the quantiles
argument we could plot a quantile plot
az.plot_dot(values = values, point_interval = True, quantiles = 50)
Also if you want to have the x-axis as the frequency axis, you could specify the rotated
argument.
az.plot_dot(values = values, point_interval = True, rotated = True)
There are many more arguments which you can use to play with the plot. Also, I might add some more.
I would really suggest anyone who is interested in Bayesian Inference and want to do any visualization please check out ArviZ. It is a great backend agnostic library for the visualization of Bayesian Models.
The first two weeks have been a very fun experience for me and has also helped me learn new things. I will be posting my work and findings every two weeks.
Have a nice day!