A Gaussian Mixture Model for Clustering Traffic Congestion - Bayesian Modeling with Turing.jl

May, 2020

Introduction

This blog presents a cluster analysis using the Gaussian mixture algorthm to estimate two components of travel speed distribution. These two components can be referred to as the slow moving component, congested regime and high speed component, free-flow regime.

In [1]:
# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots
In [3]:
gr(leg = false, bg = :lightyellow)
Out[3]:
Plots.GRBackend()

The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Lets load the data and review the dataframe

In [6]:
d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)
Out[6]:

6 rows × 2 columns

speedoccupancy
Float64Float64
165.58426.13947
265.55568.85119
360.116710.6327
438.219.7278
528.272223.2722
628.850821.7625

Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.

In [9]:
p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))
Out[9]:
Looking at the figure above we can see that travel speed hase multimodal characteristics. Thus, the Gaussian mixture is more approapriate fitting this data to approximate the ditribution closely than unimodal distribution model. Ofcourse, we can compare the unimodal against multimadol distribution statisticaly to very this claim.

The Gaussian mixture model combine two or Gaussian densities by considering their weight composing the distribution. In this blog, two component Gaussian model is considered. For practical application, it a good idea to compare several components and select the best model that describe the data at hand. There are several metrics that can used such as information criteria approach, Bayesian factor etc.

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{1} \sim Beta(1, 1) \\ \pi_{1} + \pi_{1} = 1 \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

The above Equation 1 can be written as follows in Turing for parameters estimation

In [5]:
@model MarginalizedGMM(x, K) = begin
    N = length(x)
    # Vector of std (1,..,K)
    σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
    # Vector of mu (1,..,K)
    μ ~ filldist(Normal(0, 1), K)
    # Draw weights.
    π1 ~ Beta(1,1)
    π2 = 1 - π1
    #likelihood
    for i in 1:N
        x[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π1, π2])
    end
    
end;
In [ ]:
dat1 = d_0[:speed] 
dat1 .-= mean(dat1)
dat1 /= std(dat1);


dat0 = d_0[:occupancy] 
dat0 .-= mean(dat0)
dat0 /= std(dat0);
In [ ]:
gmm = MarginalizedGMM(dat1, 2)
chn = sample(gmm, NUTS(100, 0.95), 200);
In [8]:
plot(chn)
Out[8]:
In [9]:
describe(chn)
Out[9]:
2-element Array{ChainDataFrame,1}

Summary Statistics
  parameters     mean     std  naive_se     mcse       ess   r_hat
  ──────────  ───────  ──────  ────────  ───────  ────────  ──────
        μ[1]  -0.7435  0.0299    0.0030  missing   30.2028  0.9932
        μ[2]   0.6422  0.0049    0.0005  missing   82.1035  0.9900
          π1   0.4631  0.0115    0.0011  missing   18.7107  1.0198
        σ[1]   1.0454  0.0172    0.0017  missing  177.2767  0.9905
        σ[2]   0.1684  0.0048    0.0005  missing   75.0814  0.9971

Quantiles
  parameters     2.5%    25.0%    50.0%    75.0%    97.5%
  ──────────  ───────  ───────  ───────  ───────  ───────
        μ[1]  -0.7944  -0.7637  -0.7446  -0.7256  -0.6863
        μ[2]   0.6336   0.6388   0.6419   0.6455   0.6520
          π1   0.4341   0.4569   0.4630   0.4724   0.4812
        σ[1]   1.0169   1.0328   1.0440   1.0561   1.0795
        σ[2]   0.1600   0.1651   0.1682   0.1711   0.1779
In [10]:
p = get_params(chn);
In [11]:
x_plot = -4:0.1:1.5
post_pdfs_1  = [pdf(Normal(p.μ[1][i], p.σ[1][i]), x) * p.π1[i] for x = x_plot, i = 1:50]
post_pdfs_2  = [pdf(Normal(p.μ[2][i], p.σ[2][i]), x) * (1-p.π1[i]) for x = x_plot, i = 1:50];
In [12]:
histogram(dat1, xlabel = "Speed (std)", ylabel = "Data density", normed=true, legend = false)
plot!(x_plot, post_pdfs_1, lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(x_plot, post_pdfs_2, lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(x_plot, post_pdfs_1 + post_pdfs_2, lw = 3, alpha = 0.5, color = :green, legend = false)
Out[12]:

Conclusion

You will notice that I chose Beta distribution for weight as the prior distribution. I found the model running faster with this prior distribution than using the Dirichlet distribution. I will try to expand this model by incorporating the occupancy variable in the model.

In [1]:
from IPython.core.display import HTML
In [31]:
def css_styling():
    styles = open("styles/custom_5.css", "r").read()
    return HTML(styles)
css_styling()
Out[31]: