A Gaussian Mixture Model for Clustering Traffic Congestion - Bayesian Modeling with Turing.jl ¶

May, 2020

Introduction

This blog presents a cluster analysis using the Gaussian mixture algorthm to estimate two components of travel speed distribution. These two components can be referred to as the slow moving component, congested regime and high speed component, free-flow regime.

# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots

gr(leg = false, bg = :lightyellow)

Plots.GRBackend()

The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Lets load the data and review the dataframe

d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)

Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.

p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))

Looking at the figure above we can see that travel speed hase multimodal characteristics. Thus, the Gaussian mixture is more approapriate fitting this data to approximate the ditribution closely than unimodal distribution model. Ofcourse, we can compare the unimodal against multimadol distribution statisticaly to very this claim.

The Gaussian mixture model combine two or Gaussian densities by considering their weight composing the distribution. In this blog, two component Gaussian model is considered. For practical application, it a good idea to compare several components and select the best model that describe the data at hand. There are several metrics that can used such as information criteria approach, Bayesian factor etc.

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{1} \sim Beta(1, 1) \\ \pi_{1} + \pi_{1} = 1 \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

The above Equation 1 can be written as follows in Turing for parameters estimation

@model MarginalizedGMM(x, K) = begin
    N = length(x)
    # Vector of std (1,..,K)
    σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
    # Vector of mu (1,..,K)
    μ ~ filldist(Normal(0, 1), K)
    # Draw weights.
    π1 ~ Beta(1,1)
    π2 = 1 - π1
    #likelihood
    for i in 1:N
        x[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π1, π2])
    end
    
end;

dat1 = d_0[:speed] 
dat1 .-= mean(dat1)
dat1 /= std(dat1);


dat0 = d_0[:occupancy] 
dat0 .-= mean(dat0)
dat0 /= std(dat0);

gmm = MarginalizedGMM(dat1, 2)
chn = sample(gmm, NUTS(100, 0.95), 200);

plot(chn)

describe(chn)

2-element Array{ChainDataFrame,1}

Summary Statistics
  parameters     mean     std  naive_se     mcse       ess   r_hat
  ──────────  ───────  ──────  ────────  ───────  ────────  ──────
        μ[1]  -0.7435  0.0299    0.0030  missing   30.2028  0.9932
        μ[2]   0.6422  0.0049    0.0005  missing   82.1035  0.9900
          π1   0.4631  0.0115    0.0011  missing   18.7107  1.0198
        σ[1]   1.0454  0.0172    0.0017  missing  177.2767  0.9905
        σ[2]   0.1684  0.0048    0.0005  missing   75.0814  0.9971

Quantiles
  parameters     2.5%    25.0%    50.0%    75.0%    97.5%
  ──────────  ───────  ───────  ───────  ───────  ───────
        μ[1]  -0.7944  -0.7637  -0.7446  -0.7256  -0.6863
        μ[2]   0.6336   0.6388   0.6419   0.6455   0.6520
          π1   0.4341   0.4569   0.4630   0.4724   0.4812
        σ[1]   1.0169   1.0328   1.0440   1.0561   1.0795
        σ[2]   0.1600   0.1651   0.1682   0.1711   0.1779

p = get_params(chn);

x_plot = -4:0.1:1.5
post_pdfs_1  = [pdf(Normal(p.μ[1][i], p.σ[1][i]), x) * p.π1[i] for x = x_plot, i = 1:50]
post_pdfs_2  = [pdf(Normal(p.μ[2][i], p.σ[2][i]), x) * (1-p.π1[i]) for x = x_plot, i = 1:50];

histogram(dat1, xlabel = "Speed (std)", ylabel = "Data density", normed=true, legend = false)
plot!(x_plot, post_pdfs_1, lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(x_plot, post_pdfs_2, lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(x_plot, post_pdfs_1 + post_pdfs_2, lw = 3, alpha = 0.5, color = :green, legend = false)

Conclusion¶

You will notice that I chose Beta distribution for weight as the prior distribution. I found the model running faster with this prior distribution than using the Dirichlet distribution. I will try to expand this model by incorporating the occupancy variable in the model.

References¶

Turing.jl: A library for robust, efficient, general-purpose probabilistic programming
Characterization of Congestion Based on Speed Distribution: A Statistical Approach Using Gaussian Mixture Model
Applying Probabilistic Model to Quantify Influence of Rainy Weather on Stochastic and Dynamic Transition of Traffic Conditions

from IPython.core.display import HTML

def css_styling():
    styles = open("styles/custom_5.css", "r").read()
    return HTML(styles)
css_styling()

	speed	occupancy
	Float64	Float64
1	65.5842	6.13947
2	65.5556	8.85119
3	60.1167	10.6327
4	38.2	19.7278
5	28.2722	23.2722
6	28.8508	21.7625