Bayesian Mixture of Gaussian Model With Varying Mixing Proportions ¶

Mixture models belong to a family of semi-parametric model that are flexible in fitting data with heterogeneous characteristics. Thus, I am using the Gaussian mixture model to demonstrate its uses in clustering traffic congestion using data collected from a freeway. I will be incorporating a regression model with the explanatory variable in the components’ proportions. In particular, the mixture-component weights are fitted with traffic occupancy to increase flexibility of the model. This model can also be referred to as the mixture models with varying mixing proportions.

# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots
using Distributions
using StatsFuns: logistic

gr(leg = false, bg = :papayawhip)

Plots.GRBackend()

The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Let’s load the data and review the data frame

d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)

Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.

p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))

Here we assume that the number of mixture components is known. Since we are clustering free-flow and congested regime, two mixture components will be considered. Based on this, the mixing proportions can be fitted with either logit or probit. I am going to fit the proportions using the logistic regression with traffic occupancy variable. Other variables such as weather conditions, incidents, etc., can be incorporated in the model.

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

The above Equation 1 can be written as follows in Turing for parameters estimation

@model MarginalizedGMM(x, y) = begin
    K = 2
    N = length(y)
    # Vector of std (1,..,K)
    σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
    # Vector of mu (1,..,K)
    μ ~ filldist(Normal(0, 1), K)

    intercept ~ Normal(0, 1)
    slope ~ Normal(0, 1)
    
    #likelihood
    for i in 1:N
        π = logistic(intercept + slope * x[i])
        y[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π, 1-π])
    end
    
end;

function standardize(x)
    return (x .- mean(x)) ./ std(x)
end

standardize (generic function with 1 method)

# standardizing the data
dat1 = standardize(d_0[!, :speed]) 
dat0 = standardize(d_0[!, :occupancy]);

gmm = MarginalizedGMM(dat0, dat1)
chn = sample(gmm, NUTS(100, 0.95), 200);

┌ Info: Found initial step size
│   ϵ = 0.025
└ @ Turing.Inference C:\Users\kidando_ey\.julia\packages\Turing\d4vqQ\src\inference\hmc.jl:626
Sampling: 100%|█████████████████████████████████████████| Time: 0:03:38

plot(chn)

formatters = (v,i,j) -> (j > 1) ? round(v, digits=3) : v
using PrettyTables
function prettystats(chains)
  chains |>
    x -> summarystats(x) |>
    x -> DataFrame(x) |>
    x -> pretty_table(x, backend = :html, formatters = formatters)
end
;

prettystats(chn)

p = get_params(chn);

oc_plot = minimum(dat0):0.1:maximum(dat0);
Component_2 = [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
Component_1 = 1 .- [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]

#Lets sample 90 estimates from the model
Component_2_sample = [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90]
    
Component_1_sample = 1 .- [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90];

function unstandardize(x, orig)
    return x .* std(orig, dims=1) .+ mean(orig, dims=1)
end
;

Let’s examine the membership of each occupancy recorded in the data set. In the figure below, we can see that traffic occupancy from 0 to nearly 12 % belongs to component 1 while above 12 % belong to component 2. Therefore, congestion on these data starts when occupancy recorded on the site is above 12 %.

plot(unstandardize(oc_plot, d_0[!, :occupancy]),
                    Component_2,lw = 3, alpha = 0.5,
    color = :gray, label="Component 2",legend = true
                        )
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
                    Component_1,
                        lw = 3, alpha = 0.5, color = :red, label="Component 1",legend = true)

plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
    Component_2_sample,

    lw = 3, alpha = 0.05, color = :gray, label="")
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
    Component_1_sample,
    lw = 3, alpha = 0.05, color = :red, xlabel = "Occupancy",
    ylabel = "Component responsibility", label="")

sp_plot = -4:0.1:1.5
post_pdfs_1  = [pdf(Normal(p.μ[1][i], p.σ[1][i]), x) * (1 .- logistic(p.intercept[i] .+ p.slope[i] * y)) for x = sp_plot, y = oc_plot,i = 1:90]
post_pdfs_2  = [pdf(Normal(p.μ[2][i], p.σ[2][i]), x) * logistic(p.intercept[i] .+ p.slope[i] * y) for x = sp_plot, y = oc_plot, i = 1:90];

histogram(d_0[!, :speed], xlabel = "Speed (mph)", ylabel = "Data density", normed=true, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_1, dims=2), dims = 2), 
    lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_2, dims = 2), dims = 2),
    lw = 3, alpha = 0.5, color = :gray, legend = false)
#plot!(sp_plot, post_pdfs_1 + post_pdfs_2, lw = 3, alpha = 0.5, color = :green, legend = false)

The plot above shows the plot of posterior distribution of the predicted samples for approximating the travel speed distribution.

Conclusion¶

We can calibrate the Gaussian mixture model expected values with explanatory variables. Basically the regression will look like as follows:

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq2} \tag{2} \\ \mu_{k} = \alpha_{0k} + \alpha_{1k} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \alpha_{0k} \sim \mathcal{N}(0, 1) \\ \alpha_{1k} \sim \mathcal{N}(0, 1) \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

Also, if more than two components are considered in the model, then the appropriate function to use will be multinomial instead of logistic which is normally used for binary problem. In fitting the actual problem, it is important to ask ourselves if the mixing proportions actually depend on the covariates or not. Appropriate statistical test can be used such as WAIC, LOO, etc. can be used to answer this question.

References¶

Turing.jl: A library for robust, efficient, general-purpose probabilistic programming
Characterization of Congestion Based on Speed Distribution: A Statistical Approach Using Gaussian Mixture Model
Applying Probabilistic Model to Quantify Influence of Rainy Weather on Stochastic and Dynamic Transition of Traffic Conditions
Evaluating Traffic Congestion Using the Traffic Occupancy and Speed Distribution

from IPython.core.display import HTML

def css_styling():
    styles = open("styles/custom_1.css", "r").read()
    return HTML(styles)
css_styling()

	speed	occupancy
	Float64	Float64
1	65.5842	6.13947
2	65.5556	8.85119
3	60.1167	10.6327
4	38.2	19.7278
5	28.2722	23.2722
6	28.8508	21.7625

parameters	mean	std	naive_se	mcse	ess	r_hat
String	Float64	Float64	Float64	Missing	Float64	Float64
intercept	1.727	0.166	0.017	missing	22.704	1.083
slope	-5.128	0.3	0.03	missing	20.252	1.062
μ[1]	0.501	0.009	0.001	missing	68.421	1.006
μ[2]	-1.408	0.039	0.004	missing	10.988	1.108
σ[1]	0.331	0.008	0.001	missing	33.763	1.033
σ[2]	0.892	0.022	0.002	missing	64.571	1.019