Bayesian Mixture of Gaussian Model With Varying Mixing Proportions

Mixture models belong to a family of semi-parametric model that are flexible in fitting data with heterogeneous characteristics. Thus, I am using the Gaussian mixture model to demonstrate its uses in clustering traffic congestion using data collected from a freeway. I will be incorporating a regression model with the explanatory variable in the components’ proportions. In particular, the mixture-component weights are fitted with traffic occupancy to increase flexibility of the model. This model can also be referred to as the mixture models with varying mixing proportions.

In [1]:
# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots
using Distributions
using StatsFuns: logistic
In [2]:
gr(leg = false, bg = :papayawhip)
Out[2]:
Plots.GRBackend()

The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Let’s load the data and review the data frame

In [3]:
d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)
Out[3]:

6 rows × 2 columns

speedoccupancy
Float64Float64
165.58426.13947
265.55568.85119
360.116710.6327
438.219.7278
528.272223.2722
628.850821.7625

Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.

In [10]:
p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))
Out[10]:

Here we assume that the number of mixture components is known. Since we are clustering free-flow and congested regime, two mixture components will be considered. Based on this, the mixing proportions can be fitted with either logit or probit. I am going to fit the proportions using the logistic regression with traffic occupancy variable. Other variables such as weather conditions, incidents, etc., can be incorporated in the model.

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

The above Equation 1 can be written as follows in Turing for parameters estimation

In [5]:
@model MarginalizedGMM(x, y) = begin
    K = 2
    N = length(y)
    # Vector of std (1,..,K)
    σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
    # Vector of mu (1,..,K)
    μ ~ filldist(Normal(0, 1), K)

    intercept ~ Normal(0, 1)
    slope ~ Normal(0, 1)
    
    #likelihood
    for i in 1:N
        π = logistic(intercept + slope * x[i])
        y[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π, 1-π])
    end
    
end;
In [6]:
function standardize(x)
    return (x .- mean(x)) ./ std(x)
end
Out[6]:
standardize (generic function with 1 method)
In [7]:
# standardizing the data
dat1 = standardize(d_0[!, :speed]) 
dat0 = standardize(d_0[!, :occupancy]);
In [8]:
gmm = MarginalizedGMM(dat0, dat1)
chn = sample(gmm, NUTS(100, 0.95), 200);
┌ Info: Found initial step size
│   ϵ = 0.025
â”” @ Turing.Inference C:\Users\kidando_ey\.julia\packages\Turing\d4vqQ\src\inference\hmc.jl:626
Sampling: 100%|█████████████████████████████████████████| Time: 0:03:38
In [9]:
plot(chn)
Out[9]:
In [12]:
formatters = (v,i,j) -> (j > 1) ? round(v, digits=3) : v
using PrettyTables
function prettystats(chains)
  chains |>
    x -> summarystats(x) |>
    x -> DataFrame(x) |>
    x -> pretty_table(x, backend = :html, formatters = formatters)
end
;
In [13]:
prettystats(chn)
parameters mean std naive_se mcse ess r_hat
String Float64 Float64 Float64 Missing Float64 Float64
intercept 1.727 0.166 0.017 missing 22.704 1.083
slope -5.128 0.3 0.03 missing 20.252 1.062
μ[1] 0.501 0.009 0.001 missing 68.421 1.006
μ[2] -1.408 0.039 0.004 missing 10.988 1.108
σ[1] 0.331 0.008 0.001 missing 33.763 1.033
σ[2] 0.892 0.022 0.002 missing 64.571 1.019
In [14]:
p = get_params(chn);
In [15]:
oc_plot = minimum(dat0):0.1:maximum(dat0);
Component_2 = [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
Component_1 = 1 .- [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]

#Lets sample 90 estimates from the model
Component_2_sample = [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90]
    
Component_1_sample = 1 .- [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90];
In [16]:
function unstandardize(x, orig)
    return x .* std(orig, dims=1) .+ mean(orig, dims=1)
end
;

Let’s examine the membership of each occupancy recorded in the data set. In the figure below, we can see that traffic occupancy from 0 to nearly 12 % belongs to component 1 while above 12 % belong to component 2. Therefore, congestion on these data starts when occupancy recorded on the site is above 12 %.

In [17]:
plot(unstandardize(oc_plot, d_0[!, :occupancy]),
                    Component_2,lw = 3, alpha = 0.5,
    color = :gray, label="Component 2",legend = true
                        )
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
                    Component_1,
                        lw = 3, alpha = 0.5, color = :red, label="Component 1",legend = true)

plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
    Component_2_sample,

    lw = 3, alpha = 0.05, color = :gray, label="")
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
    Component_1_sample,
    lw = 3, alpha = 0.05, color = :red, xlabel = "Occupancy",
    ylabel = "Component responsibility", label="")
Out[17]:
In [18]:
sp_plot = -4:0.1:1.5
post_pdfs_1  = [pdf(Normal(p.μ[1][i], p.σ[1][i]), x) * (1 .- logistic(p.intercept[i] .+ p.slope[i] * y)) for x = sp_plot, y = oc_plot,i = 1:90]
post_pdfs_2  = [pdf(Normal(p.μ[2][i], p.σ[2][i]), x) * logistic(p.intercept[i] .+ p.slope[i] * y) for x = sp_plot, y = oc_plot, i = 1:90];
In [19]:
histogram(d_0[!, :speed], xlabel = "Speed (mph)", ylabel = "Data density", normed=true, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_1, dims=2), dims = 2), 
    lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_2, dims = 2), dims = 2),
    lw = 3, alpha = 0.5, color = :gray, legend = false)
#plot!(sp_plot, post_pdfs_1 + post_pdfs_2, lw = 3, alpha = 0.5, color = :green, legend = false)
Out[19]:

The plot above shows the plot of posterior distribution of the predicted samples for approximating the travel speed distribution.

Conclusion

We can calibrate the Gaussian mixture model expected values with explanatory variables. Basically the regression will look like as follows:

\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq2} \tag{2} \\ \mu_{k} = \alpha_{0k} + \alpha_{1k} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \alpha_{0k} \sim \mathcal{N}(0, 1) \\ \alpha_{1k} \sim \mathcal{N}(0, 1) \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}

Also, if more than two components are considered in the model, then the appropriate function to use will be multinomial instead of logistic which is normally used for binary problem. In fitting the actual problem, it is important to ask ourselves if the mixing proportions actually depend on the covariates or not. Appropriate statistical test can be used such as WAIC, LOO, etc. can be used to answer this question.

In [1]:
from IPython.core.display import HTML
In [2]:
def css_styling():
    styles = open("styles/custom_1.css", "r").read()
    return HTML(styles)
css_styling()
Out[2]: