Mixture models belong to a family of semi-parametric model that are flexible in fitting data with heterogeneous characteristics. Thus, I am using the Gaussian mixture model to demonstrate its uses in clustering traffic congestion using data collected from a freeway. I will be incorporating a regression model with the explanatory variable in the components’ proportions. In particular, the mixture-component weights are fitted with traffic occupancy to increase flexibility of the model. This model can also be referred to as the mixture models with varying mixing proportions.
# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots
using Distributions
using StatsFuns: logistic
gr(leg = false, bg = :papayawhip)
The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Let’s load the data and review the data frame
d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)
Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.
p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))
Here we assume that the number of mixture components is known. Since we are clustering free-flow and congested regime, two mixture components will be considered. Based on this, the mixing proportions can be fitted with either logit or probit. I am going to fit the proportions using the logistic regression with traffic occupancy variable. Other variables such as weather conditions, incidents, etc., can be incorporated in the model.
\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}The above Equation 1 can be written as follows in Turing for parameters estimation
@model MarginalizedGMM(x, y) = begin
K = 2
N = length(y)
# Vector of std (1,..,K)
σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
# Vector of mu (1,..,K)
μ ~ filldist(Normal(0, 1), K)
intercept ~ Normal(0, 1)
slope ~ Normal(0, 1)
#likelihood
for i in 1:N
Ï€ = logistic(intercept + slope * x[i])
y[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π, 1-π])
end
end;
function standardize(x)
return (x .- mean(x)) ./ std(x)
end
# standardizing the data
dat1 = standardize(d_0[!, :speed])
dat0 = standardize(d_0[!, :occupancy]);
gmm = MarginalizedGMM(dat0, dat1)
chn = sample(gmm, NUTS(100, 0.95), 200);
plot(chn)
formatters = (v,i,j) -> (j > 1) ? round(v, digits=3) : v
using PrettyTables
function prettystats(chains)
chains |>
x -> summarystats(x) |>
x -> DataFrame(x) |>
x -> pretty_table(x, backend = :html, formatters = formatters)
end
;
prettystats(chn)
p = get_params(chn);
oc_plot = minimum(dat0):0.1:maximum(dat0);
Component_2 = [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
Component_1 = 1 .- [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
#Lets sample 90 estimates from the model
Component_2_sample = [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90]
Component_1_sample = 1 .- [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90];
function unstandardize(x, orig)
return x .* std(orig, dims=1) .+ mean(orig, dims=1)
end
;
Let’s examine the membership of each occupancy recorded in the data set. In the figure below, we can see that traffic occupancy from 0 to nearly 12 % belongs to component 1 while above 12 % belong to component 2. Therefore, congestion on these data starts when occupancy recorded on the site is above 12 %.
plot(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_2,lw = 3, alpha = 0.5,
color = :gray, label="Component 2",legend = true
)
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_1,
lw = 3, alpha = 0.5, color = :red, label="Component 1",legend = true)
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_2_sample,
lw = 3, alpha = 0.05, color = :gray, label="")
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_1_sample,
lw = 3, alpha = 0.05, color = :red, xlabel = "Occupancy",
ylabel = "Component responsibility", label="")