Mixture models belong to a family of semi-parametric model that are flexible in fitting data with heterogeneous characteristics. Thus, I am using the Gaussian mixture model to demonstrate its uses in clustering traffic congestion using data collected from a freeway. I will be incorporating a regression model with the explanatory variable in the components’ proportions. In particular, the mixture-component weights are fitted with traffic occupancy to increase flexibility of the model. This model can also be referred to as the mixture models with varying mixing proportions.
# loading the library needed
using Turing, DataFrames, CSV
using MCMCChains, Plots, StatsPlots
using Distributions
using StatsFuns: logistic
gr(leg = false, bg = :papayawhip)
The traffic speed data used in the analysis were collected from a freeway using the micro wave detectors at 5-mins resolution. Let’s load the data and review the data frame
d_0 = CSV.read("../df.csv",copycols=true)
first(d_0, 6)
Here we are ploting the histogram of the travel speed and occupancy to see if the multimodal shape can be visually observed.
p1 = histogram(d_0[!, :speed], title = "Speed distribution", xlabel = "Speed (mph)")
p2 = histogram(d_0[!, :occupancy], title = "Occupancy distribution", xlabel = "Occupancy (%)")
p = plot(p1, p2, ylabel = "Data density", size = (950,500))
Here we assume that the number of mixture components is known. Since we are clustering free-flow and congested regime, two mixture components will be considered. Based on this, the mixing proportions can be fitted with either logit or probit. I am going to fit the proportions using the logistic regression with traffic occupancy variable. Other variables such as weather conditions, incidents, etc., can be incorporated in the model.
\begin{equation} \ y_{i} \sim \sum\limits_{k=1}^2 \pi_{k}\mathcal{N}(y_{i}|x_{i},\mu_{k},\delta_{k}^2) \label{eq1} \tag{1} \\ \pi_{k} = \frac{e^{\beta_{0} + \beta_{1} x}}{1 + e^{\beta_{0} + \beta_{1} x}} \\ \beta_{0} \sim \mathcal{N}(0, 1) \\ \beta_{1} \sim \mathcal{N}(0, 1) \\ \mu_{k} \sim \mathcal{N}(0, 1) \\ \delta_{k} \sim \mathcal{HalfNormal}(1) \end{equation}The above Equation 1 can be written as follows in Turing for parameters estimation
@model MarginalizedGMM(x, y) = begin
K = 2
N = length(y)
# Vector of std (1,..,K)
σ ~ filldist(truncated(Normal(0, 1), 0, Inf), K)
# Vector of mu (1,..,K)
μ ~ filldist(Normal(0, 1), K)
intercept ~ Normal(0, 1)
slope ~ Normal(0, 1)
#likelihood
for i in 1:N
Ï€ = logistic(intercept + slope * x[i])
y[i] ~ MixtureModel(Normal, [(μ[i], σ[i]) for i = 1:K], [π, 1-π])
end
end;
function standardize(x)
return (x .- mean(x)) ./ std(x)
end
# standardizing the data
dat1 = standardize(d_0[!, :speed])
dat0 = standardize(d_0[!, :occupancy]);
gmm = MarginalizedGMM(dat0, dat1)
chn = sample(gmm, NUTS(100, 0.95), 200);
plot(chn)
formatters = (v,i,j) -> (j > 1) ? round(v, digits=3) : v
using PrettyTables
function prettystats(chains)
chains |>
x -> summarystats(x) |>
x -> DataFrame(x) |>
x -> pretty_table(x, backend = :html, formatters = formatters)
end
;
prettystats(chn)
p = get_params(chn);
oc_plot = minimum(dat0):0.1:maximum(dat0);
Component_2 = [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
Component_1 = 1 .- [logistic(p.intercept[1] .+ p.slope[1] * x) for x = oc_plot, i = 1:1]
#Lets sample 90 estimates from the model
Component_2_sample = [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90]
Component_1_sample = 1 .- [logistic(p.intercept[i] .+ p.slope[i] * x) for x = oc_plot, i = 1:90];
function unstandardize(x, orig)
return x .* std(orig, dims=1) .+ mean(orig, dims=1)
end
;
Let’s examine the membership of each occupancy recorded in the data set. In the figure below, we can see that traffic occupancy from 0 to nearly 12 % belongs to component 1 while above 12 % belong to component 2. Therefore, congestion on these data starts when occupancy recorded on the site is above 12 %.
plot(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_2,lw = 3, alpha = 0.5,
color = :gray, label="Component 2",legend = true
)
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_1,
lw = 3, alpha = 0.5, color = :red, label="Component 1",legend = true)
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_2_sample,
lw = 3, alpha = 0.05, color = :gray, label="")
plot!(unstandardize(oc_plot, d_0[!, :occupancy]),
Component_1_sample,
lw = 3, alpha = 0.05, color = :red, xlabel = "Occupancy",
ylabel = "Component responsibility", label="")
sp_plot = -4:0.1:1.5
post_pdfs_1 = [pdf(Normal(p.μ[1][i], p.σ[1][i]), x) * (1 .- logistic(p.intercept[i] .+ p.slope[i] * y)) for x = sp_plot, y = oc_plot,i = 1:90]
post_pdfs_2 = [pdf(Normal(p.μ[2][i], p.σ[2][i]), x) * logistic(p.intercept[i] .+ p.slope[i] * y) for x = sp_plot, y = oc_plot, i = 1:90];
histogram(d_0[!, :speed], xlabel = "Speed (mph)", ylabel = "Data density", normed=true, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_1, dims=2), dims = 2),
lw = 3, alpha = 0.5, color = :gray, legend = false)
plot!(unstandardize(sp_plot, d_0[!, :speed]), dropdims(sum(post_pdfs_2, dims = 2), dims = 2),
lw = 3, alpha = 0.5, color = :gray, legend = false)
#plot!(sp_plot, post_pdfs_1 + post_pdfs_2, lw = 3, alpha = 0.5, color = :green, legend = false)
The plot above shows the plot of posterior distribution of the predicted samples for approximating the travel speed distribution.
We can calibrate the Gaussian mixture model expected values with explanatory variables. Basically the regression will look like as follows:
Also, if more than two components are considered in the model, then the appropriate function to use will be multinomial instead of logistic which is normally used for binary problem. In fitting the actual problem, it is important to ask ourselves if the mixing proportions actually depend on the covariates or not. Appropriate statistical test can be used such as WAIC, LOO, etc. can be used to answer this question.
from IPython.core.display import HTML
def css_styling():
styles = open("styles/custom_1.css", "r").read()
return HTML(styles)
css_styling()