library(lares) #devtools::install_github(“laresbernardo/lares”)

clusters <- clusterKmeans(df, k = 5) # df is any data.frame. One hot smart enconding will be applied if categorical values are present

clusters$correlations # Same as corr_cross(clusters$df, contains = "cluster")

Hope it comes helpful ðŸ˜‰ ]]>

There’s no penalty in the K-means model for centroids being closed together and there’s no attempt in the algorithm to keep them far apart. If there was, it’d show up as a penalty term or prior in the model. An algorithm called K-means++ tries to initialize centroids far from each other with such a penalty to avoid climbing to a bad local maximum.

Theoretically, simple K-means as presented here can be viewed as a Gaussian (normal) mixture model with unit covariance. As such, you can use the expectation maximization (EM) to fit it. The expectation (E) step computes the expectations, which here reduce to the probability a point is in each cluster, then use those weights to do a maximization (M) step (using what an ML person would call “weighted” training).

The algorithm that’s most often called “K-means” is the one you present. It’s a greedy heuristic strategy for hill climbing to a local maximum. Finding the true maximum likelihood cluster assignment is NP-hard due to the combinatorics and almost never attempted as N and K are usually prohibitively large for exponential search. So the answer depends on the initialization. If you rerun K-means with new initial centers, you’ll get a different result on most non-trivial data sets. The EM algorithm will also get stuck in a local maximum, just like the simpler greedy algorithm.

]]>Supervised means that one attribute of your data is the target variable, with clustering all data are equal, therefore it is *un-*supervised!

*k* is a so called *hyperparamter* which specifies the number of clusters.

Hope that helps!

]]>