Saturday, February 25, 2012

Clusters with "missing" values

Hello all and a happy new year!

I used Microsoft clustering for grouping my data. Even though i already cleaned the data and have no null values i get one cluster with missing values in every attribute. (i set CLUSTER_COUNT=3 and i'm using Scalable k-means algorithm)

Does "missing" mean that the algorithm cannot group that particular tuple in another group so it consider it as missing?

Thank you in advance.

"missing" is an implicit state of any attribute, and is considered whether or not there is missing data or not. For example, a "gender" attribute could have states "Male", "Female" and "missing". An "Age" attribute would have as possible states the age value, or "missing." When clusters are initialized the centroid of the cluster is initialized as a probability of that state being in that cluster, e.g. 45% Male, 50% female, and 5% missing. In general, the initial probabilities are determined by perturbing the distributions of the values shown in the data.

That being said, you say that you are getting a cluster with missing values in every attribute - what is the support of this cluster relative to all the others? It may be possible that you are really only finding two clusters and getting a cluster that is negligable. Could you post the results of SELECT FLATTENED * FROM <model name>.CONTENT?

|||

Thank you for you answer it was very helpful!

The support of these cluster is very small comparing to the other clusters (~0,01 -0,02%)
Can i consider those data as outliers and discard them?

|||Likely - you could just name it as such

No comments:

Post a Comment