Friday, February 24, 2012

Clustering, can I force a cluster?

I've been using some of the data mining capability in SQL 2005, and have been very impressed. But I've mostly been using decision trees/neural nets, and now have a problem that would seem to require clustering, but I'm not sure how to approach it...

Have list of parts by serial number and part number, and in a data cube have crunched a bunch of data to give me each parts' relative reliability.

So what I can do so far is to ask the cube "for a certain part number, which specific serial numbers are our 'bad actors'", ie have an unacceptably high failure rate compared to the average, by standard deviation.

The next logical question would be... what factors do these particular "bad" parts have in common? Environmental exposure, usage rates, a particular repair facility, etc etc. I have that type of raw data in other tables, no problem.

So what it seems like I would want to do is force a clustering algorithm to choose a certain segment of parts as a cluster and then use DM to determine what characteristics those parts share. Then we could see if there was any corrective action we could take, or whether we should just chuck those parts in the bin (if the DM didn't find any reasonable relationships between the parts). But I don't know enough about the clustering/association type DM to figure it out.

How does the group think I could approach this?

Geof

It may be difficult to identify a natural cluster containing only parts with a high failure rate.

I think you could recast this question as "what attributes influence a part's failure rate?"

As you have the data easily available you should be able to build a single table with facility codes, usage rates, environmental factors and failure rate. Or the tables of usage, environmental factors etc could be nested tables keyed to the parts table.

With this data structure, you could build a Na?ve Bayes model with the failure rate as the predictable: the visualizers will help you to understand the significance of different attributes.

To simplify the analysis, you could create and calculate a new boolean column such as HighFailureRate which flagged failures over a certain threshold..

You might even like to use the Data Mining Add-ins for Office. If you can build your unified view in Excel, then the Analyze Key Influencers feature of the Table Analysis tools will make this task even easier for you. All you will need to do is to format your data as a table in Excel; select Analyze Key Influencers from the Analyze tab of the ribbon; and finally select the Failure Rate (or HighFailureRate flag) column to analyze.

hth

|||

Interesting, I had assumed Clustering (or a variation) would be the best method.

But I see your logic there. I've used boolean columns like you've described to good effect in some of the other models.

I've previously also extracted the Model Content data to a SQL DB and then used some recursive queries to find the 'whys' for predictions, so I think I can probably adapt that technique to this scenario.

Thanks for your thoughts! I've head in this direction and see where it takes me.

(Unfortunately, I can't use Office 2007 yet until it's approved by our security/configuration folks, so the DM add-ins aren't an option yet. They sound great though!)

Geof

|||

Another thing to try is since you can identify the items with high failure rates, simply cluster those. You won't find out factors that discriminate between failure and non failure, but will determine patterns that show different types of failures. For example, I was able to find different classes of malignant tumors on a cancer dataset this way.

Other options are to build predictive models as Donald suggests and then build cluster models on all items that are predicted to be failures (whether they are or not) and predicted not to be failures (whether they are or not) - then you can see the distribution of failures across the clusters in each section to see if there are identifiable classes of items that are misclassified.

Another is to simply run clustering across all attributes (possibly even excluding the failure attribute) and use the cluster label as an input to a predictive algorithm such as decision trees. This is essentially a way to combine many attributes into a single attribute for further mining.

HTH

-Jamie

No comments:

Post a Comment