Wednesday, May 9, 2007

SUPERVISED VERSUS UNSUPERVISED METHODS

Data mining methods may be categorized as either supervised or unsupervised.
Inunsupervised methods, no target variable is identified as such. Instead, the data miningalgorithm searches for patterns and structure among all the variables. The most commonunsupervised data mining method is clustering.

Most data mining methods are supervised methods, however, meaning that (1) there is a particular prespecified target variable, and (2) the algorithm is given many examples where the value of the target variable is provided, so that the algorithmmay learn which values of the target variable are associated with which values of thepredictor variables.

Most supervised data mining methods apply the following methodology for buildingand evaluating a model.

  1. First, the algorithm is provided with a training set of data,which includes the preclassified values of the target variable in addition to the predictorvariables. For example, if we are interested in classifying income bracket, based onage, gender, and occupation, our classification algorithm would need a large pool ofrecords, containing complete (as complete as possible) information about every field,including the target field, income bracket. In other words, the records in the trainingset need to be preclassified.Aprovisional data mining model is then constructed usingthe training samples provided in the training data set.However, the training set is necessarily incomplete; that is, it does not includethe “new” or future data that the data modelers are really interested in classifying.Therefore, the algorithm needs to guard against “memorizing” the training set andblindly applying all patterns found in the training set to the future data. For example,it may happen that all customers named “David” in a training set may be in the highincomebracket.We would presumably not want our final model, to be applied to newdata, to include the pattern “If the customer’s first name is David, the customer has ahigh income.” Such a pattern is a spurious artifact of the training set and needs to beverified before deployment.
  2. The next step in supervised data mining methodology is to examine how the provisional data mining model performs on a test set of data. In the testset, a holdout data set, the values of the target variable are hidden temporarily fromthe provisional model, which then performs classification according to the patternsand structure it learned from the training set. The efficacy of the classifications are then evaluated by comparing them against the true values of the target variable.
  3. The provisional data mining model is then adjusted to minimize the error rate on the testset.
  4. The adjusted data mining model is then applied to a validation data set, anotherholdout data set, where the values of the target variable are again hidden temporarilyfrom the model. The adjusted model is itself then adjusted, to minimize the error rateon the validation set. Estimates of model performance for future, unseen data canthen be computed by observing various evaluative measures applied to the validationset.

Methodology for supervised modeling.

Sunday, May 6, 2007

What is Data Mining?

According to the Gartner Group, Data mining is the process of discoveringmeaningful new correlations, patterns and trends by sifting through large amounts ofdata stored in repositories, using pattern recognition technologies as well as statisticaland mathematical techniques.
Other definitions:
  • Data mining is the analysis of (often large) observational data sets to findunsuspected relationships and to summarize the data in novel ways that areboth understandable and useful to the data owner. (David Hand, Heikki Mannila, and Padhraic Smyth, Principles of Data Mining, MIT Press,Cambridge, MA, 2001.)
  • Data mining is an interdisciplinary field bringing togther techniques frommachine learning, pattern recognition, statistics, databases, and visualization toaddress the issue of information extraction from large data bases. (Peter Cabena, Pablo Hadjinian, Rolf Stadler, JaapVerhees, and Alessandro Zanasi, DiscoveringData Mining: From Concept to Implementation, Prentice Hall, Upper Saddle River,NJ, 1998.)

DM is a process!

CRISP-DM: The Six Phases of Data Mining
From this diagram, we can see that DM is a complex, iterative process and often costly. You can't justpurchase some data mining software, install it, sit back, and watch it solve all yourproblems. It's impossible! Data mining is not magic.
Without skilled human supervision, blind useof data mining software will only provide you with the wrong answer to the wrongquestion applied to the wrong type of data. The wrong analysis is worse than noanalysis, since it leads to policy recommendations that will probably turn out to beexpensive failures.