Data mining methods may be categorized as either supervised or unsupervised.
Inunsupervised methods, no target variable is identified as such. Instead, the data miningalgorithm searches for patterns and structure among all the variables. The most commonunsupervised data mining method is clustering.
Most data mining methods are supervised methods, however, meaning that (1) there is a particular prespecified target variable, and (2) the algorithm is given many examples where the value of the target variable is provided, so that the algorithmmay learn which values of the target variable are associated with which values of thepredictor variables.
Most supervised data mining methods apply the following methodology for buildingand evaluating a model.
- First, the algorithm is provided with a training set of data,which includes the preclassified values of the target variable in addition to the predictorvariables. For example, if we are interested in classifying income bracket, based onage, gender, and occupation, our classification algorithm would need a large pool ofrecords, containing complete (as complete as possible) information about every field,including the target field, income bracket. In other words, the records in the trainingset need to be preclassified.Aprovisional data mining model is then constructed usingthe training samples provided in the training data set.However, the training set is necessarily incomplete; that is, it does not includethe “new” or future data that the data modelers are really interested in classifying.Therefore, the algorithm needs to guard against “memorizing” the training set andblindly applying all patterns found in the training set to the future data. For example,it may happen that all customers named “David” in a training set may be in the highincomebracket.We would presumably not want our final model, to be applied to newdata, to include the pattern “If the customer’s first name is David, the customer has ahigh income.” Such a pattern is a spurious artifact of the training set and needs to beverified before deployment.
- The next step in supervised data mining methodology is to examine how the provisional data mining model performs on a test set of data. In the testset, a holdout data set, the values of the target variable are hidden temporarily fromthe provisional model, which then performs classification according to the patternsand structure it learned from the training set. The efficacy of the classifications are then evaluated by comparing them against the true values of the target variable.
- The provisional data mining model is then adjusted to minimize the error rate on the testset.
- The adjusted data mining model is then applied to a validation data set, anotherholdout data set, where the values of the target variable are again hidden temporarilyfrom the model. The adjusted model is itself then adjusted, to minimize the error rateon the validation set. Estimates of model performance for future, unseen data canthen be computed by observing various evaluative measures applied to the validationset.
Methodology for supervised modeling.