-
Why Data Mining?
- -More intense competition at the global scale
- -Recognition of the value in data sources
- -Availability of quality data on customers, vendors, transactions, Web, etc.
- -Consolidation and integration of data repositories into data warehouses
- -The exponential increase in data processing and storage capabilities; and decrease in cost
- -Movement toward conversion of information resources into nonphysical form
-
Definition of Data Mining
- -The nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data stored in structured databases. - Fayyad et al., (1996)
- -Keywords in this definition: Process, nontrivial, valid, novel, potentially useful, understandable.
- -Data mining: a misnomer?
- -Other names: knowledge extraction, pattern analysis, knowledge discovery, information harvesting, pattern searching, data dredging,…
-
Data Mining at the Intersection of Many Disciplines
Slide 8
-
Data Mining Characteristics/Objectives
- -Source of data for DM is often a consolidated data warehouse (not always!)
- -DM environment is usually a client-server or a Web-based information systems architecture
- -Data is the most critical ingredient for DM which may include soft/unstructured data
- -The miner is often an end user
- -Striking it rich requires creative thinking
- -Data mining tools’ capabilities and ease of use are essential (Web, Parallel processing, etc.)
-
Data in Data Mining
- Data: a collection of facts usually obtained as the result of experiences, observations, or experiments
- -Data may consist of numbers, words, images, …
- -Data: lowest level of abstraction (from which information and knowledge are derived)
- Slide 10
- Examples
- -Categorical: Male, Female
- -Ordinal: Freshmen, Sophmore, Junior, Senior
-
-
Customer Relationship Management
- -Maximize return on marketing campaigns
- -Improve customer retention (churn analysis)
- -Maximize customer value (cross-, up-selling)
- -Identify and treat most valued customers
-
Banking and Other Financial
- -Automate the loan application process
- -Detecting fraudulent transactions
- -Maximize customer value (cross-, up-selling)
- -Optimizing cash reserves with forecasting
-
Retailing and Logistics
- -Optimize inventory levels at different locations
- -Improve the store layout and sales promotions
- -Optimize logistics by predicting seasonal effects
- -Minimize losses due to limited shelf life
-
Manufacturing and Maintenance
- -Predict/prevent machinery failures
- -Identify anomalies in production systems to optimize the use manufacturing capacity
- -Discover novel patterns to improve product quality
-
Brokerage and Securities Trading
- -Predict changes on certain bond prices
- -Forecast the direction of stock fluctuations
- -Assess the effect of events on market movements
- -Identify and prevent fraudulent activities in trading
-
Insurance
- -Forecast claim costs for better business planning
- -Determine optimal rate plans
- -Optimize marketing to specific customers
- -Identify and prevent fraudulent claim activities
-
Data Mining Process
- -A manifestation of best practices
- -A systematic way to conduct DM projects
- -Different groups has different versions
- -Most common standard processes:
- -CRISP-DM (Cross-Industry Standard Process for Data Mining)
- -SEMMA (Sample, Explore, Modify, Model, and Assess)
- -KDD (Knowledge Discovery in Databases)
-
Data Mining Process
Slide 16
-
Data Mining Process: CRISP-DM
- Slide17
- Step 1: Business Understanding
- Step 2: Data Understanding
- Step 3: Data Preparation (!)
- Step 4: Model Building
- Step 5: Testing and Evaluation
- Step 6: Deployment
- -The process is highly repetitive and experimental (DM: art versus science?)
-
Data Preparation – A Critical DM Task
Slide19
-
Data Mining Process: SEMMA
Slide20
- Data Mining Methods: Classification
- -Most frequently used DM method
- -Part of the machine-learning family
- -Employ supervised learning
- -Learn from past data, classify new data
- -The output variable is categorical (nominal or ordinal) in nature
- -Classification versus regression?
- -Classification versus clustering?
-
Assessment Methods for Classification
- -Predictive accuracy
- --Hit rate
- -Speed
- --Model building; predicting
- -Robustness
- -Scalability
- -Interpretability
- --Transparency, explainability
-
Accuracy of Classification Models
- -In classification problems, the primary source for accuracy estimation is the confusion matrix
- Slide23
- Estimation Methodologies for Classification
- -Simple split (or holdout or test sample estimation)
- -Split the data into 2 mutually exclusive sets training (~70%) and testing (30%)
- Slide24
- -For ANN, the data is split into three sub-sets (training [~60%], validation [~20%], testing [~20%])
- -k-Fold Cross Validation (rotation estimation)
- -Split the data into k mutually exclusive subsets
- -Use each subset as testing while using the rest of the subsets as training
- -Repeat the experimentation for k times
- -Aggregate the test results for true estimation of prediction accuracy training
-
Classification Techniques
- -Decision tree analysis
- -Statistical analysis
- -Neural networks
- -Support vector machines
- -Case-based reasoning
- -Bayesian classifiers
- -Genetic algorithms
- -Rough sets
-
Decision Trees
- -Employs the divide and conquer method
- -Recursively divides a training set until each division consists of examples from one class
- -1. Create a root node and assign all of the training data to it
- -2. Select the best splitting attribute
- -3. Add a branch to the root node for each value of the split. Split the data into mutually exclusive subsets along the lines of the specific split
- -4. Repeat the steps 2 and 3 for each and every leaf node until the stopping criteria is reached
-
DT algorithms mainly differ on
- -Splitting criteria
- --Which variable to split first?
- --What values to use to split?
- --How many splits to form for each node?
- -Stopping criteria
- --When to stop building the tree
- -Pruning (generalization method)
- --Pre-pruning versus post-pruning
-
Most popular DT algorithms include
-ID3, C4.5, C5; CART; CHAID; M5
-
Alternative splitting criteria
- -Gini index determines the purity of a specific class as a result of a decision to branch along a particular attribute/value
- --Used in CART
- -Information gain uses entropy to measure the extent of uncertainty or randomness of a particular attribute/value split
- --Used in ID3, C4.5, C5
- -Chi-square statistics (used in CHAID)
-
Cluster Analysis for Data Mining
- -Used for automatic identification of natural groupings of things
- -Part of the machine-learning family
- -Employ unsupervised learning
- -Learns the clusters of things from past data, then assigns new instances
- -There is not an output variable
- -Also known as segmentation
-
Clustering results may be used to
- -Identify natural groupings of customers
- -Identify rules for assigning new cases to classes for targeting/diagnostic purposes
- -Provide characterization, definition, labeling of populations
- -Decrease the size and complexity of problems for other data mining methods
- -Identify outliers in a specific domain (e.g., rare-event detection)
-
Analysis methods
- -Statistical methods (including both hierarchical and nonhierarchical), such as k-means, k-modes, and so on
- -Neural networks (adaptive resonance theory [ART], self-organizing map [SOM])
- -Fuzzy logic (e.g., fuzzy c-means algorithm)
- -Genetic algorithms
-
Divisive versus Agglomerative methods
-
How many clusters?
- -There is not a “truly optimal” way to calculate it
- -Heuristics are often used
- --Look at the sparseness of clusters
- --Number of clusters = (n/2)1/2 (n: no of data points)
-
Most cluster analysis methods involve the use of a distance measure to calculate the closeness between pairs of items
-Euclidian versus Manhattan (rectilinear) distance
-
k-Means Clustering Algorithm
- -k : pre-determined number of clusters
- -Algorithm (Step 0: determine value of k)
- Step 1: Randomly generate k random points as initial cluster centers
- Step 2: Assign each point to the nearest cluster center
- Step 3: Re-compute the new cluster centers
- Repetition step: Repeat steps 3 and 4 until some convergence criterion is met (usually that the assignment of points to clusters becomes stable)
-
Cluster Analysis for Data Mining - k-Means Clustering Algorithm
Slide35
- Association Rule Mining
- -A very popular DM method in business
- -Finds interesting relationships (affinities) between variables (items or events)
- -Part of machine learning family
- -Employs unsupervised learning
- -There is no output variable
- -Also known as market basket analysis
- -Often used as an example to describe DM to ordinary people, such as the famous “relationship between diapers and beers!”
-
Data Mining Software
- Commercial
- -SPSS - PASW (formerly Clementine)
- -SAS - Enterprise Miner
- -IBM - Intelligent Miner
- -StatSoft – Statistical Data Miner
- -… many more
- Free and/or Open Source
- -Weka
- -RapidMiner…
|
|