1. Introduction

We are drowning in data, but starving for knowledge. (John Naisbitt, 1982)
Data, Information, Knowledge, and Wisdom
Data mining draws ideas from machine learning, statistics, and database systems.
Methods

Descriptive methods = unsupervised Predictive methods = supervised
with a target (class) attribute no target attribute
Clustering, Association Rule Mining, Text Mining, Anomaly Detection, Sequential Pattern Mining Classification, Regression, Text Mining, Time Series Prediction

None of the data mining steps actually require a computer. But computers are scalability and they can help avoid human bias.

Basic process:
Apply data mining method -> Evaluate resulting model / patterns -> Iterate:
– Experiment with different parameter settings
– Experiment with different alternative methods
– Improve preprocessing and feature generation
– Combine different methods

2. Clustering

Intra-cluster distances are minimized: Data points in one cluster are similar to one another.
Intercluster distances are maximized: Data points in separate clusters are different from each other.
Application area: Market segmentation, Document Clustering
Types:

  1. Partitional Clustering: A division data objects into nonoverlapping subsets (clusters) such that each data object is in exactly one subset
  2. Hierarchical Clustering: A set of nested clusters organized as a hierarchical tree

**Clustering algorithm: ** Partitional, Hierarchical, Densitybased Algorithms
**Proximity (similarity, or dissimilarity) measure: ** Euclidean Distance, Cosine Similarity, Domainspecific Similarity Measures
Application area: Product Grouping, Social Network Analysis, Grouping Search Engine Results, Image Recognition

2.1 K-Means Clustering

Weaknesses1: Initial Seeds
Results can vary significantly depending on the initial choice of seeds (number and position)
Improving:

Weaknesses2: Outlier Handling
Remedy:

  1. remove data points far away from centroids
  2. random sampling
    Choose a small subset of the data points. The chance of selecting an outlier is very small if the data set is large enough. After determining the centroids based on samples, assign the rest of the data points. It’s also a method for improving runtime performance!

Evaluation
maximize Cohesion & Separation

summary
Advantages: Simple, Efficient time complexity: O(tkn) [n: number of data points, k: number of clusters, t: number of iterations]
Disadvantages: Must pick number of clusters before hand; All items are forced into a
cluster; Sensitive to outliers; Sensitve to initial seeds

2.2 K-Medoids

K-Medoids is a K-Means variation that uses the medians of each cluster instead of the mean.
Medoids are the most central existing data points in each cluster
K-Medoids is more robust against outliers as the median is not affected by extreme values

2.3 DBSCAN

DBSCAN: Density-Based Spatial Clustering of Applications with Noise
density-based algorithm
Density = number of points within a specified radius (Eps)

DBSCAN Algorithm: Eliminate noise points -> Perform clustering on the remaining points
Advantages: Resistant to Noise + Can handle clusters of different shapes and sizes
Disadvantages: Varying densities + High-dimensional data
Determining EPS and MinPts?

2.4 Hierarchical Clustering

Produces a set of nested clusters organized as a hierarchical tree. Can be visualized as a Dendrogram. (A tree like diagram that records the sequences of merges or splits. The y-axis displays the former distance between merged clusters.)
Advantages: We do not have to assume any particular number of clusters. May be used to look for meaningful taxonomies.
Step:
Starting Situation: Start with clusters of individual points and a proximity matrix
Intermediate Situation: After some merging steps, we have a number of clusters.
How to Define Inter-Cluster Similarity?

  1. Single Link (MIN): Similarity of two clusters is based on the two most similar (closest) points in the different clusters. 求解最近距离之间最小值. Can handle nonelliptic shapes but Sensitive to outliers.
  2. Complete Link (MAX): Similarity of two clusters is based on the two least similar (most distant) points in the different clusters. 求解最远距离之间最小值Less sensitive to noise and outliers but biased towards globular clusters and tends to break large clusters.
  3. Group Average: average of pairwise
    proximity between points in the two clusters. Need to use average connectivity for scalability since total proximity favors large clusters. Compromise between Single and Complete Link. Less susceptible to noise and outliers but Biased towards globular clusters
  4. Distance Between Centroids

Limitations

2.5 Proximity Measures

Single Attributes: Similarity [0,1] and Dissimilarity [0, upper limit varies]
Many Mttributes:
Euclidean DistanceEuclidean Distance

Caution
We are easily comparing apples and oranges.
changing units of measurement -> changes the clustering result
Recommendation: use normalization before clustering(generally, for all data mining algorithms involving distances)

Similarity of Binary Attributes
Common situation is that objects, p and q, have only binary attributes
1.Symmetric Binary Attributes -> hobbies, favorite bands, favorite movies
A binary attribute is symmetric if both of its states (0 and 1) have
equal importance, and carry the same weights
Similarity measure: Simple Matching CoefficientSMC

2.Asymmetric Binary Attributes -> (dis-)agreement with political statements, recommendation for voting
Asymmetric: If one of the states is more important or more valuable than the other. By convention, state 1 represents the more important state. 1 is typically the rare or infrequent state. Example: Shopping Basket, Word/Document Vector
Jaccard Coefficient

Association Rule Discovery

Given a set of records each of which contains some number of items from a given collection.
Produce dependency rules that will predict the occurrence of an item based on occurrences of other items.
Application area: Marketing and Sales Promotion, Content-based recommendation, Customer loyalty programs

原文地址:https://blog.csdn.net/weixin_45012798/article/details/134784917

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任

如若转载,请注明出处:http://www.7code.cn/show_48556.html

如若内容造成侵权/违法违规/事实不符,请联系代码007邮箱:suwngjj01@126.com进行投诉反馈,一经查实,立即删除

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注