Unsupervised Learning

1. k-Means Clustering:

Description: k-Means is a simple and widely used clustering algorithm. It partitions the data into kkk clusters, where each data point belongs to the cluster with the nearest mean.
How it works:
1. Initialize kkk centroids randomly.
2. Assign each data point to the nearest centroid.
3. Recalculate the centroids based on the current cluster members.
4. Repeat steps 2 and 3 until convergence (centroids no longer change).
Use Case: Customer segmentation, image compression

2. Hierarchical Clustering:

Description: Hierarchical clustering creates a tree of clusters, where each node is a cluster containing its children clusters. This can be done in an agglomerative manner (bottom-up) or a divisive manner (top-down).
How it works (Agglomerative):
1. Start with each data point as a single cluster.
2. Merge the two closest clusters.
3. Repeat until all points are merged into a single cluster.
Use Case: Creating taxonomies, social network analysis.

3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):

Description: DBSCAN is a density-based clustering algorithm that groups together points that are closely packed together while marking points that are in low-density regions as outliers.
How it works:
1. Identify core points, which are points with at least a minimum number of neighboring points within a certain distance.
2. Expand clusters from these core points, including all directly reachable points.
3. Mark points that are not part of any cluster as noise (outliers).
Use Case: Clustering in data with noise, spatial data analysis.

1. Statistical Methods:

Description: Anomalies are detected by identifying data points that significantly deviate from the statistical distribution of the data (e.g., z-scores, Grubbs’ test).
Use Case: Fraud detection, quality control.

2. Isolation Forest:

Description: Isolation Forest is an ensemble method that isolates anomalies by recursively partitioning data points. Anomalies are more likely to be isolated sooner because they are fewer and different.
How it works:
1. Randomly select a feature and a split value between the maximum and minimum values of the selected feature.
2. Recursively partition the data until all points are isolated.
3. Anomalies have shorter paths, as they are easier to isolate.
Use Case: Detecting rare events, outlier detection in high-dimensional datasets.

2. One-Class SVM:

Description: One-Class SVM is an algorithm that learns a decision boundary that separates normal data points from outliers. It is particularly effective when the dataset is imbalanced, with very few anomalies.
How it works:
1. Train the model on normal data (assumes that the majority of data points are normal).
2. Data points that fall outside the learned boundary are classified as anomalies.
Use Case: Anomaly detection in network security, fraud detection.

Example: k-Means Clustering in Python

Here’s a Python example demonstrating how to use k-means clustering with the sklearn library:

from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Generate synthetic data
X, y = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Apply k-means clustering
kmeans = KMeans(n_clusters=4)
y_kmeans = kmeans.fit_predict(X)

# Plot the results
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

# Plot the centroids
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1], c='red', s=200, alpha=0.75)
plt.show()

Clustering algorithms: k-means, hierarchical clustering, DBSCAN

Dimensionality Reduction

Anomaly Detection Techniques