๐งฉ Unsupervised Learning
๐ What Is Unsupervised Learning?
Unsupervised Learning is a type of Machine Learning where the model learns from unlabeled data โ meaning the input data has no predefined output.
The goal is to find hidden patterns, relationships, or structures within the data.
Key idea: The algorithm explores the data and organizes it based on similarities or underlying patterns.
Examples:
- Grouping customers based on shopping behavior ๐
- Detecting unusual transactions (anomalies) ๐ณ
- Reducing high-dimensional data for visualization ๐
๐น K-Means Clustering
Concept
K-Means is one of the most popular clustering algorithms.
It divides data into K clusters, where each data point belongs to the cluster with the nearest mean (centroid).
How It Works
- Choose the number of clusters (K).
- Randomly initialize K centroids.
- Assign each data point to the nearest centroid.
- Recalculate centroids based on current assignments.
- Repeat steps 3โ4 until centroids donโt change significantly.
Goal: Minimize the distance between data points and their assigned centroids.
Example in Python
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
# Example data
X = np.array([[1, 2], [1, 4], [1, 0],
[10, 2], [10, 4], [10, 0]])
# Create and fit the model
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans.fit(X)
# Get cluster centers and labels
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)
# Visualize clusters
plt.scatter(X[:, 0], X[:, 1], c=kmeans.labels_, cmap='viridis')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], color='red', marker='X', s=200)
plt.title("K-Means Clustering Example")
plt.show()
Choosing K:
Use the Elbow Method โ plot the sum of squared errors (SSE) for different K values and look for the "elbow" point where improvement slows down.
Use Cases:
- Customer segmentation
- Market analysis
- Image compression
๐น Dimensionality Reduction (PCA)
Concept
Dimensionality Reduction simplifies large datasets by reducing the number of variables (features) while keeping important information.
This helps speed up algorithms, remove noise, and make visualization easier.
The most common technique is Principal Component Analysis (PCA).
What PCA Does
PCA transforms data into a new coordinate system where:
- The first component explains the most variance.
- The second component explains the next most variance.
- And so onโฆ
Essentially, PCA finds directions (principal components) that best represent the data.
Steps in PCA
- Standardize the data.
- Compute the covariance matrix.
- Calculate eigenvalues and eigenvectors.
- Select top components explaining most variance.
- Transform data into new feature space.
Example in Python
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
# Example data
X = np.array([[2.5, 2.4],
[0.5, 0.7],
[2.2, 2.9],
[1.9, 2.2],
[3.1, 3.0],
[2.3, 2.7],
[2, 1.6],
[1, 1.1],
[1.5, 1.6],
[1.1, 0.9]])
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Apply PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print("Explained Variance Ratio:", pca.explained_variance_ratio_)
print("Transformed Data:\n", X_pca)
Visualizing PCA Results
import matplotlib.pyplot as plt
plt.scatter(X_pca[:, 0], X_pca[:, 1])
plt.title("PCA - Dimensionality Reduction")
plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.show()
Use Cases:
- Visualizing high-dimensional data
- Reducing noise in datasets
- Preprocessing for clustering or classification
๐ง Summary
| Concept | Description | Use Case |
|---|---|---|
| Unsupervised Learning | Finds hidden patterns without labels | Customer segmentation, anomaly detection |
| K-Means Clustering | Groups similar data points into K clusters | Market segmentation, pattern discovery |
| PCA (Dimensionality Reduction) | Reduces features while preserving data variance | Visualization, noise reduction |
Unsupervised learning helps uncover the hidden structure in data, making it easier to explore and interpret.
Itโs especially useful when you donโt have labeled datasets but still want to understand relationships within your data.