1. Scikit-learn Cheat Sheet
- 1. Scikit-learn Cheat Sheet
- 1.1 Quick Reference
- 1.2 Machine Learning Workflow
- 1.3 Getting Started
- 1.4 Data Preprocessing
- 1.5 Model Selection and Training
- 1.6 Model Evaluation
- 1.7 Hyperparameter Tuning
- 1.8 Pipelines
- 1.9 Ensemble Methods
- 1.10 Dimensionality Reduction
- 1.11 Model Inspection
- 1.12 Calibration
- 1.13 Dummy Estimators
- 1.14 Multi-label Classification
- 1.15 Multi-class and Multi-label Classification
- 1.16 Outlier Detection
- 1.17 Semi-Supervised Learning
- 1.18 Common Use Cases
- 1.19 Complete End-to-End Example
- 1.20 Tips and Best Practices
This cheat sheet provides an exhaustive overview of the Scikit-learn (sklearn) machine learning library, covering essential concepts, code examples, and best practices for efficient model building, training, evaluation, and deployment. It aims to be a one-stop reference for common tasks.
1.1 Quick Reference
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β COMMON TASKS QUICK REFERENCE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Task Module / Function
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DATA PREPARATION
Load dataset datasets.load_iris()
Split data train_test_split()
Scale features StandardScaler()
Handle missing values SimpleImputer()
Encode categories OneHotEncoder()
MODEL TRAINING
Linear classification LogisticRegression()
Non-linear classification RandomForestClassifier()
Regression LinearRegression()
Clustering KMeans()
Dimensionality reduction PCA()
MODEL EVALUATION
Classification metrics accuracy_score(), f1_score()
Regression metrics r2_score(), mean_squared_error()
Cross-validation cross_val_score()
Confusion matrix confusion_matrix()
HYPERPARAMETER TUNING
Grid search GridSearchCV()
Random search RandomizedSearchCV()
PIPELINES & WORKFLOWS
Create pipeline Pipeline()
Ensemble methods VotingClassifier()
MODEL PERSISTENCE
Save model joblib.dump()
Load model joblib.load()
1.2 Machine Learning Workflow
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ML WORKFLOW WITH SKLEARN β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββ
β Load Data β
β (datasets.load) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Explore & Clean β
β (Imputation) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Preprocess Data β
β (Scale/Encode) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Split Dataset β
β (train_test_split)β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Select Model β
β (Choose Algo) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Train Model β
β (model.fit) β
ββββββββββ¬ββββββββββ
β
ββββββββββββββββββββ
β Evaluate Model β
β (metrics) β
ββββββββββ¬ββββββββββ
β
ββββββ Poor Performance?
β β
β ββββββββββββββββββββ
β β Tune Hyperparams β
β β (GridSearchCV) β
β ββββββββββ¬ββββββββββ
β β
β ββββββββββββ [Back to Train]
β
ββββββββββββββββββββ
β Deploy Model β
β (joblib.dump) β
ββββββββββββββββββββ
1.3 Getting Started
1.3.1 Installation
pip install scikit-learn
1.3.2 Importing Scikit-learn
import sklearn
from sklearn import datasets # For built-in datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
1.4 Data Preprocessing
1.4.1 Loading Data
1.4.1.1 Built-in Datasets
from sklearn import datasets
# Classification dataset: Iris (150 samples, 4 features, 3 classes)
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Regression dataset: California Housing (20,640 samples, 8 features)
california_housing = datasets.fetch_california_housing()
X = california_housing.data
y = california_housing.target
# Image dataset: Handwritten digits (1,797 samples, 64 features, 10 classes)
digits = datasets.load_digits()
X = digits.data
y = digits.target
1.4.1.2 From Pandas DataFrame
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv("your_data.csv")
X = df.drop("target_column", axis=1)
y = df["target_column"]
1.4.2 Splitting Data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # 80% training, 20% testing
1.4.3 Feature Scaling
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FEATURE SCALING METHODS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
StandardScaler MinMaxScaler RobustScaler
(Mean=0, Std=1) (Range: 0-1) (Uses Median/IQR)
β β β
β β β
z = (x-ΞΌ)/Ο x' = (x-min)/(max-min) (x-median)/IQR
β β β
β β β
Best for: Normal Best for: Bounded Best for: Data
distributions ranges needed with outliers
Normalizer (L2): Scales each sample to unit norm
x' = x / ||x||β
Best for: When direction matters more than magnitude
1.4.3.1 Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.4.3.2 Min-Max Scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.4.3.3 Robust Scaling
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
1.4.3.4 Normalization
from sklearn.preprocessing import Normalizer
normalizer = Normalizer()
X_train_normalized = normalizer.fit_transform(X_train)
X_test_normalized = normalizer.transform(X_test)
1.4.4 Handling Missing Values
1.4.4.1 Imputation (SimpleImputer)
from sklearn.impute import SimpleImputer
import numpy as np
imputer = SimpleImputer(strategy="mean") # Replace missing values with the mean
# Other strategies: "median", "most_frequent", "constant"
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
1.4.4.2 Imputation (KNNImputer)
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
X_train_imputed = imputer.fit_transform(X_train)
X_test_imputed = imputer.transform(X_test)
1.4.4.3 Dropping Missing Values
import pandas as pd
# Assuming X_train and X_test are pandas DataFrames
X_train_dropped = X_train.dropna()
X_test_dropped = X_test.dropna()
1.4.5 Encoding Categorical Features
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENCODING STRATEGIES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Original: ['cat', 'dog', 'bird', 'cat']
OneHotEncoder OrdinalEncoder LabelEncoder
β β β
β β β
βββββββββββββ βββββββββββ βββββββ
β 1 0 0 β β 0 β β 0 β
β 0 1 0 β β 1 β β 1 β
β 0 0 1 β β 2 β β 2 β
β 1 0 0 β β 0 β β 0 β
βββββββββββββ βββββββββββ βββββββ
cat dog bird Ordered Target
relationship only
Use Case: Use Case: Use Case:
- No ordinal - Ordered - Target variable
relationship categories encoding
- Creates sparse - Tree-based - Simple integer
features models mapping
1.4.5.1 One-Hot Encoding
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
# Assuming X_train and X_test are pandas DataFrames
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False) # sparse=False for older versions
X_train_encoded = encoder.fit_transform(X_train[['categorical_feature']])
X_test_encoded = encoder.transform(X_test[['categorical_feature']])
# Or, using pandas:
X_train_encoded = pd.get_dummies(X_train, columns=['categorical_feature'])
X_test_encoded = pd.get_dummies(X_test, columns=['categorical_feature'])
1.4.5.2 Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train[['ordinal_feature']])
X_test_encoded = encoder.transform(X_test[['ordinal_feature']])
1.4.5.3 Label Encoding (for target variable)
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
y_train_encoded = encoder.fit_transform(y_train)
y_test_encoded = encoder.transform(y_test)
1.4.6 Feature Engineering
1.4.6.1 Polynomial Features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)
1.4.6.2 Custom Transformers
from sklearn.preprocessing import FunctionTransformer
import numpy as np
def log_transform(x):
return np.log1p(x)
log_transformer = FunctionTransformer(log_transform)
X_train_log = log_transformer.transform(X_train)
X_test_log = log_transformer.transform(X_test)
1.4.7 Feature Selection
1.4.7.1 VarianceThreshold
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1) # Remove features with variance below 0.1
X_train_selected = selector.fit_transform(X_train)
X_test_selected = selector.transform(X_test)
1.4.7.2 SelectKBest
from sklearn.feature_selection import SelectKBest, f_classif
selector = SelectKBest(score_func=f_classif, k=5) # Select top 5 features
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.4.7.3 SelectFromModel
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression(penalty="l1", solver='liblinear')
selector = SelectFromModel(estimator)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.4.7.4 RFE (Recursive Feature Elimination)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
estimator = LogisticRegression()
selector = RFE(estimator, n_features_to_select=5)
X_train_selected = selector.fit_transform(X_train, y_train)
X_test_selected = selector.transform(X_test)
1.5 Model Selection and Training
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL SELECTION GUIDE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Problem Type?
β
βββββββββββββββββΌββββββββββββββββ
β β
Classification Regression
β β
β β
Data Size? Data Size?
β β
ββββββ΄βββββ ββββββ΄βββββ
β β β β
Small Large Small Large
β β β β
β β β β
Linear? Linear? Linear? Linear?
β β β β
βββββ΄ββββ ββββ΄ββββ βββββ΄ββββ ββββ΄ββββ
β β β β β β β β
Yes No Yes No Yes No Yes No
β β β β β β β β
β β β β β β β β
CLASSIFICATION MODELS REGRESSION MODELS
βββββββββββββββββββββ ββββββββββββββββββ
Small + Linear: Small + Linear:
- Logistic Regression - Linear Regression
- Linear SVM - Ridge/Lasso
- SVR (linear)
Small + Non-linear:
- SVM (RBF kernel) Small + Non-linear:
- Decision Tree - SVR (RBF kernel)
- Random Forest - Decision Tree
- Random Forest
Large + Linear:
- SGD Classifier Large + Linear:
- Logistic Regression - SGD Regressor
- Linear Regression
Large + Non-linear:
- Random Forest Large + Non-linear:
- Gradient Boosting - Random Forest
- Neural Networks - Gradient Boosting
- Neural Networks
Special Cases:
- Multi-class: OneVsRest, OneVsOne
- Imbalanced: Use class_weight='balanced'
- High dimensions: L1/L2 regularization
- Interpretability needed: Linear models, Decision Trees
1.5.1 Linear Regression
from sklearn.linear_model import LinearRegression
# Basic linear regression
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Access coefficients and intercept
print(f"Coefficients: {model.coef_}")
print(f"Intercept: {model.intercept_}")
# Make predictions
new_data = [[1.5, 2.0]]
prediction = model.predict(new_data)
1.5.2 Logistic Regression
from sklearn.linear_model import LogisticRegression
# Binary classification
model = LogisticRegression(solver='liblinear', random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# Get probability estimates
y_pred_proba = model.predict_proba(X_test) # Returns probabilities for each class
print(f"Probability estimates: {y_pred_proba[:5]}")
# Multi-class classification (One-vs-Rest by default)
model_multiclass = LogisticRegression(multi_class='ovr', max_iter=200, random_state=42)
model_multiclass.fit(X_train, y_train)
# Regularization: Control overfitting with C parameter (smaller C = stronger regularization)
model_regularized = LogisticRegression(C=0.1, penalty='l2', random_state=42, max_iter=200)
model_regularized.fit(X_train, y_train)
1.5.3 Support Vector Machines (SVM)
from sklearn.svm import SVC, SVR
# For classification
model = SVC(kernel='linear', C=1.0)
model.fit(X_train, y_train)
# For regression
model = SVR(kernel='linear', C=1.0)
model.fit(X_train, y_train)
1.5.4 Decision Trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# For classification
model = DecisionTreeClassifier(max_depth=5)
model.fit(X_train, y_train)
# For regression
model = DecisionTreeRegressor(max_depth=5)
model.fit(X_train, y_train)
1.5.5 Random Forest
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# For classification
model = RandomForestClassifier(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
# For regression
model = RandomForestRegressor(n_estimators=100, max_depth=5)
model.fit(X_train, y_train)
1.5.6 Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
# For classification
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
# For regression
model = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, max_depth=3)
model.fit(X_train, y_train)
1.5.7 K-Nearest Neighbors (KNN)
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
# For classification
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# For regression
model = KNeighborsRegressor(n_neighbors=5)
model.fit(X_train, y_train)
1.5.8 Naive Bayes
from sklearn.naive_bayes import GaussianNB
model = GaussianNB()
model.fit(X_train, y_train)
1.5.9 Clustering (K-Means)
from sklearn.cluster import KMeans
model = KMeans(n_clusters=3, random_state=42, n_init = 'auto') # Added n_init
model.fit(X_train)
labels = model.predict(X_test)
1.5.10 Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
1.5.11 Model Persistence
import joblib
# Save the model
joblib.dump(model, 'my_model.pkl')
# Load the model
loaded_model = joblib.load('my_model.pkl')
1.6 Model Evaluation
1.6.1 Regression Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
1.6.2 Classification Metrics
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CONFUSION MATRIX & METRICS β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Predicted
Positive Negative
ββββββββββββ¬βββββββββββ
Actual Pos β TP β FN β
ββββββββββββΌβββββββββββ€
Neg β FP β TN β
ββββββββββββ΄βββββββββββ
Metrics:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Accuracy = (TP + TN) / (TP + TN + FP + FN)
β Overall correctness
Precision = TP / (TP + FP)
β Of predicted positives, how many are correct?
β Important when FP is costly
Recall = TP / (TP + FN)
β Of actual positives, how many did we catch?
β Important when FN is costly
F1 Score = 2 Γ (Precision Γ Recall) / (Precision + Recall)
β Harmonic mean of Precision and Recall
β Balanced metric for imbalanced data
Example Use Cases:
- Spam Detection: High Precision (avoid blocking real emails)
- Disease Screening: High Recall (catch all potential cases)
- Balanced: F1 Score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import numpy as np
# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='binary') # 'macro', 'micro', 'weighted' for multi-class
recall = recall_score(y_test, y_pred, average='binary')
f1 = f1_score(y_test, y_pred, average='binary')
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
# Detailed report
report = classification_report(y_test, y_pred)
print("\nClassification Report:")
print(report)
# For multi-class classification
# Use average='weighted' to account for class imbalance
precision_multi = precision_score(y_test, y_pred, average='weighted')
1.6.3 ROC Curve and AUC
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.linear_model import LogisticRegression
import matplotlib.pyplot as plt
# Train model and get probability predictions
model = LogisticRegression(random_state=42, max_iter=200)
model.fit(X_train, y_train)
# For binary classification - get probabilities for positive class
y_pred_proba = model.predict_proba(X_test)[:, 1]
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
# Plot ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, label=f'Model (AUC = {auc:.3f})', linewidth=2)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.show()
# For multi-class classification
from sklearn.metrics import roc_auc_score
y_pred_proba_multi = model.predict_proba(X_test)
auc_multi = roc_auc_score(y_test, y_pred_proba_multi, multi_class='ovr')
print(f"Multi-class AUC: {auc_multi:.3f}")
1.6.4 Cross-Validation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β K-FOLD CROSS-VALIDATION (K=5) β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Full Dataset: [ββββββββββββββββββββββββββββββββββββββββ]
Fold 1: [TEST][TRAIN][TRAIN][TRAIN][TRAIN] β Scoreβ
Fold 2: [TRAIN][TEST][TRAIN][TRAIN][TRAIN] β Scoreβ
Fold 3: [TRAIN][TRAIN][TEST][TRAIN][TRAIN] β Scoreβ
Fold 4: [TRAIN][TRAIN][TRAIN][TEST][TRAIN] β Scoreβ
Fold 5: [TRAIN][TRAIN][TRAIN][TRAIN][TEST] β Scoreβ
β
Final Score = Mean(Scoreβ, Scoreβ, ..., Scoreβ
)
Benefits:
- Every sample used for both training and testing
- More reliable performance estimate
- Reduces variance in model evaluation
from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
# K-Fold Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=5) # 5-fold cross-validation
# Stratified K-Fold (for classification - preserves class distribution)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
cv_scores = cross_val_score(model, X, y, cv=cv)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean cross-validation score: {cv_scores.mean():.2f}")
print(f"Std deviation: {cv_scores.std():.2f}")
1.6.5 Learning Curves
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BIAS-VARIANCE TRADEOFF & LEARNING CURVES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
UNDERFITTING (High Bias) GOOD FIT OVERFITTING (High Variance)
ββββββββββββββββββββββββββ ββββββββββ ββββββββββββββββββββββββββ
Score Score Score
β β β
1.0β 1.0β 1.0β ββTrain
β βββββββββββ Train β ββTrain β β±
β β β β± β β±
0.5β β 0.5β 0.5β β
β βββββββββββ Valid β ββValid β βββββValid
β β β
0.0βββββββββββββββββ 0.0βββββββββββ 0.0βββββββββββ
Training Size Training Size Training Size
Symptoms: Symptoms: Symptoms:
- Low train score - High train score - Very high train score
- Low valid score - High valid score - Much lower valid score
- Similar scores - Similar scores - Large gap
- Model too simple - Right complexity - Model too complex
Solutions: Keep it! Solutions:
- More features - More training data
- More complex model - Reduce features
- Reduce regularization - Increase regularization
- Simpler model
- Early stopping
from sklearn.model_selection import learning_curve
import matplotlib.pyplot as plt
import numpy as np
# Generate learning curves
train_sizes, train_scores, test_scores = learning_curve(
model, X, y, cv=5, train_sizes=np.linspace(0.1, 1.0, 10),
scoring='accuracy', n_jobs=-1
)
# Calculate mean and standard deviation
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
# Plot learning curves
plt.figure(figsize=(10, 6))
plt.plot(train_sizes, train_mean, label='Training score', marker='o')
plt.plot(train_sizes, test_mean, label='Cross-validation score', marker='s')
plt.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, alpha=0.15)
plt.fill_between(train_sizes, test_mean - test_std, test_mean + test_std, alpha=0.15)
plt.xlabel('Training examples')
plt.ylabel('Score')
plt.legend(loc='best')
plt.title('Learning Curve')
plt.grid(True)
plt.show()
# Diagnose overfitting/underfitting
gap = train_mean[-1] - test_mean[-1]
if gap > 0.1:
print("β Model may be overfitting (large gap between train and test)")
elif test_mean[-1] < 0.6:
print("β Model may be underfitting (low scores on both sets)")
else:
print("β Model appears to be fitting well")
1.6.6 Validation Curves
from sklearn.model_selection import validation_curve
import numpy as np
param_range = np.logspace(-6, -1, 5)
train_scores, test_scores = validation_curve(
model, X, y, param_name="gamma", param_range=param_range,
cv=5, scoring="accuracy")
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)
plt.plot(param_range, train_mean, label='Training score')
plt.plot(param_range, test_mean, label='Cross-validation score')
plt.fill_between(param_range, train_mean - train_std, train_mean + train_std, alpha=0.1)
plt.fill_between(param_range, test_mean - test_std, test_mean + test_std, alpha=0.1)
plt.xscale('log')
plt.xlabel('Parameter Value')
plt.ylabel('Score')
plt.legend()
plt.title('Validation Curve')
plt.show()
1.7 Hyperparameter Tuning
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β HYPERPARAMETER TUNING STRATEGIES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
GridSearchCV (Exhaustive Search)
ββββββββββββββββββββββββββββββββ
Parameter Grid:
C: [0.1, 1, 10]
kernel: ['linear', 'rbf']
gamma: [0.1, 1]
β
Tests ALL combinations: 3 Γ 2 Γ 2 = 12 models
β
βββ Model(C=0.1, kernel='linear', gamma=0.1)
βββ Model(C=0.1, kernel='linear', gamma=1)
βββ Model(C=0.1, kernel='rbf', gamma=0.1)
βββ ... (9 more)
β
Cross-Validate each
β
Select Best Params
Best for: Small parameter space
RandomizedSearchCV (Random Sampling)
ββββββββββββββββββββββββββββββββββββ
Parameter Distributions:
C: uniform(0.1, 10)
kernel: ['linear', 'rbf']
gamma: log-uniform(0.001, 1)
β
Sample N random combinations (e.g., n_iter=20)
β
βββ Model(C=3.2, kernel='rbf', gamma=0.034)
βββ Model(C=7.1, kernel='linear', gamma=0.421)
βββ ... (18 more)
β
Cross-Validate each
β
Select Best Params
Best for: Large parameter space, continuous parameters
Comparison:
ββββββββββββββββββββββββββββββββββββ
GridSearchCV RandomizedSearchCV
- Exhaustive - Sampling-based
- Guaranteed best - May miss optimal
- Slow for large - Faster
parameter space
- Good for discrete - Good for continuous
parameters parameters
1.7.1 GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Simple grid search
param_grid = {
'C': [0.1, 1, 10],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 1, 'scale', 'auto']
}
grid_search = GridSearchCV(
SVC(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
verbose=1,
n_jobs=-1 # Use all CPU cores
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.2f}")
best_model = grid_search.best_estimator_
# Evaluate on test set
test_score = best_model.score(X_test, y_test)
print(f"Test set score: {test_score:.2f}")
# Grid search with Pipeline (RECOMMENDED)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC(random_state=42))
])
# Use double underscore to access pipeline step parameters
param_grid_pipeline = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf'],
'classifier__gamma': ['scale', 'auto']
}
grid_search_pipeline = GridSearchCV(
pipeline,
param_grid_pipeline,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid_search_pipeline.fit(X_train, y_train)
print(f"\nPipeline best parameters: {grid_search_pipeline.best_params_}")
print(f"Pipeline best score: {grid_search_pipeline.best_score_:.2f}")
1.7.2 RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint
param_dist = {
'n_estimators': randint(10, 200),
'max_depth': [3, 5, 10, None],
'min_samples_split': randint(2, 11),
'min_samples_leaf': randint(1, 11),
'bootstrap': [True, False]
}
random_search = RandomizedSearchCV(RandomForestClassifier(), param_distributions=param_dist,
n_iter=20, cv=5, scoring='accuracy', random_state=42, verbose=2)
random_search.fit(X_train, y_train)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.2f}")
best_model = random_search.best_estimator_
1.8 Pipelines
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PIPELINE FLOW β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Raw Data (X_train)
β
β
ββββββββββββββββββββ
β StandardScaler β Step 1: Transform
β (fit_transform) β
ββββββββββ¬ββββββββββ
β
Scaled Data
β
β
ββββββββββββββββββββ
β Feature Select β Step 2: Transform
β (fit_transform) β
ββββββββββ¬ββββββββββ
β
Selected Features
β
β
ββββββββββββββββββββ
β Classifier β Step 3: Fit
β (fit) β
ββββββββββ¬ββββββββββ
β
Trained Model
Benefits:
- Prevents data leakage (fit only on training data)
- Simplifies workflow
- Easy hyperparameter tuning with GridSearchCV
- Ensures consistent preprocessing
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.svm import SVC
# Create pipeline with multiple steps
pipeline = Pipeline([
('scaler', StandardScaler()),
('feature_selection', SelectKBest(f_classif, k=10)),
('classifier', SVC(kernel='rbf'))
])
# Fit and predict (all steps executed automatically)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Access individual steps
scaler = pipeline.named_steps['scaler']
classifier = pipeline.named_steps['classifier']
1.9 Ensemble Methods
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ENSEMBLE LEARNING STRATEGIES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
BAGGING (Bootstrap Aggregating)
ββββββββββββββββββββββββββββββ
Training Data β [Random Samples] β Parallel Training
β
βββββββββββββββΌββββββββββββββ
β β β
Model 1 Model 2 Model 3
β β β
βββββββββββββββΌββββββββββββββ
β
Vote / Average
β
Final Prediction
Examples: Random Forest, Bagging Classifier
BOOSTING (Sequential Learning)
ββββββββββββββββββββββββββββββ
Training Data β Model 1 β Reweight β Model 2 β Reweight β Model 3
β β β β β
ββββββββ Focus on ββββ β β
Errors β β
ββββββ Weighted
Combination
Examples: AdaBoost, Gradient Boosting
STACKING (Meta-Learning)
ββββββββββββββββββββββββββββββ
Training Data
β
βββββ Model 1 (SVM) β Prediction 1
βββββ Model 2 (Tree) β Prediction 2 } Level 0
βββββ Model 3 (KNN) β Prediction 3
β
β
[Predictions as Features]
β
β
Meta-Model (LogReg) } Level 1
β
β
Final Prediction
VOTING
ββββββββββββββββββββββββββββββ
Hard Voting: Majority class wins
Soft Voting: Average predicted probabilities
1.9.1 Bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
base_estimator = DecisionTreeClassifier(max_depth=5)
bagging = BaggingClassifier(base_estimator=base_estimator, n_estimators=10, random_state=42)
bagging.fit(X_train, y_train)
1.9.2 Boosting (AdaBoost)
from sklearn.ensemble import AdaBoostClassifier
adaboost = AdaBoostClassifier(n_estimators=50, learning_rate=1.0, random_state=42)
adaboost.fit(X_train, y_train)
1.9.3 Stacking
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
estimators = [
('svm', SVC(kernel='linear', C=1.0)),
('dt', DecisionTreeClassifier(max_depth=5))
]
final_estimator = LogisticRegression()
stacking = StackingClassifier(estimators=estimators, final_estimator=final_estimator)
stacking.fit(X_train, y_train)
1.9.4 Voting Classifier
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
estimator1 = LogisticRegression(solver='liblinear')
estimator2 = SVC(kernel='linear', C=1.0, probability=True) # probability=True for soft voting
voting = VotingClassifier(estimators=[('lr', estimator1), ('svc', estimator2)], voting='soft') # 'hard' for majority voting
voting.fit(X_train, y_train)
1.10 Dimensionality Reduction
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DIMENSIONALITY REDUCTION TECHNIQUES β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
High-Dimensional Data
β
β
ββββββββββββββββββΌβββββββββββββββββ
β β
β β
LINEAR NON-LINEAR
β β
βββββββ΄ββββββ βββββββ΄ββββββ
β β β β
PCA LDA t-SNE NMF
β β β β
β β β β
PCA (Principal Component Analysis)
ββββββββββββββββββββββββββββββββββββ
- Unsupervised
- Finds directions of maximum variance
- Orthogonal components
- Preserves global structure
Use: General dimensionality reduction, visualization
LDA (Linear Discriminant Analysis)
βββββββββββββββββββββββββββββββββββ
- Supervised (requires labels)
- Maximizes class separability
- Linear transformation
Use: Classification preprocessing, feature extraction
t-SNE (t-Distributed Stochastic Neighbor Embedding)
βββββββββββββββββββββββββββββββββββββββββββββββββ
- Non-linear
- Preserves local structure
- Computationally expensive
- Stochastic (different runs β different results)
Use: Visualization (2D/3D), cluster analysis
NMF (Non-negative Matrix Factorization)
ββββββββββββββββββββββββββββββββββββββ
- Non-negative data only
- Parts-based representation
- Interpretable components
Use: Topic modeling, image analysis, recommender systems
Comparison:
ββββββββββββββββββββββββββββββββββββββββββββββββ
Technique Supervised Linear Preserves Speed
PCA No Yes Global Fast
LDA Yes Yes Separability Fast
t-SNE No No Local Slow
NMF No No Parts Medium
1.10.1 PCA
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import numpy as np
# Fit PCA
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.transform(X_test)
# Explained variance
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")
print(f"Total variance explained: {pca.explained_variance_ratio_.sum():.2%}")
# Determine optimal number of components
pca_full = PCA()
pca_full.fit(X_train)
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
# Plot explained variance
plt.figure(figsize=(10, 6))
plt.plot(cumsum, marker='o')
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('PCA Analysis')
plt.legend()
plt.grid(True)
plt.show()
# Choose components to retain 95% variance
n_components_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components needed for 95% variance: {n_components_95}")
1.10.2 Linear Discriminant Analysis (LDA)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train, y_train) # Supervised, needs y_train
X_test_lda = lda.transform(X_test)
1.10.3 t-distributed Stochastic Neighbor Embedding (t-SNE)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=42)
X_train_tsne = tsne.fit_transform(X_train) # Usually only fit_transform
1.10.4 Non-negative Matrix Factorization (NMF)
from sklearn.decomposition import NMF
nmf = NMF(n_components=2, random_state=42)
X_train_nmf = nmf.fit_transform(X_train)
X_test_nmf = nmf.transform(X_test)
1.11 Model Inspection
1.11.1 Feature Importances
# For tree-based models (RandomForest, GradientBoosting)
importances = model.feature_importances_
print(importances)
# For linear models (LogisticRegression, LinearRegression)
coefficients = model.coef_
print(coefficients)
1.11.2 Partial Dependence Plots
from sklearn.inspection import plot_partial_dependence
plot_partial_dependence(model, X_train, features=[0, 1]) # Plot for features 0 and 1
1.11.3 Permutation Importance
from sklearn.inspection import permutation_importance
result = permutation_importance(model, X_test, y_test, n_repeats=10, random_state=42)
print(result.importances_mean)
1.12 Calibration
from sklearn.calibration import CalibratedClassifierCV
calibrated_model = CalibratedClassifierCV(model, method='isotonic', cv=5) # 'sigmoid' is another method
calibrated_model.fit(X_train, y_train)
1.13 Dummy Estimators
from sklearn.dummy import DummyClassifier, DummyRegressor
# For classification
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
# For regression
dummy_reg = DummyRegressor(strategy="mean")
dummy_reg.fit(X_train, y_train)
1.14 Multi-label Classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
multilabel_model = MultiOutputClassifier(RandomForestClassifier())
multilabel_model.fit(X_train, y_train) # y_train is a 2D array of shape (n_samples, n_labels)
1.15 Multi-class and Multi-label Classification
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
ovr_model = OneVsRestClassifier(SVC(kernel='linear'))
ovr_model.fit(X_train, y_train)
1.16 Outlier Detection
from sklearn.ensemble import IsolationForest
outlier_detector = IsolationForest(random_state=42)
outlier_detector.fit(X_train)
outliers = outlier_detector.predict(X_test) # 1 for inliers, -1 for outliers
1.17 Semi-Supervised Learning
from sklearn.semi_supervised import LabelPropagation
label_prop_model = LabelPropagation()
label_prop_model.fit(X_train, y_train) # y_train can contain -1 for unlabeled samples
1.18 Common Use Cases
1.18.1 Handling Imbalanced Data
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
# Method 1: Use class_weight parameter
model = LogisticRegression(class_weight='balanced', random_state=42, max_iter=200)
model.fit(X_train, y_train)
# Method 2: Compute custom class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
class_weight_dict = dict(enumerate(class_weights))
model = LogisticRegression(class_weight=class_weight_dict, random_state=42, max_iter=200)
model.fit(X_train, y_train)
# Method 3: Resampling (requires imbalanced-learn library)
# from imblearn.over_sampling import SMOTE
# smote = SMOTE(random_state=42)
# X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
1.18.2 Feature Importance Analysis
from sklearn.ensemble import RandomForestClassifier
import numpy as np
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
# Get feature importances
importances = model.feature_importances_
indices = np.argsort(importances)[::-1]
# Print feature ranking
print("Feature ranking:")
for i, idx in enumerate(indices):
print(f"{i+1}. Feature {idx}: {importances[idx]:.4f}")
# Visualize feature importances
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
plt.bar(range(len(importances)), importances[indices])
plt.xlabel('Feature Index')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.show()
1.18.3 Time Series Split for Sequential Data
from sklearn.model_selection import TimeSeriesSplit
from sklearn.linear_model import LinearRegression
import numpy as np
# Create time series cross-validator
tscv = TimeSeriesSplit(n_splits=5)
# Perform cross-validation
scores = []
for train_idx, test_idx in tscv.split(X):
X_train_ts, X_test_ts = X[train_idx], X[test_idx]
y_train_ts, y_test_ts = y[train_idx], y[test_idx]
model = LinearRegression()
model.fit(X_train_ts, y_train_ts)
score = model.score(X_test_ts, y_test_ts)
scores.append(score)
print(f"Time series CV scores: {scores}")
print(f"Mean score: {np.mean(scores):.3f}")
1.18.4 Custom Transformer for Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
import numpy as np
# Create custom transformer
class LogTransformer(BaseEstimator, TransformerMixin):
def __init__(self, features=None):
self.features = features
def fit(self, X, y=None):
return self
def transform(self, X):
X_copy = X.copy()
if self.features is None:
return np.log1p(X_copy)
else:
X_copy[:, self.features] = np.log1p(X_copy[:, self.features])
return X_copy
# Use in pipeline
pipeline = Pipeline([
('log_transform', LogTransformer(features=[0, 1])),
('scaler', StandardScaler()),
('classifier', LogisticRegression(random_state=42, max_iter=200))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
1.18.5 Model Comparison
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
import numpy as np
# Define models to compare
models = {
'Logistic Regression': LogisticRegression(random_state=42, max_iter=200),
'Decision Tree': DecisionTreeClassifier(random_state=42),
'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
'SVM': SVC(random_state=42)
}
# Compare models using cross-validation
results = {}
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='accuracy')
results[name] = {
'mean': scores.mean(),
'std': scores.std(),
'scores': scores
}
print(f"{name}: {scores.mean():.3f} (+/- {scores.std():.3f})")
# Select best model
best_model_name = max(results, key=lambda x: results[x]['mean'])
print(f"\nBest model: {best_model_name}")
1.18.6 Saving Multiple Models
import joblib
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# Train and save multiple models
models = {
'random_forest': RandomForestClassifier(n_estimators=100, random_state=42),
'gradient_boosting': GradientBoostingClassifier(n_estimators=100, random_state=42)
}
for name, model in models.items():
model.fit(X_train, y_train)
joblib.dump(model, f'{name}_model.pkl')
print(f"Saved {name} model")
# Load and use models
loaded_rf = joblib.load('random_forest_model.pkl')
loaded_gb = joblib.load('gradient_boosting_model.pkl')
# Ensemble predictions
rf_pred = loaded_rf.predict_proba(X_test)
gb_pred = loaded_gb.predict_proba(X_test)
ensemble_pred = (rf_pred + gb_pred) / 2
final_pred = ensemble_pred.argmax(axis=1)
1.19 Complete End-to-End Example
# Complete ML pipeline: Classification on Iris dataset
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import joblib
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 1. LOAD DATA
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
iris = datasets.load_iris()
X, y = iris.data, iris.target
print(f"Dataset shape: {X.shape}")
print(f"Classes: {np.unique(y)}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 2. SPLIT DATA
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape}, Test set: {X_test.shape}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 3. CREATE PIPELINE
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', SVC(random_state=42))
])
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 4. HYPERPARAMETER TUNING
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['linear', 'rbf'],
'classifier__gamma': ['scale', 'auto']
}
grid_search = GridSearchCV(
pipeline, param_grid, cv=5, scoring='accuracy', verbose=1, n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.3f}")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 5. EVALUATE ON TEST SET
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
print(f"\nTest set accuracy: {accuracy_score(y_test, y_pred):.3f}")
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 6. CROSS-VALIDATION ON FULL DATASET
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
cv_scores = cross_val_score(best_model, X, y, cv=5)
print(f"\nCross-validation scores: {cv_scores}")
print(f"Mean CV score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 7. SAVE MODEL
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
joblib.dump(best_model, 'iris_classifier.pkl')
print("\nModel saved as 'iris_classifier.pkl'")
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# 8. LOAD AND USE MODEL
# βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
loaded_model = joblib.load('iris_classifier.pkl')
new_sample = [[5.1, 3.5, 1.4, 0.2]] # Example: likely setosa
prediction = loaded_model.predict(new_sample)
print(f"\nPrediction for {new_sample}: {iris.target_names[prediction[0]]}")
1.20 Tips and Best Practices
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BEST PRACTICES CHECKLIST β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
DATA PREPARATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Always split data BEFORE preprocessing
β Use stratify=y for imbalanced classification
β Handle missing values appropriately
β Scale features (especially for SVM, KNN, Linear models)
β Encode categorical variables correctly
β Check for data leakage
MODEL TRAINING
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Start with simple baseline (DummyClassifier/Regressor)
β Use pipelines to prevent data leakage
β Apply cross-validation for reliable estimates
β Tune hyperparameters systematically
β Use appropriate random_state for reproducibility
β Consider class imbalance (class_weight='balanced')
MODEL EVALUATION
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Choose metrics appropriate for the task
β Never evaluate on training data
β Use confusion matrix for classification
β Check learning curves for overfitting/underfitting
β Compare with baseline model
β Consider computational cost vs. performance
COMMON PITFALLS
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Fitting scaler on entire dataset β Use pipeline
β Not using cross-validation β Unreliable estimates
β Ignoring class imbalance β Biased model
β Over-tuning on test set β Use validation set
β Not setting random_state β Non-reproducible results
β Forgetting to scale for distance-based models
1.20.1 Key Guidelines
- Data Preprocessing: Always preprocess your data (scaling, encoding, handling missing values) before training a model.
- Cross-Validation: Use cross-validation to get a reliable estimate of your model's performance.
- Hyperparameter Tuning: Use
GridSearchCVorRandomizedSearchCVto find the best hyperparameters for your model. - Pipelines: Use pipelines to streamline your workflow and prevent data leakage.
- Model Persistence: Save your trained models using
jobliborpickle. - Feature Importance: Use feature importance techniques to understand which features are most important for your model.
- Regularization: Use regularization techniques (L1, L2) to prevent overfitting.
- Ensemble Methods: Combine multiple models to improve performance.
- Choose the Right Model: Select a model that is appropriate for your data and task (see Model Selection Guide).
- Evaluate Your Model: Use appropriate evaluation metrics for your task.
- Understand Your Data: Spend time exploring and understanding your data before building a model.
- Start Simple: Begin with a simple model and gradually increase complexity.
- Iterate: Machine learning is an iterative process. Experiment with different models, features, and hyperparameters.
- Document Your Work: Keep track of your experiments and results.
- Use Version Control: Use Git to track changes to your code.
- Use Virtual Environments: Isolate project dependencies.
- Read the Documentation: The Scikit-learn documentation is excellent.