Classical Machine Learning

Overview of Classical Machine Learning

Classical machine learning serves as the foundation for data-driven discovery in the GWData-Bootcamp. Before moving to complex deep learning architectures, we focus on supervised learning workflows using Scikit-Learn. These methods are essential for baseline comparisons, feature engineering, and understanding the statistical properties of gravitational wave data and auxiliary datasets.

The curriculum focuses on two primary paradigms:

Classification: Identifying the category of an observation (e.g., Signal vs. Noise, Credit Risk).
Ensemble Methods: Combining multiple models to improve predictive performance and robustness.

Core Workflow: Supervised Learning

The bootcamp utilizes a standard Scikit-Learn pipeline for training and evaluating classical models. While the examples often use the CreditScoring dataset to teach fundamentals, the logic is directly applicable to gravitational wave event classification.

1. Data Preprocessing

High-quality input is critical. We use sklearn.preprocessing to handle feature scaling and categorical encoding.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Splitting data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalizing features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

2. Model Implementation

We explore a variety of estimators to compare performance across linear and non-linear boundaries.

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Initialize models
log_reg = LogisticRegression()
svm_clf = SVC(probability=True)
dt_clf = DecisionTreeClassifier()

# Training
log_reg.fit(X_train_scaled, y_train)

Model Ensembles

A key highlight of the classical ML section is the use of Ensemble Learning. By aggregating predictions from multiple models, we can achieve higher accuracy than any single constituent model.

Bagging and Boosting

The repository includes practical guides on:

Random Forests: Using RandomForestClassifier to reduce variance through bagging.
Gradient Boosting: Implementing XGBoost or GradientBoostingClassifier to reduce bias.

Voting Classifiers

We demonstrate how to combine different algorithms (e.g., Logistic Regression, SVM, and Trees) into a single ensemble.

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
    estimators=[('lr', log_reg), ('rf', rf_clf), ('svc', svm_clf)],
    voting='soft' # Uses predicted probabilities for the weighted average
)
voting_clf.fit(X_train_scaled, y_train)

Case Study: Credit Scoring

The CreditScoring.ipynb (and associated HTML exports) serves as the primary hands-on project for this section. Students are tasked with building a model to predict loan default risks.

Key learning objectives include:

Feature Selection: Identifying which attributes (income, age, debt ratio) are most predictive.
Hyperparameter Tuning: Using GridSearchCV to find the optimal settings for ensemble models.
Performance Metrics: Evaluating models using ROC-AUC, F1-score, and Confusion Matrices rather than just raw accuracy.

Practical Exercises

The homework_credit_scoring_finetune_ensemble.html guide walks students through a complete optimization loop:

Baseline: Establish a simple model performance.
Fine-tuning: Adjust parameters like n_estimators, max_depth, and learning_rate.
Validation: Use K-Fold Cross-Validation to ensure the model generalizes to unseen data.
Ensemble Construction: Finalize a weighted ensemble that outperforms individual baselines.

These skills prepare students for the more advanced "Deep Learning" modules, where similar concepts of training, validation, and optimization are applied to neural networks and time-series gravitational wave data.