XGBoost also offers various hyperparameters for performance optimization. Let's explore the core ones often tuned during cross-validation and grid search.
Core Parameters
N-estimators: The number of boosting rounds. Higher values can lead to overfitting, so it's crucial for model stability.
Max depth: Determines the maximum depth of each tree for better control over model complexity. Deeper trees can lead to overfitting.
Subsample: Represents the fraction of data to be randomly sampled for each boosting round, helping to prevent overfitting.
Learning rate (eta): Scales the contribution of each tree, offering both control over speed and potential for better accuracy.
Regularization Parameters
Gamma (min_split_loss): Specifies the minimum loss reduction required to make a further partition.
Alpha & Lambda: Control the L1 and L2 regularization terms, aiding in case of highly-correlated features.
Cross-Validation and Scoring
Objective: Defines the loss function to be optimized, such as 'reg:squarederror' for regression and 'binary:logistic' for binary classification. There are various objectives catering to different problems.
Evaluation Metric: Defines the metric to be used for cross-validation and model evaluation, such as 'rmse' for regression and 'auc' for binary classification.
Specialized Parameters
Max delta step: Useful for those employing Logistic Regression with imbalanced classes.
Tree method: For specifying the tree construction method like 'hist' for approximate tree method.
Scale Pos Weight: For imbalanced class problems.
Silent: Used for suppressing all messages.
Seed: Controls the randomness.
Model Training Parameters
Learning Rate (eta): Scales the contribution of each tree, offering both control over speed and potential for better accuracy.
gamma (alias: min_split_loss): Specifies the minimum loss reduction required to make a further partition.
colsample_bytree: The fraction of features to select from at each level of the tree.
lambda (alias: reg_lambda): L2 regularization term on weights.
alpha (alias: reg_alpha): L1 regularization term on weights.
Device and Storage Parameters
Determining the memory constraints and speeding up training.
CPU/GPU selection.
Advanced Parameters
Dart: Helps avoid overfitting through aggressive dropout.
Colsample_bynode: Controls the fraction of features to be used for each node split in a specific level.
Categorical Features: Incorporates categorical features in XGBoost models.
APIs for Distributed Computing
process_type and updater.
Code Example: Parameter Tuning
Here is the Python code:
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_squared_error
# Load Boston housing data
boston = load_boston()
X, y = boston.data, boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define hyperparameters for grid search
params = {
'n_estimators': [100, 200, 300],
'max_depth': [3, 5, 7],
'learning_rate': [0.1, 0.01, 0.001]
}
# Instantiate and fit the model with cross-validation and grid search
xg_reg = xgb.XGBRegressor(eval_metric='rmse')
grid_search = GridSearchCV(estimator=xg_reg, param_grid=params, scoring='neg_mean_squared_error', cv=5)
grid_search.fit(X_train, y_train)
# Get best parameters and evaluate on test set
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
print("Best parameters from grid search:", best_params)
print("RMSE on test set:", rmse)