0 votes
in XGBoost by

How does XGBoost handle missing or null values in the dataset?

1 Answer

0 votes
by
XGBoost has built-in mechanisms to effectively handle missing or null values. This powerful feature makes it ideal for datasets with incomplete information.

Sparsity-Aware Split Finding

XGBoost automatically detects if a feature is missing and learns the best direction to move in the tree to minimize loss. The algorithm makes this decision during split finding and tree building. If a missing value does not help improve the loss, the path leading to this missing value is minimized by setting the associated weights to 0.

Weight Adjustments

The algorithm utilizes three-to-five splits, depending on the best solution. If a feature is missing, weights of leaf nodes with missing values are adjusted.

Visual Representation: Handling Missing Data

Handling Missing Data

Code Example: Utilizing xgboost.DMatrix

In the Python code provided, xgboost.DMatrix is used for more fine-grained control.

import xgboost as xgb

import pandas as pd

import numpy as np

from sklearn.model_selection import train_test_split

# Creating a sample dataset with missing values

data = {

    'feature1': [1, 2, np.nan, 4, 5],

    'label': [0, 1, 0, 1, 0]

}

df = pd.DataFrame(data)

# Assigning missing values to a column

df.loc[df['feature1'].isnull(), 'feature1'] = -999

# Splitting the dataset

X_train, X_test, y_train, y_test = train_test_split(df['feature1'], df['label'], test_size=0.2)

# Converting the dataset to 'DMatrix'

dtrain = xgb.DMatrix(X_train, label=y_train)

dtest = xgb.DMatrix(X_test, label=y_test)

# Defining model parameters

params = {

    'objective': 'binary:logistic',

    'eval_metric': 'logloss'

}

# Training the model

model = xgb.train(params, dtrain, num_boost_round=10)

# Making predictions

preds = model.predict(dtest)
...