Skip to content

TMVA.Experimental.SaveXGBoost and feature names #20267

@jlidrych

Description

@jlidrych

Check duplicate issues.

  • Checked for duplicates

Description

Hello,

Recently, I attempted to save an XGBoost BDT model into a ROOT file using TMVA::Experimental::SaveXGBoost, and later evaluate it with TMVA::Experimental::RBDT.

I encountered an issue where TMVA::Experimental::SaveXGBoost appears to expect feature names in the format f0, f1, etc.
If the features instead have descriptive names such as leptonPt or leptonEta, the conversion fails with the following error.

---------------------------------------------------------------------------
runtime_error                             Traceback (most recent call last)
Cell In[10], line 2
      1 from ROOT import TMVA
----> 2 TMVA.Experimental.SaveXGBoost(clf, "myBDT", "model.root", len(feature_names))

File /cvmfs/sft.cern.ch/lcg/views/LCG_108/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_tmva/_tree_inference.py:75, in SaveXGBoost(xgb_model, key_name, output_path, num_inputs)
     73 bs = get_basescore(xgb_model)
     74 logistic = objective == "logistic"
---> 75 bdt = cppyy.gbl.TMVA.Experimental.RBDT.LoadText(
     76     output_path,
     77     features,
     78     num_outputs,
     79     logistic,
     80     cppyy.gbl.std.log(bs / (1.0 - bs)) if logistic else bs,
     81 )
     83 with cppyy.gbl.TFile.Open(output_path, "RECREATE") as tFile:
     84     tFile.WriteObject(bdt, key_name)

runtime_error: static TMVA::Experimental::RBDT TMVA::Experimental::RBDT::LoadText(const string& txtpath, vector<string>& features, int nClasses, bool logistic, TMVA::Experimental::RBDT::Value_t baseScore) =>
    runtime_error: constructing RBDT from istream: feature leptonPt not in list of features

It looks like the feature names are hard-coded to f0, f1,... here:

features = cppyy.gbl.std.vector["std::string"]([f"f{i}" for i in range(num_inputs)])

If I’m not mistaken, a possible fix could be:

if xgb_model.get_booster().feature_names is None:
    features = cppyy.gbl.std.vector["std::string"]([f"f{i}" for i in range(num_inputs)])
else:
    features = cppyy.gbl.std.vector["std::string"](xgb_model.get_booster().feature_names)

I’m not sure whether this is the intended behavior or a bug.

Cheers,
Jindrich

Reproducer

import os
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# --- Generate dummy data ---
np.random.seed(42)
n_events = 10000

#feature_names = ["f0", "f1"] # this works
feature_names = ["leptonPt", "leptonEta"] # this will crash

df = pd.DataFrame({
    feature_names[0]: np.random.uniform(10, 200, n_events),
    feature_names[1]: np.random.uniform(-2.5, 2.5, n_events),
})
df["label"] = ((df[feature_names[0]] > 60) & (np.abs(df[feature_names[1]]) < 1.2)).astype(int)

# Train/test split
X_train, X_valid, y_train, y_valid = train_test_split(
    df[[feature_names[0], feature_names[1]]],
    df["label"],
    test_size=0.3,
    random_state=123,
    stratify=df["label"],
)

# --- Define classifier ---
clf = xgb.XGBClassifier(
    tree_method="hist",
    objective="binary:logistic",
    eval_metric="auc",
    n_estimators=50,
    subsample=0.8,
    max_depth=2,
)

# --- Train ---
clf.fit(
    X_train, y_train,
    eval_set=[(X_valid, y_valid)],
    verbose=10
)

# --- Evaluate ---
y_pred_proba = clf.predict_proba(X_valid)[:, 1]
auc = roc_auc_score(y_valid, y_pred_proba)
print(f"Validation AUC: {auc:.3f}")

# --- Feature names ---
print("Feature names used in the model:", clf.get_booster().feature_names)

# --- Save model in json ---
clf.save_model("xgb_lepton_model.json")

# ---Save model in root file ---
from ROOT import TMVA
TMVA.Experimental.SaveXGBoost(clf, "myBDT", "model.root", len(feature_names))

ROOT version

ROOT 6.36.02

/cvmfs/sft.cern.ch/lcg/views/LCG_108/x86_64-el9-gcc13-opt/setup.sh

Installation method

LCG_108, LCG_106

Operating system

Linux

Additional context

No response

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions