Check duplicate issues.
Description
Hello,
Recently, I attempted to save an XGBoost BDT model into a ROOT file using TMVA::Experimental::SaveXGBoost, and later evaluate it with TMVA::Experimental::RBDT.
I encountered an issue where TMVA::Experimental::SaveXGBoost appears to expect feature names in the format f0, f1, etc.
If the features instead have descriptive names such as leptonPt or leptonEta, the conversion fails with the following error.
---------------------------------------------------------------------------
runtime_error Traceback (most recent call last)
Cell In[10], line 2
1 from ROOT import TMVA
----> 2 TMVA.Experimental.SaveXGBoost(clf, "myBDT", "model.root", len(feature_names))
File /cvmfs/sft.cern.ch/lcg/views/LCG_108/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_tmva/_tree_inference.py:75, in SaveXGBoost(xgb_model, key_name, output_path, num_inputs)
73 bs = get_basescore(xgb_model)
74 logistic = objective == "logistic"
---> 75 bdt = cppyy.gbl.TMVA.Experimental.RBDT.LoadText(
76 output_path,
77 features,
78 num_outputs,
79 logistic,
80 cppyy.gbl.std.log(bs / (1.0 - bs)) if logistic else bs,
81 )
83 with cppyy.gbl.TFile.Open(output_path, "RECREATE") as tFile:
84 tFile.WriteObject(bdt, key_name)
runtime_error: static TMVA::Experimental::RBDT TMVA::Experimental::RBDT::LoadText(const string& txtpath, vector<string>& features, int nClasses, bool logistic, TMVA::Experimental::RBDT::Value_t baseScore) =>
runtime_error: constructing RBDT from istream: feature leptonPt not in list of features
It looks like the feature names are hard-coded to f0, f1,... here:
features = cppyy.gbl.std.vector["std::string"]([f"f{i}" for i in range(num_inputs)])
If I’m not mistaken, a possible fix could be:
if xgb_model.get_booster().feature_names is None:
features = cppyy.gbl.std.vector["std::string"]([f"f{i}" for i in range(num_inputs)])
else:
features = cppyy.gbl.std.vector["std::string"](xgb_model.get_booster().feature_names)
I’m not sure whether this is the intended behavior or a bug.
Cheers,
Jindrich
Reproducer
import os
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
# --- Generate dummy data ---
np.random.seed(42)
n_events = 10000
#feature_names = ["f0", "f1"] # this works
feature_names = ["leptonPt", "leptonEta"] # this will crash
df = pd.DataFrame({
feature_names[0]: np.random.uniform(10, 200, n_events),
feature_names[1]: np.random.uniform(-2.5, 2.5, n_events),
})
df["label"] = ((df[feature_names[0]] > 60) & (np.abs(df[feature_names[1]]) < 1.2)).astype(int)
# Train/test split
X_train, X_valid, y_train, y_valid = train_test_split(
df[[feature_names[0], feature_names[1]]],
df["label"],
test_size=0.3,
random_state=123,
stratify=df["label"],
)
# --- Define classifier ---
clf = xgb.XGBClassifier(
tree_method="hist",
objective="binary:logistic",
eval_metric="auc",
n_estimators=50,
subsample=0.8,
max_depth=2,
)
# --- Train ---
clf.fit(
X_train, y_train,
eval_set=[(X_valid, y_valid)],
verbose=10
)
# --- Evaluate ---
y_pred_proba = clf.predict_proba(X_valid)[:, 1]
auc = roc_auc_score(y_valid, y_pred_proba)
print(f"Validation AUC: {auc:.3f}")
# --- Feature names ---
print("Feature names used in the model:", clf.get_booster().feature_names)
# --- Save model in json ---
clf.save_model("xgb_lepton_model.json")
# ---Save model in root file ---
from ROOT import TMVA
TMVA.Experimental.SaveXGBoost(clf, "myBDT", "model.root", len(feature_names))
ROOT version
ROOT 6.36.02
/cvmfs/sft.cern.ch/lcg/views/LCG_108/x86_64-el9-gcc13-opt/setup.sh
Installation method
LCG_108, LCG_106
Operating system
Linux
Additional context
No response
Check duplicate issues.
Description
Hello,
Recently, I attempted to save an XGBoost BDT model into a ROOT file using
TMVA::Experimental::SaveXGBoost, and later evaluate it withTMVA::Experimental::RBDT.I encountered an issue where
TMVA::Experimental::SaveXGBoostappears to expect feature names in the formatf0,f1, etc.If the features instead have descriptive names such as
leptonPtorleptonEta, the conversion fails with the following error.--------------------------------------------------------------------------- runtime_error Traceback (most recent call last) Cell In[10], line 2 1 from ROOT import TMVA ----> 2 TMVA.Experimental.SaveXGBoost(clf, "myBDT", "model.root", len(feature_names)) File /cvmfs/sft.cern.ch/lcg/views/LCG_108/x86_64-el9-gcc13-opt/lib/ROOT/_pythonization/_tmva/_tree_inference.py:75, in SaveXGBoost(xgb_model, key_name, output_path, num_inputs) 73 bs = get_basescore(xgb_model) 74 logistic = objective == "logistic" ---> 75 bdt = cppyy.gbl.TMVA.Experimental.RBDT.LoadText( 76 output_path, 77 features, 78 num_outputs, 79 logistic, 80 cppyy.gbl.std.log(bs / (1.0 - bs)) if logistic else bs, 81 ) 83 with cppyy.gbl.TFile.Open(output_path, "RECREATE") as tFile: 84 tFile.WriteObject(bdt, key_name) runtime_error: static TMVA::Experimental::RBDT TMVA::Experimental::RBDT::LoadText(const string& txtpath, vector<string>& features, int nClasses, bool logistic, TMVA::Experimental::RBDT::Value_t baseScore) => runtime_error: constructing RBDT from istream: feature leptonPt not in list of featuresIt looks like the feature names are hard-coded to
f0,f1,... here:If I’m not mistaken, a possible fix could be:
I’m not sure whether this is the intended behavior or a bug.
Cheers,
Jindrich
Reproducer
ROOT version
ROOT 6.36.02
Installation method
LCG_108, LCG_106
Operating system
Linux
Additional context
No response