Why can xgboost not deal with this simple Chinese sentence case?
up vote
2
down vote
favorite
There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.
In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.
The data is quite simple:
sent1 sent2 label
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")
def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result
fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()
labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])
labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)
train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]
train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]
params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}
xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =
for k in importance:
temp1.append(k)
temp2.append(importance[k])
print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')
# print(feature_importance_df)
feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])
train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)
print(train_predict)
train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))
print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)
print(valid_predict)
valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))
But result shows that xgboost do not fit the data:
TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
python artificial-intelligence random-forest decision-tree xgboost
|
show 2 more comments
up vote
2
down vote
favorite
There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.
In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.
The data is quite simple:
sent1 sent2 label
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")
def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result
fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()
labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])
labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)
train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]
train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]
params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}
xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =
for k in importance:
temp1.append(k)
temp2.append(importance[k])
print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')
# print(feature_importance_df)
feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])
train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)
print(train_predict)
train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))
print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)
print(valid_predict)
valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))
But result shows that xgboost do not fit the data:
TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
python artificial-intelligence random-forest decision-tree xgboost
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
1
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11
|
show 2 more comments
up vote
2
down vote
favorite
up vote
2
down vote
favorite
There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.
In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.
The data is quite simple:
sent1 sent2 label
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")
def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result
fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()
labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])
labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)
train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]
train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]
params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}
xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =
for k in importance:
temp1.append(k)
temp2.append(importance[k])
print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')
# print(feature_importance_df)
feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])
train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)
print(train_predict)
train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))
print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)
print(valid_predict)
valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))
But result shows that xgboost do not fit the data:
TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
python artificial-intelligence random-forest decision-tree xgboost
There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.
In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.
The data is quite simple:
sent1 sent2 label
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
我想听 我想听 1
我想听 我想说 0
我想说 我想说 1
我想说 我想听 0
import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")
def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result
fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()
labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])
labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)
train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]
train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]
params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}
xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =
for k in importance:
temp1.append(k)
temp2.append(importance[k])
print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')
# print(feature_importance_df)
feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])
train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)
print(train_predict)
train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))
print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)
print(valid_predict)
valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))
But result shows that xgboost do not fit the data:
TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000
python artificial-intelligence random-forest decision-tree xgboost
python artificial-intelligence random-forest decision-tree xgboost
edited Nov 19 at 7:08
asked Nov 19 at 3:45
jet
6761927
6761927
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
1
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11
|
show 2 more comments
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
1
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
1
1
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11
|
show 2 more comments
1 Answer
1
active
oldest
votes
up vote
0
down vote
accepted
I obtained 100% converge. Here are differences between the configurations:
I set
min_child_weightto 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.I removed
colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
accepted
I obtained 100% converge. Here are differences between the configurations:
I set
min_child_weightto 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.I removed
colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.
add a comment |
up vote
0
down vote
accepted
I obtained 100% converge. Here are differences between the configurations:
I set
min_child_weightto 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.I removed
colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.
add a comment |
up vote
0
down vote
accepted
up vote
0
down vote
accepted
I obtained 100% converge. Here are differences between the configurations:
I set
min_child_weightto 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.I removed
colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.
I obtained 100% converge. Here are differences between the configurations:
I set
min_child_weightto 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.I removed
colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.
edited Nov 22 at 1:10
answered Nov 20 at 6:35
jet
6761927
6761927
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368042%2fwhy-can-xgboost-not-deal-with-this-simple-chinese-sentence-case%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04
Question belongs on DataScience.SE
– smci
Nov 19 at 4:08
This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34
1
wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58
EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11