Why can xgboost not deal with this simple Chinese sentence case?











up vote
2
down vote

favorite












There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.



In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.



The data is quite simple:





sent1 sent2 label



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0





import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")


def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result

fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()

labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])

labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)

train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]


train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]


params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}

xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =

for k in importance:
temp1.append(k)
temp2.append(importance[k])

print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')

# print(feature_importance_df)

feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])


train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)

print(train_predict)

train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))

print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)

print(valid_predict)

valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))


But result shows that xgboost do not fit the data:



TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000

TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000









share|improve this question
























  • Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
    – smci
    Nov 19 at 4:04










  • Question belongs on DataScience.SE
    – smci
    Nov 19 at 4:08












  • This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
    – Scott
    Nov 19 at 7:34






  • 1




    wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
    – Eran Moshe
    Nov 19 at 12:58












  • EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
    – smci
    Nov 20 at 1:11

















up vote
2
down vote

favorite












There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.



In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.



The data is quite simple:





sent1 sent2 label



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0





import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")


def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result

fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()

labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])

labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)

train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]


train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]


params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}

xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =

for k in importance:
temp1.append(k)
temp2.append(importance[k])

print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')

# print(feature_importance_df)

feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])


train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)

print(train_predict)

train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))

print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)

print(valid_predict)

valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))


But result shows that xgboost do not fit the data:



TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000

TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000









share|improve this question
























  • Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
    – smci
    Nov 19 at 4:04










  • Question belongs on DataScience.SE
    – smci
    Nov 19 at 4:08












  • This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
    – Scott
    Nov 19 at 7:34






  • 1




    wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
    – Eran Moshe
    Nov 19 at 12:58












  • EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
    – smci
    Nov 20 at 1:11















up vote
2
down vote

favorite









up vote
2
down vote

favorite











There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.



In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.



The data is quite simple:





sent1 sent2 label



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0





import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")


def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result

fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()

labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])

labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)

train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]


train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]


params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}

xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =

for k in importance:
temp1.append(k)
temp2.append(importance[k])

print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')

# print(feature_importance_df)

feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])


train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)

print(train_predict)

train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))

print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)

print(valid_predict)

valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))


But result shows that xgboost do not fit the data:



TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000

TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000









share|improve this question















There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.



In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.



The data is quite simple:





sent1 sent2 label



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0



我想听 我想听 1



我想听 我想说 0



我想说 我想说 1



我想说 我想听 0





import pandas as pd
import xgboost as xgb
d = pd.read_csv("data_small.tsv",sep=" ")


def my_test(sent1,sent2):
result = [0]
if "我想说" in sent1 and "我想说" in sent2:
result[0] = 1
if "我想听" in sent1 and "我想听" in sent2:
result[0] = 1
return result

fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()

labels = d["label"].tolist()
fea = pd.DataFrame(fea_)
for i in range(len(fea_)):
print(fea_[i],labels[i])

labels = pd.DataFrame(labels)
from sklearn.model_selection import train_test_split
# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,
# random_state=1234)

train_x_pd_split = fea[0:16]
valid_x_pd = fea[16:20]
train_y_pd_split = labels[0:16]
valid_y_pd = labels[16:20]


train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)
valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)
watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]


params3 = {
'seed': 1337,
'colsample_bytree': 0.48,
'silent': 1,
'subsample': 1,
'eta': 0.05,
'objective': 'binary:logistic',
'eval_metric': 'logloss',
'max_depth': 8,
'min_child_weight': 20,
'nthread': 8,
'tree_method': 'hist',
}

xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,
verbose_eval=10)
# xgb_trained_model.save_model("predict/model/xgb_model_all")
print("feature importance 0:")
importance = xgb_trained_model.get_fscore()
temp1 =
temp2 =

for k in importance:
temp1.append(k)
temp2.append(importance[k])

print("-----")
feature_importance_df = pd.DataFrame({
'column': temp1,
'importance': temp2,
}).sort_values(by='importance')

# print(feature_importance_df)

feature_sort_list = feature_importance_df["column"].tolist()
feature_importance_list = feature_importance_df["importance"].tolist()
print()
for i,item in enumerate(feature_sort_list):
print(item,feature_importance_list[i])


train_x_xgb = xgb.DMatrix(train_x_pd_split)
train_predict = xgb_trained_model.predict(train_x_xgb)

print(train_predict)

train_predict_binary = (train_predict >= 0.5) * 1
print("TRAIN DATA SELF")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))
print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))
print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))
print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))

print()
valid_xgb = xgb.DMatrix(valid_x_pd)
valid_predict = xgb_trained_model.predict(valid_xgb)

print(valid_predict)

valid_predict_binary = (valid_predict >= 0.5) * 1
print("TEST DATA PERFORMANCE")
from sklearn import metrics
print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))
print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))
print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))
print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))
print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))
print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))


But result shows that xgboost do not fit the data:



TRAIN DATA SELF
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000

TEST DATA PERFORMANCE
LogLoss: 0.6931
AUC: 0.5000
ACC: 0.5000
Recall: 1.0000
F1-score: 0.6667
Precesion: 0.5000






python artificial-intelligence random-forest decision-tree xgboost






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 19 at 7:08

























asked Nov 19 at 3:45









jet

6761927




6761927












  • Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
    – smci
    Nov 19 at 4:04










  • Question belongs on DataScience.SE
    – smci
    Nov 19 at 4:08












  • This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
    – Scott
    Nov 19 at 7:34






  • 1




    wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
    – Eran Moshe
    Nov 19 at 12:58












  • EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
    – smci
    Nov 20 at 1:11




















  • Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
    – smci
    Nov 19 at 4:04










  • Question belongs on DataScience.SE
    – smci
    Nov 19 at 4:08












  • This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
    – Scott
    Nov 19 at 7:34






  • 1




    wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
    – Eran Moshe
    Nov 19 at 12:58












  • EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
    – smci
    Nov 20 at 1:11


















Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04




Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04












Question belongs on DataScience.SE
– smci
Nov 19 at 4:08






Question belongs on DataScience.SE
– smci
Nov 19 at 4:08














This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34




This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34




1




1




wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58






wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58














EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11






EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11














1 Answer
1






active

oldest

votes

















up vote
0
down vote



accepted










I obtained 100% converge. Here are differences between the configurations:




  1. I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.


  2. I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.







share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368042%2fwhy-can-xgboost-not-deal-with-this-simple-chinese-sentence-case%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote



    accepted










    I obtained 100% converge. Here are differences between the configurations:




    1. I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.


    2. I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.







    share|improve this answer



























      up vote
      0
      down vote



      accepted










      I obtained 100% converge. Here are differences between the configurations:




      1. I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.


      2. I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.







      share|improve this answer

























        up vote
        0
        down vote



        accepted







        up vote
        0
        down vote



        accepted






        I obtained 100% converge. Here are differences between the configurations:




        1. I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.


        2. I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.







        share|improve this answer














        I obtained 100% converge. Here are differences between the configurations:




        1. I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.


        2. I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.








        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 22 at 1:10

























        answered Nov 20 at 6:35









        jet

        6761927




        6761927






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368042%2fwhy-can-xgboost-not-deal-with-this-simple-chinese-sentence-case%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Ottavio Pratesi

            Tricia Helfer

            15 giugno