Why can xgboost not deal with this simple Chinese sentence case?

up vote
2
down vote

favorite

There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.

In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.

The data is quite simple:

sent1 sent2 label

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

import pandas as pd

import xgboost as xgb

d = pd.read_csv("data_small.tsv",sep=" ")





def my_test(sent1,sent2):

    result = [0]

    if "我想说" in sent1 and "我想说" in sent2:

        result[0] = 1

    if "我想听" in sent1 and "我想听" in sent2:

        result[0] = 1

    return result



fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()



labels = d["label"].tolist()

fea = pd.DataFrame(fea_)

for i in range(len(fea_)):

    print(fea_[i],labels[i])



labels = pd.DataFrame(labels)

from sklearn.model_selection import train_test_split

# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,

#                                                                                random_state=1234)



train_x_pd_split = fea[0:16]

valid_x_pd = fea[16:20]

train_y_pd_split = labels[0:16]

valid_y_pd = labels[16:20]





train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)

valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)

watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]





params3 = {

    'seed': 1337,

    'colsample_bytree': 0.48,

    'silent': 1,

    'subsample': 1,

    'eta': 0.05,

    'objective': 'binary:logistic',

    'eval_metric': 'logloss',

    'max_depth': 8,

    'min_child_weight': 20,

    'nthread': 8,

    'tree_method': 'hist',

}



xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,

                              verbose_eval=10)

# xgb_trained_model.save_model("predict/model/xgb_model_all")

print("feature importance 0:")

importance = xgb_trained_model.get_fscore()

temp1 = 

temp2 = 



for k in importance:

    temp1.append(k)

    temp2.append(importance[k])



print("-----")

feature_importance_df = pd.DataFrame({

    'column': temp1,

    'importance': temp2,

}).sort_values(by='importance')



# print(feature_importance_df)



feature_sort_list = feature_importance_df["column"].tolist()

feature_importance_list = feature_importance_df["importance"].tolist()

print()

for i,item in enumerate(feature_sort_list):

    print(item,feature_importance_list[i])





train_x_xgb = xgb.DMatrix(train_x_pd_split)

train_predict = xgb_trained_model.predict(train_x_xgb)



print(train_predict)



train_predict_binary = (train_predict >= 0.5) * 1

print("TRAIN DATA SELF")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))

print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))

print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))

print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))



print()

valid_xgb = xgb.DMatrix(valid_x_pd)

valid_predict = xgb_trained_model.predict(valid_xgb)



print(valid_predict)



valid_predict_binary = (valid_predict >= 0.5) * 1

print("TEST DATA PERFORMANCE")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))

print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))

print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))

print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))

But result shows that xgboost do not fit the data:

TRAIN DATA SELF

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000



TEST DATA PERFORMANCE

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04

Question belongs on DataScience.SE
– smci
Nov 19 at 4:08

This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34

1

wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58

EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11

|
show 2 more comments

up vote
2
down vote

favorite

There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.

In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.

The data is quite simple:

sent1 sent2 label

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

import pandas as pd

import xgboost as xgb

d = pd.read_csv("data_small.tsv",sep=" ")





def my_test(sent1,sent2):

    result = [0]

    if "我想说" in sent1 and "我想说" in sent2:

        result[0] = 1

    if "我想听" in sent1 and "我想听" in sent2:

        result[0] = 1

    return result



fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()



labels = d["label"].tolist()

fea = pd.DataFrame(fea_)

for i in range(len(fea_)):

    print(fea_[i],labels[i])



labels = pd.DataFrame(labels)

from sklearn.model_selection import train_test_split

# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,

#                                                                                random_state=1234)



train_x_pd_split = fea[0:16]

valid_x_pd = fea[16:20]

train_y_pd_split = labels[0:16]

valid_y_pd = labels[16:20]





train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)

valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)

watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]





params3 = {

    'seed': 1337,

    'colsample_bytree': 0.48,

    'silent': 1,

    'subsample': 1,

    'eta': 0.05,

    'objective': 'binary:logistic',

    'eval_metric': 'logloss',

    'max_depth': 8,

    'min_child_weight': 20,

    'nthread': 8,

    'tree_method': 'hist',

}



xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,

                              verbose_eval=10)

# xgb_trained_model.save_model("predict/model/xgb_model_all")

print("feature importance 0:")

importance = xgb_trained_model.get_fscore()

temp1 = 

temp2 = 



for k in importance:

    temp1.append(k)

    temp2.append(importance[k])



print("-----")

feature_importance_df = pd.DataFrame({

    'column': temp1,

    'importance': temp2,

}).sort_values(by='importance')



# print(feature_importance_df)



feature_sort_list = feature_importance_df["column"].tolist()

feature_importance_list = feature_importance_df["importance"].tolist()

print()

for i,item in enumerate(feature_sort_list):

    print(item,feature_importance_list[i])





train_x_xgb = xgb.DMatrix(train_x_pd_split)

train_predict = xgb_trained_model.predict(train_x_xgb)



print(train_predict)



train_predict_binary = (train_predict >= 0.5) * 1

print("TRAIN DATA SELF")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))

print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))

print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))

print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))



print()

valid_xgb = xgb.DMatrix(valid_x_pd)

valid_predict = xgb_trained_model.predict(valid_xgb)



print(valid_predict)



valid_predict_binary = (valid_predict >= 0.5) * 1

print("TEST DATA PERFORMANCE")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))

print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))

print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))

print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))

But result shows that xgboost do not fit the data:

TRAIN DATA SELF

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000



TEST DATA PERFORMANCE

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04

Question belongs on DataScience.SE
– smci
Nov 19 at 4:08

This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34

1

wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58

EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11

|
show 2 more comments

up vote
2
down vote

favorite

There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.

In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.

The data is quite simple:

sent1 sent2 label

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

import pandas as pd

import xgboost as xgb

d = pd.read_csv("data_small.tsv",sep=" ")





def my_test(sent1,sent2):

    result = [0]

    if "我想说" in sent1 and "我想说" in sent2:

        result[0] = 1

    if "我想听" in sent1 and "我想听" in sent2:

        result[0] = 1

    return result



fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()



labels = d["label"].tolist()

fea = pd.DataFrame(fea_)

for i in range(len(fea_)):

    print(fea_[i],labels[i])



labels = pd.DataFrame(labels)

from sklearn.model_selection import train_test_split

# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,

#                                                                                random_state=1234)



train_x_pd_split = fea[0:16]

valid_x_pd = fea[16:20]

train_y_pd_split = labels[0:16]

valid_y_pd = labels[16:20]





train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)

valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)

watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]





params3 = {

    'seed': 1337,

    'colsample_bytree': 0.48,

    'silent': 1,

    'subsample': 1,

    'eta': 0.05,

    'objective': 'binary:logistic',

    'eval_metric': 'logloss',

    'max_depth': 8,

    'min_child_weight': 20,

    'nthread': 8,

    'tree_method': 'hist',

}



xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,

                              verbose_eval=10)

# xgb_trained_model.save_model("predict/model/xgb_model_all")

print("feature importance 0:")

importance = xgb_trained_model.get_fscore()

temp1 = 

temp2 = 



for k in importance:

    temp1.append(k)

    temp2.append(importance[k])



print("-----")

feature_importance_df = pd.DataFrame({

    'column': temp1,

    'importance': temp2,

}).sort_values(by='importance')



# print(feature_importance_df)



feature_sort_list = feature_importance_df["column"].tolist()

feature_importance_list = feature_importance_df["importance"].tolist()

print()

for i,item in enumerate(feature_sort_list):

    print(item,feature_importance_list[i])





train_x_xgb = xgb.DMatrix(train_x_pd_split)

train_predict = xgb_trained_model.predict(train_x_xgb)



print(train_predict)



train_predict_binary = (train_predict >= 0.5) * 1

print("TRAIN DATA SELF")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))

print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))

print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))

print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))



print()

valid_xgb = xgb.DMatrix(valid_x_pd)

valid_predict = xgb_trained_model.predict(valid_xgb)



print(valid_predict)



valid_predict_binary = (valid_predict >= 0.5) * 1

print("TEST DATA PERFORMANCE")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))

print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))

print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))

print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))

But result shows that xgboost do not fit the data:

TRAIN DATA SELF

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000



TEST DATA PERFORMANCE

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

There is only 1 feature dim. But the result is unreasonable. The code and data is below. The purpose of the code is to judge whether the two sentences are the same.

In fact, the final input to the model is: feature is [1] with label 1, and feature is [0] with label 0.

The data is quite simple:

sent1 sent2 label

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

我想听我想听 1

我想听我想说 0

我想说我想说 1

我想说我想听 0

import pandas as pd

import xgboost as xgb

d = pd.read_csv("data_small.tsv",sep=" ")





def my_test(sent1,sent2):

    result = [0]

    if "我想说" in sent1 and "我想说" in sent2:

        result[0] = 1

    if "我想听" in sent1 and "我想听" in sent2:

        result[0] = 1

    return result



fea_ = d.apply(lambda row: my_test(row['sent1'], row['sent2']), axis=1).tolist()



labels = d["label"].tolist()

fea = pd.DataFrame(fea_)

for i in range(len(fea_)):

    print(fea_[i],labels[i])



labels = pd.DataFrame(labels)

from sklearn.model_selection import train_test_split

# train_x_pd_split, valid_x_pd, train_y_pd_split, valid_y_pd = train_test_split(fea, labels, test_size=0.2,

#                                                                                random_state=1234)



train_x_pd_split = fea[0:16]

valid_x_pd = fea[16:20]

train_y_pd_split = labels[0:16]

valid_y_pd = labels[16:20]





train_xgb_split = xgb.DMatrix(train_x_pd_split, label=train_y_pd_split)

valid_xgb = xgb.DMatrix(valid_x_pd, label=valid_y_pd)

watch_list = [(train_xgb_split, 'train'), (valid_xgb, 'valid')]





params3 = {

    'seed': 1337,

    'colsample_bytree': 0.48,

    'silent': 1,

    'subsample': 1,

    'eta': 0.05,

    'objective': 'binary:logistic',

    'eval_metric': 'logloss',

    'max_depth': 8,

    'min_child_weight': 20,

    'nthread': 8,

    'tree_method': 'hist',

}



xgb_trained_model = xgb.train(params3, train_xgb_split, 1000, watch_list, early_stopping_rounds=50,

                              verbose_eval=10)

# xgb_trained_model.save_model("predict/model/xgb_model_all")

print("feature importance 0:")

importance = xgb_trained_model.get_fscore()

temp1 = 

temp2 = 



for k in importance:

    temp1.append(k)

    temp2.append(importance[k])



print("-----")

feature_importance_df = pd.DataFrame({

    'column': temp1,

    'importance': temp2,

}).sort_values(by='importance')



# print(feature_importance_df)



feature_sort_list = feature_importance_df["column"].tolist()

feature_importance_list = feature_importance_df["importance"].tolist()

print()

for i,item in enumerate(feature_sort_list):

    print(item,feature_importance_list[i])





train_x_xgb = xgb.DMatrix(train_x_pd_split)

train_predict = xgb_trained_model.predict(train_x_xgb)



print(train_predict)



train_predict_binary = (train_predict >= 0.5) * 1

print("TRAIN DATA SELF")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(train_y_pd_split, train_predict))

print('AUC: %.4f' % metrics.roc_auc_score(train_y_pd_split, train_predict))

print('ACC: %.4f' % metrics.accuracy_score(train_y_pd_split, train_predict_binary))

print('Recall: %.4f' % metrics.recall_score(train_y_pd_split, train_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(train_y_pd_split, train_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(train_y_pd_split, train_predict_binary))



print()

valid_xgb = xgb.DMatrix(valid_x_pd)

valid_predict = xgb_trained_model.predict(valid_xgb)



print(valid_predict)



valid_predict_binary = (valid_predict >= 0.5) * 1

print("TEST DATA PERFORMANCE")

from sklearn import metrics

print('LogLoss: %.4f' % metrics.log_loss(valid_y_pd, valid_predict))

print('AUC: %.4f' % metrics.roc_auc_score(valid_y_pd, valid_predict))

print('ACC: %.4f' % metrics.accuracy_score(valid_y_pd, valid_predict_binary))

print('Recall: %.4f' % metrics.recall_score(valid_y_pd, valid_predict_binary))

print('F1-score: %.4f' % metrics.f1_score(valid_y_pd, valid_predict_binary))

print('Precesion: %.4f' % metrics.precision_score(valid_y_pd, valid_predict_binary))

But result shows that xgboost do not fit the data:

TRAIN DATA SELF

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000



TEST DATA PERFORMANCE

LogLoss: 0.6931

AUC: 0.5000

ACC: 0.5000

Recall: 1.0000

F1-score: 0.6667

Precesion: 0.5000

python artificial-intelligence random-forest decision-tree xgboost

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

edited Nov 19 at 7:08

asked Nov 19 at 3:45

jet

6761927

asked Nov 19 at 3:45

jet

6761927

asked Nov 19 at 3:45

jet

6761927

Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04

Question belongs on DataScience.SE
– smci
Nov 19 at 4:08

This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34

1

wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58

EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11

|
show 2 more comments

Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04

Question belongs on DataScience.SE
– smci
Nov 19 at 4:08

This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34

1

wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58

EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11

Assuming it's an algorithmic rather than a straight-up programming issue, migrate to DataScience.SE?
– smci
Nov 19 at 4:04

Question belongs on DataScience.SE
– smci
Nov 19 at 4:08

This is identical to XOR, which, though simply stated, is a notoriously hard problem in machine learning.
– Scott
Nov 19 at 7:34

wtf r u guys talking about?? xgboost solves XOR easy! and it is NOT HARD problem in machine learning! It's only impossible for LINEAR classifiers. Though it's new to me xgboost deals with strings rather than numerics.
– Eran Moshe
Nov 19 at 12:58

EranMoshe is right, @Scott. XOR is not exactly "notoriously hard", it just forces a maximum-depth tree, so for 2^N cases you simply get a depth-N (balanced) tree. XOR is only hard/impossible for linear classifiers.
– smci
Nov 20 at 1:11

|
show 2 more comments

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

I obtained 100% converge. Here are differences between the configurations:

I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.

I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53368042%2fwhy-can-xgboost-not-deal-with-this-simple-chinese-sentence-case%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

accepted

I obtained 100% converge. Here are differences between the configurations:

I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.

I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

add a comment |

up vote
0
down vote

accepted

I obtained 100% converge. Here are differences between the configurations:

I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.

I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

add a comment |

up vote
0
down vote

accepted

I obtained 100% converge. Here are differences between the configurations:

I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.

I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

I obtained 100% converge. Here are differences between the configurations:

I set min_child_weight to 0. It’s unreasonable to set it to 20 and expect XGBoost to find split.

I removed colsample_bytree, you only have 1 features, I don’t think sampling is a good choice.

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

edited Nov 22 at 1:10

answered Nov 20 at 6:35

jet

6761927

answered Nov 20 at 6:35

jet

6761927

answered Nov 20 at 6:35

jet

6761927

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Nsryjdtyk