ridge regression rmse on all subsets higher than on the total set

I trained a model on a set and tried to use it on all subsets.

Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.

I did the following:

%pyspark

def preprocessing(features, attributes):



    features_2 = features[attributes]

    y = features['y'].values

    x = features_2.values 



    robustScaler = RobustScaler(quantile_range=(25.0,75.0))

    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])



    xScaled[xScaled < -2.0] = -2.0 

    xScaled[xScaled > 2.0] = 2.0

    xCustomers = x[:,0]

    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS = xScaled 

    x_T0 = xScaled[:,:] 

    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 

    xCustR = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 

    x_all = np.hstack((x_T0_all, x_TS_all))

    variable_names = features_2.columns.get_values()[1:].tolist() 

    return x_all, variable_names, y



def trainModel(features,attributes,optAlpha):

    x_all, variable_names, y = preprocessing(features, attributes)

    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')

    ridge.fit(x_all, y)

    return ridge



def useModel(features,ridge,attributes):

    x_all, variable_names, y = preprocessing(features, attributes)

    y_pred = ridge.predict(x_all)

    rmse = np.sqrt(mean_squared_error(y,y_pred))

    mae = mean_absolute_error(y, y_pred)    

    print "RMSE on test set: ", round(rmse,2)

    print "MAE on test set:  ", round(mae,2)

    return y_pred, y, rmse, mae



ridge = trainModel(df_features_train, attributes, optAlpha)

useModel(df_features_train,ridge,attributes)



RMSE on test set:  67.05

MAE on test set:   52.5

Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.

orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])



for orgID in df_features['orgID'].unique():

    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)

    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])

    orgIDError = orgIDError.append(df)

print(orgIDError)



   orgID       rmse          mae

0  615   194.848564   155.502885

0  577   101.156573    76.083797

0  957  1564.256952   814.316566

0  763   832.782755   501.865561

0  616  1337.456555   860.404253

0  968   526.207558   347.265139

0  954  1570.315284  1149.191017

0  874   241.254153   202.429037

0  554   402.013992   344.846957

0  950  1073.348186   673.874603

Any ideas what went wrong?

asked Nov 21 '18 at 10:37

Thomas R

626

add a comment |

I trained a model on a set and tried to use it on all subsets.

Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.

I did the following:

%pyspark

def preprocessing(features, attributes):



    features_2 = features[attributes]

    y = features['y'].values

    x = features_2.values 



    robustScaler = RobustScaler(quantile_range=(25.0,75.0))

    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])



    xScaled[xScaled < -2.0] = -2.0 

    xScaled[xScaled > 2.0] = 2.0

    xCustomers = x[:,0]

    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS = xScaled 

    x_T0 = xScaled[:,:] 

    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 

    xCustR = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 

    x_all = np.hstack((x_T0_all, x_TS_all))

    variable_names = features_2.columns.get_values()[1:].tolist() 

    return x_all, variable_names, y



def trainModel(features,attributes,optAlpha):

    x_all, variable_names, y = preprocessing(features, attributes)

    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')

    ridge.fit(x_all, y)

    return ridge



def useModel(features,ridge,attributes):

    x_all, variable_names, y = preprocessing(features, attributes)

    y_pred = ridge.predict(x_all)

    rmse = np.sqrt(mean_squared_error(y,y_pred))

    mae = mean_absolute_error(y, y_pred)    

    print "RMSE on test set: ", round(rmse,2)

    print "MAE on test set:  ", round(mae,2)

    return y_pred, y, rmse, mae



ridge = trainModel(df_features_train, attributes, optAlpha)

useModel(df_features_train,ridge,attributes)



RMSE on test set:  67.05

MAE on test set:   52.5

Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.

orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])



for orgID in df_features['orgID'].unique():

    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)

    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])

    orgIDError = orgIDError.append(df)

print(orgIDError)



   orgID       rmse          mae

0  615   194.848564   155.502885

0  577   101.156573    76.083797

0  957  1564.256952   814.316566

0  763   832.782755   501.865561

0  616  1337.456555   860.404253

0  968   526.207558   347.265139

0  954  1570.315284  1149.191017

0  874   241.254153   202.429037

0  554   402.013992   344.846957

0  950  1073.348186   673.874603

Any ideas what went wrong?

asked Nov 21 '18 at 10:37

Thomas R

626

add a comment |

I trained a model on a set and tried to use it on all subsets.

Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.

I did the following:

%pyspark

def preprocessing(features, attributes):



    features_2 = features[attributes]

    y = features['y'].values

    x = features_2.values 



    robustScaler = RobustScaler(quantile_range=(25.0,75.0))

    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])



    xScaled[xScaled < -2.0] = -2.0 

    xScaled[xScaled > 2.0] = 2.0

    xCustomers = x[:,0]

    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS = xScaled 

    x_T0 = xScaled[:,:] 

    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 

    xCustR = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 

    x_all = np.hstack((x_T0_all, x_TS_all))

    variable_names = features_2.columns.get_values()[1:].tolist() 

    return x_all, variable_names, y



def trainModel(features,attributes,optAlpha):

    x_all, variable_names, y = preprocessing(features, attributes)

    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')

    ridge.fit(x_all, y)

    return ridge



def useModel(features,ridge,attributes):

    x_all, variable_names, y = preprocessing(features, attributes)

    y_pred = ridge.predict(x_all)

    rmse = np.sqrt(mean_squared_error(y,y_pred))

    mae = mean_absolute_error(y, y_pred)    

    print "RMSE on test set: ", round(rmse,2)

    print "MAE on test set:  ", round(mae,2)

    return y_pred, y, rmse, mae



ridge = trainModel(df_features_train, attributes, optAlpha)

useModel(df_features_train,ridge,attributes)



RMSE on test set:  67.05

MAE on test set:   52.5

Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.

orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])



for orgID in df_features['orgID'].unique():

    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)

    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])

    orgIDError = orgIDError.append(df)

print(orgIDError)



   orgID       rmse          mae

0  615   194.848564   155.502885

0  577   101.156573    76.083797

0  957  1564.256952   814.316566

0  763   832.782755   501.865561

0  616  1337.456555   860.404253

0  968   526.207558   347.265139

0  954  1570.315284  1149.191017

0  874   241.254153   202.429037

0  554   402.013992   344.846957

0  950  1073.348186   673.874603

Any ideas what went wrong?

asked Nov 21 '18 at 10:37

Thomas R

626

I trained a model on a set and tried to use it on all subsets.

Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.

I did the following:

%pyspark

def preprocessing(features, attributes):



    features_2 = features[attributes]

    y = features['y'].values

    x = features_2.values 



    robustScaler = RobustScaler(quantile_range=(25.0,75.0))

    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])



    xScaled[xScaled < -2.0] = -2.0 

    xScaled[xScaled > 2.0] = 2.0

    xCustomers = x[:,0]

    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS = xScaled 

    x_T0 = xScaled[:,:] 

    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3)) 

    xCustR = xCustomers.reshape((x[:,0].size, 1)) 

    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3))) 

    x_all = np.hstack((x_T0_all, x_TS_all))

    variable_names = features_2.columns.get_values()[1:].tolist() 

    return x_all, variable_names, y



def trainModel(features,attributes,optAlpha):

    x_all, variable_names, y = preprocessing(features, attributes)

    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')

    ridge.fit(x_all, y)

    return ridge



def useModel(features,ridge,attributes):

    x_all, variable_names, y = preprocessing(features, attributes)

    y_pred = ridge.predict(x_all)

    rmse = np.sqrt(mean_squared_error(y,y_pred))

    mae = mean_absolute_error(y, y_pred)    

    print "RMSE on test set: ", round(rmse,2)

    print "MAE on test set:  ", round(mae,2)

    return y_pred, y, rmse, mae



ridge = trainModel(df_features_train, attributes, optAlpha)

useModel(df_features_train,ridge,attributes)



RMSE on test set:  67.05

MAE on test set:   52.5

Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.

orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])



for orgID in df_features['orgID'].unique():

    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)

    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])

    orgIDError = orgIDError.append(df)

print(orgIDError)



   orgID       rmse          mae

0  615   194.848564   155.502885

0  577   101.156573    76.083797

0  957  1564.256952   814.316566

0  763   832.782755   501.865561

0  616  1337.456555   860.404253

0  968   526.207558   347.265139

0  954  1570.315284  1149.191017

0  874   241.254153   202.429037

0  554   402.013992   344.846957

0  950  1073.348186   673.874603

Any ideas what went wrong?

python scikit-learn regression

asked Nov 21 '18 at 10:37

Thomas R

626

asked Nov 21 '18 at 10:37

Thomas R

626

asked Nov 21 '18 at 10:37

Thomas R

626

asked Nov 21 '18 at 10:37

Thomas R

626

asked Nov 21 '18 at 10:37

Thomas R

626

add a comment |

1 Answer
1

active

oldest

votes

I found it my self.

The robustScaler in the preprocessing is working differently on different sets / subsets.

Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.

answered Nov 21 '18 at 11:17

Thomas R

626

1

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410232%2fridge-regression-rmse-on-all-subsets-higher-than-on-the-total-set%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

I found it my self.

The robustScaler in the preprocessing is working differently on different sets / subsets.

Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.

answered Nov 21 '18 at 11:17

Thomas R

626

1

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

add a comment |

I found it my self.

The robustScaler in the preprocessing is working differently on different sets / subsets.

Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.

answered Nov 21 '18 at 11:17

Thomas R

626

1

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

add a comment |

I found it my self.

The robustScaler in the preprocessing is working differently on different sets / subsets.

Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.

answered Nov 21 '18 at 11:17

Thomas R

626

I found it my self.

The robustScaler in the preprocessing is working differently on different sets / subsets.

Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.

answered Nov 21 '18 at 11:17

Thomas R

626

answered Nov 21 '18 at 11:17

Thomas R

626

answered Nov 21 '18 at 11:17

Thomas R

626

answered Nov 21 '18 at 11:17

Thomas R

626

1

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

add a comment |

1

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

– Vivek Kumar
Nov 21 '18 at 13:16

Great. Thanks :)

– Thomas R
Nov 21 '18 at 16:48

I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

– Thomas R
Nov 21 '18 at 16:54

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

yu9U0v08n4E tIWEGXIjDvruAWV2IxBmwCCdj kGF8DFx,DB eLI9uPZc oapC0 6FKJLwRYikWuVb8A9G 9a5iKj,sfE

搜尋此網誌

Nsryjdtyk