ridge regression rmse on all subsets higher than on the total set












1















I trained a model on a set and tried to use it on all subsets.



Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.



I did the following:



%pyspark
def preprocessing(features, attributes):

features_2 = features[attributes]
y = features['y'].values
x = features_2.values

robustScaler = RobustScaler(quantile_range=(25.0,75.0))
xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

xScaled[xScaled < -2.0] = -2.0
xScaled[xScaled > 2.0] = 2.0
xCustomers = x[:,0]
xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1))
x_TS = xScaled
x_T0 = xScaled[:,:]
x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3))
xCustR = xCustomers.reshape((x[:,0].size, 1))
x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3)))
x_all = np.hstack((x_T0_all, x_TS_all))
variable_names = features_2.columns.get_values()[1:].tolist()
return x_all, variable_names, y

def trainModel(features,attributes,optAlpha):
x_all, variable_names, y = preprocessing(features, attributes)
ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
ridge.fit(x_all, y)
return ridge

def useModel(features,ridge,attributes):
x_all, variable_names, y = preprocessing(features, attributes)
y_pred = ridge.predict(x_all)
rmse = np.sqrt(mean_squared_error(y,y_pred))
mae = mean_absolute_error(y, y_pred)
print "RMSE on test set: ", round(rmse,2)
print "MAE on test set: ", round(mae,2)
return y_pred, y, rmse, mae

ridge = trainModel(df_features_train, attributes, optAlpha)
useModel(df_features_train,ridge,attributes)

RMSE on test set: 67.05
MAE on test set: 52.5


Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.



orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])

for orgID in df_features['orgID'].unique():
yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
orgIDError = orgIDError.append(df)
print(orgIDError)

orgID rmse mae
0 615 194.848564 155.502885
0 577 101.156573 76.083797
0 957 1564.256952 814.316566
0 763 832.782755 501.865561
0 616 1337.456555 860.404253
0 968 526.207558 347.265139
0 954 1570.315284 1149.191017
0 874 241.254153 202.429037
0 554 402.013992 344.846957
0 950 1073.348186 673.874603


Any ideas what went wrong?










share|improve this question



























    1















    I trained a model on a set and tried to use it on all subsets.



    Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.



    I did the following:



    %pyspark
    def preprocessing(features, attributes):

    features_2 = features[attributes]
    y = features['y'].values
    x = features_2.values

    robustScaler = RobustScaler(quantile_range=(25.0,75.0))
    xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

    xScaled[xScaled < -2.0] = -2.0
    xScaled[xScaled > 2.0] = 2.0
    xCustomers = x[:,0]
    xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1))
    x_TS = xScaled
    x_T0 = xScaled[:,:]
    x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3))
    xCustR = xCustomers.reshape((x[:,0].size, 1))
    x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3)))
    x_all = np.hstack((x_T0_all, x_TS_all))
    variable_names = features_2.columns.get_values()[1:].tolist()
    return x_all, variable_names, y

    def trainModel(features,attributes,optAlpha):
    x_all, variable_names, y = preprocessing(features, attributes)
    ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
    ridge.fit(x_all, y)
    return ridge

    def useModel(features,ridge,attributes):
    x_all, variable_names, y = preprocessing(features, attributes)
    y_pred = ridge.predict(x_all)
    rmse = np.sqrt(mean_squared_error(y,y_pred))
    mae = mean_absolute_error(y, y_pred)
    print "RMSE on test set: ", round(rmse,2)
    print "MAE on test set: ", round(mae,2)
    return y_pred, y, rmse, mae

    ridge = trainModel(df_features_train, attributes, optAlpha)
    useModel(df_features_train,ridge,attributes)

    RMSE on test set: 67.05
    MAE on test set: 52.5


    Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.



    orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])

    for orgID in df_features['orgID'].unique():
    yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
    df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
    orgIDError = orgIDError.append(df)
    print(orgIDError)

    orgID rmse mae
    0 615 194.848564 155.502885
    0 577 101.156573 76.083797
    0 957 1564.256952 814.316566
    0 763 832.782755 501.865561
    0 616 1337.456555 860.404253
    0 968 526.207558 347.265139
    0 954 1570.315284 1149.191017
    0 874 241.254153 202.429037
    0 554 402.013992 344.846957
    0 950 1073.348186 673.874603


    Any ideas what went wrong?










    share|improve this question

























      1












      1








      1








      I trained a model on a set and tried to use it on all subsets.



      Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.



      I did the following:



      %pyspark
      def preprocessing(features, attributes):

      features_2 = features[attributes]
      y = features['y'].values
      x = features_2.values

      robustScaler = RobustScaler(quantile_range=(25.0,75.0))
      xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

      xScaled[xScaled < -2.0] = -2.0
      xScaled[xScaled > 2.0] = 2.0
      xCustomers = x[:,0]
      xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1))
      x_TS = xScaled
      x_T0 = xScaled[:,:]
      x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3))
      xCustR = xCustomers.reshape((x[:,0].size, 1))
      x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3)))
      x_all = np.hstack((x_T0_all, x_TS_all))
      variable_names = features_2.columns.get_values()[1:].tolist()
      return x_all, variable_names, y

      def trainModel(features,attributes,optAlpha):
      x_all, variable_names, y = preprocessing(features, attributes)
      ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
      ridge.fit(x_all, y)
      return ridge

      def useModel(features,ridge,attributes):
      x_all, variable_names, y = preprocessing(features, attributes)
      y_pred = ridge.predict(x_all)
      rmse = np.sqrt(mean_squared_error(y,y_pred))
      mae = mean_absolute_error(y, y_pred)
      print "RMSE on test set: ", round(rmse,2)
      print "MAE on test set: ", round(mae,2)
      return y_pred, y, rmse, mae

      ridge = trainModel(df_features_train, attributes, optAlpha)
      useModel(df_features_train,ridge,attributes)

      RMSE on test set: 67.05
      MAE on test set: 52.5


      Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.



      orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])

      for orgID in df_features['orgID'].unique():
      yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
      df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
      orgIDError = orgIDError.append(df)
      print(orgIDError)

      orgID rmse mae
      0 615 194.848564 155.502885
      0 577 101.156573 76.083797
      0 957 1564.256952 814.316566
      0 763 832.782755 501.865561
      0 616 1337.456555 860.404253
      0 968 526.207558 347.265139
      0 954 1570.315284 1149.191017
      0 874 241.254153 202.429037
      0 554 402.013992 344.846957
      0 950 1073.348186 673.874603


      Any ideas what went wrong?










      share|improve this question














      I trained a model on a set and tried to use it on all subsets.



      Mathematically the total rmse and mae (mean average error) should be in between the single rsme's and mae's. But all the single rmse' and mae's are higher than the total one.



      I did the following:



      %pyspark
      def preprocessing(features, attributes):

      features_2 = features[attributes]
      y = features['y'].values
      x = features_2.values

      robustScaler = RobustScaler(quantile_range=(25.0,75.0))
      xScaled = robustScaler.fit_transform(x[:,1:x.shape[1]])

      xScaled[xScaled < -2.0] = -2.0
      xScaled[xScaled > 2.0] = 2.0
      xCustomers = x[:,0]
      xCustomers_reshaped = xCustomers.reshape((x[:,0].size, 1))
      x_TS = xScaled
      x_T0 = xScaled[:,:]
      x_T0_all = np.hstack((np.ones((x_T0.shape[0], 1)), x_T0, x_T0**2, x_T0**3))
      xCustR = xCustomers.reshape((x[:,0].size, 1))
      x_TS_all = np.hstack((xCustR*np.ones((x_TS.shape[0], 1)), xCustR*x_TS, xCustR*(x_TS**2), xCustR*(x_TS**3)))
      x_all = np.hstack((x_T0_all, x_TS_all))
      variable_names = features_2.columns.get_values()[1:].tolist()
      return x_all, variable_names, y

      def trainModel(features,attributes,optAlpha):
      x_all, variable_names, y = preprocessing(features, attributes)
      ridge = linear_model.Ridge(fit_intercept=False, copy_X=True, alpha=optAlpha, solver='auto')
      ridge.fit(x_all, y)
      return ridge

      def useModel(features,ridge,attributes):
      x_all, variable_names, y = preprocessing(features, attributes)
      y_pred = ridge.predict(x_all)
      rmse = np.sqrt(mean_squared_error(y,y_pred))
      mae = mean_absolute_error(y, y_pred)
      print "RMSE on test set: ", round(rmse,2)
      print "MAE on test set: ", round(mae,2)
      return y_pred, y, rmse, mae

      ridge = trainModel(df_features_train, attributes, optAlpha)
      useModel(df_features_train,ridge,attributes)

      RMSE on test set: 67.05
      MAE on test set: 52.5


      Now I tried to use the useModel-function including the preprocessing on all different orgIDs separately.



      orgIDError = pd.DataFrame(,columns=['orgID','rmse','mae'])

      for orgID in df_features['orgID'].unique():
      yPred, y, rmse, mae = useModel(df_features_train[df_features_train.orgID == orgID],ridge,attributes)
      df = pd.DataFrame([[orgID,rmse,mae]],columns=['orgID','rmse','mae'])
      orgIDError = orgIDError.append(df)
      print(orgIDError)

      orgID rmse mae
      0 615 194.848564 155.502885
      0 577 101.156573 76.083797
      0 957 1564.256952 814.316566
      0 763 832.782755 501.865561
      0 616 1337.456555 860.404253
      0 968 526.207558 347.265139
      0 954 1570.315284 1149.191017
      0 874 241.254153 202.429037
      0 554 402.013992 344.846957
      0 950 1073.348186 673.874603


      Any ideas what went wrong?







      python scikit-learn regression






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 21 '18 at 10:37









      Thomas RThomas R

      626




      626
























          1 Answer
          1






          active

          oldest

          votes


















          1














          I found it my self.



          The robustScaler in the preprocessing is working differently on different sets / subsets.



          Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.






          share|improve this answer



















          • 1





            Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

            – Vivek Kumar
            Nov 21 '18 at 13:16











          • Great. Thanks :)

            – Thomas R
            Nov 21 '18 at 16:48











          • I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

            – Thomas R
            Nov 21 '18 at 16:54











          Your Answer






          StackExchange.ifUsing("editor", function () {
          StackExchange.using("externalEditor", function () {
          StackExchange.using("snippets", function () {
          StackExchange.snippets.init();
          });
          });
          }, "code-snippets");

          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "1"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410232%2fridge-regression-rmse-on-all-subsets-higher-than-on-the-total-set%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          I found it my self.



          The robustScaler in the preprocessing is working differently on different sets / subsets.



          Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.






          share|improve this answer



















          • 1





            Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

            – Vivek Kumar
            Nov 21 '18 at 13:16











          • Great. Thanks :)

            – Thomas R
            Nov 21 '18 at 16:48











          • I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

            – Thomas R
            Nov 21 '18 at 16:54
















          1














          I found it my self.



          The robustScaler in the preprocessing is working differently on different sets / subsets.



          Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.






          share|improve this answer



















          • 1





            Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

            – Vivek Kumar
            Nov 21 '18 at 13:16











          • Great. Thanks :)

            – Thomas R
            Nov 21 '18 at 16:48











          • I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

            – Thomas R
            Nov 21 '18 at 16:54














          1












          1








          1







          I found it my self.



          The robustScaler in the preprocessing is working differently on different sets / subsets.



          Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.






          share|improve this answer













          I found it my self.



          The robustScaler in the preprocessing is working differently on different sets / subsets.



          Therefore, the values in the subsets are prepared differently and therefore no longer fit the model.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 21 '18 at 11:17









          Thomas RThomas R

          626




          626








          • 1





            Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

            – Vivek Kumar
            Nov 21 '18 at 13:16











          • Great. Thanks :)

            – Thomas R
            Nov 21 '18 at 16:48











          • I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

            – Thomas R
            Nov 21 '18 at 16:54














          • 1





            Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

            – Vivek Kumar
            Nov 21 '18 at 13:16











          • Great. Thanks :)

            – Thomas R
            Nov 21 '18 at 16:48











          • I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

            – Thomas R
            Nov 21 '18 at 16:54








          1




          1





          Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

          – Vivek Kumar
          Nov 21 '18 at 13:16





          Yes you are correct. In that case, you will need to save the robustScaler after fitting during trainModel and use that (only call transform) during useModel

          – Vivek Kumar
          Nov 21 '18 at 13:16













          Great. Thanks :)

          – Thomas R
          Nov 21 '18 at 16:48





          Great. Thanks :)

          – Thomas R
          Nov 21 '18 at 16:48













          I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

          – Thomas R
          Nov 21 '18 at 16:54





          I have already joined the results to the original features and analyzed the rsme and mae on the this new dataFrame. But your solution seems to be perfect when the prediction has to be done with new data.

          – Thomas R
          Nov 21 '18 at 16:54


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53410232%2fridge-regression-rmse-on-all-subsets-higher-than-on-the-total-set%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          Costa Masnaga

          Fotorealismo

          Sidney Franklin