Deep learning from scratch: Debug (Loss increases)

up vote
-1
down vote

favorite

I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.

In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)

However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?

you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a

For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.

enter code here



def train(self, x_train, y_train):

    ### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches



    # FEEDFORWARD

    # print('x_train: ', x_train)

    hid_layer_out = 

    for i in range(self.num_of_hidden_layers):



        # First iteration is between the input and first hidden layer.

        if i == 0:

            hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



        # Next ones are between only hidden layers.

        else:

            hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



    # Last iteration is the final feedforward output of the network.

    forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])

    #Add bias

    bias = self.weights[-1][-1,:]

    bias = bias.reshape(1,len(bias))

    forwardprop_out += bias

    forwardprop_out = activation_func.sigmoid(forwardprop_out)

    # forwardprop_out = activation_func.softmax(forwardprop_out)

    #        print('HID_LAYER_OUT:n', hid_layer_out)

    #        print('nFORWARDPROP_OUT:n', forwardprop_out)



    ##### BACKPROP

    # One-hot Encoding.

    N = x_train.shape[1]

    target = np.zeros((N, self.num_output_nodes))

    target[np.arange(N), (y_train.T.astype(int) - 1)] = 1



    # Find error and loss values.

    error, loss = CEcost(forwardprop_out, target)

    if self.regularize:

        reg_coef = 2

        error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))

        loss = np.log(np.sum(error))

    # Record loss value of current iteration.

    self.loss_values.append(loss)



    hid_layer_out.append(forwardprop_out)

    # print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))

    for i in range(self.num_of_hidden_layers, 0, -1):  # +1)):

        # print('n',i)

        # error_M = error * forwardprop_out * (1 - forwardprop_out)



        #for sigmoid backprop

        error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])





        gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N



        # error_mean = np.mean(error,axis=0)

        # error_mean = error_mean.reshape(len(error_mean),1)



        db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

        temp = self.learning_rate * gradient

        temp = np.concatenate([temp,db])



        self.weights[i] = self.weights[i] - temp

        # print('error_M shape: ', error_M.shape)



        #update error with only weights

        error = np.dot(error, self.weights[i][:-1,:].T)



        # print(self.learning_rate, gradient.shape, len(self.weights))



    #Again for sigmoid backprop

    error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])

    # error_M = error * (1 - error)  # alternatively try with this



    gradient = np.dot(x_train, error_M) / N



    db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

    temp = self.learning_rate * gradient

    temp = np.concatenate([temp, db])



    self.weights[0] = self.weights[0] - temp

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50

28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29

add a comment |

up vote
-1
down vote

favorite

you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a

enter code here



def train(self, x_train, y_train):

    ### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches



    # FEEDFORWARD

    # print('x_train: ', x_train)

    hid_layer_out = 

    for i in range(self.num_of_hidden_layers):



        # First iteration is between the input and first hidden layer.

        if i == 0:

            hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



        # Next ones are between only hidden layers.

        else:

            hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



    # Last iteration is the final feedforward output of the network.

    forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])

    #Add bias

    bias = self.weights[-1][-1,:]

    bias = bias.reshape(1,len(bias))

    forwardprop_out += bias

    forwardprop_out = activation_func.sigmoid(forwardprop_out)

    # forwardprop_out = activation_func.softmax(forwardprop_out)

    #        print('HID_LAYER_OUT:n', hid_layer_out)

    #        print('nFORWARDPROP_OUT:n', forwardprop_out)



    ##### BACKPROP

    # One-hot Encoding.

    N = x_train.shape[1]

    target = np.zeros((N, self.num_output_nodes))

    target[np.arange(N), (y_train.T.astype(int) - 1)] = 1



    # Find error and loss values.

    error, loss = CEcost(forwardprop_out, target)

    if self.regularize:

        reg_coef = 2

        error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))

        loss = np.log(np.sum(error))

    # Record loss value of current iteration.

    self.loss_values.append(loss)



    hid_layer_out.append(forwardprop_out)

    # print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))

    for i in range(self.num_of_hidden_layers, 0, -1):  # +1)):

        # print('n',i)

        # error_M = error * forwardprop_out * (1 - forwardprop_out)



        #for sigmoid backprop

        error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])





        gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N



        # error_mean = np.mean(error,axis=0)

        # error_mean = error_mean.reshape(len(error_mean),1)



        db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

        temp = self.learning_rate * gradient

        temp = np.concatenate([temp,db])



        self.weights[i] = self.weights[i] - temp

        # print('error_M shape: ', error_M.shape)



        #update error with only weights

        error = np.dot(error, self.weights[i][:-1,:].T)



        # print(self.learning_rate, gradient.shape, len(self.weights))



    #Again for sigmoid backprop

    error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])

    # error_M = error * (1 - error)  # alternatively try with this



    gradient = np.dot(x_train, error_M) / N



    db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

    temp = self.learning_rate * gradient

    temp = np.concatenate([temp, db])



    self.weights[0] = self.weights[0] - temp

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50

28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29

add a comment |

up vote
-1
down vote

favorite

you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a

enter code here



def train(self, x_train, y_train):

    ### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches



    # FEEDFORWARD

    # print('x_train: ', x_train)

    hid_layer_out = 

    for i in range(self.num_of_hidden_layers):



        # First iteration is between the input and first hidden layer.

        if i == 0:

            hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



        # Next ones are between only hidden layers.

        else:

            hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



    # Last iteration is the final feedforward output of the network.

    forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])

    #Add bias

    bias = self.weights[-1][-1,:]

    bias = bias.reshape(1,len(bias))

    forwardprop_out += bias

    forwardprop_out = activation_func.sigmoid(forwardprop_out)

    # forwardprop_out = activation_func.softmax(forwardprop_out)

    #        print('HID_LAYER_OUT:n', hid_layer_out)

    #        print('nFORWARDPROP_OUT:n', forwardprop_out)



    ##### BACKPROP

    # One-hot Encoding.

    N = x_train.shape[1]

    target = np.zeros((N, self.num_output_nodes))

    target[np.arange(N), (y_train.T.astype(int) - 1)] = 1



    # Find error and loss values.

    error, loss = CEcost(forwardprop_out, target)

    if self.regularize:

        reg_coef = 2

        error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))

        loss = np.log(np.sum(error))

    # Record loss value of current iteration.

    self.loss_values.append(loss)



    hid_layer_out.append(forwardprop_out)

    # print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))

    for i in range(self.num_of_hidden_layers, 0, -1):  # +1)):

        # print('n',i)

        # error_M = error * forwardprop_out * (1 - forwardprop_out)



        #for sigmoid backprop

        error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])





        gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N



        # error_mean = np.mean(error,axis=0)

        # error_mean = error_mean.reshape(len(error_mean),1)



        db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

        temp = self.learning_rate * gradient

        temp = np.concatenate([temp,db])



        self.weights[i] = self.weights[i] - temp

        # print('error_M shape: ', error_M.shape)



        #update error with only weights

        error = np.dot(error, self.weights[i][:-1,:].T)



        # print(self.learning_rate, gradient.shape, len(self.weights))



    #Again for sigmoid backprop

    error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])

    # error_M = error * (1 - error)  # alternatively try with this



    gradient = np.dot(x_train, error_M) / N



    db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

    temp = self.learning_rate * gradient

    temp = np.concatenate([temp, db])



    self.weights[0] = self.weights[0] - temp

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a

enter code here



def train(self, x_train, y_train):

    ### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches



    # FEEDFORWARD

    # print('x_train: ', x_train)

    hid_layer_out = 

    for i in range(self.num_of_hidden_layers):



        # First iteration is between the input and first hidden layer.

        if i == 0:

            hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



        # Next ones are between only hidden layers.

        else:

            hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b

            #Add bias



    # Last iteration is the final feedforward output of the network.

    forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])

    #Add bias

    bias = self.weights[-1][-1,:]

    bias = bias.reshape(1,len(bias))

    forwardprop_out += bias

    forwardprop_out = activation_func.sigmoid(forwardprop_out)

    # forwardprop_out = activation_func.softmax(forwardprop_out)

    #        print('HID_LAYER_OUT:n', hid_layer_out)

    #        print('nFORWARDPROP_OUT:n', forwardprop_out)



    ##### BACKPROP

    # One-hot Encoding.

    N = x_train.shape[1]

    target = np.zeros((N, self.num_output_nodes))

    target[np.arange(N), (y_train.T.astype(int) - 1)] = 1



    # Find error and loss values.

    error, loss = CEcost(forwardprop_out, target)

    if self.regularize:

        reg_coef = 2

        error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))

        loss = np.log(np.sum(error))

    # Record loss value of current iteration.

    self.loss_values.append(loss)



    hid_layer_out.append(forwardprop_out)

    # print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))

    for i in range(self.num_of_hidden_layers, 0, -1):  # +1)):

        # print('n',i)

        # error_M = error * forwardprop_out * (1 - forwardprop_out)



        #for sigmoid backprop

        error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])





        gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N



        # error_mean = np.mean(error,axis=0)

        # error_mean = error_mean.reshape(len(error_mean),1)



        db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

        temp = self.learning_rate * gradient

        temp = np.concatenate([temp,db])



        self.weights[i] = self.weights[i] - temp

        # print('error_M shape: ', error_M.shape)



        #update error with only weights

        error = np.dot(error, self.weights[i][:-1,:].T)



        # print(self.learning_rate, gradient.shape, len(self.weights))



    #Again for sigmoid backprop

    error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])

    # error_M = error * (1 - error)  # alternatively try with this



    gradient = np.dot(x_train, error_M) / N



    db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]

    temp = self.learning_rate * gradient

    temp = np.concatenate([temp, db])



    self.weights[0] = self.weights[0] - temp

python debugging neural-network loss cross-entropy

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

edited 2 days ago

asked Nov 17 at 14:36

kerem kurban

asked Nov 17 at 14:36

kerem kurban

asked Nov 17 at 14:36

kerem kurban

Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50

28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29

add a comment |

Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50

28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29

Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50

28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29

add a comment |

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53352222%2fdeep-learning-from-scratch-debug-loss-increases%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

mgKEI5u mIKzFyi69K7zNFAfqps8,Z 8R6rvFCGO62UKmkngz,o CWdxDLOOZUDtbO1x2L,1Z1fAs1E wSn

搜尋此網誌

Nsryjdtyk