Deep learning from scratch: Debug (Loss increases)











up vote
-1
down vote

favorite












I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.



In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)



However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?



you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a



For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.



enter code here

def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches

# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):

# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)

##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1

# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)

hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)

#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])


gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N

# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])

self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)

#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)

# print(self.learning_rate, gradient.shape, len(self.weights))

#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this

gradient = np.dot(x_train, error_M) / N

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])

self.weights[0] = self.weights[0] - temp









share|improve this question
























  • Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
    – Róbert Druska
    Nov 17 at 14:50










  • 28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
    – kerem kurban
    Nov 17 at 21:29

















up vote
-1
down vote

favorite












I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.



In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)



However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?



you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a



For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.



enter code here

def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches

# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):

# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)

##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1

# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)

hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)

#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])


gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N

# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])

self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)

#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)

# print(self.learning_rate, gradient.shape, len(self.weights))

#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this

gradient = np.dot(x_train, error_M) / N

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])

self.weights[0] = self.weights[0] - temp









share|improve this question
























  • Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
    – Róbert Druska
    Nov 17 at 14:50










  • 28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
    – kerem kurban
    Nov 17 at 21:29















up vote
-1
down vote

favorite









up vote
-1
down vote

favorite











I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.



In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)



However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?



you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a



For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.



enter code here

def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches

# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):

# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)

##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1

# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)

hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)

#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])


gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N

# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])

self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)

#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)

# print(self.learning_rate, gradient.shape, len(self.weights))

#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this

gradient = np.dot(x_train, error_M) / N

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])

self.weights[0] = self.weights[0] - temp









share|improve this question















I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.



In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)



However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?



you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a



For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.



enter code here

def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches

# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):

# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias

# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)

##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1

# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)

hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)

#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])


gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N

# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])

self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)

#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)

# print(self.learning_rate, gradient.shape, len(self.weights))

#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this

gradient = np.dot(x_train, error_M) / N

db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])

self.weights[0] = self.weights[0] - temp






python debugging neural-network loss cross-entropy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 2 days ago

























asked Nov 17 at 14:36









kerem kurban

12




12












  • Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
    – Róbert Druska
    Nov 17 at 14:50










  • 28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
    – kerem kurban
    Nov 17 at 21:29




















  • Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
    – Róbert Druska
    Nov 17 at 14:50










  • 28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
    – kerem kurban
    Nov 17 at 21:29


















Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50




Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50












28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29






28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29



















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53352222%2fdeep-learning-from-scratch-debug-loss-increases%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53352222%2fdeep-learning-from-scratch-debug-loss-increases%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin