Deep learning from scratch: Debug (Loss increases)
up vote
-1
down vote
favorite
I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.
In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)
However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?
you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a
For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.
enter code here
def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches
# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):
# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)
##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1
# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)
hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)
#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])
gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N
# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])
self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)
#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)
# print(self.learning_rate, gradient.shape, len(self.weights))
#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this
gradient = np.dot(x_train, error_M) / N
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])
self.weights[0] = self.weights[0] - temp
python debugging neural-network loss cross-entropy
add a comment |
up vote
-1
down vote
favorite
I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.
In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)
However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?
you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a
For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.
enter code here
def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches
# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):
# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)
##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1
# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)
hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)
#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])
gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N
# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])
self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)
#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)
# print(self.learning_rate, gradient.shape, len(self.weights))
#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this
gradient = np.dot(x_train, error_M) / N
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])
self.weights[0] = self.weights[0] - temp
python debugging neural-network loss cross-entropy
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29
add a comment |
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.
In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)
However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?
you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a
For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.
enter code here
def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches
# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):
# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)
##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1
# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)
hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)
#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])
gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N
# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])
self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)
#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)
# print(self.learning_rate, gradient.shape, len(self.weights))
#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this
gradient = np.dot(x_train, error_M) / N
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])
self.weights[0] = self.weights[0] - temp
python debugging neural-network loss cross-entropy
I'm trying to implement a neural network with a single hidden layer having 40 neurons using handwritten letter dataset (26 classes). I know convnets are better for this purpose but i saw that this approach should also be ok if used cross entropy loss and sigmoid activation function.
In my code, i have included:
- mini batch
- weight decay (0.005/epoch)
- batch norm
- learning rate decay (currently 10% decay at every 10 epochs. I will add adam optimizer next)
- regularization (loss = loss + sum of squared weights at each layer)
However, due to some reason my loss keeps increasing. I tried to change initial learning rates from (0.0001 to 1), add or remove regularization, but could not figure out what was the problem. At some point, the loss was decreasing but still the accuracy was near 0.03 for train and test dataset. Have i made a mistake in calculating the gradient?
you can access the code and the dataset from here:
https://files.fm/u/f64dfz9a
For simplicity,I'm adding the train function here. Can someone check if my backprop is right? What i did basically is multiply (cross entropy error) with (derivative of sigmoid activation func)then matrix multiply previous_layer's_output with that, then multiplied with a Learning rate and substract from the last weights.
enter code here
def train(self, x_train, y_train):
### Takes all inputs first elements at the same time. should either take one by one or with batches / minibatches
# FEEDFORWARD
# print('x_train: ', x_train)
hid_layer_out =
for i in range(self.num_of_hidden_layers):
# First iteration is between the input and first hidden layer.
if i == 0:
hid_layer_out.append(activation_func.sigmoid(np.dot(x_train.T, self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Next ones are between only hidden layers.
else:
hid_layer_out.append(activation_func.sigmoid(np.dot(hid_layer_out[i - 1], self.weights[i][:-1,:])+self.weights[i][-1,:])) # wx + b
#Add bias
# Last iteration is the final feedforward output of the network.
forwardprop_out = np.dot(hid_layer_out[-1], self.weights[-1][:-1,:])
#Add bias
bias = self.weights[-1][-1,:]
bias = bias.reshape(1,len(bias))
forwardprop_out += bias
forwardprop_out = activation_func.sigmoid(forwardprop_out)
# forwardprop_out = activation_func.softmax(forwardprop_out)
# print('HID_LAYER_OUT:n', hid_layer_out)
# print('nFORWARDPROP_OUT:n', forwardprop_out)
##### BACKPROP
# One-hot Encoding.
N = x_train.shape[1]
target = np.zeros((N, self.num_output_nodes))
target[np.arange(N), (y_train.T.astype(int) - 1)] = 1
# Find error and loss values.
error, loss = CEcost(forwardprop_out, target)
if self.regularize:
reg_coef = 2
error = np.log(error + (reg_coef/2*N) * (np.sum(np.square(self.weights[0]))+np.sum(np.square(self.weights[1])) ))
loss = np.log(np.sum(error))
# Record loss value of current iteration.
self.loss_values.append(loss)
hid_layer_out.append(forwardprop_out)
# print('error.shape: {} forwardprop.shape: {}'.format (error.shape, forwardprop_out.shape))
for i in range(self.num_of_hidden_layers, 0, -1): # +1)):
# print('n',i)
# error_M = error * forwardprop_out * (1 - forwardprop_out)
#for sigmoid backprop
error_M = error * hid_layer_out[i] * (1 - hid_layer_out[i])
gradient = np.dot(hid_layer_out[i - 1].T, error_M) / N
# error_mean = np.mean(error,axis=0)
# error_mean = error_mean.reshape(len(error_mean),1)
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp,db])
self.weights[i] = self.weights[i] - temp
# print('error_M shape: ', error_M.shape)
#update error with only weights
error = np.dot(error, self.weights[i][:-1,:].T)
# print(self.learning_rate, gradient.shape, len(self.weights))
#Again for sigmoid backprop
error_M = error * hid_layer_out[0] * (1 - hid_layer_out[0])
# error_M = error * (1 - error) # alternatively try with this
gradient = np.dot(x_train, error_M) / N
db = [(self.learning_rate/10) * np.mean(error_M,axis=0)]
temp = self.learning_rate * gradient
temp = np.concatenate([temp, db])
self.weights[0] = self.weights[0] - temp
python debugging neural-network loss cross-entropy
python debugging neural-network loss cross-entropy
edited 2 days ago
asked Nov 17 at 14:36
kerem kurban
12
12
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29
add a comment |
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29
add a comment |
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53352222%2fdeep-learning-from-scratch-debug-loss-increases%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Hi, if you could rewrite at least the model architecture here. That would help us with diagosing the issue.
– Róbert Druska
Nov 17 at 14:50
28x28 images flattened at the input layer. so 784xbatch_size data comes at each epoch. Weights (Input->HL) : 785x40 (784+1=785 with bias) and the second weights (HL->out) : 41x26 (40+1 = 41 with bias)
– kerem kurban
Nov 17 at 21:29