Calculate the average word vector of a tweet using word2vec











up vote
0
down vote

favorite












I have a script that predicts the sentiments of a list of tweets as either positive or negative.



My biggest issue is when I am calculating the average word-vector of a tweet using two for loops. This is slow, and when I have too many tweets it consumes all my memory. My function is creating a matrix that i will use to predict the sentiment (positive or negative) of tweets.



Here is my code for creating the matrix:



def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0: #in case tweet is empty
X[indeks] = X[indeks]/N #taking average of word vectors
return X


Here is the complete script:



import numpy as np
import gensim
from gensim.models import Word2Vec
import csv
from sklearn import naive_bayes as nb
from sklearn.linear_model import LogisticRegression

def open_file(fileName):
with open(str(fileName), "r", encoding="utf8") as sample:
s = sample.readlines()
return s

def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0:
X[indeks] = X[indeks]/N
return X

def processTrainingData(list_of_tweets):
list_of_tweets = list(set(list_of_tweets)) # remove duplicates
list_of_tweets = [gensim.utils.simple_preprocess(line) for line in list_of_tweets] # simple preprocessing
return list_of_tweets

def createWordEmbedding(list_of_tweets, features, epoc):
model = Word2Vec(list_of_tweets, size=features, window=5, min_count=1, workers=4)
model.train(list_of_tweets, total_examples=len(list_of_tweets), epochs=epoc)
return model

def save_csv(fileName, test_y):
ids = np.arange(len(test_y))
with open(fileName, 'w') as csvfile:
tempwriter = csv.writer(csvfile)
tempwriter.writerow(["Id","Prediction"])
count = 0
for row in test_y:
if row == 0:
row = -1
tempwriter.writerow([(ids[count])+1,str(row)])
count = count + 1

def train(method, x, y, x_test):
met = method.fit(x,y)
test_y = met.predict(x_test)
return test_y


features = 350
epoc = 50
spos = open_file("train_pos.txt")
sneg = open_file("train_neg.txt")

spos = processTrainingData(spos)
sneg = processTrainingData(sneg)

y = [1]*len(spos)+[0]*len(sneg)

stotal = spos+sneg

model_tot = createWordEmbedding(stotal, features, epoc)

X = create_X(stotal,model_tot, features)


testd = open_file("test_data.txt")
testd = [gensim.utils.simple_preprocess(line) for line in testd]

model_test = createWordEmbedding(testd, features, epoc)

X_test = create_X(testd,model_test, features)

# Build logistic regression classifiers to identify the polarity of words
test_y = train(LogisticRegression(), X, y, X_test)

# Build naive bayes classifiers to identify the polarity of words
test_y_nb = train(nb.GaussianNB(), X, y, X_test) # this one isn't working


save_csv('test_resultLR.csv', test_y)
save_csv('test_resultNB.csv', test_y_nb)


What I primarily could use help in, is coming up with code that uses less memory, and an extra bonus would be faster code. I was thinking I could do some list comprehension, but I am unable to figure out a good way to do that.



model_tot.wv[str(word)] gives a vector of length = features:



model_tot.wv[str(word)]









share|improve this question









New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
    – Quill
    7 hours ago










  • @Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
    – bamsebu
    7 hours ago










  • Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
    – Sᴀᴍ Onᴇᴌᴀ
    58 mins ago

















up vote
0
down vote

favorite












I have a script that predicts the sentiments of a list of tweets as either positive or negative.



My biggest issue is when I am calculating the average word-vector of a tweet using two for loops. This is slow, and when I have too many tweets it consumes all my memory. My function is creating a matrix that i will use to predict the sentiment (positive or negative) of tweets.



Here is my code for creating the matrix:



def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0: #in case tweet is empty
X[indeks] = X[indeks]/N #taking average of word vectors
return X


Here is the complete script:



import numpy as np
import gensim
from gensim.models import Word2Vec
import csv
from sklearn import naive_bayes as nb
from sklearn.linear_model import LogisticRegression

def open_file(fileName):
with open(str(fileName), "r", encoding="utf8") as sample:
s = sample.readlines()
return s

def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0:
X[indeks] = X[indeks]/N
return X

def processTrainingData(list_of_tweets):
list_of_tweets = list(set(list_of_tweets)) # remove duplicates
list_of_tweets = [gensim.utils.simple_preprocess(line) for line in list_of_tweets] # simple preprocessing
return list_of_tweets

def createWordEmbedding(list_of_tweets, features, epoc):
model = Word2Vec(list_of_tweets, size=features, window=5, min_count=1, workers=4)
model.train(list_of_tweets, total_examples=len(list_of_tweets), epochs=epoc)
return model

def save_csv(fileName, test_y):
ids = np.arange(len(test_y))
with open(fileName, 'w') as csvfile:
tempwriter = csv.writer(csvfile)
tempwriter.writerow(["Id","Prediction"])
count = 0
for row in test_y:
if row == 0:
row = -1
tempwriter.writerow([(ids[count])+1,str(row)])
count = count + 1

def train(method, x, y, x_test):
met = method.fit(x,y)
test_y = met.predict(x_test)
return test_y


features = 350
epoc = 50
spos = open_file("train_pos.txt")
sneg = open_file("train_neg.txt")

spos = processTrainingData(spos)
sneg = processTrainingData(sneg)

y = [1]*len(spos)+[0]*len(sneg)

stotal = spos+sneg

model_tot = createWordEmbedding(stotal, features, epoc)

X = create_X(stotal,model_tot, features)


testd = open_file("test_data.txt")
testd = [gensim.utils.simple_preprocess(line) for line in testd]

model_test = createWordEmbedding(testd, features, epoc)

X_test = create_X(testd,model_test, features)

# Build logistic regression classifiers to identify the polarity of words
test_y = train(LogisticRegression(), X, y, X_test)

# Build naive bayes classifiers to identify the polarity of words
test_y_nb = train(nb.GaussianNB(), X, y, X_test) # this one isn't working


save_csv('test_resultLR.csv', test_y)
save_csv('test_resultNB.csv', test_y_nb)


What I primarily could use help in, is coming up with code that uses less memory, and an extra bonus would be faster code. I was thinking I could do some list comprehension, but I am unable to figure out a good way to do that.



model_tot.wv[str(word)] gives a vector of length = features:



model_tot.wv[str(word)]









share|improve this question









New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.




















  • Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
    – Quill
    7 hours ago










  • @Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
    – bamsebu
    7 hours ago










  • Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
    – Sᴀᴍ Onᴇᴌᴀ
    58 mins ago















up vote
0
down vote

favorite









up vote
0
down vote

favorite











I have a script that predicts the sentiments of a list of tweets as either positive or negative.



My biggest issue is when I am calculating the average word-vector of a tweet using two for loops. This is slow, and when I have too many tweets it consumes all my memory. My function is creating a matrix that i will use to predict the sentiment (positive or negative) of tweets.



Here is my code for creating the matrix:



def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0: #in case tweet is empty
X[indeks] = X[indeks]/N #taking average of word vectors
return X


Here is the complete script:



import numpy as np
import gensim
from gensim.models import Word2Vec
import csv
from sklearn import naive_bayes as nb
from sklearn.linear_model import LogisticRegression

def open_file(fileName):
with open(str(fileName), "r", encoding="utf8") as sample:
s = sample.readlines()
return s

def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0:
X[indeks] = X[indeks]/N
return X

def processTrainingData(list_of_tweets):
list_of_tweets = list(set(list_of_tweets)) # remove duplicates
list_of_tweets = [gensim.utils.simple_preprocess(line) for line in list_of_tweets] # simple preprocessing
return list_of_tweets

def createWordEmbedding(list_of_tweets, features, epoc):
model = Word2Vec(list_of_tweets, size=features, window=5, min_count=1, workers=4)
model.train(list_of_tweets, total_examples=len(list_of_tweets), epochs=epoc)
return model

def save_csv(fileName, test_y):
ids = np.arange(len(test_y))
with open(fileName, 'w') as csvfile:
tempwriter = csv.writer(csvfile)
tempwriter.writerow(["Id","Prediction"])
count = 0
for row in test_y:
if row == 0:
row = -1
tempwriter.writerow([(ids[count])+1,str(row)])
count = count + 1

def train(method, x, y, x_test):
met = method.fit(x,y)
test_y = met.predict(x_test)
return test_y


features = 350
epoc = 50
spos = open_file("train_pos.txt")
sneg = open_file("train_neg.txt")

spos = processTrainingData(spos)
sneg = processTrainingData(sneg)

y = [1]*len(spos)+[0]*len(sneg)

stotal = spos+sneg

model_tot = createWordEmbedding(stotal, features, epoc)

X = create_X(stotal,model_tot, features)


testd = open_file("test_data.txt")
testd = [gensim.utils.simple_preprocess(line) for line in testd]

model_test = createWordEmbedding(testd, features, epoc)

X_test = create_X(testd,model_test, features)

# Build logistic regression classifiers to identify the polarity of words
test_y = train(LogisticRegression(), X, y, X_test)

# Build naive bayes classifiers to identify the polarity of words
test_y_nb = train(nb.GaussianNB(), X, y, X_test) # this one isn't working


save_csv('test_resultLR.csv', test_y)
save_csv('test_resultNB.csv', test_y_nb)


What I primarily could use help in, is coming up with code that uses less memory, and an extra bonus would be faster code. I was thinking I could do some list comprehension, but I am unable to figure out a good way to do that.



model_tot.wv[str(word)] gives a vector of length = features:



model_tot.wv[str(word)]









share|improve this question









New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











I have a script that predicts the sentiments of a list of tweets as either positive or negative.



My biggest issue is when I am calculating the average word-vector of a tweet using two for loops. This is slow, and when I have too many tweets it consumes all my memory. My function is creating a matrix that i will use to predict the sentiment (positive or negative) of tweets.



Here is my code for creating the matrix:



def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0: #in case tweet is empty
X[indeks] = X[indeks]/N #taking average of word vectors
return X


Here is the complete script:



import numpy as np
import gensim
from gensim.models import Word2Vec
import csv
from sklearn import naive_bayes as nb
from sklearn.linear_model import LogisticRegression

def open_file(fileName):
with open(str(fileName), "r", encoding="utf8") as sample:
s = sample.readlines()
return s

def create_X(list_of_tweets, w2v, features):

X = np.zeros((len(list_of_tweets),features))

for indeks, tweet in enumerate(list_of_tweets):
for word in tweet:
try:
X[indeks,:] = X[indeks,:] + model_tot.wv[str(word)]
except:
pass
N = len(tweet)
if N>0:
X[indeks] = X[indeks]/N
return X

def processTrainingData(list_of_tweets):
list_of_tweets = list(set(list_of_tweets)) # remove duplicates
list_of_tweets = [gensim.utils.simple_preprocess(line) for line in list_of_tweets] # simple preprocessing
return list_of_tweets

def createWordEmbedding(list_of_tweets, features, epoc):
model = Word2Vec(list_of_tweets, size=features, window=5, min_count=1, workers=4)
model.train(list_of_tweets, total_examples=len(list_of_tweets), epochs=epoc)
return model

def save_csv(fileName, test_y):
ids = np.arange(len(test_y))
with open(fileName, 'w') as csvfile:
tempwriter = csv.writer(csvfile)
tempwriter.writerow(["Id","Prediction"])
count = 0
for row in test_y:
if row == 0:
row = -1
tempwriter.writerow([(ids[count])+1,str(row)])
count = count + 1

def train(method, x, y, x_test):
met = method.fit(x,y)
test_y = met.predict(x_test)
return test_y


features = 350
epoc = 50
spos = open_file("train_pos.txt")
sneg = open_file("train_neg.txt")

spos = processTrainingData(spos)
sneg = processTrainingData(sneg)

y = [1]*len(spos)+[0]*len(sneg)

stotal = spos+sneg

model_tot = createWordEmbedding(stotal, features, epoc)

X = create_X(stotal,model_tot, features)


testd = open_file("test_data.txt")
testd = [gensim.utils.simple_preprocess(line) for line in testd]

model_test = createWordEmbedding(testd, features, epoc)

X_test = create_X(testd,model_test, features)

# Build logistic regression classifiers to identify the polarity of words
test_y = train(LogisticRegression(), X, y, X_test)

# Build naive bayes classifiers to identify the polarity of words
test_y_nb = train(nb.GaussianNB(), X, y, X_test) # this one isn't working


save_csv('test_resultLR.csv', test_y)
save_csv('test_resultNB.csv', test_y_nb)


What I primarily could use help in, is coming up with code that uses less memory, and an extra bonus would be faster code. I was thinking I could do some list comprehension, but I am unable to figure out a good way to do that.



model_tot.wv[str(word)] gives a vector of length = features:



model_tot.wv[str(word)]






python machine-learning memory-optimization






share|improve this question









New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question









New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question








edited 3 hours ago









Jamal

30.2k11115226




30.2k11115226






New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 8 hours ago









bamsebu

11




11




New contributor




bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






bamsebu is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
    – Quill
    7 hours ago










  • @Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
    – bamsebu
    7 hours ago










  • Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
    – Sᴀᴍ Onᴇᴌᴀ
    58 mins ago




















  • Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
    – Quill
    7 hours ago










  • @Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
    – bamsebu
    7 hours ago










  • Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
    – Sᴀᴍ Onᴇᴌᴀ
    58 mins ago


















Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
– Quill
7 hours ago




Hi Bamsebu, here at code review, we review working and complete. We aren't here to help you solve bugs with your code.
– Quill
7 hours ago












@Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
– bamsebu
7 hours ago




@Quill, Thanks for letting me know! I may have misunderstood the rules here or I have been a bit unclear, but my code is working. The problem is that it is slow and when the number of tweets approaches a million, my pc is in agony. If that's not the type of bugs that's meant for this site, i will search elsewhere.
– bamsebu
7 hours ago












Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
– Sᴀᴍ Onᴇᴌᴀ
58 mins ago






Does it work okay for small inputs, but then exceeds time/memory limits for larger inputs? If so, that should be okay for this site. Refer to the discussion on this meta post
– Sᴀᴍ Onᴇᴌᴀ
58 mins ago

















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});






bamsebu is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208252%2fcalculate-the-average-word-vector-of-a-tweet-using-word2vec%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes








bamsebu is a new contributor. Be nice, and check out our Code of Conduct.










 

draft saved


draft discarded


















bamsebu is a new contributor. Be nice, and check out our Code of Conduct.













bamsebu is a new contributor. Be nice, and check out our Code of Conduct.












bamsebu is a new contributor. Be nice, and check out our Code of Conduct.















 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f208252%2fcalculate-the-average-word-vector-of-a-tweet-using-word2vec%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Create new schema in PostgreSQL using DBeaver

Deepest pit of an array with Javascript: test on Codility

Costa Masnaga