Counting lower vs non-lowercase tokens for tokenized text with several conditions











up vote
1
down vote

favorite












Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.



The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?



First the function has to determine whether :





  • token is an xml tag, if so ignore it and move to the next token


  • token is in a list of predefined delayed sentence start, if so ignore it and move to the next token


# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue


Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.




  • if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token


  • if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token



# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue


Then finally after checking for unweight-able words, and the function continues to finally updates the weight.



First all weights are set to 0, and then set to 1 if it's not is_first_word.



Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.



Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False



current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False


The full code is in the train() function below:



import re

from collections import defaultdict, Counter
from six import text_type

from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes

perluniprops = Perluniprops()


class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)

self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]

def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue

# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue

current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False
return casing









share|improve this question
























  • Could you provide some example inputs and the corresponding outputs?
    – 200_success
    11 mins ago















up vote
1
down vote

favorite












Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.



The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?



First the function has to determine whether :





  • token is an xml tag, if so ignore it and move to the next token


  • token is in a list of predefined delayed sentence start, if so ignore it and move to the next token


# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue


Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.




  • if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token


  • if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token



# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue


Then finally after checking for unweight-able words, and the function continues to finally updates the weight.



First all weights are set to 0, and then set to 1 if it's not is_first_word.



Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.



Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False



current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False


The full code is in the train() function below:



import re

from collections import defaultdict, Counter
from six import text_type

from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes

perluniprops = Perluniprops()


class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)

self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]

def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue

# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue

current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False
return casing









share|improve this question
























  • Could you provide some example inputs and the corresponding outputs?
    – 200_success
    11 mins ago













up vote
1
down vote

favorite









up vote
1
down vote

favorite











Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.



The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?



First the function has to determine whether :





  • token is an xml tag, if so ignore it and move to the next token


  • token is in a list of predefined delayed sentence start, if so ignore it and move to the next token


# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue


Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.




  • if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token


  • if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token



# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue


Then finally after checking for unweight-able words, and the function continues to finally updates the weight.



First all weights are set to 0, and then set to 1 if it's not is_first_word.



Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.



Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False



current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False


The full code is in the train() function below:



import re

from collections import defaultdict, Counter
from six import text_type

from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes

perluniprops = Perluniprops()


class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)

self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]

def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue

# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue

current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False
return casing









share|improve this question















Assuming that the text is tokenized with whitespace for a natural language processing task, the goal is to check the count of the words (regardless of casing) and check them through some conditions.



The current code works as its supposed to but is there a way to optimize and if-else conditions to make it cleaner or more directly?



First the function has to determine whether :





  • token is an xml tag, if so ignore it and move to the next token


  • token is in a list of predefined delayed sentence start, if so ignore it and move to the next token


# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue


Then it checks whether to toggle the is_first_word condition to whether the token is the first word of the sentence; note that there can be many sentences in each line.




  • if token in a list of pre-defined sentence ending and the is_first_word condition is False, then set is_first_word to True and then move on to the next token


  • if there's nothing to case, since none of the characters falls under the letter regex, then set is_first_word to False and move on to the next token



# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue


Then finally after checking for unweight-able words, and the function continues to finally updates the weight.



First all weights are set to 0, and then set to 1 if it's not is_first_word.



Then if the possibly_use_first_token option is set, then check if the token is lower case, if so use the word. Otherwise, assign a 0.1 weight to it, that's better than setting the weights to 0.



Then finally, update the weights if it's non-zero. And set the is_first_word toggle to False



current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False


The full code is in the train() function below:



import re

from collections import defaultdict, Counter
from six import text_type

from sacremoses.corpus import Perluniprops
from sacremoses.corpus import NonbreakingPrefixes

perluniprops = Perluniprops()


class MosesTruecaser(object):
"""
This is a Python port of the Moses Truecaser from
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/train-truecaser.perl
https://github.com/moses-smt/mosesdecoder/blob/master/scripts/recaser/truecase.perl
"""
# Perl Unicode Properties character sets.
Lowercase_Letter = text_type(''.join(perluniprops.chars('Lowercase_Letter')))
Uppercase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))
Titlecase_Letter = text_type(''.join(perluniprops.chars('Uppercase_Letter')))

def __init__(self):
# Initialize the object.
super(MosesTruecaser, self).__init__()
# Initialize the language specific nonbreaking prefixes.
self.SKIP_LETTERS_REGEX = r"[{}{}{}]".format(Lowercase_Letter,
Uppercase_Letter, Titlecase_Letter)

self.SENT_END = [".", ":", "?", "!"]
self.DELAYED_SENT_START = ["(", "[", """, "'", "&apos;", "&quot;", "[", "]"]

def train(self, filename, possibly_use_first_token=False):
casing = defaultdict(Counter)
with open(filename) as fin:
for line in fin:
# Keep track of first words in the sentence(s) of the line.
is_first_word = True
for i, token in enumerate(line.split()):
# Skip XML tags.
if re.search(r"(<S[^>]*>)", token):
continue
# Skip if sentence start symbols.
elif token in self.DELAYED_SENT_START:
continue

# Resets the `is_first_word` after seeing sent end symbols.
if not is_first_word and token in self.SENT_END:
is_first_word = True
continue

# Skips words with nothing to case.
if not re.search(r"[{}]".format(ll_lu_lt), token):
is_first_word = False
continue

current_word_weight = 0
if not is_first_word:
current_word_weight = 1
elif possibly_use_first_token:
# Gated special handling of first word of sentence.
# Check if first characer of token is lowercase.
if token[0].is_lower():
current_word_weight = 1
elif i == 1:
current_word_weight = 0.1

if current_word_weight > 0:
casing[token.lower()][token] += current_word_weight

is_first_word = False
return casing






python regex natural-language-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 14 mins ago









200_success

127k15149412




127k15149412










asked 18 mins ago









alvas

278311




278311












  • Could you provide some example inputs and the corresponding outputs?
    – 200_success
    11 mins ago


















  • Could you provide some example inputs and the corresponding outputs?
    – 200_success
    11 mins ago
















Could you provide some example inputs and the corresponding outputs?
– 200_success
11 mins ago




Could you provide some example inputs and the corresponding outputs?
– 200_success
11 mins ago















active

oldest

votes











Your Answer





StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "196"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209335%2fcounting-lower-vs-non-lowercase-tokens-for-tokenized-text-with-several-condition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes
















draft saved

draft discarded




















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f209335%2fcounting-lower-vs-non-lowercase-tokens-for-tokenized-text-with-several-condition%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Costa Masnaga

Fotorealismo

Sidney Franklin