Text cleaning script, producing lowercase words with minimal punctuation











up vote
6
down vote

favorite












I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.



Particularly, I'm interested in feedback to the following code:



def cleaning(text):

import string
exclude = set(string.punctuation)

import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&:/~+#]*[w-@?^=%&/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()

return text


The script is cleaned via



cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')


So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?



Unclear seems the difference between



text = re.sub(r'n', '', text)
text = re.sub('n', '', text)


and whether



text = re.sub(r's+', ' ', text)
...
text = text.strip()


makes sense.










share|improve this question
























  • Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
    – TensorJoe
    Feb 2 at 21:01






  • 2




    cleaner is not necessary. Just do .apply(cleaning).
    – Dair
    Feb 2 at 22:34










  • Good suggestion. I actually wondered, too, why another script did it. Performance?
    – TensorJoe
    Feb 3 at 8:14















up vote
6
down vote

favorite












I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.



Particularly, I'm interested in feedback to the following code:



def cleaning(text):

import string
exclude = set(string.punctuation)

import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()

return text


The script is cleaned via



cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')


So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?



Unclear seems the difference between



text = re.sub(r'n', '', text)
text = re.sub('n', '', text)


and whether



text = re.sub(r's+', ' ', text)
...
text = text.strip()


makes sense.










share|improve this question
























  • Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
    – TensorJoe
    Feb 2 at 21:01






  • 2




    cleaner is not necessary. Just do .apply(cleaning).
    – Dair
    Feb 2 at 22:34










  • Good suggestion. I actually wondered, too, why another script did it. Performance?
    – TensorJoe
    Feb 3 at 8:14













up vote
6
down vote

favorite









up vote
6
down vote

favorite











I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.



Particularly, I'm interested in feedback to the following code:



def cleaning(text):

import string
exclude = set(string.punctuation)

import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()

return text


The script is cleaned via



cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')


So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?



Unclear seems the difference between



text = re.sub(r'n', '', text)
text = re.sub('n', '', text)


and whether



text = re.sub(r's+', ' ', text)
...
text = text.strip()


makes sense.










share|improve this question















I created following script to clean text that I scraped. The clean text would ideally be lowercase words, without numbers and at maybe only commas and a dot at the end of a sentence. It should only have white-space between words and remove all "n" elements from the text.



Particularly, I'm interested in feedback to the following code:



def cleaning(text):

import string
exclude = set(string.punctuation)

import re
# remove new line and digits with regular expression
text = re.sub(r'n', '', text)
text = re.sub(r'd', '', text)
# remove patterns matching url format
url_pattern = r'((http|ftp|https)://)?[w-_]+(.[w-_]+)+([w-.,@?^=%&amp;:/~+#]*[w-@?^=%&amp;/~+#])?'
text = re.sub(url_pattern, ' ', text)
# remove non-ascii characters
text = ''.join(character for character in text if ord(character) < 128)
# remove punctuations
text = ''.join(character for character in text if character not in exclude)
# standardize white space
text = re.sub(r's+', ' ', text)
# drop capitalization
text = text.lower()
#remove white space
text = text.strip()

return text


The script is cleaned via



cleaner = lambda x: cleaning(x)
df['text_clean'] = df['text'].apply(cleaner)
# Replace and remove empty rows
df['text_clean'] = df['text_clean'].replace('', np.nan)
df = df.dropna(how='any')


So far, the script does the job, which is great. However, how could the script above be improved, or be written cleaner?



Unclear seems the difference between



text = re.sub(r'n', '', text)
text = re.sub('n', '', text)


and whether



text = re.sub(r's+', ' ', text)
...
text = text.strip()


makes sense.







python strings parsing regex






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Feb 2 at 18:44









200_success

127k15148410




127k15148410










asked Feb 2 at 16:08









TensorJoe

312




312












  • Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
    – TensorJoe
    Feb 2 at 21:01






  • 2




    cleaner is not necessary. Just do .apply(cleaning).
    – Dair
    Feb 2 at 22:34










  • Good suggestion. I actually wondered, too, why another script did it. Performance?
    – TensorJoe
    Feb 3 at 8:14


















  • Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
    – TensorJoe
    Feb 2 at 21:01






  • 2




    cleaner is not necessary. Just do .apply(cleaning).
    – Dair
    Feb 2 at 22:34










  • Good suggestion. I actually wondered, too, why another script did it. Performance?
    – TensorJoe
    Feb 3 at 8:14
















Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01




Also, maybe text = ''.join(character for character in text if ord(character) < 128) can be replaced by .decode('utf-8')?
– TensorJoe
Feb 2 at 21:01




2




2




cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34




cleaner is not necessary. Just do .apply(cleaning).
– Dair
Feb 2 at 22:34












Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14




Good suggestion. I actually wondered, too, why another script did it. Performance?
– TensorJoe
Feb 3 at 8:14










1 Answer
1






active

oldest

votes

















up vote
2
down vote













Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.



The idea makes sense.



But concretely, what is the result of this script?
It returns all words in lower-case separated by a space.



Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.



The url pattern



If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).



One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.



The url pattern can be rewritten like this:



b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*


and the whole function:



import re

def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
words = re.findall(r'[a-z]+', text)
return ' '.join(words)


Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.



If you want to keep commas and dots:

Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:



import re

def cleaning2(text):
text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
words = re.findall(r'[a-z.,]+', text)
return ' '.join(words)





share|improve this answer





















    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["\$", "\$"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "196"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186614%2ftext-cleaning-script-producing-lowercase-words-with-minimal-punctuation%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    2
    down vote













    Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.



    The idea makes sense.



    But concretely, what is the result of this script?
    It returns all words in lower-case separated by a space.



    Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.



    The url pattern



    If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
    All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).



    One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
    It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.



    The url pattern can be rewritten like this:



    b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*


    and the whole function:



    import re

    def cleaning2(text):
    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
    words = re.findall(r'[a-z]+', text)
    return ' '.join(words)


    Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.



    If you want to keep commas and dots:

    Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:



    import re

    def cleaning2(text):
    text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
    words = re.findall(r'[a-z.,]+', text)
    return ' '.join(words)





    share|improve this answer

























      up vote
      2
      down vote













      Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.



      The idea makes sense.



      But concretely, what is the result of this script?
      It returns all words in lower-case separated by a space.



      Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.



      The url pattern



      If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
      All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).



      One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
      It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.



      The url pattern can be rewritten like this:



      b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*


      and the whole function:



      import re

      def cleaning2(text):
      text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
      words = re.findall(r'[a-z]+', text)
      return ' '.join(words)


      Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.



      If you want to keep commas and dots:

      Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:



      import re

      def cleaning2(text):
      text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
      words = re.findall(r'[a-z.,]+', text)
      return ' '.join(words)





      share|improve this answer























        up vote
        2
        down vote










        up vote
        2
        down vote









        Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.



        The idea makes sense.



        But concretely, what is the result of this script?
        It returns all words in lower-case separated by a space.



        Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.



        The url pattern



        If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
        All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).



        One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
        It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.



        The url pattern can be rewritten like this:



        b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*


        and the whole function:



        import re

        def cleaning2(text):
        text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
        words = re.findall(r'[a-z]+', text)
        return ' '.join(words)


        Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.



        If you want to keep commas and dots:

        Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:



        import re

        def cleaning2(text):
        text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
        words = re.findall(r'[a-z.,]+', text)
        return ' '.join(words)





        share|improve this answer












        Actually, your approach consists to remove or to replace with a space all that isn't a word (urls and characters that are not an ascii letter). Then you finish the job removing duplicate spaces, spaces at the beginning or the end of the string, and converting all in lower-case.



        The idea makes sense.



        But concretely, what is the result of this script?
        It returns all words in lower-case separated by a space.



        Described like that, you easily understand that you can extract the words and join them with a space. To do that, a simple re.findall(r'[a-z]+', text) suffices, but you have to remove the urls first if you don't want to catch letter sequences contained in them.



        The url pattern



        If you read your url pattern, you can see that the only part that isn't optional is in fact [w-]+(?:.[w-]+)+ (written [w-_]+(.[w-_]+)+ in your script: _ is already inside w, you can put - at the end of a character without to escape it, the capture group is useless).
        All that comes after this part of the pattern doesn't require a precise description and can be replaced with a S* (zero or more non-white-spaces). Even if it catches a closing parenthesis or a comma, it isn't important for what you want to do (we will see how to handle commas or dots later).



        One of the weaknesses of the url pattern is that it starts with an alternation in an optional group. This means that at each failing position of the string, the regex engine has to test the three alternatives (http|ftp|https) and without the whole group for nothing.
        It's possible to improve that a little if you start the pattern with a word boundary and if you replace the last alternative (https) with an optional s in the first.



        The url pattern can be rewritten like this:



        b(?:(?:https|ftp)://)?w[w-]*(?:.[w-]+)+S*


        and the whole function:



        import re

        def cleaning2(text):
        text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*', ' ', text.lower())
        words = re.findall(r'[a-z]+', text)
        return ' '.join(words)


        Note that the URL syntax can be particularly complex, and that it isn't always possible to extract an URL from a non-formatted string.



        If you want to keep commas and dots:

        Few changes, you only have to be sure that S* in the url pattern doesn't eat a comma or a dot at the end of the url with a negative lookbehind (?<!...), and to add them in the character class in the re.findall pattern:



        import re

        def cleaning2(text):
        text = re.sub(r'b(?:(?:https?|ftp)://)?w[w-]*(?:.[w-]+)+S*(?<![.,])', ' ', text.lower())
        words = re.findall(r'[a-z.,]+', text)
        return ' '.join(words)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Feb 4 at 17:39









        Casimir et Hippolyte

        25728




        25728






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f186614%2ftext-cleaning-script-producing-lowercase-words-with-minimal-punctuation%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Costa Masnaga

            Fotorealismo

            Sidney Franklin