Find most occurring words in text file












2















I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.



e.g. log.:



Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073' 
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'


Now I want to identify the top 10 categories that failed.



Using sed:



sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr  | head  -10


I am getting 1636 [ERROR



While I was looking for a list of categories sorting after amount of occurrenxe. e.g.



139 category1
23 category 2
...









share|improve this question

























  • Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

    – RavinderSingh13
    Nov 26 '18 at 7:28











  • agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

    – Drako
    Nov 26 '18 at 7:30
















2















I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.



e.g. log.:



Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073' 
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'


Now I want to identify the top 10 categories that failed.



Using sed:



sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr  | head  -10


I am getting 1636 [ERROR



While I was looking for a list of categories sorting after amount of occurrenxe. e.g.



139 category1
23 category 2
...









share|improve this question

























  • Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

    – RavinderSingh13
    Nov 26 '18 at 7:28











  • agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

    – Drako
    Nov 26 '18 at 7:30














2












2








2








I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.



e.g. log.:



Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073' 
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'


Now I want to identify the top 10 categories that failed.



Using sed:



sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr  | head  -10


I am getting 1636 [ERROR



While I was looking for a list of categories sorting after amount of occurrenxe. e.g.



139 category1
23 category 2
...









share|improve this question
















I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.



e.g. log.:



Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073' 
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'


Now I want to identify the top 10 categories that failed.



Using sed:



sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr  | head  -10


I am getting 1636 [ERROR



While I was looking for a list of categories sorting after amount of occurrenxe. e.g.



139 category1
23 category 2
...






unix command-line text-processing






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 27 '18 at 14:29







merlin

















asked Nov 26 '18 at 7:25









merlinmerlin

7761922




7761922













  • Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

    – RavinderSingh13
    Nov 26 '18 at 7:28











  • agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

    – Drako
    Nov 26 '18 at 7:30



















  • Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

    – RavinderSingh13
    Nov 26 '18 at 7:28











  • agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

    – Drako
    Nov 26 '18 at 7:30

















Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

– RavinderSingh13
Nov 26 '18 at 7:28





Please post more explanatory samples of input and output in your post and let us know then as it is not clear.

– RavinderSingh13
Nov 26 '18 at 7:28













agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

– Drako
Nov 26 '18 at 7:30





agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself

– Drako
Nov 26 '18 at 7:30












5 Answers
5






active

oldest

votes


















1














You say you want to make a counting using sed, but actually, you are having an entire pipeline with sed, grep, sort, uniq and head. Generally, when this happens, your problem is screaming for awk:



awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file


The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO). The field separator is set to the <single quote> (') which has octal value 47 as it assumes that the category name is between single quotes.



If you are not using GNU awk, you could use sort and head or do the sorting yourself. One way is:



awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file


or just do:



awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10





share|improve this answer


























  • awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

    – merlin
    Nov 27 '18 at 10:29













  • @merlin do you have an example of the input file?

    – kvantour
    Nov 27 '18 at 10:32











  • Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

    – merlin
    Nov 27 '18 at 10:42













  • @merlin, could you please provide us with more than one more line. We need a sample of your input.

    – kvantour
    Nov 27 '18 at 10:48











  • How can I add a file to stackoverflow? awk version 20070501

    – merlin
    Nov 27 '18 at 10:51



















0














You got 1636 [ERROR because you change the space character into a newline character, then you grep the word ERROR, then you count.



This :



sed -e 's/s/n/g' < file.log | grep ERROR 


Gives you this :



[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)


You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :



grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10


This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.






share|improve this answer


























  • Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

    – tripleee
    Nov 26 '18 at 9:12



















0














Assuming 'Bulgari' is an example of a category you want to extract, try



sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10


The sed command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.



In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR and later on followed by ] Category ' and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.



Your original attempt would only extract the actual ERROR strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed accepts the Perl s shorthand, which isn't standard in sed, and that n gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).






share|improve this answer

































    0














    The way to go is to select the erred categories and replace the whole line with only the Category name using sed.



    Give a try to this:



    sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort  | uniq -c | sort -nr | head -16


    ^ is the start of the line



    ( ... ) : the char sequence enclosed in this escaped parenthesis can be referred with 1 for the first pair appearing in the regex, 2 for the second pair etc.



    $ is the end of the line.



    The sed selects a line which contains [ERROR and some chars until a ], folled with the word Category, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after Category.






    share|improve this answer

































      0














      Using Perl



      > cat merlin.txt
      Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
      Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
      Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
      > perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
      'subcat-name2' 1
      'subcat-name1' 1
      'model' 1
      'mcat-name2' 1
      'mcat-name1' 1
      'make' 1
      >





      share|improve this answer























        Your Answer






        StackExchange.ifUsing("editor", function () {
        StackExchange.using("externalEditor", function () {
        StackExchange.using("snippets", function () {
        StackExchange.snippets.init();
        });
        });
        }, "code-snippets");

        StackExchange.ready(function() {
        var channelOptions = {
        tags: "".split(" "),
        id: "1"
        };
        initTagRenderer("".split(" "), "".split(" "), channelOptions);

        StackExchange.using("externalEditor", function() {
        // Have to fire editor after snippets, if snippets enabled
        if (StackExchange.settings.snippets.snippetsEnabled) {
        StackExchange.using("snippets", function() {
        createEditor();
        });
        }
        else {
        createEditor();
        }
        });

        function createEditor() {
        StackExchange.prepareEditor({
        heartbeatType: 'answer',
        autoActivateHeartbeat: false,
        convertImagesToLinks: true,
        noModals: true,
        showLowRepImageUploadWarning: true,
        reputationToPostImages: 10,
        bindNavPrevention: true,
        postfix: "",
        imageUploader: {
        brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
        contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
        allowUrls: true
        },
        onDemand: true,
        discardSelector: ".discard-answer"
        ,immediatelyShowMarkdownHelp:true
        });


        }
        });














        draft saved

        draft discarded


















        StackExchange.ready(
        function () {
        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53476413%2ffind-most-occurring-words-in-text-file%23new-answer', 'question_page');
        }
        );

        Post as a guest















        Required, but never shown

























        5 Answers
        5






        active

        oldest

        votes








        5 Answers
        5






        active

        oldest

        votes









        active

        oldest

        votes






        active

        oldest

        votes









        1














        You say you want to make a counting using sed, but actually, you are having an entire pipeline with sed, grep, sort, uniq and head. Generally, when this happens, your problem is screaming for awk:



        awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file


        The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO). The field separator is set to the <single quote> (') which has octal value 47 as it assumes that the category name is between single quotes.



        If you are not using GNU awk, you could use sort and head or do the sorting yourself. One way is:



        awk 'BEGIN{FS="47"; n=10 }
        /[ERROR /{ c[$2]++ }
        END {
        for (l in c) {
        for (i=1;i<=n;++i) {
        if (c[l] > c[s[i]]) {
        for(j=n;j>i;--j) s[j]=s[j-1];
        s[i]=l
        break
        }
        }
        }
        for (i=1;i<=n;++i) {
        if (s[i]=="") break
        print c[s[i]], s[i]
        }
        }' file


        or just do:



        awk 'BEGIN{FS="47"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
        | sort -nr | head -10





        share|improve this answer


























        • awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

          – merlin
          Nov 27 '18 at 10:29













        • @merlin do you have an example of the input file?

          – kvantour
          Nov 27 '18 at 10:32











        • Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

          – merlin
          Nov 27 '18 at 10:42













        • @merlin, could you please provide us with more than one more line. We need a sample of your input.

          – kvantour
          Nov 27 '18 at 10:48











        • How can I add a file to stackoverflow? awk version 20070501

          – merlin
          Nov 27 '18 at 10:51
















        1














        You say you want to make a counting using sed, but actually, you are having an entire pipeline with sed, grep, sort, uniq and head. Generally, when this happens, your problem is screaming for awk:



        awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file


        The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO). The field separator is set to the <single quote> (') which has octal value 47 as it assumes that the category name is between single quotes.



        If you are not using GNU awk, you could use sort and head or do the sorting yourself. One way is:



        awk 'BEGIN{FS="47"; n=10 }
        /[ERROR /{ c[$2]++ }
        END {
        for (l in c) {
        for (i=1;i<=n;++i) {
        if (c[l] > c[s[i]]) {
        for(j=n;j>i;--j) s[j]=s[j-1];
        s[i]=l
        break
        }
        }
        }
        for (i=1;i<=n;++i) {
        if (s[i]=="") break
        print c[s[i]], s[i]
        }
        }' file


        or just do:



        awk 'BEGIN{FS="47"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
        | sort -nr | head -10





        share|improve this answer


























        • awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

          – merlin
          Nov 27 '18 at 10:29













        • @merlin do you have an example of the input file?

          – kvantour
          Nov 27 '18 at 10:32











        • Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

          – merlin
          Nov 27 '18 at 10:42













        • @merlin, could you please provide us with more than one more line. We need a sample of your input.

          – kvantour
          Nov 27 '18 at 10:48











        • How can I add a file to stackoverflow? awk version 20070501

          – merlin
          Nov 27 '18 at 10:51














        1












        1








        1







        You say you want to make a counting using sed, but actually, you are having an entire pipeline with sed, grep, sort, uniq and head. Generally, when this happens, your problem is screaming for awk:



        awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file


        The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO). The field separator is set to the <single quote> (') which has octal value 47 as it assumes that the category name is between single quotes.



        If you are not using GNU awk, you could use sort and head or do the sorting yourself. One way is:



        awk 'BEGIN{FS="47"; n=10 }
        /[ERROR /{ c[$2]++ }
        END {
        for (l in c) {
        for (i=1;i<=n;++i) {
        if (c[l] > c[s[i]]) {
        for(j=n;j>i;--j) s[j]=s[j-1];
        s[i]=l
        break
        }
        }
        }
        for (i=1;i<=n;++i) {
        if (s[i]=="") break
        print c[s[i]], s[i]
        }
        }' file


        or just do:



        awk 'BEGIN{FS="47"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
        | sort -nr | head -10





        share|improve this answer















        You say you want to make a counting using sed, but actually, you are having an entire pipeline with sed, grep, sort, uniq and head. Generally, when this happens, your problem is screaming for awk:



        awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file


        The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO). The field separator is set to the <single quote> (') which has octal value 47 as it assumes that the category name is between single quotes.



        If you are not using GNU awk, you could use sort and head or do the sorting yourself. One way is:



        awk 'BEGIN{FS="47"; n=10 }
        /[ERROR /{ c[$2]++ }
        END {
        for (l in c) {
        for (i=1;i<=n;++i) {
        if (c[l] > c[s[i]]) {
        for(j=n;j>i;--j) s[j]=s[j-1];
        s[i]=l
        break
        }
        }
        }
        for (i=1;i<=n;++i) {
        if (s[i]=="") break
        print c[s[i]], s[i]
        }
        }' file


        or just do:



        awk 'BEGIN{FS="47"}
        /[ERROR /{c[$2]++}
        END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
        | sort -nr | head -10






        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 27 '18 at 14:33

























        answered Nov 27 '18 at 9:52









        kvantourkvantour

        9,92631731




        9,92631731













        • awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

          – merlin
          Nov 27 '18 at 10:29













        • @merlin do you have an example of the input file?

          – kvantour
          Nov 27 '18 at 10:32











        • Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

          – merlin
          Nov 27 '18 at 10:42













        • @merlin, could you please provide us with more than one more line. We need a sample of your input.

          – kvantour
          Nov 27 '18 at 10:48











        • How can I add a file to stackoverflow? awk version 20070501

          – merlin
          Nov 27 '18 at 10:51



















        • awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

          – merlin
          Nov 27 '18 at 10:29













        • @merlin do you have an example of the input file?

          – kvantour
          Nov 27 '18 at 10:32











        • Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

          – merlin
          Nov 27 '18 at 10:42













        • @merlin, could you please provide us with more than one more line. We need a sample of your input.

          – kvantour
          Nov 27 '18 at 10:48











        • How can I add a file to stackoverflow? awk version 20070501

          – merlin
          Nov 27 '18 at 10:51

















        awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

        – merlin
        Nov 27 '18 at 10:29







        awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.

        – merlin
        Nov 27 '18 at 10:29















        @merlin do you have an example of the input file?

        – kvantour
        Nov 27 '18 at 10:32





        @merlin do you have an example of the input file?

        – kvantour
        Nov 27 '18 at 10:32













        Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

        – merlin
        Nov 27 '18 at 10:42







        Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'

        – merlin
        Nov 27 '18 at 10:42















        @merlin, could you please provide us with more than one more line. We need a sample of your input.

        – kvantour
        Nov 27 '18 at 10:48





        @merlin, could you please provide us with more than one more line. We need a sample of your input.

        – kvantour
        Nov 27 '18 at 10:48













        How can I add a file to stackoverflow? awk version 20070501

        – merlin
        Nov 27 '18 at 10:51





        How can I add a file to stackoverflow? awk version 20070501

        – merlin
        Nov 27 '18 at 10:51













        0














        You got 1636 [ERROR because you change the space character into a newline character, then you grep the word ERROR, then you count.



        This :



        sed -e 's/s/n/g' < file.log | grep ERROR 


        Gives you this :



        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        ... (1630 more)


        You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :



        grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10


        This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.






        share|improve this answer


























        • Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

          – tripleee
          Nov 26 '18 at 9:12
















        0














        You got 1636 [ERROR because you change the space character into a newline character, then you grep the word ERROR, then you count.



        This :



        sed -e 's/s/n/g' < file.log | grep ERROR 


        Gives you this :



        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        ... (1630 more)


        You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :



        grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10


        This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.






        share|improve this answer


























        • Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

          – tripleee
          Nov 26 '18 at 9:12














        0












        0








        0







        You got 1636 [ERROR because you change the space character into a newline character, then you grep the word ERROR, then you count.



        This :



        sed -e 's/s/n/g' < file.log | grep ERROR 


        Gives you this :



        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        ... (1630 more)


        You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :



        grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10


        This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.






        share|improve this answer















        You got 1636 [ERROR because you change the space character into a newline character, then you grep the word ERROR, then you count.



        This :



        sed -e 's/s/n/g' < file.log | grep ERROR 


        Gives you this :



        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        [ERROR
        ... (1630 more)


        You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :



        grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10


        This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 26 '18 at 9:18

























        answered Nov 26 '18 at 8:54









        Corentin LimierCorentin Limier

        2,0511611




        2,0511611













        • Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

          – tripleee
          Nov 26 '18 at 9:12



















        • Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

          – tripleee
          Nov 26 '18 at 9:12

















        Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

        – tripleee
        Nov 26 '18 at 9:12





        Of course, sed is perfectly able to perform the work of grep, too. See useless use of grep

        – tripleee
        Nov 26 '18 at 9:12











        0














        Assuming 'Bulgari' is an example of a category you want to extract, try



        sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
        sort | uniq -c | sort -rn | head -n 10


        The sed command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.



        In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR and later on followed by ] Category ' and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.



        Your original attempt would only extract the actual ERROR strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed accepts the Perl s shorthand, which isn't standard in sed, and that n gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).






        share|improve this answer






























          0














          Assuming 'Bulgari' is an example of a category you want to extract, try



          sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
          sort | uniq -c | sort -rn | head -n 10


          The sed command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.



          In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR and later on followed by ] Category ' and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.



          Your original attempt would only extract the actual ERROR strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed accepts the Perl s shorthand, which isn't standard in sed, and that n gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).






          share|improve this answer




























            0












            0








            0







            Assuming 'Bulgari' is an example of a category you want to extract, try



            sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
            sort | uniq -c | sort -rn | head -n 10


            The sed command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.



            In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR and later on followed by ] Category ' and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.



            Your original attempt would only extract the actual ERROR strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed accepts the Perl s shorthand, which isn't standard in sed, and that n gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).






            share|improve this answer















            Assuming 'Bulgari' is an example of a category you want to extract, try



            sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
            sort | uniq -c | sort -rn | head -n 10


            The sed command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.



            In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR and later on followed by ] Category ' and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.



            Your original attempt would only extract the actual ERROR strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed accepts the Perl s shorthand, which isn't standard in sed, and that n gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 26 '18 at 9:27

























            answered Nov 26 '18 at 9:17









            tripleeetripleee

            94.9k13133188




            94.9k13133188























                0














                The way to go is to select the erred categories and replace the whole line with only the Category name using sed.



                Give a try to this:



                sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort  | uniq -c | sort -nr | head -16


                ^ is the start of the line



                ( ... ) : the char sequence enclosed in this escaped parenthesis can be referred with 1 for the first pair appearing in the regex, 2 for the second pair etc.



                $ is the end of the line.



                The sed selects a line which contains [ERROR and some chars until a ], folled with the word Category, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after Category.






                share|improve this answer






























                  0














                  The way to go is to select the erred categories and replace the whole line with only the Category name using sed.



                  Give a try to this:



                  sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort  | uniq -c | sort -nr | head -16


                  ^ is the start of the line



                  ( ... ) : the char sequence enclosed in this escaped parenthesis can be referred with 1 for the first pair appearing in the regex, 2 for the second pair etc.



                  $ is the end of the line.



                  The sed selects a line which contains [ERROR and some chars until a ], folled with the word Category, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after Category.






                  share|improve this answer




























                    0












                    0








                    0







                    The way to go is to select the erred categories and replace the whole line with only the Category name using sed.



                    Give a try to this:



                    sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort  | uniq -c | sort -nr | head -16


                    ^ is the start of the line



                    ( ... ) : the char sequence enclosed in this escaped parenthesis can be referred with 1 for the first pair appearing in the regex, 2 for the second pair etc.



                    $ is the end of the line.



                    The sed selects a line which contains [ERROR and some chars until a ], folled with the word Category, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after Category.






                    share|improve this answer















                    The way to go is to select the erred categories and replace the whole line with only the Category name using sed.



                    Give a try to this:



                    sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort  | uniq -c | sort -nr | head -16


                    ^ is the start of the line



                    ( ... ) : the char sequence enclosed in this escaped parenthesis can be referred with 1 for the first pair appearing in the regex, 2 for the second pair etc.



                    $ is the end of the line.



                    The sed selects a line which contains [ERROR and some chars until a ], folled with the word Category, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after Category.







                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited Nov 27 '18 at 7:19

























                    answered Nov 26 '18 at 9:40









                    Jay jargotJay jargot

                    1,9821511




                    1,9821511























                        0














                        Using Perl



                        > cat merlin.txt
                        Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
                        Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
                        Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
                        > perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
                        'subcat-name2' 1
                        'subcat-name1' 1
                        'model' 1
                        'mcat-name2' 1
                        'mcat-name1' 1
                        'make' 1
                        >





                        share|improve this answer




























                          0














                          Using Perl



                          > cat merlin.txt
                          Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
                          Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
                          Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
                          > perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
                          'subcat-name2' 1
                          'subcat-name1' 1
                          'model' 1
                          'mcat-name2' 1
                          'mcat-name1' 1
                          'make' 1
                          >





                          share|improve this answer


























                            0












                            0








                            0







                            Using Perl



                            > cat merlin.txt
                            Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
                            Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
                            Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
                            > perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
                            'subcat-name2' 1
                            'subcat-name1' 1
                            'model' 1
                            'mcat-name2' 1
                            'mcat-name1' 1
                            'make' 1
                            >





                            share|improve this answer













                            Using Perl



                            > cat merlin.txt
                            Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
                            Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
                            Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
                            > perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
                            'subcat-name2' 1
                            'subcat-name1' 1
                            'model' 1
                            'mcat-name2' 1
                            'mcat-name1' 1
                            'make' 1
                            >






                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 27 '18 at 14:57









                            stack0114106stack0114106

                            4,8322423




                            4,8322423






























                                draft saved

                                draft discarded




















































                                Thanks for contributing an answer to Stack Overflow!


                                • Please be sure to answer the question. Provide details and share your research!

                                But avoid



                                • Asking for help, clarification, or responding to other answers.

                                • Making statements based on opinion; back them up with references or personal experience.


                                To learn more, see our tips on writing great answers.




                                draft saved


                                draft discarded














                                StackExchange.ready(
                                function () {
                                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53476413%2ffind-most-occurring-words-in-text-file%23new-answer', 'question_page');
                                }
                                );

                                Post as a guest















                                Required, but never shown





















































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown

































                                Required, but never shown














                                Required, but never shown












                                Required, but never shown







                                Required, but never shown







                                Popular posts from this blog

                                Create new schema in PostgreSQL using DBeaver

                                Deepest pit of an array with Javascript: test on Codility

                                Costa Masnaga