Find most occurring words in text file
I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.
e.g. log.:
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Now I want to identify the top 10 categories that failed.
Using sed:
sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr | head -10
I am getting 1636 [ERROR
While I was looking for a list of categories sorting after amount of occurrenxe. e.g.
139 category1
23 category 2
...
unix command-line text-processing
add a comment |
I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.
e.g. log.:
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Now I want to identify the top 10 categories that failed.
Using sed:
sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr | head -10
I am getting 1636 [ERROR
While I was looking for a list of categories sorting after amount of occurrenxe. e.g.
139 category1
23 category 2
...
unix command-line text-processing
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30
add a comment |
I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.
e.g. log.:
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Now I want to identify the top 10 categories that failed.
Using sed:
sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr | head -10
I am getting 1636 [ERROR
While I was looking for a list of categories sorting after amount of occurrenxe. e.g.
139 category1
23 category 2
...
unix command-line text-processing
I have a log file which logs cat and sub cat names that failed with message error. My goal is to find the most occurring categories.
e.g. log.:
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Now I want to identify the top 10 categories that failed.
Using sed:
sed -e 's/s/n/g' < file.log | grep ERROR | sort | uniq -c | sort -nr | head -10
I am getting 1636 [ERROR
While I was looking for a list of categories sorting after amount of occurrenxe. e.g.
139 category1
23 category 2
...
unix command-line text-processing
unix command-line text-processing
edited Nov 27 '18 at 14:29
merlin
asked Nov 26 '18 at 7:25
merlinmerlin
7761922
7761922
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30
add a comment |
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30
add a comment |
5 Answers
5
active
oldest
votes
You say you want to make a counting using sed
, but actually, you are having an entire pipeline with sed
, grep
, sort
, uniq
and head
. Generally, when this happens, your problem is screaming for awk
:
awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO
). The field separator is set to the <single quote> ('
) which has octal value 47
as it assumes that the category name is between single quotes.
If you are not using GNU awk, you could use sort
and head
or do the sorting yourself. One way is:
awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file
or just do:
awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
|
show 3 more comments
You got 1636 [ERROR
because you change the space character into a newline character, then you grep the word ERROR, then you count.
This :
sed -e 's/s/n/g' < file.log | grep ERROR
Gives you this :
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)
You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :
grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10
This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.
Of course,sed
is perfectly able to perform the work ofgrep
, too. See useless use ofgrep
– tripleee
Nov 26 '18 at 9:12
add a comment |
Assuming 'Bulgari'
is an example of a category you want to extract, try
sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10
The sed
command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n
option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.
In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR
and later on followed by ] Category '
and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.
Your original attempt would only extract the actual ERROR
strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed
accepts the Perl s
shorthand, which isn't standard in sed
, and that n
gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).
add a comment |
The way to go is to select the erred categories and replace the whole line with only the Category name using sed
.
Give a try to this:
sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort | uniq -c | sort -nr | head -16
^
is the start of the line
( ... )
: the char sequence enclosed in this escaped parenthesis can be referred with 1
for the first pair appearing in the regex, 2
for the second pair etc.
$
is the end of the line.
The sed
selects a line which contains [ERROR
and some chars until a ]
, folled with the word Category
, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after
Category
.
add a comment |
Using Perl
> cat merlin.txt
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
> perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
'subcat-name2' 1
'subcat-name1' 1
'model' 1
'mcat-name2' 1
'mcat-name1' 1
'make' 1
>
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53476413%2ffind-most-occurring-words-in-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
5 Answers
5
active
oldest
votes
5 Answers
5
active
oldest
votes
active
oldest
votes
active
oldest
votes
You say you want to make a counting using sed
, but actually, you are having an entire pipeline with sed
, grep
, sort
, uniq
and head
. Generally, when this happens, your problem is screaming for awk
:
awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO
). The field separator is set to the <single quote> ('
) which has octal value 47
as it assumes that the category name is between single quotes.
If you are not using GNU awk, you could use sort
and head
or do the sorting yourself. One way is:
awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file
or just do:
awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
|
show 3 more comments
You say you want to make a counting using sed
, but actually, you are having an entire pipeline with sed
, grep
, sort
, uniq
and head
. Generally, when this happens, your problem is screaming for awk
:
awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO
). The field separator is set to the <single quote> ('
) which has octal value 47
as it assumes that the category name is between single quotes.
If you are not using GNU awk, you could use sort
and head
or do the sorting yourself. One way is:
awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file
or just do:
awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
|
show 3 more comments
You say you want to make a counting using sed
, but actually, you are having an entire pipeline with sed
, grep
, sort
, uniq
and head
. Generally, when this happens, your problem is screaming for awk
:
awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO
). The field separator is set to the <single quote> ('
) which has octal value 47
as it assumes that the category name is between single quotes.
If you are not using GNU awk, you could use sort
and head
or do the sorting yourself. One way is:
awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file
or just do:
awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10
You say you want to make a counting using sed
, but actually, you are having an entire pipeline with sed
, grep
, sort
, uniq
and head
. Generally, when this happens, your problem is screaming for awk
:
awk 'BEGIN{FS="47"; PROCINFO["sorted_in"]="@val_num_asc"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
The above solution is a GNU awk solution as it makes use of non-POSIX compliant features such as the sorting of the array traversal (PROCINFO
). The field separator is set to the <single quote> ('
) which has octal value 47
as it assumes that the category name is between single quotes.
If you are not using GNU awk, you could use sort
and head
or do the sorting yourself. One way is:
awk 'BEGIN{FS="47"; n=10 }
/[ERROR /{ c[$2]++ }
END {
for (l in c) {
for (i=1;i<=n;++i) {
if (c[l] > c[s[i]]) {
for(j=n;j>i;--j) s[j]=s[j-1];
s[i]=l
break
}
}
}
for (i=1;i<=n;++i) {
if (s[i]=="") break
print c[s[i]], s[i]
}
}' file
or just do:
awk 'BEGIN{FS="47"}
/[ERROR /{c[$2]++}
END{for(i in c) { print c[i],i; if(++j == 10) exit } }' file
| sort -nr | head -10
edited Nov 27 '18 at 14:33
answered Nov 27 '18 at 9:52
kvantourkvantour
9,92631731
9,92631731
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
|
show 3 more comments
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
awk seems to be the best solution. The result returns the category with a number assuming the amount occurrence, but is not sorted after the most occurrence.
– merlin
Nov 27 '18 at 10:29
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
@merlin do you have an example of the input file?
– kvantour
Nov 27 '18 at 10:32
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
Here is another line which has more txt after ref number. make and model are the ones I try to count in order to identify the cats with most errors: Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
– merlin
Nov 27 '18 at 10:42
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
@merlin, could you please provide us with more than one more line. We need a sample of your input.
– kvantour
Nov 27 '18 at 10:48
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
How can I add a file to stackoverflow? awk version 20070501
– merlin
Nov 27 '18 at 10:51
|
show 3 more comments
You got 1636 [ERROR
because you change the space character into a newline character, then you grep the word ERROR, then you count.
This :
sed -e 's/s/n/g' < file.log | grep ERROR
Gives you this :
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)
You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :
grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10
This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.
Of course,sed
is perfectly able to perform the work ofgrep
, too. See useless use ofgrep
– tripleee
Nov 26 '18 at 9:12
add a comment |
You got 1636 [ERROR
because you change the space character into a newline character, then you grep the word ERROR, then you count.
This :
sed -e 's/s/n/g' < file.log | grep ERROR
Gives you this :
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)
You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :
grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10
This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.
Of course,sed
is perfectly able to perform the work ofgrep
, too. See useless use ofgrep
– tripleee
Nov 26 '18 at 9:12
add a comment |
You got 1636 [ERROR
because you change the space character into a newline character, then you grep the word ERROR, then you count.
This :
sed -e 's/s/n/g' < file.log | grep ERROR
Gives you this :
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)
You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :
grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10
This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.
You got 1636 [ERROR
because you change the space character into a newline character, then you grep the word ERROR, then you count.
This :
sed -e 's/s/n/g' < file.log | grep ERROR
Gives you this :
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
[ERROR
... (1630 more)
You need to grep first then sed (pretty sure you can do better with sed but I'm just talking about the logic behind the commands) :
grep ERROR file.log | sed -e 's/s/n/g' | sort | uniq -c | sort -nr | head -10
This may not be the best solution as it counts the word ERROR and other useless words, but you didn't give us a lot of information on the input file.
edited Nov 26 '18 at 9:18
answered Nov 26 '18 at 8:54
Corentin LimierCorentin Limier
2,0511611
2,0511611
Of course,sed
is perfectly able to perform the work ofgrep
, too. See useless use ofgrep
– tripleee
Nov 26 '18 at 9:12
add a comment |
Of course,sed
is perfectly able to perform the work ofgrep
, too. See useless use ofgrep
– tripleee
Nov 26 '18 at 9:12
Of course,
sed
is perfectly able to perform the work of grep
, too. See useless use of grep
– tripleee
Nov 26 '18 at 9:12
Of course,
sed
is perfectly able to perform the work of grep
, too. See useless use of grep
– tripleee
Nov 26 '18 at 9:12
add a comment |
Assuming 'Bulgari'
is an example of a category you want to extract, try
sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10
The sed
command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n
option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.
In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR
and later on followed by ] Category '
and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.
Your original attempt would only extract the actual ERROR
strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed
accepts the Perl s
shorthand, which isn't standard in sed
, and that n
gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).
add a comment |
Assuming 'Bulgari'
is an example of a category you want to extract, try
sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10
The sed
command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n
option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.
In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR
and later on followed by ] Category '
and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.
Your original attempt would only extract the actual ERROR
strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed
accepts the Perl s
shorthand, which isn't standard in sed
, and that n
gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).
add a comment |
Assuming 'Bulgari'
is an example of a category you want to extract, try
sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10
The sed
command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n
option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.
In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR
and later on followed by ] Category '
and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.
Your original attempt would only extract the actual ERROR
strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed
accepts the Perl s
shorthand, which isn't standard in sed
, and that n
gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).
Assuming 'Bulgari'
is an example of a category you want to extract, try
sed -n "s/.*ERROR.*] Category '([^']*)'.*/1/p" file.log |
sort | uniq -c | sort -rn | head -n 10
The sed
command finds lines which match a fairly complex regular expression and captures part of the line, then replaces the match with the captured substring, and prints it (the -n
option disables the default print action, so we only print the extracted lines). The rest is basically identical to what you already had.
In the regex, we look for (beginning of line followed by) anything (except a newline) followed by ERROR
and later on followed by ] Category '
and then a string which doesn't contain a single quote, then the closing single quote followed by anything. The lots of "anything (except newline)" are required in order to replace the entire line with just the captured string from inside the single quotes. The backslashed parentheses are what capture an expression; google for "backref" for the full scoop.
Your original attempt would only extract the actual ERROR
strings, because you replaced all the surrounding spaces with newlines (assuming vaguely that your sed
accepts the Perl s
shorthand, which isn't standard in sed
, and that n
gets interpreted as a literal newline in the replacement, which also isn't entirely standard or portable).
edited Nov 26 '18 at 9:27
answered Nov 26 '18 at 9:17
tripleeetripleee
94.9k13133188
94.9k13133188
add a comment |
add a comment |
The way to go is to select the erred categories and replace the whole line with only the Category name using sed
.
Give a try to this:
sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort | uniq -c | sort -nr | head -16
^
is the start of the line
( ... )
: the char sequence enclosed in this escaped parenthesis can be referred with 1
for the first pair appearing in the regex, 2
for the second pair etc.
$
is the end of the line.
The sed
selects a line which contains [ERROR
and some chars until a ]
, folled with the word Category
, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after
Category
.
add a comment |
The way to go is to select the erred categories and replace the whole line with only the Category name using sed
.
Give a try to this:
sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort | uniq -c | sort -nr | head -16
^
is the start of the line
( ... )
: the char sequence enclosed in this escaped parenthesis can be referred with 1
for the first pair appearing in the regex, 2
for the second pair etc.
$
is the end of the line.
The sed
selects a line which contains [ERROR
and some chars until a ]
, folled with the word Category
, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after
Category
.
add a comment |
The way to go is to select the erred categories and replace the whole line with only the Category name using sed
.
Give a try to this:
sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort | uniq -c | sort -nr | head -16
^
is the start of the line
( ... )
: the char sequence enclosed in this escaped parenthesis can be referred with 1
for the first pair appearing in the regex, 2
for the second pair etc.
$
is the end of the line.
The sed
selects a line which contains [ERROR
and some chars until a ]
, folled with the word Category
, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after
Category
.
The way to go is to select the erred categories and replace the whole line with only the Category name using sed
.
Give a try to this:
sed -e "s/^.* [ERROR .*] Category '([^']*)' .*$/1/g" file.log | sort | uniq -c | sort -nr | head -16
^
is the start of the line
( ... )
: the char sequence enclosed in this escaped parenthesis can be referred with 1
for the first pair appearing in the regex, 2
for the second pair etc.
$
is the end of the line.
The sed
selects a line which contains [ERROR
and some chars until a ]
, folled with the word Category
, and then after the (space) char, any sequence of chars, up to the next space char, is selected with a pair of escaped parenthesis, followed with any sequence of chars up to the end of the line. If a such a line is found, it is replaced with the char sequence after
Category
.
edited Nov 27 '18 at 7:19
answered Nov 26 '18 at 9:40
Jay jargotJay jargot
1,9821511
1,9821511
add a comment |
add a comment |
Using Perl
> cat merlin.txt
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
> perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
'subcat-name2' 1
'subcat-name1' 1
'model' 1
'mcat-name2' 1
'mcat-name1' 1
'make' 1
>
add a comment |
Using Perl
> cat merlin.txt
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
> perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
'subcat-name2' 1
'subcat-name1' 1
'model' 1
'mcat-name2' 1
'mcat-name1' 1
'make' 1
>
add a comment |
Using Perl
> cat merlin.txt
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
> perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
'subcat-name2' 1
'subcat-name1' 1
'model' 1
'mcat-name2' 1
'mcat-name1' 1
'make' 1
>
Using Perl
> cat merlin.txt
Mon, 26 Nov 2018 07:51:07 +0100 | 164: [ERROR ***] Category ID not found for 'mcat-name1' 'subcat-name1' ref: '073'
Mon, 26 Nov 2018 07:51:08 +0100 | 278: [ERROR ***] Category ID not found for 'mcat-name2' 'subcat-name2' ref: '020'
Mon, 26 Nov 2018 07:51:21 +0100 | 1232: [ERROR ***] Category ID not found for 'make' 'model' ref: '228239'
> perl -ne ' { s/(.*)Category.*for(.+)ref.*/2/g and s/(47S+47)/$kv{$1}++/ge if /ERROR/} END { foreach (sort keys %kv) { print "$_ $kv{$_}n" } } ' merlin.txt | sort -nr
'subcat-name2' 1
'subcat-name1' 1
'model' 1
'mcat-name2' 1
'mcat-name1' 1
'make' 1
>
answered Nov 27 '18 at 14:57
stack0114106stack0114106
4,8322423
4,8322423
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53476413%2ffind-most-occurring-words-in-text-file%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Please post more explanatory samples of input and output in your post and let us know then as it is not clear.
– RavinderSingh13
Nov 26 '18 at 7:28
agree @RavinderSingh13 - there is no category1 in your example but you want it to be in output; and also fix question title - seems like you are looking for a count of somthing not the word itself
– Drako
Nov 26 '18 at 7:30