Parsing log file

Parsing log file - regex

I am trying to parse a text like this from a log file:
[2016-01-29 11:31:33,809: WARNING/Worker-1283]
1030140:::DEAL_OF_DAY:::29:::1:::11 [2016-01-29 11:31:34,103:
WARNING/Worker-1197] 1025311:::DEAL_OF_DAY:::29:::1:::11 [2016-01-29
11:31:34,291: WARNING/Worker-1197] 1025158:::DEAL_OF_DAY:::29:::1:::11
I want to extract these numbers 1030140, 1025311, 1025158 and so on.
I have tried the following
cat deals29.txt | egrep -o '[0-9]+'
But this gives other digits as well
I tried
cat deals29.txt | egrep -o ' [0-9]+:::'
but now it gives the colons in the output as well and there is no way to capture the group in the command line version of grep.
Any suggestions? grep solution would be preferred but I can go with sed/awk as well if grep cannot do the job.

Using grep -oP and match reset \K:
grep -oP '^\[.*?\] \K\d+' file.log
1030140
1025311
1025158
If your grep doesn't support -P (PCRE) then use awk:
awk -F '\\] |:::' '{print $2}' file.log
1030140
1025311
1025158

You can train regex here : https://regex101.com/
I get
] [0-9]*
and you have to delete the first 2 chars

You could use a solution like:
(\d{3,})::
# looks for at least 3 digits (or more) followed by two colons
# puts the matched numbers in group 1
See a demo for this approach here.

Related

Extract version using grep/regex in bash

I have a file that has a line stating
version = "12.0.08-SNAPSHOT"
The word version and quoted strings can occur on multiple lines in that file.
I am looking for a single line bash statement that can output the following string:
12.0.08-SNAPSHOT
The version can have RELEASE tag too instead of SNAPSHOT.
So to summarize, given
version = "12.0.08-SNAPSHOT"
expected output: 12.0.08-SNAPSHOT
And given
version = "12.0.08-RELEASE"
expected output: 12.0.08-RELEASE

The following command prints strings enquoted in version = "...":
grep -Po '\bversion\s*=\s*"\K.*?(?=")' yourFile
-P enables perl regexes, which allow us to use features like \K and so on.
-o only prints matched parts instead of the whole lines.
\b ensures that version starts at a word boundary and we do not match things like abcversion.
\s stands for any kind of whitespace.
\K lets grep forget, that it matched the part before \K. The forgotten part will not be printed.
.*? matches as few chararacters as possible (the matching part will be printed) ...
(?=") ... until we see a ", which won't be included in the match either (this is called a lookahead).
Not all grep implementations support the -P option. Alternatively, you can use perl, as described in this answer:
perl -nle 'print $& if m{\bversion\s*=\s*"\K.*?(?=")}' yourFile

Seems like a job for cut:
$ echo 'version = "12.0.08-SNAPSHOT"' | cut -d'"' -f2
12.0.08-SNAPSHOT
$ echo 'version = "12.0.08-RELEASE"' | cut -d'"' -f2
12.0.08-RELEASE

Portable solution:
$ echo 'version = "12.0.08-RELEASE"' |sed -E 's/.*"(.*)"/\1/g'
12.0.08-RELEASE
or even:
$ perl -pe 's/.*"(.*)"/\1/g'.
$ awk -F"\"" '{print $2}'

Grepping for overlapping pattern matches

This is what I'm running
grep -o ',[tcb],' <<< "r,t,c,q,c b,b,"
The output is
,t,
,b,
But I want to get
,t,
,c,
,b,
(I do not want the b without a preceding , or the c without a trailing , to be matched)
Because ,[tcb], should be found in 'r",t,"c,q b,b,' 'r,t",c,"q b,b,' and 'r,t,c,q b",b,"'
But it seems that when the , is included in the first pattern match then grep does not look for this in the second instance of the pattern match
Is there a way around this or is grep not meant to do this

You can use awk instead of grep for this with record separator as comma:
awk -v RS=, '/^[tcb]$/{print RS $0 RS}' <<< "r,t,c,q,c b,b,"
,t,
,c,
,b,

You can use grep with a Perl RE, which allows non-capturing look-behind and look-ahead patterns to extract letters surrounded by commas. You can then restore the separators just as you need them as by:
grep -o -P '(?<=,)[tcb](?=,)' <<< "r,t,c,q,c b,b,"|while read c; do echo ",$c,"; done

The awk solution is nice. I have another with sed+grep:
echo "r,t,c,q,c b,b," | sed "s/,/,,/g" | grep -o ',[tcb],'
,t,
,c,
,b,

Bash grep ip from line

I have the file ip.txt which contain the following
ata001dcfe16f85.mm.ph.ph.cox.net (24.252.231.220)
220.231.252.24.xxx.com (24.252.231.220)
and I made this bash command to extract ips :
grep -Eo '(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' ip.txt | sort -u > good.txt
I want to edit the code so it extracts the ips between the parentheses ONLY . not all the ips on the line because the current code extract the ip 220.231.252.24

To get the IP within paranthesis all you need is to wrap the entire regex in an escaped \( \)
grep -Eo '\((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\)'
will give output as
(24.252.231.220)
(24.252.231.220)
if you want to get rid of the paranthesis as well in the output, look around would be usefull
grep -oP '(?<=\()(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)(?=\))'
would produce output as
24.252.231.220
24.252.231.220
a much more lighter version would be
grep -oP '(?<=\()(25[0-5]|2[0-4][0-9]|[01]?[0-9]{2}?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{2}?)){3}(?=\))'
here
[0-9]{2} matches the number 2 times
(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{2}?)){3} matches . followed by 3 digit number three times
The repeating lines can be removed using a pipe to uniq as
grep -oP '(?<=\()(25[0-5]|2[0-4][0-9]|[01]?[0-9]{2}?)(\.(25[0-5]|2[0-4][0-9]|[01]?[0-9]{2}?)){3}(?=\))' input | uniq
giving the output as
24.252.231.220

You can try awk
awk -F"[()]" '{print $(NF-1)}' file
24.252.231.220
24.252.231.220

How can I extract the content between two brackets?

My input:
1:FAILED + *1 0 (8328832,AR,UNDECLARED)
This is what I expect:
8328832,AR,UNDECLARED
I am trying to find a general regular expression that allows to take any content between two brackets out.
My attempt is
grep -o '\[(.*?)\]' test.txt > output.txt
but it doesn't match anything.

Still using grep and regex
grep -oP '\(\K[^\)]+' file
\K means that use look around regex advanced feature. More precisely, it's a positive look-behind assertion, you can do it like this too :
grep -oP '(?<=\()[^\)]+' file
if you lack the -P option, you can do this with perl :
perl -lne '/\(\K[^\)]+/ and print $&' file
Another simpler approach using awk
awk -F'[()]' '{print $2}' file

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.

You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'

This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2

I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234

As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.

To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?

You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'

Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Parsing log file - regex

Using grep -oP and match reset \K: grep -oP '^\[.*?\] \K\d+' file.log 1030140 1025311 1025158 If your grep doesn't support -P (PCRE) then use awk: awk -F '\\] |:::' '{print $2}' file.log 1030140 1025311 1025158

You can train regex here : https://regex101.com/ I get ] [0-9]* and you have to delete the first 2 chars

You could use a solution like: (\d{3,}):: # looks for at least 3 digits (or more) followed by two colons # puts the matched numbers in group 1 See a demo for this approach here.

Related

Extract version using grep/regex in bash

Grepping for overlapping pattern matches

Bash grep ip from line

How can I extract the content between two brackets?

grep: group capturing

Categories

Resources