grep: group capturing - regex

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.

You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'

This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2

I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234

As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.

To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?

You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'

Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

Related

Regex using sed and or grep

How can I display the arch and version of queried rpm package using sed or grep?
[root#kitchen-vm-centos6-box boot]# rpm -qa | grep kernel-devel
kernel-devel-2.6.32-642.11.1.el6.x86_64
kernel-devel-2.6.32-696.10.2.el6.x86_64
What i need only is:
2.6.32-642.11.1.el6.x86_64
What is missing in my sed? => sed 's/[^\.]\+\.//'
Thanks in advance!
You can also use cut:
rpm -qa | grep kernel-devel | cut -d \- -f 3-4
You can use sed as this and avoid en extra grep:
rpm -qa | sed '/kernel-devel/s/^[^0-9]*//'
2.6.32-642.11.1.el6.x86_64
2.6.32-696.10.2.el6.x86_64
Your sed removes the first dot after the first "2", because it's matched by the regex you provided.
You can fix easily by making the regex more explicit.
Other answers already suggested solutions, here's another one using grep:
$ rpm -qa | grep -oP "devel-\K(.*)"
2.6.32-642.11.1.el6.x86_64
2.6.32-696.10.2.el6.x86_64
\K tells the engine to pretend that the match attempt started at this position (that's the alternative that Perl suggested for lookbehind).
You can do it with grep only:
rpm -qa | grep -P -o '(?<=kernel-devel-).*'
Explanation:
-o is match only. I.e. grep will return the matched part only
-P is perl regex mode. It enables lookarounds.
(?<=...) is lookbehind. I.e. stuff before the match. This is not part of the match so -o is not going to retain it
Of course, sed can help too:
rpm -qa | grep 'kernel-devel' | sed 's/^[^.0-9]*-//g'
Explanation:
^ matches the start of the string
[^.0-9] matches the non-dot, non-number characters from the start of the string. This is the part that we don't need.
The //g ending of the sed command replaces the matched part with empty string
One in awk:
$ rpm -qa | awk 'match($0,/^kernel-devel-./){print substr($0,RLENGTH)}'
2.6.32-642.11.1.el6.x86_64
2.6.32-696.10.2.el6.x86_64
Explained:
match($0,/^kernel-devel-./) { # if the record starts with kernel-devel-[ANYTHING]
print substr($0,RLENGTH) # print starting from the [ANYTHING]
}

What does '\K' mean in this regex?

Given the following shell script, would someone be so kind as to explain the grep -Po regex please?
#!/bin/bash
# Issue the request for a bearer token, json is returned
raw_json=`curl -s -X POST -d "username=name&password=secret&client_id=security-admin-console" http://localhost:8081/auth/realms/master/tokens/grants/access`
# Strip away all but the "access_token" field's value using a Python regular expression
bearerToken=`echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
echo "The bearer token is:"
echo $bearerToken
So specifically, I'm interested in understanding the parts of the regex
grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
and how it works. Why so many quotes? What is the "K" for? I've some experience with grep regex but this confuses me.
This is the actual output of the curl command and the shell script (grep) works as desired returning just the contents of the "access_token" value.
{"access_token":"eyJhbGciOiJSandNoThisIsntRealndmbS1yZWFsbSI6eyJyb2xlcyI6WyJtYW5hZ2UtY2xpZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtZXZlbnRzIiwidmlldy1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctdXNlcnMiLCJ2aWV3LWNsaWVudHMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtYXBwbGljYXRpb25zIiwibWFuYWdlLXJlYWxtIl19LCJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1jbGllbnRzIiwidmlldy1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy1hcHBsaWNhdGlvbnMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLXVzZXJzIiwibWFuYWdlLWFwcGxpY2F0aW9ucyJdfX19.fQmQKn-xatvflHPAaxCfrrVow3ynpw0sREho7__jZo2d0g1SwZV7Lf4C26CcweNLlb3wmKHHo63HRz35qRxJ7BXyiZwHgXokvDJj13yuOb6Sirg9z02n6fwGy8Iog30pUvffnDaVnUWHfVL-h_R4-OZNf-_YUK5RcL2DHt0zUXI","expires_in":60,"refresh_expires_in":1800,"refresh_token":"eyJhbGciOiJSUzI1NiJ9.eyJqdGkiOiJlNWFmYTZiOC04ZjM5LTQ5MjUtOWZiMC00MmY3MTM4YzUzMGIiLCJleHAiOjE0NDY4Mjk3OTksIm5iZiI6MCwAreYouKiddingIwouldnotputSOmethigRealHereNpb25fc3RhdGUiOiI2MmVmYzA1Yy0xYmY1LTRmNTUtYjc0OS01ZTBlZmY5NDE1NWIiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiYWRtaW4iLCJjcmVhdGUtcmVhbG0iXX0sInJlc291cmNlX2FjY2VzcyI6eyJ3Zm0tcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsInZpZXctYXBwbGljYXRpb25zIiwidmlldy1ldmVudHMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLWFwcGxpY2F0aW9ucyIsIm1hbmFnZS11c2VycyJdfSwibWFzdGVyLXJlYWxtIjp7InJvbGVzIjpbInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsIm1hbmFnZS1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hcHBsaWNhdGlvbnMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtcmVhbG0iXX19fQ.WeiJOC1jQ52aKgnW8UN2Lv9rJ_yKZiOhijOYKLN2EEOkYF8rvRZsSKbTPFKTIUvjnwy2A7V_N-GhhJH4C-T7F5__QPNofSXbCNyvATj52jGLxk9V0Afvk-Z5QAWi55PJRTC0qteeMRcO2Frw-0KtKYe9o3UcGICJubxhZHsXBLA","token_type":"bearer","id_token":"eyJhbGciOiJSUzI1NiJ9.eyJuYW1lIjoiIiwianRpIjoiMGIyMGI0ODctOTI4OS00YTFhLTgyNmMtM2NiOTg0MDJkMzVkIiwiZXhwIjoxNDQ2ODI4MDU5LCJuYmYiOjAsImlhdCI6MTQ0NjgyNzk5OIwouldhaveToBeNutsUiLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJhZG1pbiIsImVtYWlsX3ZlcmlmaWVkIjpmYWxzZX0.DmG8Lm4niL1djzNrLsZ2CrsB1ZzUPnR2Nm7IZnrwrmkXsrPxjl6pyXKCWSj6pbk2sgVI8NNFqrGIJmEJ7gkTZWm328VGGpJsmMuJBki0KbqBRKORGQSgkas_34rwzhcTE3Iki8h_YVs2vvNIx_eZSOvIzyEcP3IGHuBoxcR6W3E","not-before-policy":0,"session-state":"62efc05c-1bf5-4f55-b749-5e0eff94155b"}
In case anyone finds this post, this is what I ended up using:
if hash jq 2>/dev/null; then
# Use the jq command to safely parse json
bearerToken=$(echo $raw_json | jq -r '.access_token')
else
# Strip away all but the "access_token" field's value using a perl regular expression
bearerToken=$(echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)')
fi
Since not all regex flavors support lookbehind, Perl introduced the \K. In general when you have:
a\Kb
When “b” is matched, \K tells the engine to pretend that the match attempt started at this position.
In your example, you want to pretend that the match attempt started at what appears after the "access_token":" text.
This example will better demonstrate the \K usage:
~$ echo 'hello world' | grep -oP 'hello \K(world)'
world
~$ echo 'hello world' | grep -oP 'hello (world)'
hello world
In addition, \K allows a variable-length look-behind:
$ echo foooooo bar | grep -oP "(?<=foo+) \Kbar"
grep: lookbehind assertion is not fixed length
$ echo foooooo bar | grep -oP "foo+ \Kbar"
bar
My solution was: sed -n 's/cut off this part \(display this part only\) cut off this part/\1/gp'
References:
https://www.cyberciti.biz/faq/unix-linux-sed-print-only-matching-lines-command/
info sed (texinfo package)
man 1 sed

How to extract value from the string in bash?

I have an input string in the following format:
bugfix/ABC-12345-1-00
I want to extract "ABC-12345". Regex for that format in C# looks like this:
.\*\\/([A-Z]+-[0-9]+).\*
How can I do that in a bash script? I've tried sed and awk but had no success because I need to extract value from the capturing group and skip the rest.
If your grep supports -P then you could use the below grep commands.
$ echo 'bugfix/ABC-12345-1-00' | grep -oP '/\K[A-Z]+-\d+'
ABC-12345
\K keeps the text matched so far out of the overall regex match.
$ echo 'bugfix/ABC-12345-1-00' | grep -oP '(?<=/)[A-Z]+-\d+'
ABC-12345
(?<=/) Positive lookbehind which asserts that the match must be preceded by a / symbol.
Through sed,
$ echo 'bugfix/ABC-12345-1-00' | sed 's~.*/\([A-Z]\+-[0-9]\+\).*~\1~'
ABC-12345
echo "bugfix/ABC-12345-1-00"| perl -ane '/.*?([A-Z]+\-[0-9]+).*/;print $1."\n"'
You could try something like:
echo "bugfix/ABC-12345-1-00" | egrep -o '[A-Z]+-[0-9]+'
OUTPUT:
ABC-12345
If you do not like to use regex, you can use this awk:
echo "bugfix/ABC-12345-1-00" | awk -F\/ '{print $NF}'
ABC-12345-1-00
Or just this:
awk -F\/ '$0=$NF'

grep with extended regex over multiple lines

I'm trying to get a pattern over multiple lines. I would like to ensure the line I'm looking for ends in \r\n and that there is specific text that comes after it at some point. The two problems I've had are I often get unmatched parenthesis in groupings or I get a positive match when there is none. Here are two simple examples.
echo -e -n "ab\r\ncd" | grep -U -c -z -E $'(\r\n)+.*TEST'
grep: Unmatched ( or \(
What exactly is unmatched there? I don't get it.
echo -e -n "ab\r\ncd" | grep -U -c -z -E $'\r\n.*TEST'
1
There is no TEST in the string, so why does this return a count of 1 for matches?
I'm using grep (GNU grep) 2.16 on Ubuntu 14. Thanks
Instead of -E you can use -P for PCRE support in gnu grep to use advanced regex like this:
echo -ne "ab\r\ncd" | ggrep -UczP '\r\n.*TEST'
0
echo -ne "ab\r\ncd" | ggrep -UczP '\r\n.*cd'
1
grep -E matches only in single line input.

Extract IPv4 and IPv6 Address Ranges in Bash?

I'm writing a bash script in which I need to extract IPv4 and IPv6 Address Ranges from multiple strings and then format it as per the requirements before saving to the file.
I've got the regex working fine: http://regexr.com?38jsb (Not optimized, roughly added)
However, with bash it throws an error if i use with egrep which states egrep: repetition-operator operand invalid
Here's my bash script:
#!/bin/bash
regex="(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}"
echo "v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all" | egrep -o $regex
How can i extract both type of IP ranges in bash? What's a better solution?
Note: I'm using sample data for testing purpose
First, single-quote the regex variable assignment (regex='...').
Then, use grep -Po (and double-quote $regex), as #BroSlow suggests (note that -P is not available on all platforms (e.g., OSX)) -- -P activates support for PCREs (Perl-Compatible Regular Expressions), which is required for your regex.
To put it all together:
regex='(?>(?>([a-f\d]{1,4})(?>:(?1)){3}|(?!(?:.*[a-f\d](?>:|$)){})((?1)(?>:(?1)){0,6})?::(?2)?)|(?>(?>(?1)(?>:(?1)){5}:|(?!(?:.*[a-f\d]:){6,})(?3)?::(?>((?1)(?>:(?1)){0,4}):)?)?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?4)){3}))\/\d{1,2}'
txt="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo "$txt" | grep -Po "$regex"
Alternative: Following #l'L'l's example, here's a greatly simplified solution that works with the sample data (again relies on -P):
echo "$txt" | grep -Po '\bip[46]:\K[^ ]+'
Variant for OSX, where grep doesn't support -P:
echo "$txt" | egrep -o '\<ip[46]:[^ ]+' | cut -c 5-
This pattern should work in combination with sed:
str="v=abc ip4:127.0.0.1/19 ip4:192.168.1.1/32 ip4:192.168.2.50/20 ip6:2001:4860:4000::/36 ip6:2404:6800:4000::/36 ip6:2607:f8b0:4000::/36 ip6:2800:3f0:4000::/36 ip6:2a00:1450:4000::/36 ip6:2c0f:fb50:4000::/36 ~all"
echo $str | grep -s -i -o "ip[0-9]\:[a-z0-9\.:/]*" --color=always | sed 's/ip[0-9]\://g'
output:
127.0.0.1/19
192.168.1.1/32
192.168.2.50/20
2001:4860:4000::/36
2404:6800:4000::/36
2607:f8b0:4000::/36
2800:3f0:4000::/36
2a00:1450:4000::/36
2c0f:fb50:4000::/36
omit the --color=always to exclude color output if desired.