What does '\K' mean in this regex? - regex

Given the following shell script, would someone be so kind as to explain the grep -Po regex please?
#!/bin/bash
# Issue the request for a bearer token, json is returned
raw_json=`curl -s -X POST -d "username=name&password=secret&client_id=security-admin-console" http://localhost:8081/auth/realms/master/tokens/grants/access`
# Strip away all but the "access_token" field's value using a Python regular expression
bearerToken=`echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
echo "The bearer token is:"
echo $bearerToken
So specifically, I'm interested in understanding the parts of the regex
grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
and how it works. Why so many quotes? What is the "K" for? I've some experience with grep regex but this confuses me.
This is the actual output of the curl command and the shell script (grep) works as desired returning just the contents of the "access_token" value.
{"access_token":"eyJhbGciOiJSandNoThisIsntRealndmbS1yZWFsbSI6eyJyb2xlcyI6WyJtYW5hZ2UtY2xpZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtZXZlbnRzIiwidmlldy1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctdXNlcnMiLCJ2aWV3LWNsaWVudHMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtYXBwbGljYXRpb25zIiwibWFuYWdlLXJlYWxtIl19LCJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1jbGllbnRzIiwidmlldy1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy1hcHBsaWNhdGlvbnMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLXVzZXJzIiwibWFuYWdlLWFwcGxpY2F0aW9ucyJdfX19.fQmQKn-xatvflHPAaxCfrrVow3ynpw0sREho7__jZo2d0g1SwZV7Lf4C26CcweNLlb3wmKHHo63HRz35qRxJ7BXyiZwHgXokvDJj13yuOb6Sirg9z02n6fwGy8Iog30pUvffnDaVnUWHfVL-h_R4-OZNf-_YUK5RcL2DHt0zUXI","expires_in":60,"refresh_expires_in":1800,"refresh_token":"eyJhbGciOiJSUzI1NiJ9.eyJqdGkiOiJlNWFmYTZiOC04ZjM5LTQ5MjUtOWZiMC00MmY3MTM4YzUzMGIiLCJleHAiOjE0NDY4Mjk3OTksIm5iZiI6MCwAreYouKiddingIwouldnotputSOmethigRealHereNpb25fc3RhdGUiOiI2MmVmYzA1Yy0xYmY1LTRmNTUtYjc0OS01ZTBlZmY5NDE1NWIiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiYWRtaW4iLCJjcmVhdGUtcmVhbG0iXX0sInJlc291cmNlX2FjY2VzcyI6eyJ3Zm0tcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsInZpZXctYXBwbGljYXRpb25zIiwidmlldy1ldmVudHMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLWFwcGxpY2F0aW9ucyIsIm1hbmFnZS11c2VycyJdfSwibWFzdGVyLXJlYWxtIjp7InJvbGVzIjpbInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsIm1hbmFnZS1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hcHBsaWNhdGlvbnMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtcmVhbG0iXX19fQ.WeiJOC1jQ52aKgnW8UN2Lv9rJ_yKZiOhijOYKLN2EEOkYF8rvRZsSKbTPFKTIUvjnwy2A7V_N-GhhJH4C-T7F5__QPNofSXbCNyvATj52jGLxk9V0Afvk-Z5QAWi55PJRTC0qteeMRcO2Frw-0KtKYe9o3UcGICJubxhZHsXBLA","token_type":"bearer","id_token":"eyJhbGciOiJSUzI1NiJ9.eyJuYW1lIjoiIiwianRpIjoiMGIyMGI0ODctOTI4OS00YTFhLTgyNmMtM2NiOTg0MDJkMzVkIiwiZXhwIjoxNDQ2ODI4MDU5LCJuYmYiOjAsImlhdCI6MTQ0NjgyNzk5OIwouldhaveToBeNutsUiLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJhZG1pbiIsImVtYWlsX3ZlcmlmaWVkIjpmYWxzZX0.DmG8Lm4niL1djzNrLsZ2CrsB1ZzUPnR2Nm7IZnrwrmkXsrPxjl6pyXKCWSj6pbk2sgVI8NNFqrGIJmEJ7gkTZWm328VGGpJsmMuJBki0KbqBRKORGQSgkas_34rwzhcTE3Iki8h_YVs2vvNIx_eZSOvIzyEcP3IGHuBoxcR6W3E","not-before-policy":0,"session-state":"62efc05c-1bf5-4f55-b749-5e0eff94155b"}
In case anyone finds this post, this is what I ended up using:
if hash jq 2>/dev/null; then
# Use the jq command to safely parse json
bearerToken=$(echo $raw_json | jq -r '.access_token')
else
# Strip away all but the "access_token" field's value using a perl regular expression
bearerToken=$(echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)')
fi

Since not all regex flavors support lookbehind, Perl introduced the \K. In general when you have:
a\Kb
When “b” is matched, \K tells the engine to pretend that the match attempt started at this position.
In your example, you want to pretend that the match attempt started at what appears after the "access_token":" text.
This example will better demonstrate the \K usage:
~$ echo 'hello world' | grep -oP 'hello \K(world)'
world
~$ echo 'hello world' | grep -oP 'hello (world)'
hello world
In addition, \K allows a variable-length look-behind:
$ echo foooooo bar | grep -oP "(?<=foo+) \Kbar"
grep: lookbehind assertion is not fixed length
$ echo foooooo bar | grep -oP "foo+ \Kbar"
bar

My solution was: sed -n 's/cut off this part \(display this part only\) cut off this part/\1/gp'
References:
https://www.cyberciti.biz/faq/unix-linux-sed-print-only-matching-lines-command/
info sed (texinfo package)
man 1 sed

Related

Applying regex in bash

I'm trying to get my filename without its extension using a regex I found on Stack Overflow. The regex is:
(.+?)(\.[^.]*$|$)
I try this on the command line
echo TestFileName.1.0.0.2.zip | grep "(.+?)(\.[^.]*$|$)"
And I get nothing in the command line. If I try it with this regex:
echo TestFileName.1.0.0.2.zip | grep "Test"
I do see the TestFileName.1.0.0.2.zip gets printed to the console with Test highlighted in red. When I tried my data in this website: http://rubular.com/r/LNrI4inMU1
It does seem to work. Am I applying the regex wrong in Bash?
You're using an extended regular expression; the standard regex language which grep uses doesn't support what you're trying to do. Change grep to be grep -E and the match will work. This specifies that your regex is an extended one.
$ echo TestFileName.1.0.0.2.zip | grep -E "(.+?)(\.[^.]*$|$)"
TestFileName.1.0.0.2.zip
See this link for more information on the distinction between regular and extended regex.
Using BASH regex:
s='TestFileName.1.0.0.2.zip'
[[ "$s" =~ ^(.*)\.[^.]+$ ]] && echo "${BASH_REMATCH[1]}"
TestFileName.1.0.0.2
Add -P (Perl-regexp) parameter to your grep along with -o (only-matching).
$ echo TestFileName.1.0.0.2.zip | grep -oP "(.+)(?=\.)"
TestFileName.1.0.0.2

Use sed to grab a string

I'm using curl to get the html from a site then I just need a specific string which is between 'standards.xml?revision=' and '&amp'. I'm using sed to do this but I can't seem to get the regex right and needed some help.
curl website.com | sed -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|'
The output I'm getting is the full html--any help would be appreciated.
You're almost there. Try using -n option with sed not to print unmatched data and add p modifier to s||| to print replace string
curl website.com | sed -n -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|p'
you can use grep -oP (PCRE option):
grep -oP 'standards\.xml\?revision=\K[0-9]+'
\K resets the matched text hence only later part [0-9]+ is returned.
curl website.com | sed -n '/xml/ {s|.*standards\.xml\?revision=([^&]+).*|\1|p;q;}'
From previous sed [0-9]+ is only if number occur maybe a [^&]+ is more appropriate.
Very good to use the ' and | to avoid problem with \ so I pick it :-)

grep: group capturing

I have following string:
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
and I need to get value of "scheme version", which is 1234 in this example.
I have tried
grep -Eo "\"scheme_version\":(\w*)"
however it returns
"scheme_version":1234
How can I make it? I know I can add sed call, but I would prefer to do it with single grep.
You'll need to use a look behind assertion so that it isn't included in the match:
grep -Po '(?<=scheme_version":)[0-9]+'
This might work for you:
echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' |
sed -n 's/.*"scheme_version":\([^}]*\)}/\1/p'
1234
Sorry it's not grep, so disregard this solution if you like.
Or stick with grep and add:
grep -Eo "\"scheme_version\":(\w*)"| cut -d: -f2
I would recommend that you use jq for the job. jq is a command-line JSON processor.
$ cat tmp
{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}
$ cat tmp | jq .scheme_version
1234
As an alternative to the positive lookbehind method suggested by SiegeX, you can reset the match starting point to directly after scheme_version": with the \K escape sequence. E.g.,
$ grep -Po 'scheme_version":\K[0-9]+'
This restarts the matching process after having matched scheme_version":, and tends to have far better performance than the positive lookbehind. Comparing the two on regexp101 demonstrates that the reset match start method takes 37 steps and 1ms, while the positive lookbehind method takes 194 steps and 21ms.
You can compare the performance yourself on regex101 and you can read more about resetting the match starting point in the PCRE documentation.
To avoid using greps PCRE feature which is available in GNU grep, but not in BSD version, another method is to use ripgrep, e.g.
$ rg -o 'scheme_version.?:(\d+)' -r '$1' <file.json
1234
-r Capture group indices (e.g., $5) and names (e.g., $foo).
Another example with Python and json.tool module which can validate and pretty-print:
$ python -mjson.tool file.json | rg -o 'scheme_version[^\d]+(\d+)' -r '$1'
1234
Related: Can grep output only specified groupings that match?
You can do this:
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | awk -F ':' '{print $4}' | tr -d '}'
Improving #potong's answer that works only to get "scheme_version", you can use this expression :
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_id":["]*\([^(",})]*\)[",}].*/\1/p'
scheme_version
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"_rev":["]*\([^(",})]*\)[",}].*/\1/p'
4-cad1842a7646b4497066e09c3788e724
$ echo '{"_id":"scheme_version","_rev":"4-cad1842a7646b4497066e09c3788e724","scheme_version":1234}' | sed -n 's/.*"scheme_version":["]*\([^(",})]*\)[",}].*/\1/p'
1234

Return a regex match in a Bash script, instead of replacing it

I just want to match some text in a Bash script. I've tried using sed but I can't seem to make it just output the match instead of replacing it with something.
echo -E "TestT100String" | sed 's/[0-9]+/dontReplace/g'
Which will output TestTdontReplaceString.
Which isn't what I want, I want it to output 100.
Ideally, it would put all the matches in an array.
edit:
Text input is coming in as a string:
newName()
{
#Get input from function
newNameTXT="$1"
if [[ $newNameTXT ]]; then
#Use code that im working on now, using the $newNameTXT string.
fi
}
You could do this purely in bash using the double square bracket [[ ]] test operator, which stores results in an array called BASH_REMATCH:
[[ "TestT100String" =~ ([0-9]+) ]] && echo "${BASH_REMATCH[1]}"
echo "TestT100String" | sed 's/[^0-9]*\([0-9]\+\).*/\1/'
echo "TestT100String" | grep -o '[0-9]\+'
The method you use to put the results in an array depends somewhat on how the actual data is being retrieved. There's not enough information in your question to be able to guide you well. However, here is one method:
index=0
while read -r line
do
array[index++]=$(echo "$line" | grep -o '[0-9]\+')
done < filename
Here's another way:
array=($(grep -o '[0-9]\+' filename))
Pure Bash. Use parameter substitution (no external processes and pipes):
string="TestT100String"
echo ${string//[^[:digit:]]/}
Removes all non-digits.
I Know this is an old topic but I came her along same searches and found another great possibility apply a regex on a String/Variable using grep:
# Simple
$(echo "TestT100String" | grep -Po "[0-9]{3}")
# More complex using lookaround
$(echo "TestT100String" | grep -Po "(?i)TestT\K[0-9]{3}(?=String)")
With using lookaround capabilities search expressions can be extended for better matching. Where (?i) indicates the Pattern before the searched Pattern (lookahead),
\K indicates the actual search pattern and (?=) contains the pattern after the search (lookbehind).
https://www.regular-expressions.info/lookaround.html
The given example matches the same as the PCRE regex TestT([0-9]{3})String
Use grep. Sed is an editor. If you only want to match a regexp, grep is more than sufficient.
using awk
linux$ echo -E "TestT100String" | awk '{gsub(/[^0-9]/,"")}1'
100
I don't know why nobody ever uses expr: it's portable and easy.
newName()
{
#Get input from function
newNameTXT="$1"
if num=`expr "$newNameTXT" : '[^0-9]*\([0-9]\+\)'`; then
echo "contains $num"
fi
}
Well , the Sed with the s/"pattern1"/"pattern2"/g just replaces globally all the pattern1s to pattern 2.
Besides that, sed while by default print the entire line by default .
I suggest piping the instruction to a cut command and trying to extract the numbers u want :
If u are lookin only to use sed then use TRE:
sed -n 's/.*\(0-9\)\(0-9\)\(0-9\).*/\1,\2,\3/g'.
I dint try and execute the above command so just make sure the syntax is right.
Hope this helped.
using just the bash shell
declare -a array
i=0
while read -r line
do
case "$line" in
*TestT*String* )
while true
do
line=${line#*TestT}
array[$i]=${line%%String*}
line=${line#*String*}
i=$((i+1))
case "$line" in
*TestT*String* ) continue;;
*) break;;
esac
done
esac
done <"file"
echo ${array[#]}

Match domain name from url (www.google.com=google)

So I want to match just the domain from ether:
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
Output should be for all 3: google
I got this code working for just .com
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'
Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)
I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \( \) …)
This will output "google" in all cases:
sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"
Edit:
This version will handle URLs like "'http://google.com.cn/test" and "http://www.google.co.uk/" as well as the ones in the original question:
sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"
This version will handle cases that don't include "http://" (plus the others):
sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"
if you have Python, you can use urlparse module
import urlparse
for http in open("file"):
o = urlparse.urlparse(http)
d = o.netloc.split(".")
if "www" in o.netloc:
print d[1]
else:
print d[0]
output
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
$ ./python.py
google
google
google
or you can use awk
awk -F"/" '{
gsub(/http:\/\/|\/.*$/,"")
split($0,d,".")
if(d[1]~/www/){
print d[2]
}else{
print d[1]
}
} ' file
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test
$ ./shell.sh
google
google
google
google
google
s|http://(www\.)?([^.]*)|$2|
It's Perl with alternate delimiters (because it makes it more legible), I'm sure you can port it to sed or whatever you need.
#! /bin/bash
urls=( \
http://www.google.com/test/ \
http://google.com/test/ \
http://google.net/test/ \
)
for url in ${urls[#]}; do
echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done
Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).
Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.
echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"