BASH URL path extraction [duplicate]

BASH URL path extraction [duplicate] - regex

This question already has answers here:
Parse URL in shell script
(16 answers)
Closed 6 years ago.
I am trying to extract a path from the url with the following expression:
url
url+="http://www.google.co.uk/setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D"
url_path=`echo "${url[0]}"| cut -d# -f2`
echo "$url_path"
I would like to get: /setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D
Any ideas please?
Additional challenge comes when the the URLs vary in format for example:
url=()
url+="http://www.google.co.uk/setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D"
url+="www.google.co.uk/shopping?hl=en&tab=wf"
url+="https://photos.google.com/?tab=wq"
url+="accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://www.google.co.uk"
Then result should be:
/setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D
/shopping?hl=en&tab=wf
/?tab=wq
/ServiceLogin?hl=en&passive=true&continue=http://www.google.co.uk

echo $url | awk -F / '{print "/"$NF}'
/setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D

In straight bash, if there are no slashes after google.co.uk/, you can use
url_path=${url[0]/#*\//\/}
The ${<var>/#<pat>/<repl>} construct replaces <pat> at the beginning (#) of the expansion of <var> with <repl>. Here, that is
var => url[0]
pat => *\/ , i.e., anything followed by a slash
repl => \/ , i.e., a single slash

The issue with your code specifically is that you are supplying the wrong delimiter to cut as well as the wrong field. Instead of -d# you should be using -d'/', and in the example you provided you want the 4th field, not the second. So what you should have used was this:
url_path=`echo "${url[0]}"| cut -d'/' -f4`
echo $url_path
setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D
However, that omits the beginning slash. If you need that, you can manually prepend it like this:
url_path='/'`echo "${url[0]}"| cut -d'/' -f4`
echo $url_path
/setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D
If there is a chance the urls will have differing numbers of slashes though, you may need a more robust solution. If you're certain you always want the 4th field, this cut example will work fine. If you always want the last field, you will be better off with awk or with bash's parameter expansion.
Here's what you'd need for either of those:
With awk, the delimiter is set with -F-, and $NF accesses the final field:
url_path=`echo ${url[0]} | awk -F / '{print "/"$NF}'`
With bash parameter expansion, ${var/pattern/} removes pattern from var. The pattern http:\/\/*\/ matches everything from "http://" to the final slash. Once again, the final slash is not in the output and is manually prepended:
url_path=`echo "/${url/http:\/\/*\//}"`

Related

Replace unknown sub-string in an URL

I have an URL in the format like https://foo.bar.whoo.dum.io, for which I like to replace the foo string with something else. Of course, the foo part is unknown and can be anything.
I tried with a simple regex like (.+?)\.(.+), but it seems that regex in Bash is always greedy (or?).
My best attempt is to split the string by . and then join it back with the first part left out, but I was wondering, whether there is a more intuitive, different solution.
Thank you

There are a lot of ways of getting the desired output.
If you're sure the url will always start with https://, we can use parameter expansion to remove everything before the first . and then add the replacement you need:
input="https://foo.bar.whoo.dum.io"
echo "https://new.${input#*.}"
Will output
https://new.bar.whoo.dum.io
Try it online!

You can use sed:
url='https://foo.bar.whoo.dum.io'
url=$(sed 's,\(.*://\)[^/.]*,\1new_value,' <<< "$url")
Here, the sed command means:
\(.*://\) - Capturing group 1: any text and then ://
[^/.]* - zero or more chars other than / and .
\1new_value - replaces the match with the Group 1 and new_value is appended to this group value.
See the online demo:
url='https://foo.bar.whoo.dum.io'
sed 's,\(.*://\)[^/.]*,\1new_value,' <<< "$url"
# => https://new_value.bar.whoo.dum.io

1st solution: Using Parameter expansion capability of bash here, adding this solution. Where newValue is variable with new value which you want to have in your url.
url='https://foo.bar.whoo.dum.io'
newValue="newValue"
echo "${url%//*}//$newValue.${url#*.}"
2nd solution: With your shown samples, please try following sed code here. Where variable url has your shown sample url value in it.
echo "$url" | sed 's/:\/\/[^.]*/:\/\/new_value/'
Explanation: Simple explanation would be, printing shell variable named url value by echo command and sending it as a standard input to sed command. Then in sed command using its capability of substitution here. Where substituting :// just before1st occurrence of . with ://new_value as per requirement.

Using sed (or any other tool) to remove the quotes in a json file

I have a json file
{"doc_type":"user","requestId":"1000778","clientId":"42114"}
I want to change it to
{"doc_type":"user","requestId":1000778,"clientId":"42114"}
i.e. convert the requestId from String to Integer. I have tried some ways, but none seem to work :
sed -e 's/"requestId":"[0-9]"/"requestId":$1/g' test.json
sed -e 's/"requestId":"\([0-9]\)"/"requestId":444/g' test.json
Could someone help me out please?

Try
sed -e 's/\("requestId":\)"\([0-9]*\)"/\1\2/g' test.json
or
sed -e 's/"requestId":"\([0-9]*\)"/"requestId":\1/g' test.json
The main differences with your attempts are:
Your regular expressions were looking for [0-9] between double quotes, and that's a single digit. By using [0-9]* instead you are looking for any number of digits (zero or more digits).
If you want to copy a sequence of characters from your search in your replacing string, you need to define a group with a starting \( and a final \) in the regexp, and then use \1 in the replacing string to insert the string there. If there are multiple groups, you use \1 for the first group, \2 for the second group, and so on.
Also note that the final g after the last / is used to apply this substitution in all matches, in every processed line. Without that g, the substitution would only be applied to the first match in every processed line. Therefore, if you are only expecting one such replacement per line, you can drop that g.

Since you said "or any other tool", I'd recommend jq! While sed is great for line-based, JSON is not and sometimes newlines are added in just for pretty printing the output to make developers' lives easier. It's rules also get even more tricky when handling Unicode or double-quotes in string content. jq is specifically designed to understand the JSON format and can dissect it appropriately.
For your case, this should do the job:
jq '.requestId = (.requestId | tonumber)'
Note, this will throw an error if requestId is missing and not output the JSON object. If that's a concern, you might need something a little more sophisticated like this example:
jq 'if has("requestId") then .requestId = (.requestId | tonumber) else . end'
Also, jq does pretty-print and colorize it's output if sent to a terminal. To avoid that and just see a compact, one-line-per-object format, add -Mc to the command. jq will also work if provided multiple objects back-to-back without a newline in the input. Here's a full-demo to show this filter:
$ (echo '{"doc_type":"bare"}{}'
echo '{"doc_type":"user","requestId":"0092","clientId":"11"}'
echo '{"doc_type":"user","requestId":"1000778","clientId":"42114"}'
) | jq 'if has("requestId") then .requestId = (.requestId | tonumber) else . end' -Mc
Which produced this output:
{"doc_type":"bare"}
{}
{"doc_type":"user","requestId":92,"clientId":"11"}
{"doc_type":"user","requestId":1000778,"clientId":"42114"}

sed -e 's/"requestId":"\([0-9]\+\)"/"requestId":\1/g' test.json
You were close. The "new" regex terms I had to add: \1 means "whatever is contained in the first \( \) on the "search" side, and \+ means "1 or more of the previous thing".
Thus, we search for the string "requestId":" followed by a group of 1 or more digits, followed by ", and replace it with "requestId": followed by that group we found earlier.

Perhaps the jq (json query) tool would help you out?
$ cat test
{"doc_type":"user","requestId":"1000778","clientId":"42114"}
$ cat test |jq '.doc_type' --raw-output
user
$

bash regexp to extract part of URL

From the following URL:
https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]
I need to extract the following part:
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
I'm pretty bad at regex. I came up with the following but it doesn't work:
sed -n "s/^.*browser\(test-lab.*/.*/\).*$/\1/p"
Can anyone help with what I'm doing wrong?

Could you please try with awk solution also and let me know if this helps you.
echo "https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/" | awk '{sub(/.*browser\//,"");sub(/\/$/,"");print}'
Explanation: Simply, substituting everything till browser/ then substituting last / with NULL.
EDIT1: Adding a sed solution here too.
sed 's/\(.[^//]*\)\/\/\(.[^/]*\)\(.[^/]*\)\(.[^/]*\)\/\(.*\)/\5/' Input_file
Output will be as follows.
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
Explanation of sed command: Dividing the whole line into parts and using sed's ability to keep the matched regex into memory so here are the dividers I used.
(.[^//]):* Which will have the value till https: in it and if anyone wants to print it you could use \1 for it because this is very first buffer for sed.
//: Now as per URL // comes to mentioning them now.
(.[^/]):* Now comes the 2nd part for sed's buffer which will have value console.developers.google.com in it, because REGEX looks for very first occurrence of / and stops matching there itself.
(.[^/]) && (.[^/]) && /(.):* These next 3 occurrences works on same method of storing buffers like they will look for first occurrence of / and keep the value from last matched letter's next occurrence to till 1st / comes.
/\5/: Now I am substituting everything with \5 means 5th buffer which contains values as per OP's instructions.

Use a different sed delimiter and don't forget to escape the braces.
avinash:~/Desktop$ echo 'https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]' | sed 's~.*/browser/\([^/]*/[^/]*/\).*~\1~'
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/
OR
Use grep with oP parameters.
avinash:~/Desktop$ echo 'https://console.developers.google.com/storage/browser/test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/]' | grep -oP '/browser/\K[^/]*/[^/]*/'
test-lab-acteghe53j0sf-jrf3f8u8p12n4/2017-09-27_15:23:07.566833_MPoy/

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!

I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.

Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.

Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest

http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt

I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt

This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))

Extract string from string using RegEx in the Terminal [duplicate]

This question already has answers here:
How to extract a value from a string using regex and a shell?
(7 answers)
Closed 4 years ago.
I have a string like first url, second url, third url and would like to extract only the url after the word second in the OS X Terminal (only the first occurrence). How can I do it?
In my favorite editor I used the regex /second (url)/ and used $1 to extract it, I just don't know how to do it in the Terminal.
Keep in mind that url is an actual url, I'll be using one of these expressions to match it: Regex to match URL

echo 'first url, second url, third url' | sed 's/.*second//'
Edit: I misunderstood. Better:
echo 'first url, second url, third url' | sed 's/.*second \([^ ]*\).*/\1/'
or:
echo 'first url, second url, third url' | perl -nle 'm/second ([^ ]*)/; print $1'

Piping to another process (like 'sed' and 'perl' suggested above) might be very expensive, especially when you need to run this operation multiple times. Bash does support regexp:
[[ "string" =~ regex ]]
Similarly to the way you extract matches in your favourite editor by using $1, $2, etc., Bash fills in the $BASH_REMATCH array with all the matches.
In your particular example:
str="first url1, second url2, third url3"
if [[ $str =~ (second )([^,]*) ]]; then
echo "match: '${BASH_REMATCH[2]}'"
else
echo "no match found"
fi
Output:
match: 'url2'
Specifically, =~ supports extended regular expressions as defined by POSIX, but with platform-specific extensions (which vary in extent and can be incompatible).
On Linux platforms (GNU userland), see man grep; on macOS/BSD platforms, see man re_format.

In the other answer provided you still remain with everything after the desired URL. So I propose you the following solution.
echo 'first url, second url, third url' | sed 's/.*second \(url\)*.*/\1/'
Under sed you group an expression by escaping the parenthesis around it (POSIX standard).

While trying this, what you probably forgot was the -E argument for sed.
From sed --help:
-E, -r, --regexp-extended
use extended regular expressions in the script
(for portability use POSIX -E).
You don't have to change your regex significantly, but you do need to add .* to match greedily around it to remove the other part of string.
This works fine for me:
echo "first url, second url, third url" | sed -E 's/.*second (url).*/\1/'
Output:
url
In which the output "url" is actually the second instance in the string. But if you already know that it is formatted in between comma and space, and you don't allow these characters in URLs, then the regex [^,]* should be fine.
Optionally:
echo "first http://test.url/1, second ://test.url/with spaces/2, third ftp://test.url/3" \
| sed -E 's/.*second ([a-zA-Z]*:\/\/[^,]*).*/\1/'
Which correctly outputs:
://example.com/with spaces/2

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

BASH URL path extraction [duplicate] - regex

echo $url | awk -F / '{print "/"$NF}' /setprefdomain?prefdom=US&sig=__REM5I87ZmVOTkq-ipnJx6oisXz0%3D

Related

Replace unknown sub-string in an URL

Using sed (or any other tool) to remove the quotes in a json file

bash regexp to extract part of URL

Regular Expression to parse Common Name from Distinguished Name

Extract string from string using RegEx in the Terminal [duplicate]

Categories

Resources