Match domain name from url (www.google.com=google) - regex

So I want to match just the domain from ether:
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
Output should be for all 3: google
I got this code working for just .com
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.com.*$/\1/p"
Output: 'google'
Then I thought it would be as simple as doing say (com|net) but that doesn't seem to be true:
echo "http://www.google.com/test/" | sed -n "s/.*www\.\(.*\)\.(com|net).*$/\1/p"
Output: '' (nothing)
I was going to use a similar method to get rid of the "www" but it seems im doing something wrong… (does it not work with regex outside the \( \) …)

This will output "google" in all cases:
sed -n "s|http://\(.*\.\)*\(.*\)\..*|\2|p"
Edit:
This version will handle URLs like "'http://google.com.cn/test" and "http://www.google.co.uk/" as well as the ones in the original question:
sed -nr "s|http://(www\.)?([^.]*)\.(.*\.?)*|\2|p"
This version will handle cases that don't include "http://" (plus the others):
sed -nr "s|(http://)?(www\.)?([^.]*)\.(.*\.?)*|\3|p"

if you have Python, you can use urlparse module
import urlparse
for http in open("file"):
o = urlparse.urlparse(http)
d = o.netloc.split(".")
if "www" in o.netloc:
print d[1]
else:
print d[0]
output
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
$ ./python.py
google
google
google
or you can use awk
awk -F"/" '{
gsub(/http:\/\/|\/.*$/,"")
split($0,d,".")
if(d[1]~/www/){
print d[2]
}else{
print d[1]
}
} ' file
$ cat file
http://www.google.com/test/
http://google.com/test/
http://google.net/test/
www.google.com.cn/test
google.com/test
$ ./shell.sh
google
google
google
google
google

s|http://(www\.)?([^.]*)|$2|
It's Perl with alternate delimiters (because it makes it more legible), I'm sure you can port it to sed or whatever you need.

#! /bin/bash
urls=( \
http://www.google.com/test/ \
http://google.com/test/ \
http://google.net/test/ \
)
for url in ${urls[#]}; do
echo $url | sed -re 's,^http://(.*\.)*(.+)\.[a-z]+/.+$,\2,'
done

Have you tried using the "-r" switch on your sed command? This enables the extended regular expression mode (egrep-compatible regexes).
Edit: try this, it seems to work. The "?:" characters in front of com|net are to prevent this set of characters to be captured by their surrounding parenthesis.
echo "http://www.google.com/test/" | sed -nr "s/.*www\.(.*)\.(?:com|net).*$/\1/p"

Related

How to remove special characters like a single quote from a string?

Using Sed I tried but it did not worked out.
Basically, I have a string say:-
Input:-
'http://www.google.com/photos'
Output required:-
http://www.google.com
I tried using sed but escaping ' is not possible.
what i did was:-
sed 's/\'//' | sed 's/photos//'
sed for photos worked but for ' it didn't.
Please suggest what can be the solution.
Escaping ' in sed is possible via a workaround:
sed 's/'"'"'//g'
# |^^^+--- bash string with the single quote inside
# | '--- return to sed string
# '------- leave sed string and go to bash
But for this job you should use tr:
tr -d "'"
Perl Replacements have a syntax identical to sed, works better than sed, is installed almost in every system by default and works for all machines the same way (portability):
$ echo "'http://www.google.com/photos'" |perl -pe "s#\'##g;s#(.*//.*/)(.*$)#\1#g"
http://www.google.com/
Mind that this solution will keep only the domain name with http in front, discarding all words following http://www.google.com/
If you want to do it with sed , you can use sed "s/'//g" as advised by Wiktor Stribiżew in comments.
PS: I sometimes refer to special chars with their ascii hex code of the special char as advised by man ascii, which is \x27 for '
So for sed you can do it:
$ echo "'http://www.google.com/photos'" |sed -r "s#'##g; s#(.*//.*/)(.*$)#\1#g;"
http://www.google.com/
# sed "s#\x27##g' will also remove the single quote using hex ascii code.
$ echo "'http://www.google.com/photos'" |sed -r "s#'##g; s#(.*//.*)(/.*$)#\1#g;"
http://www.google.com #Without the last slash
If your string is stored in a variable, you can achieve above operations with pure bash, without the need of external tools like sed or perl like this:
$ a="'http://www.google.com/photos'" && a="${a:1:-1}" && echo "$a"
http://www.google.com/photos
# This removes 1st and last char of the variable , whatever this char is.
$ a="'http://www.google.com/photos'" && a="${a:1:-1}" && echo "${a%/*}"
http://www.google.com
#This deletes every char from the end of the string up to the first found slash /.
#If you need the last slash you can just add it to the echo manually like echo "${a%/*}/" -->http://www.google.com/
It's unclear if the ' are actually around your string, although this should take care it:
str="'http://www.google.com/photos'"
echo "$str" | sed s/\'//g | sed 's/\/photos//g'
Combined:
echo "$str" | sed -e "s/'//g" -e 's/\/photos//g'
Using tr:
echo "$str" | sed -e "s/\/photos//g" | tr -d \'
Result:
http://www.google.com
If the single quotes are not around your string it should work regardless.

What does '\K' mean in this regex?

Given the following shell script, would someone be so kind as to explain the grep -Po regex please?
#!/bin/bash
# Issue the request for a bearer token, json is returned
raw_json=`curl -s -X POST -d "username=name&password=secret&client_id=security-admin-console" http://localhost:8081/auth/realms/master/tokens/grants/access`
# Strip away all but the "access_token" field's value using a Python regular expression
bearerToken=`echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
echo "The bearer token is:"
echo $bearerToken
So specifically, I'm interested in understanding the parts of the regex
grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)'`
and how it works. Why so many quotes? What is the "K" for? I've some experience with grep regex but this confuses me.
This is the actual output of the curl command and the shell script (grep) works as desired returning just the contents of the "access_token" value.
{"access_token":"eyJhbGciOiJSandNoThisIsntRealndmbS1yZWFsbSI6eyJyb2xlcyI6WyJtYW5hZ2UtY2xpZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtZXZlbnRzIiwidmlldy1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctdXNlcnMiLCJ2aWV3LWNsaWVudHMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtYXBwbGljYXRpb25zIiwibWFuYWdlLXJlYWxtIl19LCJtYXN0ZXItcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsIm1hbmFnZS1jbGllbnRzIiwidmlldy1yZWFsbSIsInZpZXctZXZlbnRzIiwidmlldy1hcHBsaWNhdGlvbnMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLXVzZXJzIiwibWFuYWdlLWFwcGxpY2F0aW9ucyJdfX19.fQmQKn-xatvflHPAaxCfrrVow3ynpw0sREho7__jZo2d0g1SwZV7Lf4C26CcweNLlb3wmKHHo63HRz35qRxJ7BXyiZwHgXokvDJj13yuOb6Sirg9z02n6fwGy8Iog30pUvffnDaVnUWHfVL-h_R4-OZNf-_YUK5RcL2DHt0zUXI","expires_in":60,"refresh_expires_in":1800,"refresh_token":"eyJhbGciOiJSUzI1NiJ9.eyJqdGkiOiJlNWFmYTZiOC04ZjM5LTQ5MjUtOWZiMC00MmY3MTM4YzUzMGIiLCJleHAiOjE0NDY4Mjk3OTksIm5iZiI6MCwAreYouKiddingIwouldnotputSOmethigRealHereNpb25fc3RhdGUiOiI2MmVmYzA1Yy0xYmY1LTRmNTUtYjc0OS01ZTBlZmY5NDE1NWIiLCJyZWFsbV9hY2Nlc3MiOnsicm9sZXMiOlsiYWRtaW4iLCJjcmVhdGUtcmVhbG0iXX0sInJlc291cmNlX2FjY2VzcyI6eyJ3Zm0tcmVhbG0iOnsicm9sZXMiOlsibWFuYWdlLWV2ZW50cyIsInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsInZpZXctYXBwbGljYXRpb25zIiwidmlldy1ldmVudHMiLCJ2aWV3LXVzZXJzIiwidmlldy1jbGllbnRzIiwibWFuYWdlLXJlYWxtIiwibWFuYWdlLWFwcGxpY2F0aW9ucyIsIm1hbmFnZS11c2VycyJdfSwibWFzdGVyLXJlYWxtIjp7InJvbGVzIjpbInZpZXctcmVhbG0iLCJtYW5hZ2UtY2xpZW50cyIsIm1hbmFnZS1ldmVudHMiLCJ2aWV3LWFwcGxpY2F0aW9ucyIsInZpZXctZXZlbnRzIiwidmlldy11c2VycyIsInZpZXctY2xpZW50cyIsIm1hbmFnZS1hcHBsaWNhdGlvbnMiLCJtYW5hZ2UtdXNlcnMiLCJtYW5hZ2UtcmVhbG0iXX19fQ.WeiJOC1jQ52aKgnW8UN2Lv9rJ_yKZiOhijOYKLN2EEOkYF8rvRZsSKbTPFKTIUvjnwy2A7V_N-GhhJH4C-T7F5__QPNofSXbCNyvATj52jGLxk9V0Afvk-Z5QAWi55PJRTC0qteeMRcO2Frw-0KtKYe9o3UcGICJubxhZHsXBLA","token_type":"bearer","id_token":"eyJhbGciOiJSUzI1NiJ9.eyJuYW1lIjoiIiwianRpIjoiMGIyMGI0ODctOTI4OS00YTFhLTgyNmMtM2NiOTg0MDJkMzVkIiwiZXhwIjoxNDQ2ODI4MDU5LCJuYmYiOjAsImlhdCI6MTQ0NjgyNzk5OIwouldhaveToBeNutsUiLCJwcmVmZXJyZWRfdXNlcm5hbWUiOiJhZG1pbiIsImVtYWlsX3ZlcmlmaWVkIjpmYWxzZX0.DmG8Lm4niL1djzNrLsZ2CrsB1ZzUPnR2Nm7IZnrwrmkXsrPxjl6pyXKCWSj6pbk2sgVI8NNFqrGIJmEJ7gkTZWm328VGGpJsmMuJBki0KbqBRKORGQSgkas_34rwzhcTE3Iki8h_YVs2vvNIx_eZSOvIzyEcP3IGHuBoxcR6W3E","not-before-policy":0,"session-state":"62efc05c-1bf5-4f55-b749-5e0eff94155b"}
In case anyone finds this post, this is what I ended up using:
if hash jq 2>/dev/null; then
# Use the jq command to safely parse json
bearerToken=$(echo $raw_json | jq -r '.access_token')
else
# Strip away all but the "access_token" field's value using a perl regular expression
bearerToken=$(echo $raw_json | grep -Po '"'"access_token"'"\s*:\s*"\K([^"]*)')
fi
Since not all regex flavors support lookbehind, Perl introduced the \K. In general when you have:
a\Kb
When “b” is matched, \K tells the engine to pretend that the match attempt started at this position.
In your example, you want to pretend that the match attempt started at what appears after the "access_token":" text.
This example will better demonstrate the \K usage:
~$ echo 'hello world' | grep -oP 'hello \K(world)'
world
~$ echo 'hello world' | grep -oP 'hello (world)'
hello world
In addition, \K allows a variable-length look-behind:
$ echo foooooo bar | grep -oP "(?<=foo+) \Kbar"
grep: lookbehind assertion is not fixed length
$ echo foooooo bar | grep -oP "foo+ \Kbar"
bar
My solution was: sed -n 's/cut off this part \(display this part only\) cut off this part/\1/gp'
References:
https://www.cyberciti.biz/faq/unix-linux-sed-print-only-matching-lines-command/
info sed (texinfo package)
man 1 sed

Sed replace domain in URL

I have these strings http://sub.domain.com/myuri/default.aspx, https://sub.domain.com/myuri/default.aspx and https://domain.com
Is it possible to use sed to replace only the domain part?
For example, this URL:
http://sub.domain.com/myuri/default.aspx
Would become:
http://anotherdomain.com/myuri/default.aspx
Please note that the protocol may differ between https and http.
I did search but could not find something similar.
You will need non-greedy pattern that sed can't offer, use perl instead:
perl -pe '/(http|https):\/\/(.*?)(\/|$)/ && s/$2/anotherdomain/g'
Edit:
awk also does the job well and it's even simpler actually:
awk -F/ 'gsub($3,"anotherdomain",$0)' <<< "$urls"
Example:
#!/bin/bash
urls=$(cat << 'EOF'
https://sub.domain.com/myuri/default.aspx
http://sub.domain.com/myuri/default.aspx
http://blabla
EOF
)
perl -pe '/(http|https):\/\/(.*?)(\/|$)/ && s/$2/anotherdomain/g' <<< "$urls"
Output:
bash test.sh
https://anotherdomain/myuri/default.aspx
http://anotherdomain/myuri/default.aspx
http://anotherdomain
If I follow your question, then yes sed 's/sub\.domain\.com/anotherdomain\.com/1' -
echo "http://sub.domain.com/myuri/default.aspx" | \
sed 's/sub\.domain\.com/anotherdomain\.com/1'
Output is
http://anotherdomain.com/myuri/default.aspx
And with,
echo "https://sub.domain.com/myuri/default.aspx" | \
sed 's/sub\.domain\.com/anotherdomain\.com/1'
Output is
https://anotherdomain.com/myuri/default.aspx
You can use sed like this:
sed -r 's|(https?://)[^/]+([[^:blank:]]*)|\1anotherdomain.com\2|g' file
http://anotherdomain.comn.com/myuri/default.aspx
https://anotherdomain.comn.com/myuri/default.aspx
https://anotherdomain.comn.com
PS: Use sed -E on OSX.
Based on #hek2mgl's solution:
SERVER=www.example.com
sed "s=\(https\?://\)[^/]\+=\1$SERVER=" \
<<< 'https://anotherdomain.com/myuri/default.aspx'
It will output:
https://www.example.com/myuri/default.aspx
Modifications from hek2mgl's sed line:
a little shorter (no need to catch the part after domain name to paste it as is in replacement)
deals with both http:// and https:// syntax
You can use sed:
SERVER=www.example.com
sed "s~https\?://\([^/]\+\)\(.*\)~http://$SERVER\2~" <<< "http://newsub.domain.com/myuri/default

Use sed to grab a string

I'm using curl to get the html from a site then I just need a specific string which is between 'standards.xml?revision=' and '&amp'. I'm using sed to do this but I can't seem to get the regex right and needed some help.
curl website.com | sed -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|'
The output I'm getting is the full html--any help would be appreciated.
You're almost there. Try using -n option with sed not to print unmatched data and add p modifier to s||| to print replace string
curl website.com | sed -n -r 's|.*standards\.xml\?revision=([0-9]+).*|\1|p'
you can use grep -oP (PCRE option):
grep -oP 'standards\.xml\?revision=\K[0-9]+'
\K resets the matched text hence only later part [0-9]+ is returned.
curl website.com | sed -n '/xml/ {s|.*standards\.xml\?revision=([^&]+).*|\1|p;q;}'
From previous sed [0-9]+ is only if number occur maybe a [^&]+ is more appropriate.
Very good to use the ' and | to avoid problem with \ so I pick it :-)

Using sed and regex to capture last part of url

I'm trying to make sed match the last part of a url and output just that. For example:
echo "http://randomurl/suburl/file.mp3" | sed (expression)
should give the output:
file.mp3
So far I've tried sed 's|\([^/]+mp3\)$|\1|g' but it just outputs the whole url. Maybe there's something I'm not seeing here but anyways, help would be much appreciated!
this works:
echo "http://randomurl/suburl/file.mp3" | sed 's#.*/##'
basename is your good friend.
> basename "http://randomurl/suburl/file.mp3"
=> file.mp3
This should do the job:
$ echo "http://randomurl/suburl/file.mp3" | sed -r 's|.*/(.*)$|\1|'
file.mp3
where:
| has been used instead of / to separate the arguments of the s command.
Everything is matched and replaced with whatever if found after the last /.
Edit: You could also use bash parameter substitution capabilities:
$ url="http://randomurl/suburl/file.mp3"
$ echo ${url##*/}
file.mp3
echo 'http://randomurl/suburl/file.mp3' | grep -oP '[^/\n]+$'
Here's another solution using grep.