Bash regex to grab subdomain from list of urls

Bash regex to grab subdomain from list of urls - regex

I have a file which contains list of URLs and I want to grab the subdomains from them.
List of URLs are:
https://www.google.com [match www]
https://www.something.random-name.domain.com [match www, something, and random-name]
https://facebook.com [don't match anything]
http://test.prod-op.bpo.yahoo.com [match test, prod-op and bpo]
I've been using the "sed" command to ditch https and http prefix and then using "awk "command to get the subdomains but the problem is I can only match the first subdomain for example:
https://www.something.random-name.domain.com
In the above example my approach would only match "www" But I want it to match "www" along with "something" and "random-name".
Input would be:
https://www.google.com
https://www.something.random-name.domain.com
https://facebook.com
http://test.prod-op.bpo.yahoo.com
Output would be:
www
www something random-name
null
test prod-op bpo
Kindly, explain me what shall be done so that I could match and extract the subdomains.
Thank you!

Here is your example file, and how to use sed to get all subdomains:
$ cat test.txt
https://www.google.com
https://www.something.random-name.domain.com
https://facebook.com
http://test.prod-op.bpo.yahoo.com
$ cat test.txt | sed -e 's/https*:\/\///; s/\.*[^\.]*\.[^\.]*$//; s/^$/null/; s/\./ /g'
www
www something random-name
null
test prod-op bpo
$
Explanation:
s/https*:\/\///; - remove protocol
s/\.*[^\.]*\.[^\.]*$//; - remove domain name and TLD
s/^$/null/; - change an empty line to null
s/\./ /g - change all dots to space

With two GNU awk:
awk -F '/' '{$0=$NF}1' file | awk -F '.' '{NF=NF-2}; NF<1{$0="null"}1'
$NF: contains last column
NF=NF-2: Removes the last two columns from current row
Output:
www
www something random-name
null
test prod-op bpo
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

This awk can do it in a single command:
awk -F. '{gsub(/^https?:\/\/|\.?[^.]+\.[^.]+$/, ""); $1=$1; print (/./ ? $0 : "null")}' file
www
www something random-name
null
test prod-op bpo

This might work for you (GNU sed):
sed -E 's#^https?://(.*)(\.[^.]+){2}#\1#;y/./ /;t;cnull' file
Pattern match on the url, removing everything but the required section.
Manipulate the section into the required format and print the result.
Otherwise, change the existing line to null.

Related

Replace in line using sed

I'm creating a shell script, which reads the following list.log
1.15.2.119
1.15.86.33
1.15.251.60
1.20.178.145/31
1.37.33.24
1.54.202.216
1.58.10.126/28
1.80.225.84
1.116.240.174/30
I would like to add a /32 IP at the end of all IPs except the ones that already exist /32 something.
Example:
1.14.191.227/32
1.15.2.119/32
1.15.86.33/32
1.15.251.60/32
1.20.178.145/31
1.37.33.24/32
1.54.202.216/32
1.58.10.126/28
1.80.225.84/32
1.116.240.174/30
My return is doubling the /32
cat list.log | sed 's/$/\/32/'
1.14.191.227/32
1.15.2.119/32
1.15.86.33/32
1.15.251.60/32
1.20.178.145/31/32
1.37.33.24/32
1.54.202.216/32
1.58.10.126/28/32
1.80.225.84/32
1.116.240.174/30/32

This could be easily done in awk, please try following awk program. Written and tested with shown samples.
awk '!/\/32$/{$0=$0"/32"} 1' Input_file
Explanation: Simple explanation would be, checking condition if line doesn't ending with /32 then add /32 to current line and mentioning 1 will print edited/non-edited current line.

Using sed
$ sed 's|\.[0-9]\+$|&/32|' list.log
1.15.2.119/32
1.15.86.33/32
1.15.251.60/32
1.20.178.145/31
1.37.33.24/32
1.54.202.216/32
1.58.10.126/28
1.80.225.84/32
1.116.240.174/30

You can add /32 to the end of lines that do not contain /
sed '\,/,!s,$,/32,' list.log > newlist.log
Details:
\,/,! - find lines not containing /
s,$,/32, - and replace end of string position with /32 there.
See the online demo:
#!/bin/bash
s='1.15.2.119
1.15.86.33
1.15.251.60
1.20.178.145/31
1.37.33.24
1.54.202.216
1.58.10.126/28
1.80.225.84
1.116.240.174/30'
sed '\,/,!s,$,/32,' <<< "$s"
Output:
1.15.2.119/32
1.15.86.33/32
1.15.251.60/32
1.20.178.145/31
1.37.33.24/32
1.54.202.216/32
1.58.10.126/28
1.80.225.84/32
1.116.240.174/30

How can I use sed to find a line starting with AAA but NOT end with BBB

I'm trying to create a script to append oracleserver to /etc/hosts as an alias of localhost. Which means I need to:
Locate the line that ^127.0.0.1 and NOT oracleserver$
Then, append oracleserver to this line
I know the best practice is probably using negative look ahead. However, sed does not have look around feature: What's wrong with my lookahead regex in GNU sed?. Can anyone provide me some possible solutions?

sed -i '/oracleserver$/! s/^127\.0\.0\.1.*$/& oracleserver/' filename
/oracleserver$/! - on lines not ending with oracleserver
^127\.0\.0\.1.*$ - replace the whole line if it is starting with 127.0.0.1
& oracleserver - with the line plus a space separator ' ' (required) and oracleserver after that

Just use awk with && to combine the two conditions:
awk '/^127\.0\.0\.1/ && !/oracleserver$/ { $0 = $0 "oracleserver" } 1' file
This appends the string when the first pattern is matched but the second one isn't. The 1 at the end is always true, so awk prints each line (the default action is { print }).

I wouldn't use sed but instead perl:
Locate the line that ^127.0.0.1 and NOT oracleserver$
perl -pe 'if ( m/^127\.0\.0\.1/ and not m/oracleserver$/ ) { s/$/oracleserver/ }'
Should do the trick. You can add -i.bak to inplace edit too.

Check if a URL has anything after the domain name in Perl

I’m trying to match any and every URL that contain only the domain name (e.g. www.domain.com, or otherdomain.com\) with regex. So domain.com or domain.com/ should be matched but domain.com/index shouldn’t.
I’ve tried this:
^://[^/]*(/)?$
But it doesn’t seem to work. How can I do this?

/^(?:https?://)?[^/]*(/)?$/i
Try this. See demo

$ printf '%s\n' domain.com/index sub.domain.com/ domain.com |
perl -lne 'm|/\w+| or print'
sub.domain.com/
domain.com

Understanding a sed example

I found a solution for extracting the password from a Mac OS X Keychain item. It uses sed to get the password from the security command:
security 2>&1 >/dev/null find-generic-password -ga $USER | \
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
The code is here in a comment by 'sr105'. The part before the | evaluates to password: "secret". I'm trying to figure out exactly how the sed command works. Here are some thoughts:
I understand the flags -En, but what are the commas doing in this example? In the sed docs it says a comma separates an address range, but there's 3 commas.
The first 'address' /^password: / has a trailing s; in the docs s is only mentioned as the replace command like s/pattern/replacement/. Not the case here.
The ^password: "(.*)"$ part looks like the Regex for isolating secret, but it's not delimited.
I can understand the end part where the back-reference \1 is printed out, but again, what are the commas doing there??
Note that I'm not interested in an easier alternative to this sed example. This will only be part of a larger bash script which will include some more sed parsing in an .htaccess file, so I'd really like to learn the syntax even if it is obscure.
Thanks for your help!

Here is sed command:
sed -En '/^password: / s,^password: "(.*)"$,\1,p'
Commas are used as regex delimiter it can very well be another delimiter like #:
sed -En '/^password: / s#^password: "(.*)"$#\1#p'`
/^password: / finds an input line that starts with password:
s#^password: "(.*)"$#\1#p finds and captures double-quoted string after password: and replaces the entire line with the captured string \1 ( so all that remains is the password )

First, the command extracts passwords from a file (or stream) and prints them to stdout.
While you "normally" might execute a sed command on all lines of a file, sed offers to specify a regex pattern which describes which lines the following command should get applied to.
In your case
/^password: /
is a regex, saying that the command:
s,^password: "(.*)"$,\1,p
should get executed for all lines looking like password: "secret". The command substitutes those lines with the password itself while suppressing the outer lines.
The substitute command might look uncommon but you can choose the delimiter in an sed command, it is not limited to /. In this case , was chosen.

How can I match a pattern that occurs after a known pattern

Consider the format of a bind dns zone file:
zone "mydomain.com" {
type slave;
file "db.mydomain";
masters {
192.168.5.15;
};
};
...
repeated several more times for other zones in the conf file.
I need to discover in a script some details about the zone.conf file.
I know the domain I am looking for so I can regex for something like '^zone "mydomain.com"'
But I need to discover the file line that occurs first after the zone name I am looking at.
I also want to discover the ip address in the masters list.
Our configuration only has one master ip so I don't have to worry about multiple ip's.
Ideas appreciated.

sed can be used to isolate the right section of the dns file, then print the next line after a pattern matched:
# sed -n '/"mydomain.com"/,/^};$/{/^zone "mydomain.com"/{n;p}}' dnsfile
type slave;
# sed -n '/"mydomain.com"/,/^};$/{/masters/{n;p}}' dnsfile
192.168.5.15;

One approach here would be to use sed to first output the zone block you are interested in, and then grab just the lines you want. This might look something like the following:
sed -n '/^zone "mydomain.com"/,/^};/p' zone.conf | sed -n -e '2p' -e '/[0-9]/p'
2p will print only the second line (first line after the zone name), and /[0-9]/p will print only lines that contain digits (ip address).

To get the next line with trimmed IP:
awk -F';' '/^ *masters/ { getline; sub(/^ */, "", $1); print $0 }' file
OUTPUT
192.168.5.15
To get zone line:
awk -F';' '/^zone "mydomain.com"/ { getline; sub(/^ */, "", $1); print $0}' file
OUTPUT
192.168.5.15

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Bash regex to grab subdomain from list of urls - regex

This awk can do it in a single command: awk -F. '{gsub(/^https?:\/\/|\.?[^.]+\.[^.]+$/, ""); $1=$1; print (/./ ? $0 : "null")}' file www www something random-name null test prod-op bpo

This might work for you (GNU sed): sed -E 's#^https?://(.*)(\.[^.]+){2}#\1#;y/./ /;t;cnull' file Pattern match on the url, removing everything but the required section. Manipulate the section into the required format and print the result. Otherwise, change the existing line to null.

Related

Replace in line using sed

How can I use sed to find a line starting with AAA but NOT end with BBB

Check if a URL has anything after the domain name in Perl

Understanding a sed example

How can I match a pattern that occurs after a known pattern

Categories

Resources