Check if a URL has anything after the domain name in Perl - regex

I’m trying to match any and every URL that contain only the domain name (e.g. www.domain.com, or otherdomain.com\) with regex. So domain.com or domain.com/ should be matched but domain.com/index shouldn’t.
I’ve tried this:
^://[^/]*(/)?$
But it doesn’t seem to work. How can I do this?

/^(?:https?://)?[^/]*(/)?$/i
Try this. See demo

$ printf '%s\n' domain.com/index sub.domain.com/ domain.com |
perl -lne 'm|/\w+| or print'
sub.domain.com/
domain.com

Related

Bash regex to grab subdomain from list of urls

I have a file which contains list of URLs and I want to grab the subdomains from them.
List of URLs are:
https://www.google.com [match www]
https://www.something.random-name.domain.com [match www, something, and random-name]
https://facebook.com [don't match anything]
http://test.prod-op.bpo.yahoo.com [match test, prod-op and bpo]
I've been using the "sed" command to ditch https and http prefix and then using "awk "command to get the subdomains but the problem is I can only match the first subdomain for example:
https://www.something.random-name.domain.com
In the above example my approach would only match "www" But I want it to match "www" along with "something" and "random-name".
Input would be:
https://www.google.com
https://www.something.random-name.domain.com
https://facebook.com
http://test.prod-op.bpo.yahoo.com
Output would be:
www
www something random-name
null
test prod-op bpo
Kindly, explain me what shall be done so that I could match and extract the subdomains.
Thank you!
Here is your example file, and how to use sed to get all subdomains:
$ cat test.txt
https://www.google.com
https://www.something.random-name.domain.com
https://facebook.com
http://test.prod-op.bpo.yahoo.com
$ cat test.txt | sed -e 's/https*:\/\///; s/\.*[^\.]*\.[^\.]*$//; s/^$/null/; s/\./ /g'
www
www something random-name
null
test prod-op bpo
$
Explanation:
s/https*:\/\///; - remove protocol
s/\.*[^\.]*\.[^\.]*$//; - remove domain name and TLD
s/^$/null/; - change an empty line to null
s/\./ /g - change all dots to space
With two GNU awk:
awk -F '/' '{$0=$NF}1' file | awk -F '.' '{NF=NF-2}; NF<1{$0="null"}1'
$NF: contains last column
NF=NF-2: Removes the last two columns from current row
Output:
www
www something random-name
null
test prod-op bpo
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
This awk can do it in a single command:
awk -F. '{gsub(/^https?:\/\/|\.?[^.]+\.[^.]+$/, ""); $1=$1; print (/./ ? $0 : "null")}' file
www
www something random-name
null
test prod-op bpo
This might work for you (GNU sed):
sed -E 's#^https?://(.*)(\.[^.]+){2}#\1#;y/./ /;t;cnull' file
Pattern match on the url, removing everything but the required section.
Manipulate the section into the required format and print the result.
Otherwise, change the existing line to null.

grep regex to find emails of a certain tld

I'd like to run a grep command that searches a text file and should match email address with a certain tld.
Example, if the text file contains the following lines
tom#google.com
mark#google.com
tom.comber#google.cz
And I'm searching for the .com tld emails:
It should match tom#google.com and mark#google.com but not tom.comber#google.cz
I'm currently using the follow grep command, which matches pretty much every string that contains a .com. I want it to match specifically the tld of the domain
grep -rnwi "/Users/Me/Desktop/Folder/" -e ".com"
EDIT
grep -rnwi '#.+\.com$' "/Users/Me/Desktop/Folder/" matches nothing. but grep -rnwi "/Users/Me/Desktop/Folder/" -e "hotmail.com" matches plenty. I don't want just hotmail.com but all .com emails
EDIT2, this seem to match nothing either. is it because I'm searching in multiple text files in a folder?
grep -rnwi '#.\+\.com$' "/Users/Me/Desktop/Folder/"
EDIT3: wasn't totally clear. There are characters after the .tld extension so I had to leave off the trailing $. That works.
Do:
grep '#.\+\.com$' file.txt
#.\+ matches a # followed by one or more characters
\.com$ matches literal .com at the end
to do the same for other TLDs, replace com at the end with that TLD.

Match a username after a keyword

I am trying to match for a username in the format of DOMAIN\USERNAME after the first appearance of a keyword "Details:"
The following is a good sample for the text that I will be looking through.
Source: Product | Action: ADDUSER |
Administrator: domain\admin | Details: Alpha Snack Foods:
Added user domain\fuser to the group Viewers |
In this example I would want to return only "domain\fuser"
I have tried using (\bdomain\\.)\w+ but this returns both "domain\admin" and "niners\fuser"
I also tried using (?<=\buser\s)(\w+)but this only returned the second instance of domain.
So i feel like i am getting closeish here but I could use some help.
Thanks!
You can change your second attempt to allow the backslash character.
(?<=\buser\s)[\w\\]+
Or you can use \S which matches any non-whitespace characters.
(?<=\buser\s)\S+
Using perl :
$ perl -lne '/Added user (domain\\fuser) to the group/ and print $1' file
or
$ perl -lne '/Added user (\Qdomain\fuser\E) to the group/ and print $1' file
Output :
domain\fuser
Details:.+(domain\\[\w]+)
This finds "Details:" in the text, looks for all characters until domain\, then grabs the user name with domain.

Bash based regex domain name validation

I want to create a script that will add new domains to our DNS Servers.
I found that Fully qualified domain name validation REGEX.
However, when I use it with sed, it is not working as I would expect:
echo test | sed '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'
--------
Output is:
test
echo test.com | sed '/(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(:[a-zA-Z]{2,})$)/p'
--------
Output is:
test.com
I expected that the output of the first command should be a blank line.
What do I do wrong?
I find this to be a more comprehensive regex:
(?=^.{4,253}$)(^(?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){0,61}[a-zA-Z0-9])?\.)+([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])$)
RFC 1034§3: Allows for a length of 4-253, with the shortest operational domain I'm aware of, "t.co", still matching where the other answers don't. 255 bytes is the maximum length, minus the length octet for each label (TLD and "primary" subdomain) gives us 253: (?=^.{4,253}$)
RFC 3696§2: Single-letter TLDs are technically permitted, meaning the minimum length would be 3, but as there are currently no single-letter TLDs a minimum length of 4 is practical.
RFC 1034§3: Allows numbers in subdomains, which Conor Clafferty's apparently doesn't (by not distinguishing other subdomains from "primary" subdomains -- i.e. the domain you register -- which the DNS spec doesn't)
RFC 1034§3: Restricts individual labels to 63 characters, permitting hyphens in the middle while restricting the beginning and end to alphanumerics (?:[a-zA-Z0-9](?:(?:[a-zA-Z0-9\-]){,61}[a-zA-Z0-9])?\.)
Requires a two-letter or larger TLD, but may be punycoded ([a-zA-Z]{2,}|xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9])
RFC 3696§2: The DNS spec technically permits numerics in the TLD, as well as single-letter TLDs; however, there are currently no single-letter TLDs or TLDs with numbers currently, and all-numeric TLDs are not permitted, so this part of the regex has been simplified to [a-zA-Z]{2,}.
--OR--
RFC 3490§5: an internationalized domain name ccTLD (IDN ccTLD) may be punycoded, as indicated by an "xn--" prefix, after which it may contain letters, numbers, or hyphens. This approximates to xn--[a-zA-Z0-9][a-zA-Z0-9\-]*[a-zA-Z0-9]
Be aware that this pattern does not validate a punycode TLD! Invalid punycode will be tolerated, e.g. "xn--qqqq", because attempting to validate punycode against the appropriate encoding mechanisms is beyond the scope of a regular expression. While punycode itself technically permits an encoded string ending in a hyphen, RFC 3492§5 observes and respects the IDNA limitation that labels may not end in a hyphen.
EDIT 02/2021: Hat tip to user2241415 for pointing out that IDN ccTLDs did not match the previously-specified regex.
You are missing a question mark in your regex :
(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)
You can test your regex here
You can do what you want with grep :
$ echo test.com | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
test.com
$ echo test | grep -P '(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+(?:[a-zA-Z]{2,})$)'
$
No sed implementation I am aware of supports the various Perl extensions you are using in that regex. Try with Perl or grep -P or pcregrep, or simplify the regex to something sed can cope with. Here is a quick and dirty adaptation which splits the regex into a script of three different regexes, and rejects when something fails to match (or matches, in the middlemost case).
echo 'test' | sed -r '/^.{5,254}$/!d
/^([^.]*\.)*[0-9]+\./d # Seems incorrect; 112.com is valid
/^([a-zA-Z0-9_\-]{1,63}\.?)+([a-zA-Z]{2,})$/!d' # should disallow underscore
# also, what's with the question mark after the literal dot?
This also completely fails to accept IDNA domains (which can contain dashes and numbers in the TLD, among other things) so I would definitely not recommend this, but hopefully it shows you how to adapt something like this to sed if you wish to.
Pierre-Louis' answer didn't quite work for me. e.g. "kittens" is considered a domain name.
I added one slight adjustment to ensure that the domain at least had a dot in it.
(?=^.{5,254}$)(^(?:(?!\d+\.)[a-zA-Z0-9_\-]{1,63}\.?)+\.(?:[a-z]{2,})$)
Theres an extra \. just before it reads the last portion of the domain.
I use grep -P to do this.
echo test | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$"
--------
Output is:
echo www.test.com | grep -P "^[a-zA-Z0-9][a-zA-Z0-9-]{1,61}[a-zA-Z0-9](?:\.[a-zA-Z]{2,})+$"
--------
Output is: www.test.com
if the domain has to exist you can try:
$ cat test.sh
#!/bin/bash
for h in "bert" "ernie" "www.google.com"
do
host $h 2>&1 > /dev/null
if [ $? -eq 0 ]
then
echo "$h is a FQDN"
else
echo "$h is not a FQDN"
fi
done
jalderman#mba:/tmp$ ./test.sh
bert is not a FQDN
ernie is not a FQDN
www.google.com is a FQDN

Regular Expressions for Linux - scan the Apache HTTPD access log for all the response code other than 200

This is a question about grep and regular expression.
If I want to see all the requests whose response is a 200 code, I can do:
grep -e '^.* - - .* .* .* .* .* 200' access_log
Quite easy peasy.
But what if I want to retrieve all the requests whose response is NOT a 200 code?
I would like to be able to do that with only one grep instruction. Is that possible?
Thanks,
Dan
You can simply use the -v option for grep. This inverts the matches, so it returns all the lines that do not match the pattern.
So like this:
grep -v [pattern] [file]
I'd use this:
^\S+\s+\S+\s+\S+\s+\[[^]]+\]\s+"(?:GET|POST|HEAD) [^ ?"]+\??[^ ?"]+? HTTP/[0-9.]+"\s+200
and then invert the result as Daniel Egeberg suggested.
With comments and capturing groups, courtesy of RegexBuddy:
^((?#client IP or domain name)\S+)\s+((?#basic authentication)\S+\s+\S+)\s+\[((?#date and time)[^]]+)\]\s+"(?:GET|POST|HEAD) ((?#file)[^ ?"]+)\??((?#parameters)[^ ?"]+)? HTTP/[0-9.]+"\s+(?#status code)200