Regex for matching last two parts of a URL - regex

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?

You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.

I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps

if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result

A shorter version
/(\.[^\.]+){2}$/

Related

Simple regex - finding words including numbers but only on occasion

I'm really bad at regex, I have:
/(#[A-Za-z-]+)/
which finds words after the # symbol in a textbox, however I need it to ignore email addresses, like:
foo#things.com
however it finds #things
I also need it to include numbers, like:
#He2foo
however it only finds the #He part.
Help is appreciated, and if you feel like explaining regex in simple terms, that'd be great :D
/(?:^|(?<=\s))#([A-Za-z0-9]+)(?=[.?]?\s)/
#This (matched) regex ignores#this but matches on #separate tokens as well as tokens at the end of a sentence like #this. or #this? (without picking the . or the ?) And yes email#addresses.com are ignored too.
The regex while matching on # also lets you quickly access what's after it (like userid in #userid) by picking up the regex group(1). Check PHP documentation on how to work with regex groups.
You can just add 0-9 to your regex, like so:
/(#[A-Za-z0-9-]+)/
Don't think any more explanation is needed since you've been able to come this far by yourself. 0-9 is just like a-z (though numeric ofcourse).
In order to ignore emailaddresses you will need to provide more specific requirements. You could try preceding # with (^| ) which basically states that your value MUST be preceeded by either the start of the string (so nothing really, though at the start) or a space.
Extending this you can also use ($| ) on the end to require the value to be followed by the end of the string or a space (which means there's no period allowed, which is requirement for a valid emailaddress).
Update
$subject = "#a #b a#b a# #b";
preg_match_all("/(^| )#[A-Za-z0-9-]+/", $subject, $matches);
print_r($matches[0]);

Pattern matching in Perl

I am doing pattern match for some names below:
ABCD123_HH1
ABCD123_HH1_K
Now, my code to grep above names is below:
($name, $kind) = $dirname =~ /ABCD(\d+)\w*_([\w\d]+)/;
Now, problem I am facing is that I get both the patterns that is ABCD123_HH1, ABCD123_HH1_K in $dirname. However, my variable $kind doesn't take this ABCD123_HH1_K. It does take ABCD123_HH1 pattern.
Appreciate your time. Could you please tell me what can be done to get pattern with _k.
You need to add the _K part to the end of your regex and make it optional with ?:
/ABCD(\d+)_([\w\d]+(_K)?)/
I also erased the \w*, which is useless and keeps you from correctly getting the HH1_K.
You should check for zero or more occurrences of _K.
* in Perl's regexp means zero or more times
+ means atleast one or more times.
Hence in your regexp, append (_K)*.
Finally, your regexp should be this:
/ABCD(\d+)\w*_([\w\d]+(_K)*)/
\w includes letters, numbers as well as underscores.
So you can use something as simple as this:
/ABCD\w+/

What regular expression can I use to find the Nᵗʰ entry in a comma-separated list?

I need a regular expression that can be used to find the Nth entry in a comma-separated list.
For example, say this list looks like this:
abc,def,4322,mail#mailinator.com,3321,alpha-beta,43
...and I wanted to find the value of the 7th entry (alpha-beta).
My first thought would not be to use a regular expression, but to use something that splits the string into an array on the comma, but since you asked for a regex.
most regexes allow you to specify a minimum or maximum match, so something like this would probably work.
/(?:[^\,]*,){5}([^,]*)/
This is intended to match any number of character that are not a comma followed by a comma six times exactly (?:[^,]*,){5} - the ?: says to not capture - and then to match and capture any number of characters that are not a comma ([^,]+). You want to use the first capture group.
Let me know if you need more info.
EDIT: I edited the above to not capture the first part of the string. This regex works in C# and Ruby.
You could use something like:
([^,]*,){$m}([^,]*),
As a starting point. (Replace $m with the value of (n-1).) The content would be in capture group 2. This doesn't handle things like lists of size n, but that's just a matter of making the appropriate modifications for your situation.
#list = split /,/ => $string;
$it = $list[6];
or just
$it = (split /,/ => $string)[6];
Beats writing a pattern with a {6} in it every time.

regex string does not contain substring

I am trying to match a string which does not contain a substring
My string always starts "http://www.domain.com/"
The substring I want to exclude from matches is ".a/" which comes after the string (a folder name in the domain name)
There will be characters in the string after the substring I want to exclude
For example:
"http://www.domain.com/.a/test.jpg" should not be matched
But "http://www.domain.com/test.jpg" should be
Use a negative lookahead assertion as:
^http://www\.domain\.com/(?!\.a/).*$
Rubular Link
The part (?!\.a/) fails the match if the URL is immediately followed with a .a/ string.
My advise in such cases is not to construct overly complicated regexes whith negative lookahead assertions or such stuff.
Keep it simple and stupid!
Do 2 matches, one for the positives, and sort out later the negatives (or the other way around). Most of the time, the regexes become easier, if not trivial.
And your program gets clearer.
For example, to extract all lines with foo, but not foobar, I use:
grep foo | grep -v foobar
I would try with
^http:\/\/www\.domain\.com\/([^.]|\.[^a]).*$
You want to match your domain, plus everything that do not continue with a . and everything that do continue with a . but not a a. (Eventually you can add you / if needed after)
If you don't use look ahead, but just simple regex, you can just say, if it matches your domain but doesn't match with a .a/
<?php
function foo($s) {
$regexDomain = '{^http://www.domain.com/}';
$regexDomainBadPath = '{^http://www.domain.com/\.a/}';
return preg_match($regexDomain, $s) && !preg_match($regexDomainBadPath, $s);
}
var_dump(foo('http://www.domain.com/'));
var_dump(foo('http://www.otherdomain.com/'));
var_dump(foo('http://www.domain.com/hello'));
var_dump(foo('http://www.domain.com/hello.html'));
var_dump(foo('http://www.domain.com/.a'));
var_dump(foo('http://www.domain.com/.a/hello'));
var_dump(foo('http://www.domain.com/.b/hello'));
var_dump(foo('http://www.domain.com/da/hello'));
?>
note that http://www.domain.com/.a will pass the test, because it doesn't end with /.

Regex to detect one of several strings

I've got a list of email addresses belonging to several domains. I'd like a regex that will match addresses belonging to three specific domains (for this example: foo, bar, & baz)
So these would match:
a#foo
a#bar
b#baz
This would not:
a#fnord
Ideally, these would not match either (though it's not critical for this particular problem):
a#foobar
b#foofoo
Abstracting the problem a bit: I want to match a string that contains at least one of a given list of substrings.
Use the pipe symbol to indicate "or":
/a#(foo|bar|baz)\b/
If you don't want the capture-group, use the non-capturing grouping symbol:
/a#(?:foo|bar|baz)\b/
(Of course I'm assuming "a" is OK for the front of the email address! You should replace that with a suitable regex.)
^(a|b)#(foo|bar|baz)$
if you have this strongly defined a list. The start and end character will only search for those three strings.
Use:
/#(foo|bar|baz)\.?$/i
Note the differences from other answers:
\.? - matching 0 or 1 dots, in case the domains in the e-mail address are "fully qualified"
$ - to indicate that the string must end with this sequence,
/i - to make the test case insensitive.
Note, this assumes that each e-mail address is on a line on its own.
If the string being matched could be anywhere in the string, then drop the $, and replace it with \s+ (which matches one or more white space characters)
should be more generic, the a shouldn't count, although the # should.
/#(foo|bar|baz)(?:\W|$)/
Here is a good reference on regex.
edit: change ending to allow end of pattern or word break. now assuming foo/bar/baz are full domain names.
If the previous (and logical) answers about '|' don't suit you, have a look at
http://metacpan.org/pod/Regex::PreSuf
module description : create regular expressions from word lists
Ok I know you asked for a regex answer.
But have you considered just splitting the string with the '#' char
taking the second array value (the domain)
and doing a simple match test
if (splitString[1] == "foo" && splitString[1] == "bar" && splitString[1] == "baz")
{
//Do Something!
}
Seems to me that RegEx is overkill. Of course my assumption is that your case is really as simple as you have listed.
You don't need a regex to find whether a string contains at least one of a given list of substrings. In Python:
def contain(string_, substrings):
return any(s in string_ for s in substrings)
The above is slow for a large string_ and many substrings. GNU fgrep can efficiently search for multiple patterns at the same time.
Using regex
import re
def contain(string_, substrings):
regex = '|'.join("(?:%s)" % re.escape(s) for s in substrings)
return re.search(regex, string_) is not None
Related
Multiple Skip Multiple Pattern Matching Algorithm (MSMPMA) [pdf]