match first character in a regex? - regex

I have the following regex:
http://([^:]*):?([0-9]*)(/.*)
When I match that against http://brandonhsiao.com/essays/showers.html, the parentheses grab: http://brandonhsiao.com/essays and /showers.html. How can I get it to grab http://brandonhsiao.com and /essays/showers.html?

Put a question mark after the first * you have to make it non-greedy. Right now your code for matching the hostname is grabbing everything all the way up to the last /.
http://([^:]*?):?([0-9]*)(/.*)
But that's not even what I would recommend. Try this instead:
(http://[^\s/]+)([^\s?#]*)
$1 should have http://brandonhsiao.com and $2 should have /essays/showers.html and any hash or query string is ignored.
Note that this is not designed to validate a URL, just to divide a URL up into the portion before the path, and the path itself. For example, it would happily accept invalid characters as part of the hostname. However, it does work fine for URLs with or without paths.
P.S. I don't know exactly what you are doing with this in Lisp, so I have taken the liberty of only testing it in other PCRE-compatible environments. Usually I test my answers in the exact context where they will be used.
$_ = "http://brandonhsiao.com/essays/showers.html";
m|(http://[^\s/]+)([^\s?#]*)|;
print "1 = '$1' and 2 = '$2'\n";
# [j#5 ~]$ perl test2.pl
# 1 = 'http://brandonhsiao.com' and 2 = '/essays/showers.html'

http://([^/:]*):?([0-9]*)(/.*)
The first group is matching everything but : and now I added /, that's because the [^] operator means match everything but what's inside the group, everything else is just the same.
Hope it helped!

http:\/\/([^:]*?)(\/.*)
The *? is a non-greedy match to the first slash (the one just after .com)
See http://rubular.com/r/VmU2ghAX0k for match groups

Related

Get all matches for a certain pattern using RegEx

I am not really a RegEx expert and hence asking a simple question.
I have a few parameters that I need to use which are in a particular pattern
For example
$$DATA_START_TIME
$$DATA_END_TIME
$$MIN_POID_ID_DLAY
$$MAX_POID_ID_DLAY
$$MIN_POID_ID_RELTM
$$MAX_POID_ID_RELTM
And these will be replaced at runtime in a string with their values (a SQL statement).
For example I have a simple query
select * from asdf where asdf.starttime = $$DATA_START_TIME and asdf.endtime = $$DATA_END_TIME
Now when I try to use the RegEx pattern
\$\$[^\W+]\w+$
I do not get all the matches(I get only a the last match).
I am trying to test my usage here https://regex101.com/r/xR9dG0/2
If someone could correct my mistake, I would really appreciate it.
Thanks!
This will do the job:
\$\$\w+/g
See Demo
Just Some clarifications why your regex is doing what is doing:
\$\$[^\W+]\w+$
Unescaped $ char means end of string, so, your pattern is matching something that must be on the end of the string, that's why its getting only the last match.
This group [^\W+] doesn't really makes sense, groups starting with [^..] means negate the chars inside here, and \W is the negation of words, and + inside the group means literally the char +, so you are saying match everything that is Not a Not word and that is not a + sign, i guess that was not what you wanted.
To match the next word just \w+ will do it. And the global modifier /g ensures that you will not stop on the first match.
This should work - Based on what you said you wanted to match this should work . Also it won't match $$lower_case_strings if that's what you wanted. If not, add the "i" flag also.
\${2}[A-Z_]+/g

Regex until repeat character (uri path)

I've got such examples
/path/to/service/People("Peter")
/i/dont/care/about/how/much/pathes/we/have/here/Customer("John")
/itcouldbejustone/Client("Rick")
i need to regex and leave just People("Peter"), Customer("John"), Client("Rick") accordingly
i was trying to use:
\/.+?(?=\/)
but we have a lot of "/" slashes, how to avoid it? thanks
Make it greedy ....
\/.+(?=\/)
To match also the last /,
\/.+\/
DEMO
You can do this without regex if using PHP
$url = '/path/to/service/People("Peter")';
$name = end(explode("/", $url));
$name will have People("Peter");
That depends a bit on which regex tool you are using.
Anyway, there are many ways of getting that:
If you will never have a slash at the final part, you can ask for whatever comes after the last slash:
.*\/([^\/]+)
Your desired result will be at the group 1.
If the last part will always have the format name("string") then you can match this format as well - which strikes me as more explicit pattern:
\w+\(".*"\)

How to capture text between two markers?

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks
. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.
You can try this regular expression
/\/foo\/([^\/]+)/
\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap

Regex for matching last two parts of a URL

I am trying to figure out the best regex to simply match only the last two strings in a url.
For instance with www.stackoverflow.com I just want to match stackoverflow.com
The issue i have is some strings can have a large number of periods for instance
a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
should also return only yimg.com
The set of URLS I am working with does not have any of the path information so one can assume the last part of the string is always .org or .com or something of that nature.
What regular expresion will return stackoverflow.com when run against www.stackoverflow.com and will return yimg.com when run against a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com
under the condtions above?
You don't have to use regex, instead you can use a simple explode function.
So you're looking to split your URL at the periods, so something like
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
$url_split = explode(".",$url);
And then you need to get the last two elements, so you can echo them out from the array created.
//this will return the second to last element, yimg
echo $url_split[count($url_split)-2];
//this will echo the period
echo ".";
//this will return the last element, com
echo $url_split[count($url_split)-1];
So in the end you'll get yimg.com as the final output.
Hope this helps.
I don't know what did you try so far, but I can offer the following solution:
/.*?([\w]+\.[\w]+)$/
There are a couple of tricks here:
Use $ to match till the end of the string. This way you'll be sure your regex engine won't catch the match from the very beginning.
Use grouping inside (...). In fact it means the following: match word that contains at least one letter then there should be a dot (backslashed because dot has a special meaning in regex and we want it 'as is' and then again series of letters with at least one of letters).
Use reluctant search in the beginning of the pattern, because otherwise it will match everything in a greedy manner, for example, if your text is :
abc.def.gh
the greedy match will give f.gh in your group, and its not what you want.
I assumed that you can have only letters in your host (\w matches the word, maybe in your example you will need something more complicated).
I post here a working groovy example, you didn't specify the language you use but the engine should be similar.
def s = "abc.def.gh"
def m = s =~/.*?([\w]+\.[\w]+)$/
println m[0][1] // outputs the first (and the only you have) group in groovy
Hope this helps
if you needed a solution in a Perl Regular Expression compatible way that will work in a number of languages, you can use something like that - the example is in PHP
$url = "a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com";
preg_match('|[a-zA-Z-0-9]+\.[a-zA-Z]{2,3}$|', $url, $m);
print($m[0]);
This regex guarantees you to fetch the last part of the url + domain name. For example, with a-abcnewsplus.i-a277eea3.rtmp.atlas.cdn.yimg.com this produces
yimg.com
as an output, and with www.stackoverflow.com (with or without preceding triple w) it gives you
stackoverflow.com
as a result
A shorter version
/(\.[^\.]+){2}$/

regex string does not contain substring

I am trying to match a string which does not contain a substring
My string always starts "http://www.domain.com/"
The substring I want to exclude from matches is ".a/" which comes after the string (a folder name in the domain name)
There will be characters in the string after the substring I want to exclude
For example:
"http://www.domain.com/.a/test.jpg" should not be matched
But "http://www.domain.com/test.jpg" should be
Use a negative lookahead assertion as:
^http://www\.domain\.com/(?!\.a/).*$
Rubular Link
The part (?!\.a/) fails the match if the URL is immediately followed with a .a/ string.
My advise in such cases is not to construct overly complicated regexes whith negative lookahead assertions or such stuff.
Keep it simple and stupid!
Do 2 matches, one for the positives, and sort out later the negatives (or the other way around). Most of the time, the regexes become easier, if not trivial.
And your program gets clearer.
For example, to extract all lines with foo, but not foobar, I use:
grep foo | grep -v foobar
I would try with
^http:\/\/www\.domain\.com\/([^.]|\.[^a]).*$
You want to match your domain, plus everything that do not continue with a . and everything that do continue with a . but not a a. (Eventually you can add you / if needed after)
If you don't use look ahead, but just simple regex, you can just say, if it matches your domain but doesn't match with a .a/
<?php
function foo($s) {
$regexDomain = '{^http://www.domain.com/}';
$regexDomainBadPath = '{^http://www.domain.com/\.a/}';
return preg_match($regexDomain, $s) && !preg_match($regexDomainBadPath, $s);
}
var_dump(foo('http://www.domain.com/'));
var_dump(foo('http://www.otherdomain.com/'));
var_dump(foo('http://www.domain.com/hello'));
var_dump(foo('http://www.domain.com/hello.html'));
var_dump(foo('http://www.domain.com/.a'));
var_dump(foo('http://www.domain.com/.a/hello'));
var_dump(foo('http://www.domain.com/.b/hello'));
var_dump(foo('http://www.domain.com/da/hello'));
?>
note that http://www.domain.com/.a will pass the test, because it doesn't end with /.