How to capture text between two markers? - regex

For clarity, I have created this:
http://rubular.com/r/ejYgKSufD4
My strings:
http://blablalba.com/foo/bar_soap/foo/dir2
http://blablalba.com/foo/bar_soap/dir
http://blablalba.com/foo/bar_soap
My Regular expression:
\/foo\/(.*)
This returns:
/foo/bar_soap/dir/dir2
/foo/bar_soap/dir
/foo/bar_soap
But I only want
/foo/bar_soap
Any ideas how I can achieve this? As illustrated above, I want everything after foo up until the first forward slash.
Thanks in advance.
Edit. I only want the text after foo until until the next forward slash after. Some directories may also be named as foo and this would render incorrect results. Thanks

. will match anything, so you should change it to [^/] (not slash) instead:
\/foo\/([^\/]*)
Some of the other answers use + instead of *. That might be correct depending on what you want to do. Using + forces the regex to match at least one non-slash character, so this URL would not match since there isn't a trailing character after the slash:
http://blablalba.com/foo/
Using * instead would allow that to match since it matches "zero or more" non-slash characters. So, whether you should use + or * depends on what matches you want to allow.
Update
If you want to filter out query strings too, you could also filter against ?, which must come at the front of all query strings. (I think the examples you posted below are actually missing the leading ?):
\/foo\/([^?\/]*)
However, rather than rolling out your own solution, it might be better to just use split from the URI module. You could use URI::split to get the path part of the URL, and then use String#split split it up by /, and grab the first one. This would handle all the weird cases for URLs. One that you probably haven't though of yet is a URL with a specified fragment, e.g.:
http://blablalba.com/foo#bar
You would need to add # to your filtered-character class to handle those as well.

You can try this regular expression
/\/foo\/([^\/]+)/

\/foo\/([^\/]+)
[^\/]+ gives you a series of characters that are not a forward slash.
the parentheses cause the regex engine to store the matched contents in a group ([^\/]+), so you can get bar_soap out of the entire match of /foo/bar_soap
For example, in javascript you would get the matched group as follows:
regexp = /\/foo\/([^\/]+)/ ;
match = regexp.exec("/foo/bar_soap/dir");
console.log(match[1]); // prints bar_soap

Related

How to use regex to select whole string in one match except the forward slash /

I need some help with regex with this test string:
Kershing_User ID/Electronic Delivery_6ZZ138429_ 3142-999999__1
I want one match to select everything except the forward slash, so this would be acceptable:
Kershing_User IDElectronic Delivery_6ZZ138429_ 3142-999999__1
Even better would be to return this with a substitution of a _.
Kershing_User ID_Electronic Delivery_6ZZ138429_ 3142-999999__1
I know how to do lookarounds and can individually match the part before and after the /, but not all in one match. Anything else I have tried has come up with two separate matches. I am using this with an application called Laserfiche, so as far as I know there is not the ability to do find & replace or to extract a group, just doing it with one match. My regrets if I don't have the terminology correct. I am not even sure if this is possible. I tried for a while and come up with these below, but can't get it in one match.
This does the before: .*(?=\/)
This does the after: (?<=\/).*
Even better would be to return this with a substitution of a _.
import re
a = "Kershing_User ID/Electronic Delivery_6ZZ138429_ 3142-999999__1"
print(re.sub('\/', '_', a))
This will replace / with _. Read more on this here

Regex to match only urls that contains a certain path

I am using the following regex
(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
and it's showing me a url but I want to show only URLS that contain
/video/hd/
The following correction of the Regex above did not deal correctly with slashes
((?:https\:\/\/)|(?:http\:\/\/)|(?:www\.))?([a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,3}(?:\??)[a-zA-Z0-9\-\._\?\,\'\/\\\+&%\$#\=~]+)
You said only the whole match is used, and the regex contains no backreferences. Therefore we can replace all capturing groups (( )) in the regex by non-capturing groups ((?: )). A few of the groups are redundant, and http|https can be simplified to https?. Together this gives us
(?:https?|ftp)://[\w_-]+(?:\.[\w_-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
_ is not allowed in hostnames:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Technically - cannot appear at the beginning or end of a hostname, but we'll ignore that. Your regex doesn't allow non-default ports or IPv6 hosts either, but we'll ignore that, too.
The stuff matched by the last part of your regex (which is presumably meant to match path, query string, and anchor all together) can overlap with the hostname (both \w and - are in both character classes). We can fix this by requiring a separator of either / or ? after the hostname:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+(?:[/?][\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Now we can start looking at your additional requirement: The URL should contain /video/hd/. Presumably this string should appear somewhere in the path. We can encode this as follows:
(?:https?|ftp)://[\w-]+(?:\.[\w-]+)+/(?:[\w.,#^=%&:/~+-]*/)?video/hd/(?:[\w.,#?^=%&:/~+#-]*[\w#?^=%&/~+#-])?
Instead of matching an optional separator of / or ?, we now always require a / after the hostname. This / must be followed by either video/hd/ directly or 0 or more path characters and another /, which is then followed by video/hd/. (The set of path characters does not include ? (which would start the query string) or # (which would start the anchor).)
As before, after /video/hd/ there can be a final part of more path components, a query string, and an anchor (all optional).
First of all, you need a regex to match URLs (be they http, https...)
(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))
Once you got that, you need to select them but not "consume" them. You can do this with a lookahed, i.e. a regex that assert that what follows the current position is e.g. foo:
(?=foo)
Of course you will replace foo with the first regex I wrote.
At this point, you know you selected a URL; now you just constraint your search to URLs that contain /video/hd:
.*\/video\/hd\/.*
So the complete regex is
(?=(([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))).*\/video\/hd\/.*
You can test it here with a live demo.

Regex - Matching a part of a URL

I'm trying to use regular expression to match a part of the following url:
http://www.example.com/store/store.html?ptype=lst&id=370&3434323&root=nav_3&dir=desc&order=popularity
I want the Regex to find:
&3434323
Basically, it's meant to search any part of the argument that doesn't follow the variable=value formula. So basically I need it to search sections of the URL that don't have an equal sign it, but match just that part.
I tried using:
&\w*+[^=_-]
But it returns: &3434323&. I need it to not return the next ampersand.
And it must be done in regex. Thanks in advance!
You can use this regex:
[?&][^=]+(&|$)
It looks for any string that doesn't contain the equal sing [^=]+ and starts with the question mark or the ampersand [?&] and ends with ampersand or the end of the URL (&|$).
Please note that this will return &3434323&, so you'll have to strip the ampersands on both sides in your code. I assume that you're fine with that. If you really don't want the second ampersand, you can use a lookahead:
[?&][^=]+(?=&|$)
If you don't want even the first ampersand, you can use this regex, but not all compilers support it:
(?<=\?|&)[^=]+(?=&|$)
Parsing query parameters can be tricky, but this may do the job:
((?:[?&])[^=&]+)(?=&|$)
It will not catch the ampersand at the end of the parameter, but it will include either the question mark or the ampersand at the beginning. It will match any parameter not in the form of a key-value pair.
Demo here.

ColdFusion - How to get only the URL's in this block of text?

How can I extract only the URL's from the given block of text?
background(http://w1.sndcdn.com/f15ikDS9X_m.png)
background-image(http://w1.sndcdn.com/5ikDIlS9X_m.png)
background('http://w1.sndcdn.com/m1kDIl9X_m.png')
background-image('http://w1.sndcdn.com/fm15iIlS9X_m.png')
background("http://w1.sndcdn.com/fm15iklS9X_m.png")
background-image("http://w1.sndcdn.com/m5iIlS9X_m.png")
Perhaps Regex would work, but I'm not advanced enough to work it out!
Many thanks!
Mikey
You're over-thinking the problem - all you need to do is match the URLs, which is a simple match:
rematch('\bhttps?:[^)''"]+',input)
That'll work based on the input provided - might need tweaking if different input used.
(e.g. You can optionally add a \s into the char class if that might be a factor.)
The regex itself is simple:
\bhttps?: ## look for http: or https: with no alphanumeric chars beforehand.
[^)'"]+ ## match characters that are NOT ) or ' or "
## match as many as possible, at least one required.
If this is matching false positives, you can of course look for a more refined URL regex, such as these.
DEMO
background(?:-image)?\((["']?)(?<url>http.*)\1\)
Explanation:
background(?:-image)? -> It matches background or background-image (without grouping)
\( -> matches a literal parentheses
(["']?) -> matches if there is a ' or " or VOID before the url
(?<url>http.*) -> matches the url
\1\) -> matches the grouped (third line of this explanation) and then a literal parentheses
If you want an answer without regular expressions, something like this will work.
YourString = "background(http://w1.sndcdn.com/f15ikDS9X_m.png)";
YourString = ListLast(YourString, "("); // yields http://w1.sndcdn.com/f15ikDS9X_m.png)
YourString = replace(YourString, ")", ""); // http://w1.sndcdn.com/f15ikDS9X_m.png
Since you are doing it more than once, you can make it a function. Also, you might need some other replace commands to handle the quotes that are in some of your strings.
Having said all that, getting a regex to work would be better.

How to match a string that does not end in a certain substring?

how can I write regular expression that dose not contain some string at the end.
in my project,all classes that their names dont end with some string such as "controller" and "map" should inherit from a base class. how can I do this using regular expression ?
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Do a search for all filenames matching this:
(?<!controller|map|anythingelse)$
(Remove the |anythingelse if no other keywords, or append other keywords similarly.)
If you can't use negative lookbehinds (the (?<!..) bit), do a search for filenames that do not match this:
(?:controller|map)$
And if that still doesn't work (might not in some IDEs), remove the ?: part and it probably will - that just makes it a non-capturing group, but the difference here is fairly insignificant.
If you're using something where the full string must match, then you can just prefix either of the above with ^.* to do that.
Update:
In response to this:
but using both
public*.class[a-zA-Z]*(?<!controller|map)$
public*.class*.(?<!controller)$
there isnt any match case!!!
Not quite sure what you're attempting with the public/class stuff there, so try this:
public.*class.*(?<!controller|map)$`
The . is a regex char that means "anything except newline", and the * means zero or more times.
If this isn't what you're after, edit the question with more details.
Depending on your regex implementation, you might be able to use a lookbehind for this task. This would look like
(?<!SomeText)$
This matches any lines NOT having "SomeText" at their end. If you cannot use that, the expression
^(?!.*SomeText$).*$
matches any non-empty lines not ending with "SomeText" as well.
You could write a regex that contains two groups, one consists of one or more characters before controller or map, the other contains controller or map and is optional.
^(.+)(controller|map)?$
With that you may match your string and if there is a group() method in the regex API you use, if group(2) is empty, the string does not contain controller or map.
Check if the name does not match [a-zA-Z]*controller or [a-zA-Z]*map.
finally I did it in this way
public.*class.*[^(controller|map|spec)]$
it worked