Simple regex replaces more than it should in a string - regex

I'm trying to create a regular expression that replaces an URL with a token. That's how far I got, and I don't understand why this expression replace everything after the URL as well except for the last word.
string<-"This is a website http://www.bla.com that I like very much"
gsub("https?://.*\\s|www.*\\s"," [url] ",string)
>>"this is a website [url] much"
Appreciate your help very much!

The problem is the .* - it will match anything greedily, so you will match all the way up to the last space. try
gsub("https?://[^[:blank:]]*","[url]",string) instead.

Related

URL regex that skips ending periods

I'm trying to create a regex that matches url strings within normal text. I have this:
http[s]?://[^\s]+
This seems to work well with the exception that if the url is at the end of a sentence it will grab the period as well. For example for this string:
I am typing some text with the url http://something.com/something-?args=someargs. This is another sentence.
it matches:
http://something.com/some-thing?args=someargs.
I would like it to match:
http://something.com/some-thing?args=someargs
Obviously I can't exclude periods because they are in the url previously but I can't figure out how to tell it to exclude the last period if there is one. I could potentially use a negative lookahead for end of line or whitespace, but if it's in the middle of the line (without a period after it) that would leave off the last character of the url.
Most of the ones I have seen online have the same issue that they match the ending dot so maybe it's not possible? I know basic regex but certainly not a genius with it so if someone has a solution I would be very grateful :).
Also, I can do some post-process in this case to remove the dot if I need to, just seems like there should be a Regex solution...
Try this one
http[s]?://[^\s]+[^. ]

Clojure Regex: given a string, how can I return valid URLs in that string?

I am trying to return valid URLs (as a substring) in a string in Clojurescript, what Regular Expression can I use?
(re-find #"regex for valid URL" "You can visit www.google.com")
=> "www.google.com"
(re-find #"regex for valid URL" "<b>www.google.com</b>")
=> "www.google.com"
(re-find #"regex for valid URL" "<b>www.google.com</b> and www.yahoo.com")
=> "www.google.com, www.yahoo.com"
Depending on how carefully you want your script to validate the URL, the regex you provided, as long as you get rid of the '^' and '$' anchors, works fairly well (as seen here).
Note that I added some whitespace in the regex just for readability.
There are several issues that I see from that regex (as you can probably see on that page). It matches in places where it shouldn't (such as repeated .. characters), and sites with .co.uk are matching the .co portion along with the domain name and .uk separately. That, by itself, can be fairly easy to fix just simply adding those edge cases directly into the second group (the one with (com|org|...)).
The reason you'll need to remove the '^' and '$' anchors is that the pattern will only match if the URL is the only thing on the line: ^ has to match at the beginning of the line, and $ can only match at the end. Having <b>www.google.com</b> means that the <b> will make the ^ anchor fail to match the URL since it's not starting at the beginning of the line.
The other suggestions, such as #amalloy's link, gives a much more comprehensive solution and will match everything correctly, but it is very complex.
So knowing exactly what you want to match, and what you're willing to ignore/trade/give up, will help craft something that works for you.

URL Rewrite Pattern to exclude application name from path

I'm trying to use the IIS 7 URL Rewrite feature for the first time, and I'm having trouble getting my regular expression working. It seems like it should be simple enough. All I need to do is rewrite a URL like this:
http://localhost/myApplication/MySpecialFolder
To:
http://localhost/MySpecialFolder
Is this possible? I want the regular expression to ignore everything before "myApplication" in the original URL, so that I could use "http://localhost" OR "http://mysite", etc.
Here's what I've got so far:
^myApplication/MySpecialFolder$
But using the "Test Pattern..." feature in IIS, it says my patterns don't match unless I supply "myApplication/MySpecialFolder" exactly. Does anyone know how I can update my regular expression so that everything prior to "myApplication" is ignored and the following URLs will be seen as a match?
http://localhost/myApplication/MySpecialFolder
http://mysite/myApplication/MySpecialFolder
Many thanks in advance!
SOLUTION:
I needed to change my regex to:
myApplication/MySpecialFolder
Without the ^ at the beginning and without the $ at the end.
Your regular expression is correct, the pattern will be matched against path starting after the first slash after the domain.
So only bold part will be used for matching: http://localhost/myApplication/MySpecialFolder
To limit the rewriting to specific domain you have to use Conditions section with Condition input = {HTTP_HOST}
Unless there is something radically different with regexes in IIS, you would want to take out the anchor (^) at the beginning to match.
myApplication/MySpecialFolder$
The carat ^ tells it that that is the beginning of the string and the dollar sign $ tells it to match the end. A regex like abc finds "abc" anywhere in the string, ^abc matches strings that start with "abc", abc$ matches strings that end with "abc", and ^abc$ only matches when the whole string is "abc".

Jmeter - Regex Extractor not working

I'm trying to extract a simple string from the HTML response.
The response looks like this
patients-list-of-visits.aspx?p=a1363839-76fb-43f3-97ba-26218faefee1
The Regex I have tried so far are
patients-list-of-visits.aspx?p=(.+?)
patients-list-of-visits.aspx?p=(.+)
Can someone please let me know what am I doing wrong here?
Thanks!
This is better:
patients-list-of-visits\.aspx\?p=(.+)
2 remarks
don't forget to escape . and ? if you want to match them literally
your first attempt .*? is a lazy match and will result in in only the first letter being matched. Your second attempt is better

Regular Expressions with part of a URI

I'm not very familiar with regular expressions.
I'm trying to create a regular expression that will match the text between the first group of two forwards slashes. It's easiest just to show an example.
Search Texts:
/index
/index/
/index/foo/
/index/foo/bar/
All of those should return just "index"
Another example:
Search Texts:
/page.php
/page.php/foo?bar=1
Should return just "page.php" for both of those
Thanks alot guys!
Try this one for javascript or php preg_match: ^\/([^\/]*)
The pattern matches only of if there is a slash at the beginning and then matches everything that is not a slash.