golang regex to find urls in a string - regex

I am tring to find all links in a string and then hyperlink them
like this js lib https://github.com/bryanwoods/autolink-js
i tried to use alot of regex but i always got too many errors
http://play.golang.org/p/iQiccXvFiB
i don't know if go has a different regex syntax
so, what regex that works in go that is good to match urls in strings
thanks

You can use xurls:
import "mvdan.cc/xurls"
func main() {
xurls.Relaxed().FindString("Do gophers live in golang.org?")
// "golang.org"
xurls.Relaxed().FindAllString("foo.com is http://foo.com/.", -1)
// ["foo.com", "http://foo.com/"]
xurls.Strict().FindAllString("foo.com is http://foo.com/.", -1)
// ["http://foo.com/"]
}

Use back-ticks instead of double-quotes for your string literals. Back-slashes inside double-quotes start escape sequences, which you don't need/want for this use case.
Additionally, how did you expect this to work?
"$0"

Related

How to match all URLs until slash?

I need a RegEx for matching all of these URLs:
https://www.domain.tld/service?itm_pm=de:ncp:ctr:c1cn:0:0
https://www.domain.tld/service
https://www.domain.tld/service/
But not these one:
https://www.domain.tld/service/afdsasdaf
https://www.domain.tld/service/afdsasdaf/asdasd
I tried it with
https://www.domain.tld/service[^/]*
but it doesn't work
Mark the end of the string
Summary of changes:
I would work with a $ delimiter for "end of string"
A / usually needs to be escaped. This may be different based on your settings/language etc.
The . must be escaped as well, otherwise wwwwdomain.tld would be found
Let's use this one:
Solution with working example:
https:\/\/www\.domain\.tld\/service[^\/]*\/?$
You can play around with it here:
https://regex101.com/r/wm6Nit/1
If you want to allow https://www.domain.tld/service/ specifically, do that explicitly:
https://www.domain.tld/service(/?|[^/]*)$

Setting regular expression to validate URL format in Adobe CQ5

I want to validate a URL inside a textfield using Adobe CQ5, so I set up the properties regex and regexText as usual, but for some reason is not working:
<facebook
jcr:primaryType="cq:Widget"
emptyText="http://www.facebook.com/account-name"
fieldDescription="Set the Facebook URL"
fieldLabel="Facebook"
name="./facebookUrl"
regex="/^(http://www.|https://www.|http://|https://)[a-z0-9]+([-.]{1}[a-z0-9]+)*.[a-z]{2,5}(:[0-9]{1,5})?(/.*)?$/"
regexText="Invalid URL format"
xtype="textfield"/>
So when I type inside the component I can see an error message at the console:
Uncaught TypeError: this.regex.test is not a function
To be more accurate the error comes from this line:
if (this.regex && !this.regex.test(value)) {
I tried several regular expressions and none of them worked. I guess the problem is the regular expression itself, because in the other hand I have this other regex to evaluate email address, and it works perfectly fine:
/^[A-za-z0-9]+[\\._]*[A-za-z0-9]*#[A-za-z.-]+[\\.]+[A-Za-z]{2,4}$/
Any suggestions? Thanks in advance.
The syntax of your regex seems to treat the forward slashes (/) as special characters. Since you want to parse a URL containing slashes, my guess is you should escape them twice like this: '\\/' instead of '/'. The result would be:
/^(http:\\/\\/www.|https:\\/\\/www.|http:\\/\\/|https:\\/\\/)[a-z0-9]+([-.]{1}[a-z0-9]+)‌​*.[a-z]{2,5}(:[0-9]{1,5})?(\\/.*)?$/
You need to escape them twice because the string to be compiled as a regex must contain '\/' to escape the slashes, but to introduce a backslash in a string you have to escape the backslash itself too.

Regex with non-capturing hashbangs

I'm trying to write a regex which will parse the hash portion of a URL, removing whichever conventionally-formatted hashbang may be present.
For example, I wish to remove any of the following:
#
#/
#!
#!/
This is what I currently have:
/[(?:#|#\/|#!|#!\/)]+/
However, this is capturing an empty group at the start, and splitting the remaining strings. For example,
"#!/E/F".split(/[(?:#|#\/|#!|#!\/)]/); // ["", "", "", "E", "F"]
Whereas the desirable outcome is simply a single group
["E/F"]
Could someone please point out the error in my regex?
[If it makes a difference, I produced the above output using the JavaScript console in Firebug.]
Use string.replace instead of string.split.
#!?\/?
Use the above regex and then replace the match with empty string.
> '#!/E/F'.replace(/#!?\/?/g, '');
'E/F'
DEMO
Your regex seems awfully complicated. Maybe this is more what you're looking for:
"#!/E/F".split(/(#!/|#/|#!|#)/);
Did you checkout the Javascript regex documentation?
It might be different from what you imagined, since I don't understand why you're using the : and ? in your regex.
If you're using Javascript then you can just use:
location.assign(location.href.replace(/#.*$/, ""));
However if you only want to remove above listed hashtags then use:
var repl = location.href.replace(/#(!\/?|\/)?$/, '');

Selecting URLs using RegExp but ignoring them when surrounded by double quotes

I've searched around quite a bit now, but I can't get any suggestions to work in my situation. I've seen success with negative lookahead or lookaround, but I really don't understand it.
I wish to use RegExp to find URLs in blocks of text but ignore them when quoted. While not perfect yet I have the following to find URLs:
(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?
I want it to match the following:
www.test.com:50/stuff
http://player.vimeo.com/video/63317960
odd.name.amazone.com/pizza
But not match:
"www.test.com:50/stuff
http://plAyerz.vimeo.com/video/63317960"
"odd.name.amazone.com/pizza"
Edit:
To clarify, I could be passing a full paragraph of text through the expression. Sample paragraph of what I'd like below:
I would like the following link to be found www.example.com. However this link should be ignored "www.example.com". It would be nice, but not required, to have "www.example.com and www.example.com" ignored as well.
A sample of a different one I have working below. language is php:
$articleEntry = "Hey guys! Check out this cool video on Vimeo: player.vimeo.com/video/63317960";
$pattern = array('/\n+/', '/(https?\:\/\/)?(player\.vimeo\.com\/video\/[0-9]+)/');
$replace = array('<br/><br/>',
'<iframe src="http://$2?color=40cc20" width="500" height="281" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe>');
$articleEntry = preg_replace($pattern,$replace,$articleEntry);
The result of the above will replace any new lines "\n" with a double break "" and will embed the Vimeo video by replacing the Vimeo address with an iframe and link.
I've found a solution!
(?=(([^"]+"){2})*[^"]*$)((https?:\/\/)?(\w+\.)+\w{2,}(:[0-9]+)?((\/\w+)+(\.\w+)?)?\/?)
The first part from (? to *$) what makes it work for me. I found this as an answer in java Regex - split but ignore text inside quotes? by https://stackoverflow.com/users/548225/anubhava
While I had read that question before, I had overlooked his answer because it wasn't the one that "solved" the question. I just changed the single quote to double quote and it works out for me.
add ^ and $ to your regex
^(https?\://)?(\w+\.)+\w{2,}(:[0-9])?\/?((/?\w+)+)?(\.\w+)?$
please notice you might need to escape the slashes after http (meaning https?\:\/\/)
update
if you want it to be case sensitive, you shouldn't use \w but [a-z]. the \w contains all letters and numbers, so you should be careful while using it.

Regex to get a filename from a url

I am trying to write a regex to get the filename from a url if it exists.
This is what I have so far:
(?:[^/][\d\w\.]+)+$
So from the url http://www.foo.com/bar/baz/filename.jpg, I should match filename.jpg
Unfortunately, I match anything after the last /.
How can I tighten it up so it only grabs it if it looks like a filename?
The examples above fails to get file name "file-1.name.zip" from this URL:
"http://sub.domain.com/sub/sub/handler?file=data/file-1.name.zip&v=1"
So I created my REGEX version:
[^/\\&\?]+\.\w{3,4}(?=([\?&].*$|$))
Explanation:
[^/\\&\?]+ # file name - group of chars without URL delimiters
\.\w{3,4} # file extension - 3 or 4 word chars
(?=([\?&].*$|$)) # positive lookahead to ensure that file name is at the end of string or there is some QueryString parameters, that needs to be ignored
This one works well for me.
(\w+)(\.\w+)+(?!.*(\w+)(\.\w+)+)
(?:.+\/)(.+)
Select all up to the last forward slash (/), capture everything after this forward slash. Use subpattern $1.
Non Pcre
(?:[^/][\d\w\.]+)$(?<=\.\w{3,4})
Pcre
(?:[^/][\d\w\.]+)$(?<=(?:.jpg)|(?:.pdf)|(?:.gif)|(?:.jpeg)|(more_extension))
Demo
Since you test using regexpal.com that is based on javascript(doesnt support lookbehind), try this instead
(?=\w+\.\w{3,4}$).+
I'm using this:
(?<=\/)[^\/\?#]+(?=[^\/]*$)
Explanation:
(?<=): positive look behind, asserting that a string has this expression, but not matching it.
(?<=/): positive look behind for the literal forward slash "/", meaning I'm looking for an expression which is preceded, but does not match a forward slash.
[^/\?#]+: one or more characters which are not either "/", "?" or "#", stripping search params and hash.
(?=[^/]*$): positive look ahead for anything not matching a slash, then matching the line ending. This is to ensure that the last forward slash segment is selected.
Example usage:
const urlFileNameRegEx = /(?<=\/)[^\/\?#]+(?=[^\/]*$)/;
const testCases = [
"https://developer.mozilla.org/en-US/docs/Web/API/MutationObserverInit#yo",
"https://developer.mozilla.org/static/fonts/locales/ZillaSlab-Regular.subset.bbc33fb47cf6.woff2",
"https://developer.mozilla.org/static/build/styles/locale-en-US.520ecdcaef8c.css?is-nice=true"
];
testCases.forEach(testStr => console.log(`The file of ${testStr} is ${urlFileNameRegEx.exec(testStr)[0]}`))
It might work as well:
(\w+\.)+\w+$
You know what your delimiters look like, so you don't need a regex. Just split the string. Since you didn't mention a language, here's an implementation in Perl:
use strict;
use warnings;
my $url = "http://www.foo.com/bar/baz/filename.jpg";
my #url_parts = split/\//,$url;
my $filename = $url_parts[-1];
if(index($filename,".") > 0 )
{
print "It appears as though we have a filename of $filename.\n";
}
else
{
print "It seems as though the end of the URL ($filename) is not a filename.\n";
}
Of course, if you need to worry about specific filename extensions (png,jpg,html,etc), then adjust appropriately.
> echo "http://www.foo.com/bar/baz/filename.jpg" | sed 's/.*\/\([^\/]*\..*\)$/\1/g'
filename.jpg
Assuming that you will be using javascript:
var fn=window.location.href.match(/([^/])+/g);
fn = fn[fn.length-1]; // get the last element of the array
alert(fn.substring(0,fn.indexOf('.')));//alerts the filename
Here is the code you may use:
\/([\w.][\w.-]*)(?<!\/\.)(?<!\/\.\.)(?:\?.*)?$
names "." and ".." are not considered as normal.
you can play with this regexp here https://regex101.com/r/QaAK06/1/:
In case you are using the JavaScript URL object, you can use the pathname combined with the following RegExp:
.*\/(.[^(\/)]+)
Benefit:
It matches anything at the end of the path, but excludes a possible trailing slash (as long as there aren't two trailing slashes)!
Try this one instead:
(?:[^/]*+)$(?<=\..*)
This is worked for me, no matter if you have '.' or without '.' it take the sufix of url
\/(\w+)[\.|\w]+$