Detect URL in a string without any whitespace regexp - regex

So I know the idea of catching any URL is a very difficult task, and that's not what i'm wanting to do. I'm wanting to find a piece of regex that'll catch urls in the form of
http://something.xx.yy
http://www.something.xxx
www.something.xx.yy
in a string that will contain lots of other text and no whitespace, so for example
hellopleasevisitwww.something.xxthankyou
I've tried my best to detect something like that by myself, but it's been pretty fruitless. Any help would be great. Below are some of the expressions I tried to modify in order to have these requirements met
.*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|].*
\\b\\w*\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]\\w*\\b
\\(?\\b(http://|www[.])[-A-Za-z0-9+&##/%?=~_()|!:,.;]*[-A-Za-z0-9+&##/%=~_()|]
Thanks for your time

If it really can be as simple as you're saying...
(http://(www\\.)?|www\\.)[^.]+\\.(\\w{3}|\\w{2}\\.\\w{2})
The expressions you tried all have \\b which is a word boundary and your string unfortunately does not have word boundaries.
See it in action

Related

Regular expression/Regex with Java/Javascript: performance drop or infinite loop

I want here to submit a very specific performance problem that i want to understand.
Goal
I'm trying to validate a custom synthax with a regex. Usually, i'm not encountering performance issues, so i like to use it.
Case
The regex:
^(\{[^\][{}(),]+\}\s*(\[\s*(\[([^\][{}(),]+\s*(\(\s*([^\][{}(),]+\,?\s*)+\))?\,?\s*)+\]\s*){1,2}\]\s*)*)+$
A valid synthax:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]
You could find the regex and a test text here :
https://regexr.com/3jama
I hope that be sufficient enough, i don't know how to explain what i want to match more than with a regex ;-).
Issue
Applying the regex on valid text is not costing much, it's almost instant.
But when it comes to specific not valid text case, the regexr app hangs. It's not specific to regexr app since i also encountered dramatic performances with my own java code or javascript code.
Thus, my needs is to validate all along the user is typing the text. I can even imagine validating the text on click, but i cannot afford that the app will be hanging if the text submited by the user is structured as the case below, or another that produce the same performance drop.
Reproducing the issue
Just remove the trailing "]" character from the test text
So the invalid text to raise the performance drop becomes:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4
Another invalid test could be, and with no permformance drop:
{Section}[[actor1, actor2(syno1, syno2)][expr1,expr2]][[actor3,actor4(syno3, syno4)][expr3,expr4]]]
Request
I'll be glad if a regex guru coming by could explain me what i'm doing wrong, or why my use case isn't adapted for regex.
This answer is for the condensed regex from your comment:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+\,?)+\))?\,?)+\]){1,2}\])*)+$
The issues are similar for your original pattern.
You are facing catastrophic backtracking. Whenever the regex engine cannot complete a match, it backtracks into the string, trying to find other ways to match the pattern to certain substrings. If you have lots of ambiguous patterns, especially if they occur inside repetitions, testing all possible variations takes a looooong time. See link for a better explanation.
One of the subpatterns that you use is the following (multilined for better visualisation):
([^\][{}(),]+
(\(
([^\][{}(),]+\,?)+
\))?
\,?)+
That is supposed to match a string like actor4(syno3, syno4). Condensing this pattern a little more, you get to ([^\][{}(),]+,?)+. If you remove the ,? from it, you get ([^\][{}(),]+)+ which is an opening gate to the catasrophic backtracking, as string can be matched in quite a lot of different ways with this pattern.
I get what you try to do with this pattern - match an identifier - and maybe other other identifiers that are separated by comma. The proper way of doing this however is: ([^\][{}(),]+(?:,[^\][{}(),]+)*). Now there isn't an ambiguous way left to backtrack into this pattern.
Doing this for the whole pattern shown above (yes, there is another optional comma that has to be rolled out) and inserting it back to your complete pattern I get to:
^(\{[^\][{}(),]+\}(\[(\[([^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+)*)\))?(?:\,[^\][{}(),]+(\(([^\][{}(),]+(?:,[^\][{}(),]+))*\))?)*)\]){1,2}\])*)+$
Which doesn't catastrophically backtrack anymore.
You might want to do yourself a favour and split this into subpatterns that you concat together either using strings in your actual source or using defines if you are using a PCRE pattern.
Note that some regex engines allow the use of atomic groups and possessive quantifiers that further help avoiding needless backtracking. As you have used different languages in your title, you will have to check yourself, which one is available for your language of choice.

Regex skipping first match?

I'm trying to match every plain "twitter-like hashtags" in a text, and make hyperlinks of them.
I've had some success, but strangely enough my regular expression /(^|\s)#(\w*[a-zA-Z_]+\w*)/ is skipping the first case when the string starts with the # sign (the rest are properly matched). If I write anything else before, then the first case is properly matched. It fails only when a # sign appears to be the first character of the string.
Do you know why could it be?
function make_hashes_into_twitter_hashtag_urls($content){
$content = preg_replace('/(^|\s)#(\w*[a-zA-Z_]+\w*)/', '<span class="color_my_hash">\1#</span>\2 ', $content);
echo $content;
}
add_filter('the_content','make_hashes_into_twitter_hashtag_urls');
Thanks,
The the_content(); template tag outputs HTML and text, and HTML and regular expressions don't play nice together.
I don't know the details about why exactly something like <p>#, (which is what the_content(); was outputing), was making my regex to fail.
I'd love to know why, and how my regex should be - to avoid being trapped (even if, as stated, it's not recommended to mix regex and HTML).
My question is mainly answered, but any additional details on how to hack/workaround this situation would be hugely appreciated!

Is there a function to create a regex pattern from a string input?

I'm lousy at regular expressions but occasionally they're the only thing that's the right solution for a problem.
Is there something in the .NET framework that allows you to input an unencoded string and get a pattern from it? Which you could then modify as required?
e.g. I want to remove a CDATA section that contains a file from some XML but I can't work out what the right pattern is for <![CDATA[hugepileofrandombinarydataherethatalsoneedstogo]]> and I don't want to ask for help each time I'm stuck on a regex pattern.
Such tools exist, google by "regex generator".
But, as suggested in comments, better learn regex. Simple patterns are easy. Something like <!\[.*?]]>
in your case.
There are Regex Design tools like expresso...
http://www.ultrapico.com/expresso.htm
It's not perfect but as there is no suitable .Net component the text to regex page at txt2re.com is the best I've seen for those people who occasionally need to build a regex to match a string but don't have the time to relearn regex each time they want to use one.

please regex url request

I would like to know if anybody can help me with a regular expression problem. I want to write a regular expression to catch URLs similar to this URL:
www.justin.tv/channel_name_here
I have tried:
/justin\.tv\/(.*)
The problem I get is that when this channel goes live, sometimes the URL transforms to something like this:
www.justin.tv/channel_name_here#/w/45365675688
I can't catch this. :( Can anybody please help me with this? I just want to catch the channel name without the pound symbol and the rest of the URL.
Here are some example URLs:
www.justin.tv/winning_movies#/w/6347562128
http://www.justin.tv/cine_accion_hd16#/w/6347562128/18
http://www.justin.tv/fox_movies_hd1/
I would want to get:
winning_movies
cine_accion_hd16
fox_movies_hd1
Thanks in advance! :)
Short answer:
(?<=justin\.tv\/)([^#\/]+)
Long answer:
Let's split this up into parts. Look at the back part first.
([^#\/]+)
This delimits the string into parts that don't include either '#' or '/'.
Now let's look at the first part.
(?<=justin\.tv\/)
The syntax "(?<=" followed by ")" is called positive lookbehind (this page has good examples and explanation of the different types of lookaround). Using a simple example:
(?<=A)B
The above example says "I want all 'B' that are immediately after an 'A'." Going to our big example, we're saying we want all parts (separated by '#' or '/') that are immediately after a part called "justin.tv/".
Look here for an example of the expression in action.
#justin\.tv/([^#/]+)#
If you want everything up to a certain character(-set), use a negated class.
Also, when working on regex for urls, using / as delimiter is error-prone, as you have to escape all the /'s. Use something else instead (like # in this case)

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr
The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.
sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.