Why is this line of regex capturing white spaces? - regex

I'm using the following line of regex which I found from this SO answer:
(?:[\w[a-z]-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.-]+[.??][a-z]{2,4}/)(?:[^\s()<>]+|(([^\s()<>]+|(([^\s()<>]+)))))+(?:(([^\s()<>]+|(([^\s()<>]+))))|[^\s`!()[]{};:'".,<>?«»“”‘’])
I am testing it on the following string:
"Quattro Amici in Concert Mar. 3, 2014. Long-time collaborators Lun Jiang, violin; Roberta Zalkind, viola; Pegsoon Whang, cello; and Karlyn Bond, piano, will perform works by Franz Joseph Haydn, Wolfgang Amadeus Mozart, Ludwig van Beethoven and Gabriel Faure. To purchase tickets visit westminstercollege.edu/culturalevents or call 801-832-2457. - See more at: http://entertainment.sltrib.com/events/view/quattro_amici_in_concert#sthash.QRsLXXiA.dpuf"
I'm simply attempting to extract urls from strings and based on a bunch of SO answers, I've found that regex is the recommended tool for that job. I'm not a regex expert (or even intermediate in my understanding), so I'm baffled by the empty strings my re.findall() keeps returning. I've stepped through the regex line using regex buddy and still no luck. Any help would be hugely appreciated.

I'm not sure that a big regex like that is entirely necessary - if you're just looking to get links, you could use a much simpler regex, like this:
/(https?:\/\/[\w\d\$-_\.\+!\*'\(\),\/#]+)/ig
According to RFC 1738, urls are only allowed to use the characters specified in the class above, so it should cover any valid url, without such a gigantic mess of a regex.
You can also use a tool like regexpal.com to validate regexes, which helps find issues. That said, I pasted your regex in there and it crashed chrome, so it may not be a great help for a beast like that :)

Related

Regex to get all characters AFTER first comma

I'm trying to find a regex pattern to extract some names that apprear in a string after the first comma (David Peter Richard) below.
Example string:
PALMER, David Peter Richard
I came across this thread that successfully extracts the name before, but require all the names after the comma.
I've tried to modify the ^.*?(?=,), but not having any joy. Needs to be JavaScript Regex and capture groups are not supported in the platform i'm using (Bubble)
Any help appreciated, thanks a lot!
I tried this: (?<=,)[^,]+
Which seems to work on Desktop, however on a wrapped mobile app, it doesn't seem to work.
Similarly for the Name before, I was using ^[^,]+ and experiencing the same issue, but when I use the pattern in the ^.*?(?=,) it works fine.
So now I just need the pattern to be adjusted for the names after.
On JavaScript, I would suggest a simple string split:
var input = "PALMER, David Peter Richard";
var names = input.split(/,\s*/);
console.log(names);
try to use this regex for the after comma string:
(?<=,\s)(.*)$
hope this helps.

Regular expression cannot match "</p>" correctly

everyone.
I'm having some difficulties to use regular expressions to grep the text from HTML, which has
</p>
I'm using unsung hero.*</p> to grep the paragraph I'm interested in, but cannot make it match until next </p>
The command I use is:
egrep "unsung hero.*</p>" test
and in test is a webpage like:
<p>There are going to be outliers among us, people with extraordinary skill at recognizing faces. Some of them may end up as security officers or gregarious socialites or politicians. The rest of us are going to keep smiling awkwardly at office parties at people we\'re supposed to know. It\'s what happens when you stumble around in the 21st century with a mind that was designed in the Stone Age.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n <p>VEDANTAM: This week\'s show was produced by Chris Benderev and edited by Jenny Schmidt. Our supervising producer is Tara Boyle. Our team includes Renee Cohen, Parth Shah, Laura Kwerel, Thomas Lu and Angus Chen.</p>\n <p>Our unsung hero this week is Alexander Diaz, who troubleshoots technical problems whenever they arise and has the most unflappable, kind disposition in the face of whatever crisis we throw his way. Producers at NPR have taken to calling him Batman because he\'s constantly, silently, secretly saving the day. Thanks, Batman.</p>\n <p>If you like today\'s episode, please take a second to share it with a friend. We\'re always looking for new people to discover our show. I\'m Shankar Vedantam, and this is NPR.</p>\n <p>(SOUNDBITE OF MUSIC)</p>\n\n <p class="disclaimer">Copyright © 2019 NPR. All rights reserved. Visit our website terms of use and permissions pages at www.npr.org for further information.</p>\n\n <p class="disclaimer">NPR transcripts are created on a rush deadline by Verb8tm, Inc., an NPR contractor, and produced using a proprietary transcription process developed with NPR. This text may not be in its final form and may be updated or revised in the future. Accuracy and availability may vary. The authoritative record of NPR’s programming is the audio record.</p>\n</div><div class="share-tools share-tools--secondary" aria-label="Share tools">\n <ul>\n
I'm expecting to match before
</p>\n <p>If you like
But it actually went way further than that.
I feel like the regular expression I used has issue, but don't know how. Any help will be appreciated.
Thanks!
20190523:
Thanks for your guys' suggestions.
I tried
egrep "unsung hero.*?</p>" test
But it didn't give me the result I want, insted it's like
Leo, I feel like this is a useful expression and I'd like to get it right. could you explain a bit?
The other test I did for
[^<]*
Actually gave the result expected
With .* the match will be greedy and match the longest substring possible. (Which is in your case until the last paragraph.)
What you actually want is a non-greedy match with .*?
Your specific command should most likely look like this:
grep -P -o "unsung hero.*?</p>" test
Another solution would be to expand your regex until the end of the string/webpage and than pick the selected substring with a group.
UPDATE
As Charles Duffy pointed out correctly, this will not work with the standard (POSIX ERE) syntax. Therefore the command above uses the -P flag to specify that it is a perl regular expression.
If your system or application does not support perl regular expression and you are ok with matching until the first < (instead of matching until the first </p>), matching every character except < is the way to go.
With this, the complete command should look like this:
grep -o "unsung hero[^<]*</p>" test
Thanks to Charles for pointing that out in the comments.

Regex to match everything."LettersNumbers"."extension" and forum searching tip

I would need a regex to match my files named "something".Title"numberFrom1to99".mp4 on Windows' File Explorer, my first approach as a regex newbie was something like
"..mp4"
, but it didn't work, so i tried
"*.Title[1-9][0-9].mp4"
, that also did not work.
I would also like a tip on how to search regex related advices on Stackoverflow archive but also on the web, so that i can be specific, but without having the regex in the searching bar interact.
Thank you!
EDIT
About the second part of the question: in the question itself there is written "..mp4" but i wrote "asterisk"."asterisk".mp4, is there any universal way to write regex on the web without it having effect and without escaping the characters? (in that way the backslash shows inside the regex, and that could be misunderstood)
Try something like this:
(.*)\.[A-za-z]+\d+\.mp4
See this Regex Demo to get an explanation on the regex.
Use regex101.com to test your regexs
Here it is:
^[\s\S]*\.Title[1-9][0-9]?\.mp4$
I suggest regexr.com to find many interesting regexes(Favourites tab) and simple tutorial.
About the second part of the question: in the question itself there is written "..mp4" but i wrote "asterisk"."asterisk".mp4, is there any universal way to write regex on the web without it having effect and without escaping the characters? (in that way the backslash shows inside the regex, and that could be misunderstood)

Is there a function to create a regex pattern from a string input?

I'm lousy at regular expressions but occasionally they're the only thing that's the right solution for a problem.
Is there something in the .NET framework that allows you to input an unencoded string and get a pattern from it? Which you could then modify as required?
e.g. I want to remove a CDATA section that contains a file from some XML but I can't work out what the right pattern is for <![CDATA[hugepileofrandombinarydataherethatalsoneedstogo]]> and I don't want to ask for help each time I'm stuck on a regex pattern.
Such tools exist, google by "regex generator".
But, as suggested in comments, better learn regex. Simple patterns are easy. Something like <!\[.*?]]>
in your case.
There are Regex Design tools like expresso...
http://www.ultrapico.com/expresso.htm
It's not perfect but as there is no suitable .Net component the text to regex page at txt2re.com is the best I've seen for those people who occasionally need to build a regex to match a string but don't have the time to relearn regex each time they want to use one.

Regex - match a string not contain a 'semi-word'

I tried to make regex syntax for that but I failed.
I have 2 variables
PlayerInfo[playerid][pLevel]
and
Character[playerid]
and I want to catch only the second variable,I mean only the world what don't contain PlayerInfo, but cointains [playerid]
"(\S+)\[playerid\]" cath both words and (\S+[^PlayerInfo])\[playerid\] jump on some variables- they contais p,l,a,y ...
I need to replace in notepad++,all variables like Text[playerid] to ExClass [playerid][Text]
Couple Pluasible solutions.
List item
Notepad has a plugin called python script. Running regex from there
gives full regex functionality, the python version anyway, and a lot
of powerful potential beyond that. And I use the online python regex tester to help out.
RegRexReplace plugin helps create regex plugins in Notepad++, so when you do hit a limitation, you find out a lot quicker.
Or of course default to your alternate editor (I'm assuming you have
one?) or this online regex tool is absolutely amazing. You
can perform the action on the text online as well.
(I'd try to build a regex for you, but I'm a bit lost as to what you're looking for. Unless the Ivo Abeloos got it. If you're still coming up short, maybe a code example along with values displayed?)
Good luck!
It seems that Notepad++ support negative lookbehind since v6.
In notepad++ you could try to replace (.+)\[(.+)\] with ExClass\[\2\]\[\1\]
Try to use negative lookbehind.
(?<!PlayerInfo)\[playerid\]
EDIT: unfortunately notepad++ does not support negative lookbehind.
I tried to make a workaround based on the following naive idea:
(.[^o]|[^f]o)[playerid]
But this expression does not work either. Notepad++ seems to fail in alternative operator. Thus the answer is: it is impossible to do exactly what you want. Try to solve the problem in other way or use alternative tool.