regex best practice? - regex

Today I got an email from my boss saying to change the regex in our java script code that goes onto our client's website from
[a-zA-Z0-9]+[a-zA-Z0-9_\.\-]
to
[a-zA-Z0-9]+[a-zA-Z0-9_\-\.]
because one of our clients were complaining that it wasn't regex best practices and it's causing problems with their CMS and their DB.
Looking at those two regexes, It appears to me they match the exact same thing.
the . and the - are swapped at the end, but that shouldn't make a difference. Should it?
Am I missing something?
The developer from our client's company is really adamant about us changing it.
Can someone shed some light?
Thanks!

There is no functional difference.
If anything is having issues with that regex, then it is a non-standard/buggy implementation. I recommend finding out exactly what the problem is.
While I see no reason to change it, I see no reason not to change it, so do what you wish.
Tip: I'm guessing the regex is written wrong. If I know what it is supposed to mean, I would write it:
[a-zA-Z0-9]+[_\.\-]?

If you use a - in a character group, it goes last otherwise it denotes a range of characters, like A-Z. If you're escaping it, like you are, then it can be anywhere.
It's possible the CMS or other code they use un-escapes the regex, so in this case it will throw errors if the - isn't the last character in the group. I would say that having as few escaped characters in a regular expression as possible makes it easier to read, but that's from a personal perspective.

Related

Improve regex that works

I'm not a regex expert, so please be nice :-)
I created this regex to verify if a user submitted a day of the week (in italian language):
/((lun|mart|giov)e|mercol(e?)|vener)d(ì|i('?)|í)|sabato|domenica/
This regex perfectly works and it matches the following:
lunedi
lunedì
lunedí
lunedi’
martedi
martedì
martedí
martedi'
mercoledi
mercoledì
mercoledí
mercoledi'
mercoldi
mercoldì
mercoldí
mercoldi'
giovedi
giovedì
giovedí
giovedi'
venerdi
venerdì
venerdí
venerdi'
sabato
domenica
Now consider the first part of the regex and focus on venerdì: as you can see, I added an OR (|) just to manage the venerdì day, just because of the presence of that “r”.
Anything works just fine but I’m here to ask if is there any way to start the regex this way:
(lun|mar|giov|ven)e
and then manage that “r” some way.
I red about backrefences and conditionals but I’m not sure they can be of any help.
My idea is something like: “if the first group captured ‘ven’, than add “r” to the “e” right after the end of the group.
Is this possible?
Don't "golf" your regex. If you want to improve it at all, make it more readable. While it it certainly worthwile to use different cases for the different "i" variants, everything else should IMHO be kept as simple as possible.
How about something like this?
(lune|marte|mercole?|giove|vener)d(ì|i'?|í)|sabato|domenica
Don't use backreferences and other advanced features if you don't need them, just to make your regex a few chars shorter. Even if you would still understand what it means, think about your fellow co-developers -- or just yourself two months from now.
I just removed a few redundant (...) and the "shared e" part. Note how (besides the (...)) it is the same length, whether you use (lun|mart|giov)e or lune|marte|giove, but the latter is arguably more readable. Similarly, a backreference or some conditional would likely make your regex longer instead of shorter -- and considerably more complicated.

Monitoring bad links with RegEx in Google Analytics

How do I optimize this to find all links ending in weird typos, yet still exclude correct links (ending with .html) from the results?
htmll$|hhtml$|httml$|htmml$|htmll$|btml$|hml$|htl$
Thanks in advance!
Wow, that's some pretty restrictive regex rules but that kinda makes it interesting.
since we have no character negation but we do have character classes we could do:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]
for my second suggestion and:
h.+tml|ht.+ml|htm.+l|html.+
to replace the first option leading to a total of:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]|h.+tml|ht.+ml|htm.+l|html.+
EDIT: Having noticed that the .+'s can catch things we don't want this should be changed slightly.
(.*[a-gi-z]tml|h.*[a-su-z]ml|ht.*[a-ln-z]l|htm.*[a-km-z])$

Regex help: Identifying websites in text

I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc#xyz.com to abc#. I don't want this, so I modified it to
gsub("*((^#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!#)[^(?!.*#)]*.com",'',testset[10])
I got a 'invalid regex' error. I believe a little help in correcting may get this to work...
I can't believe it. There actually is a simple solution to it.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
This works by:
Start with a space.
Put all sorts of things, except an '#' in.
end with a .com/net/org/info/
Please do look into breaking it! I'm sure there will be cases that will break this as well.
your lookarounds look a bit funny to me: you cant look behind inside a character class and why are you looking ahead? A look behind is imho more appropriate.
I think the following expression should work, although i didn't test it:
gsub("*((?<!#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehinds must have a fixed length, so no multipliers are allowed

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr
The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.
sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

Create a valid CSV with regular expressions

I have a horribly formated, tab delimited, "CSV" that I'm trying to clean up.
I would like to quote all the fields; currently only some of them are. I'm trying to go through, tab by tab, and add quotes if necessary.
This RegEx will give me all the tabs.
\t
This RegEx will give me the tabs that do not END with a ".
\t(?!")
How do I get the tabs that do not start with a "?
Generally for these kinds of problems, if it's a one time occurrence, I will use Excels capabilities or other applications (SSIS? T-SQL?) to produce the desired output.
A general purpose regex will usually run into bizarre exceptions and getting it just right will often take longer and is prone to missed groups your regex didn't catch.
If this is going to happen regularly, try to fix the problem at the source and/or create a special utility program to do it.
Use negative lookbehind: (?<!")\t
For one shots like this I usually just write a little program to clean up the data, that way I also can add some validation to make sure it really has converted properly after the run. I have nothing against regex but often in my case it takes longer for me figure out the regex expression than writing a small program. :)
edit: come to think about it, the main motivator is that it is more fun - for me at least :)