Monitoring bad links with RegEx in Google Analytics - regex

How do I optimize this to find all links ending in weird typos, yet still exclude correct links (ending with .html) from the results?
htmll$|hhtml$|httml$|htmml$|htmll$|btml$|hml$|htl$
Thanks in advance!

Wow, that's some pretty restrictive regex rules but that kinda makes it interesting.
since we have no character negation but we do have character classes we could do:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]
for my second suggestion and:
h.+tml|ht.+ml|htm.+l|html.+
to replace the first option leading to a total of:
[a-gi-z]tml$|h[a-su-z]ml|ht[a-ln-z]l|htm[a-km-z]|h.+tml|ht.+ml|htm.+l|html.+
EDIT: Having noticed that the .+'s can catch things we don't want this should be changed slightly.
(.*[a-gi-z]tml|h.*[a-su-z]ml|ht.*[a-ln-z]l|htm.*[a-km-z])$

Related

Rewrite regex without negation

I have wrote this regex to help me extract some links from some text files:
https?:\/\/(?:.(?!https?:\/\/))+$
Because I am using golang/regexp lib, I'm not able to use it, due to my negation (?!..
What I would like to do with it, is to select all the text from the last occurance of http/https till the end.
sometextsometexhttp://websites.com/path/subpath/#query1sometexthttp://websites.com/path/subpath/#query2
=> Output: http://websites.com/path/subpath/#query2
Can anyone help me with a solution, I've spent several hours trying different ways of reproducing the same result with no success.
Try this regex:
https?:[^:]*$
Regex live here.
The lookaheads exist for a reason.
However, if you insist on a supposedly equivalent alternative, a general strategy you can use is:
(?!xyz)
is somewhat equivalent to:
$|[^x]|x(?:[^y]|$)|xy(?:[^z]|$)
With that said, hopefully I didn't make any mistakes:
https?:\/\/(?:$|(?:[^h]|$)|(?:h(?:[^t]|$))|(?:ht(?:[^t]|$))|(?:htt(?:[^p]|$))|(?:http(?:[^s:]|$))|(?:https?(?:[^:]|$))|(?:https?:(?:[^\/]|$))|(?:https?:\/(?:[^\/]|$)))*$

RegEx all URLs that do NOT contain a string

I seem to be having a bit of a brain fart atm. I've got Google counting my transitions correctly but I'm getting false positives.
This is the current goal RegEx which works great.
^/click/[0-9]+\.html\?.*
But I also want it the RegEx to NOT county anything that has &confirm=1 I'm quite stuck as to how to do that in the RegEx, I thought I might be able to use [^(?:&confirm=1)] but I don't think that's valid.
Use "exclude", not "include" filter option
Try this:
^/click/[0-9]+\.html\?(?!.*\bconfirm=1).*
I changed it slightly so it will still exclude if confirm=1 is the first param (preceded by the ? rather than &)
I'm afraid you can't... I've tried doing this before, what I found was that you used to be able to do this with negative lookahead (see Rubens), but Google Analytics stopped supporting this at some point (source: http://productforums.google.com/forum/#!topic/analytics/3YnwXM0WYxE).
Maybe I'm a little late.
What about just writing :
[^(&confirm=1)]
?

In what ways can I improve this regular expression?

I have written this regex that works, but honestly, it’s like 75% guesswork.
The goal is this: I have lots of imports in Xcode, like so:
#import <UIKit/UIKit.h>
#import "NSString+MultilineFontSize.h"
and I only want to return the categories that contain +. There are also lots of lines of code throughout the source which include + in other contexts.
Right now, this returns all of the proper lines throughout the Xcode project. But if there is one thing I’ve learned from googling and searching Stack Overflow for regex tutorials, it is that there are LOTS of different ways to do things. I’d love to see all of the different ways you guys can come up with that make it either more efficient or more bulletproof regarding potential spoofs or misses.
^\#import+.[\"]*+.(?:(?!\+).)*+.*[\"]
Thanks in advance for all of your help.
Update
Also I suppose I’ll accept the answer of whoever does this with the shortest string, without missing any possible spoofs. But again, thanks to everyone who participates in this learning experience.
Resources from answers
This is an awesome resource for practicing regex from Dan Rasmussen: RegExr
The first thing I notice is that your + characters are misplaced: t+. matches t one or more times, followed by a single character .. I'm assuming you wanted to match the end of import, followed by one or more of any character: import.+
Secondly, # doesn't need to be escaped.
Here's what I came up with: ^#import\s+(.*\+.*)$
\s+ matches one or more whitespace character, so you're guaranteed that the line actually starts with #import and not #importbutnotreally or anything else.
I'm not familiar with xcode syntax, but the following part of the expression, (.*\+.*), simply matches any string with a + character somewhere in it. This means invalid imports may be matched, but I'm working under the assumption your trying to match valid code. If not, this will need to be modified to validate the importer syntax as well.
P.S. To test your expression, try RegExr. You can hover over characters to check what they do.
sed 's:^#import \(.*[+].*\):\1:' FILE
will display
"NSString+MultilineFontSize.h"
for your sample.

regex best practice?

Today I got an email from my boss saying to change the regex in our java script code that goes onto our client's website from
[a-zA-Z0-9]+[a-zA-Z0-9_\.\-]
to
[a-zA-Z0-9]+[a-zA-Z0-9_\-\.]
because one of our clients were complaining that it wasn't regex best practices and it's causing problems with their CMS and their DB.
Looking at those two regexes, It appears to me they match the exact same thing.
the . and the - are swapped at the end, but that shouldn't make a difference. Should it?
Am I missing something?
The developer from our client's company is really adamant about us changing it.
Can someone shed some light?
Thanks!
There is no functional difference.
If anything is having issues with that regex, then it is a non-standard/buggy implementation. I recommend finding out exactly what the problem is.
While I see no reason to change it, I see no reason not to change it, so do what you wish.
Tip: I'm guessing the regex is written wrong. If I know what it is supposed to mean, I would write it:
[a-zA-Z0-9]+[_\.\-]?
If you use a - in a character group, it goes last otherwise it denotes a range of characters, like A-Z. If you're escaping it, like you are, then it can be anywhere.
It's possible the CMS or other code they use un-escapes the regex, so in this case it will throw errors if the - isn't the last character in the group. I would say that having as few escaped characters in a regular expression as possible makes it easier to read, but that's from a personal perspective.

Create a valid CSV with regular expressions

I have a horribly formated, tab delimited, "CSV" that I'm trying to clean up.
I would like to quote all the fields; currently only some of them are. I'm trying to go through, tab by tab, and add quotes if necessary.
This RegEx will give me all the tabs.
\t
This RegEx will give me the tabs that do not END with a ".
\t(?!")
How do I get the tabs that do not start with a "?
Generally for these kinds of problems, if it's a one time occurrence, I will use Excels capabilities or other applications (SSIS? T-SQL?) to produce the desired output.
A general purpose regex will usually run into bizarre exceptions and getting it just right will often take longer and is prone to missed groups your regex didn't catch.
If this is going to happen regularly, try to fix the problem at the source and/or create a special utility program to do it.
Use negative lookbehind: (?<!")\t
For one shots like this I usually just write a little program to clean up the data, that way I also can add some validation to make sure it really has converted properly after the run. I have nothing against regex but often in my case it takes longer for me figure out the regex expression than writing a small program. :)
edit: come to think about it, the main motivator is that it is more fun - for me at least :)