Matching "Not a Word" in URL? - regex

I'm trying to catch:
http://anydomain/MYDIR/filename.aspx
but NOT
http://anydomain/subdir/MYDIR/filename.aspx
(essentially, the rule is to capture the first, and redirect to the second -- I've moved the files - and anyone bookmarking the old, I want them whisked away to the new).
Everything I've done is capturing both, and generates a fun redirect loop. Fun as in GRR, not fun as in YAY. Admittedly, I'm terrible at RegEx beyond the basics, for the 2 times a decade I need it, and have promptly forgotten everything. The closest I've gotten is something like this:
^.*!NEWDIR\/MYDIR\/filename\.aspx$
but it doesn't seem to validate. I believe it's my "grouping" of NEWDIR in the regex, is it thinking I'm only not'ing the N and EWDIR is supposed to be there? How do I get it to "not" NEWDIR entirely?

Try this one:
^.*(?<!NEWDIR)\/MYDIR\/filename\.aspx$
This is using negative lookbehind. Here the goal is to math a string that is not proceeded by another string.
Here is a working example. For details on lookbehind check this page.

Related

Stripping 3 parentheses in regex

I run a small forum that has an issue with people using parentheses to bracket statements. They do it to signify they are talking about Jews. I guess it is called echoes or something. So they will put a name like (((Prominent Person))) like that in the middle of a conversation.
I have recently been trying to combat this without just banning people that can't behave. I have a decent word filter but it doesn't block that. I recently installed something that allows me to use regex to strip things out but I am having trouble finding the proper string that doesn't break everything else.
"/\W{3}(.*)\W{3}/","$1"
The first is the matching string and the comma separates what is left. This string works, it strips the parentheses out and leaves everything else alone. The problem is that the string is too broad. It also strips out any [ brackets as well which breaks all of the bbcode in a post. Any post that has any number of at least 3 brackets will be broken after that.
I have been playing with different strings on regex101 but not finding the best solution. I need any time that ((( or ))) is seen to strip out those and replace it with nothing, like it never happened. It has to be exactly three and only ((( and not the other brackets it could trigger on.
Does anyone have a good solution?
\({3}(.*)\){3}
https://regex101.com/r/wD5TMb/1
So in your format probably: "/\({3}(.*)\){3}/","$1"

select area within characters using regex (spaces are an issue)

Some other guy asked a similar question earlier which got a lot of down votes, and I was interested in solving it. I came to a similar issue and would like some help with it.
Take into consideration this wall of text:
__don't__ and __do it__
__yellow__
__green__ and __purple__
I would like to select all the area within the underscores __'s
I attempted the following regex:
/__[!-~]+__/g which worked great on most things. I would like to add the ability to have spaces within the underscores. __do it__ will not be encapsulated in the search because it includes a space which was ruled out by the regex. I attempted the following:
/__[ -~]+__/g
It didn't work as planned, and selected everything from the very first __ to the very last. I was wondering how to tell the regex it has reached the end of a search once it sees a space after a __.
Here is the regex you could play around with below:
http://regexr.com/39br7
I tried using __[^ ]/g at the end but It didn't seem to help.
You could simply use the below regex,
__[^_]*__
DEMO
__(.*?)__
This seems to work.Look at the demo.
http://regex101.com/r/lJ1jB1/1

Regex help: Identifying websites in text

I am trying to write a function which removes websites from a piece of text. I have:
removeWebsites<- function(text){
text = gsub("(http://|https://|www.)[[:alnum:]~!#$%&+-=?,:/;._]*",'',text)
return(text)
}
This handles a large set of the problem, but not a popular one, i.e something of the form xyz.com
I do not wish to add .com at the end of the above regex, as it limits the scope of that regex. However I tried writing some more regexex like:
gsub("[[:alnum:]~!#$%&+-=?,:/;._]*.com",'',testset[10])
This worked, but it also modified email ids of the form abc#xyz.com to abc#. I don't want this, so I modified it to
gsub("*((^#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
This left the email ids alone but stopped recognising websites of the form xyz.com
I understand that I need some sort of a set difference here, of the form of what was explained here but I was not able to implement it (mainly because I was not able to completely understand it). Any idea on how I go about solving my problem?
Edit: I tried negative lookaheads:
gsub("[[:alnum:]~!#$%&+-=?,:/;._](?!#)[^(?!.*#)]*.com",'',testset[10])
I got a 'invalid regex' error. I believe a little help in correcting may get this to work...
I can't believe it. There actually is a simple solution to it.
gsub(" ([[:alnum:]~!#$%&+-=?,:/;._]+)((.com)|(.net)|(.org)|(.info))",' ',text)
This works by:
Start with a space.
Put all sorts of things, except an '#' in.
end with a .com/net/org/info/
Please do look into breaking it! I'm sure there will be cases that will break this as well.
your lookarounds look a bit funny to me: you cant look behind inside a character class and why are you looking ahead? A look behind is imho more appropriate.
I think the following expression should work, although i didn't test it:
gsub("*((?<!#)[[:alnum:]~!#$%&+-=?,:/;._]*).com",'\\1',testset[10])
also note that lookbehinds must have a fixed length, so no multipliers are allowed

regex best practice?

Today I got an email from my boss saying to change the regex in our java script code that goes onto our client's website from
[a-zA-Z0-9]+[a-zA-Z0-9_\.\-]
to
[a-zA-Z0-9]+[a-zA-Z0-9_\-\.]
because one of our clients were complaining that it wasn't regex best practices and it's causing problems with their CMS and their DB.
Looking at those two regexes, It appears to me they match the exact same thing.
the . and the - are swapped at the end, but that shouldn't make a difference. Should it?
Am I missing something?
The developer from our client's company is really adamant about us changing it.
Can someone shed some light?
Thanks!
There is no functional difference.
If anything is having issues with that regex, then it is a non-standard/buggy implementation. I recommend finding out exactly what the problem is.
While I see no reason to change it, I see no reason not to change it, so do what you wish.
Tip: I'm guessing the regex is written wrong. If I know what it is supposed to mean, I would write it:
[a-zA-Z0-9]+[_\.\-]?
If you use a - in a character group, it goes last otherwise it denotes a range of characters, like A-Z. If you're escaping it, like you are, then it can be anywhere.
It's possible the CMS or other code they use un-escapes the regex, so in this case it will throw errors if the - isn't the last character in the group. I would say that having as few escaped characters in a regular expression as possible makes it easier to read, but that's from a personal perspective.

Regex pattern to format url

I have this pattern ^(?:http://)?(?:www.)?(.*?)/?(.*?)$ but it's still not perfect.
Let's say we have these urls to test against it:
example.com
example.com/
www.example.com/
http://example.com/
example.com/param
http://example.com/params/
The final output should be example.com/ if there's no parameters and example.com/params/ if with parameters. My problem is that it matches only second group. It doesn't look like /? is working otherwise it would stop on slash character. Is it possible to achieve what I want using only one pattern?
So you want the host name in $1? Your regex is ambiguous, there are many ways to match it; the regex engine will prefer the longest, leftmost possible match. If you don't want slashes in the first part, then say so. Explicitly. (?:http://)?(?:www\.)?([^/]*)?/?(.*)?$
One that I've used is:
((?:(?:https?://)?[\w\d:##%/;$()~_?\+\-=&]+|www|ftp)\.[\w\d:##%/;$()~_?\+\-=&\.]+)
The problem with URLs is that there are SO many ways one can be written, which is why the above code looks so congested. This will match all your examples above, but it will also match things like:
alkasi.jaias
Hopefully this will get you headed to where you need or want to go, and perhaps someone might be able to come up behind me and clean it up some (it's early morning, I'm getting ready for work, and am exhausted. :P)