Regex - Match URLs except an isolated case - regex

I have a regex pattern which was made to match attempts of URLs advertisement.
[a-zA-Z0-9\-\.]+\s?
(\.|\(\.\)|dot|\(dot\)|-|;|:|,)\s
(com|org|net|cz|co|uk|sk|biz|mobi|xxx|eu|me)\b
I also made this formula detect attempts of outsmarting the protection such as:
www-google-com or google-com (using '-' instead of '.')
The Problem
I got reported that, in the Portuguese language, words like
"ganhou-me" or "fugiu-me"
are valid and still getting caught by the protection. The hyphen is used together with "me" domain and causing the confusion.
I'm trying to find a way to exclude that particular case from the expression but:
Still be able to detect attempts like: google.me or google;me
But ignore attempts like: google-me or ganhou-me
I thought about removing the "me" from the main expression and add a disjunction that included that particular case, but that sounds like the hardest solution?

If you want all -me addresses to not be matched and your language supports negative Look-behind you could use [a-zA-Z0-9\-\.]+\s?(\.|\(\.\)|dot|\(dot\)|-|;|:|,)\s?(com|org|net|cz|co|uk|sk|biz|mobi|xxx|eu|(?<!-)me)\b or here is a Look-ahead version [a-zA-Z0-9\-\.]+\s?(\.|\(\.\)|dot|\(dot\)|-(?!me)|;|:|,)\s?(com|org|net|cz|co|uk|sk|biz|mobi|xxx|eu|me)\b.
This works by using (?<!-) to check there is a - before 'me' when matching in the first one or uses this -(?!me) to check there is an 'me' after the - in the second one.
Here is it working in a java-script example. Note- I used the second version as java-script does not support Look-behind.
var value = "www.google.com www.google;me www.google-me";
var matches = value.match(
new RegExp("[a-zA-Z0-9\\-\\.]+\s?(\\.|\\(\\.\\)|dot|\\(dot\\)|-(?!me)|;|:|,)\\s?(com|org|net|cz|co|uk|sk|biz|mobi|xxx|eu|me)\\b", "g")
);
document.writeln(matches);
Of course it might be better to use a white list (suggested in comments above) as this is very general.

Related

Why /^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i does not work as expected

I have this regex for email validation (assume only x#y.com, abc#defghi.org, something#anotherhting.edu are valid)
/^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i
But #abc.edu and abc#xyz.eduorg are both valid as to the regex above. Can anyone explain why that is?
My approach:
there should be at least one character or number before #
then there comes #
there should be at least one character or number after # and before .
the string should end with either edu, com, or org.
Try this
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
and it should become clear - you need to group those alternatives, otherwise you can match any string that has 'edu' in it, or any string that ends with org. To put it another way, your version matches any of these patterns
^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)
(edu)
(org)$
It's worth pointing out that the original poster is using this as a regex learning exercise. This would be a terrible regex for actual production use! It's a thorny problem - see Using a regular expression to validate an email address for a lot more depth.
Your grouping parentheses are incorrect:
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
Can also just use one case as you're using the i modifier:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
N.B. you were also missing a + from the second set, I assume this was just a typo...
What you have written is the equivalent of matching something that:
Begins with [a-zA-Z0-9]+#[a-zA-Z0-9].com
contains edu
or ends with org
What you were looking for was:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
Your regex looks ok.
I guess you are looking using a find function in stead of a match function
Without specifying what you use it is a bit difficult, but in Python you would write
import re
pattern = re.compile ('^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$')
re.match('#abc.edu') # fails, use this to validate an input
re.search('#abc.edu') # matches, finds the edu
Try to use it:
[a-zA-Z0-9]+#[a-zA-Z0-9]+.(com|edu|org)+$
U forget about + modificator if u want to catch any combinations of (com|edu|org)
Upd: as i see second [a-zA-Z0-9] u missed + too

RegEx: Match Mr. Ms. etc in a "Title" Database field

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.
If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^
Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));
this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned
Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.
re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

regex: trying to improve this regex

I am using this regex :
[']?[%]?[^"]#([^#]*)#[%]?[']?
on this text:
insert into table (id,name,age) values ('#var1#' ,#var2#,'#var3#', 3, 'name') where id = '#id#' like ""
and test=<cfqueryparam value="#id#">
For some reason it is catching the comma between #var2# and '#var3#'
but when I include a [^,] it starts doing weird stuff.
Can someone help me with this one.
As I read my regex now, it should find anything that:
might have a single quote
might have a percentage
doesn't have a double quote
then has a hash (#)
followed by no hash, but all other characters
then has a hash and followed by a percentage or quote
So why, when I add "no comma" in front does the regex break??
Updated Question:
okay, Ill try to explain: a query can look like this:
SELECT e.*, m.man_id, m.man_title, c.cat_id, c.cat_name
FROM ec_products e, ec_categories c, ec_manufacturers m
WHERE c.cat_id = e.prod_category AND
e.prod_manufacturer = m.man_id AND
e.prod_title LIKE <cfqueryparam value="%#attributes.keyword#%"> and
test='#var1#'
ORDER BY e.prod_title
Now I want every value between ##, but not the values that are surrounded by a queryparam tag. So in the example I do want #var1# but not #attributes.keyword#. Reason for this is that all params in the query that are not surrounded by a tag are unsafe and can cause SQL injection. My current regex is
(?!")'?%?#(?!\d)[\w.\(\)]+#%?'?(?!")
and it is almost there. It does find the attributes.keyword because of the %. I just want anything that that has ## but not surrounded by double quotes, so not "##". This will give me all unsafe params in the sql, like '#var#', or #aNumber#, or '%##', or '%##%', or '##%, but NOT things like
<cfqueryparam value="#variable#">
. I hope you understand my intentions?
I think you might be misunderstanding [^"]. It doesn't mean "doesn't have a double quote", but rather means, "one character, which is not a double-quote". Similarly, [^,] means "one character, which is not a comma". So your regex:
[']?[%]?[^"]#([^#]*)#[%]?[']?
will match — for example — this:
2#,'#
which consists of zero single-quotes, zero percent-signs, one character-which-is-not-a-double-quote (namely 2), one hash-sign, two characters-which-are-not-hash-signs (namely ,'), one hash-sign, zero percent-signs, and zero apostrophes. The ,' is what will be captured by the parentheses.
Update for updated question:
I don't think that what you describe is possible using just a ColdFusion regex, because it would require "lookbehind" (to ensure that something is not preceded by a double-quote), which ColdFusion regexes apparently (according to a Google-search) do not support. However:
This StackOverflow answer gives a way of using Java regexes in ColdFusion. If you use that technique, then you can use the Java regex '?%?(?<!")(?<!"')(?<!"%)(?<!"'%)#(?!\d)[\w.()]+#(?!%?'?")%?'? to ensure that there's no preceding double-quote.
You never mentioned how you're actually using this regex. Would it work for you to match .'?%?#(?!\d)[\w.()]+#%?'?(?!") (i.e., to match not just the section of interest, but also the preceding character), and then separately confirm that the matched substring doesn't start with a double-quote?
I also feel compelled to mention, since it sounds like you're trying to use regex-based pattern-matching to help detect and address points of possible SQL injection, that this is a bad idea; you will never be able to do this perfectly, so if anything, I think it will end up increasing your risk of SQL injection (by increasing your reliance on a buggy methodology).
Preserving your capture group from the initial regex, here is a revised expression.
'?%?(?!")#([^#]+)#%?'?
Based on the information you provided this should be correct.
'?%?(?!")#[^#]+#%?'?

Looking for a regex to match more than one reference string in TortoiseSVN

We used two different methods to reference external documents and Bugzilla bug numbers.
I'm now looking for a regular expression that matches these two possibilities of reference strings for convenient display and linking in the TortoiseSVN 1.6.16 log screen. First should be a bugzilla entry of the form [BZ#123], second is [some text and numbers], which has not to be converted into a url.
This can be matched with
\[BZ#\d+\]
and
\[.*?\]
My problem now is to concatenate those two match strings together. Usually this would be done by the regex (first|second), and I've done it this way:
(\[.*?\]|\[BZ#\d+\])
Unfortunately in this case TortoiseSVN seems to catch it all as the bug number because of the round braces. Even if I add a second expression which (according to the documentation) is meant to be used to extract the issue number itself, this second expression is supposed to be ignored:
(\[.*?\]|\[BZ#\d+\])
\[BZ#(\d+)\]
In this case TortoiseSVN displays the bug and document references correctly in the separate column, but uses them completely for the bugtracker url, which is of course not working:
https://mybugzillaserver/show_bug.cgi?id=[BZ#949]
BTW, Mercurial uses a better way by using {1}, {2}, ... as the placeholder in URLs.
Has anybody an idea how to solve this problem?
EDIT
In short: We have used [BZ#123] as bug number references and [anytext] as references to other (partly non-electronic) documents. We would like to have both patterns listed in TortoiseSVN's extra column, but only the bug number from the first part shpuld be used as %BUGID% in the URL string.
EDIT 2
Supposedly TortoiseSVN cannot handle nested regex groups (round braces), so this question doesn't have any satisfactory answer at the moment.
I'm not familiar with TortoiseSVN regex, but what it looked like the problem was that the first piece of the regex ([.*?\]) would always match, so you would never even get to the part evaluating the second part, \[BZ#(\d+)\]
Try this one instead:
((?<=\[BZ#)\d+(?=\])|\[.*?\])
Explanation:
( #Opening group.
(?<=\[BZ#) #Look behind for a bugzilla placeholder.
\d+ #Capture just the digits.
(?=\]) #Look ahead for the closing bracket (probably not necessary.)
| #Or, if that fails,
\[.*?\] #Find all other placeholders.
) #Closing the group.
Edit: I've just looked at TortoiseSVN docs. You could also try to keep the Message part expression the same, but change the Bug-ID expression to:
(?<=\[BZ#)(\d+)(?=\])
Edit: ?<= represents a zero-width lookbehind. See http://www.regular-expressions.info/lookaround.html. It is possible that TortoiseSVN doesn't support lookbehinds.
What happens if you just use (\d+) for your Bug-ID expression?