RegEx: Match Mr. Ms. etc in a "Title" Database field - regex

I need to build a RegEx expression which gets its text strings from the Title field of my Database. I.e. the complete strings being searched are: Mr. or Ms. or Dr. or Sr. etc.
Unfortunately this field was a free field and anything could be written into it. e.g.: M. ; A ; CFO etc.
The expression needs to match on everything except: Mr. ; Ms. ; Dr. ; Sr. (NOTE: The list is a bit longer but for simplicity I keep it short.)
WHAT I HAVE TRIED SO FAR:
This is what I am using successfully on on another field:
^(?!(VIP)$).* (This will match every string except "VIP")
I rewrote that expression to look like this:
^(?!(Mr.|Ms.|Dr.|Sr.)$).*
Unfortunately this did not work. I assume this is because because of the "." (dot) is a reserved symbol in RegEx and needs special handling.
I also tried:
^(?!(Mr\.|Ms\.|Dr\.|Sr\.)$).*
But no luck as well.
I looked around in the forum and tested some other solutions but could not find any which works for me.
I would like to know how I can build my formula to search the complete (short) string and matches everything except "Mr." etc. Any help is appreciated!
Note: My Question might seem unusual and seems to have many open ends and possible errors. However the rest of my application is handling those open ends. Please trust me with this.

If you want your string simply to not start with one of those prefixes, then do this:
^(?!([MDS]r|Ms)\.).*$
The above simply ensures that the beginning of the string (^) is not followed by one of your listed prefixes. (You shouldn't even need the .*$ but this is in case you're using some engine that requires a complete match.)
If you want your string to not have those prefixes anywhere, then do:
^(.(?!([MDS]r|Ms)\.))*$
The above ensures that every character (.) is not followed by one of your listed prefixes, to the end (so the $ is necessary in this one).
I just read that your list of prefixes may be longer, so let me expand for you to add:
^(.(?!(Mr|Ms|Dr|Sr)\.))*$
You say entirely of the prefixes? Then just do this:
^(?!Mr|Ms|Dr|Sr)\.$
And if you want to make the dot conditional:
^(?!Mr|Ms|Dr|Sr)\.?$
^

Through this | we can define any number prefix pattern which we gonna match with string.
var pattern = /^(Mrs.|Mr.|Ms.|Dr.|Er.).?[A-z]$/;
var str = "Mrs.Panchal";
console.log(str.match(pattern));

this may do it
/(?!.*?(?:^|\W)(?:(?:Dr|Mr|Mrs|Ms|Sr|Jr)\.?|Miss|Phd|\+|&)(?:\W|$))^.*$/i
from that page I mentioned

Rather than trying to construct a regex that matches anything except Mr., Ms., etc., it would be easier (if your application allows it) to write a regex that matches only those strings:
/^(Mr|Ms|Dr|Sr)\.$/
and just swap the logic for handling matching vs non-matching strings.

re.sub(r'^([MmDdSs][RSrs]{1,2}|[Mm]iss)\.{0,1} ','',name)

Related

What is the correct regex pattern to use to clean up Google links in Vim?

As you know, Google links can be pretty unwieldy:
https://www.google.com/search?q=some+search+here&source=hp&newwindow=1&ei=A_23ssOllsUx&oq=some+se....
I have MANY Google links saved that I would like to clean up to make them look like so:
https://www.google.com/search?q=some+search+here
The only issue is that I cannot figure out the correct regex pattern for Vim to do this.
I figure it must be something like this:
:%s/&source=[^&].*//
:%s/&source=[^&].*[^&]//
:%s/&source=.*[^&]//
But none of these are working; they start at &source, and replace until the end of the line.
Also, the search?q=some+search+here can appear anywhere after the .com/, so I cannot rely on it being in the same place every time.
So, what is the correct Vim regex pattern to use in order to clean up these links?
Your example can easily be dealt with by using a very simple pattern:
:%s/&.*
because you want to keep everything that comes before the second parameter, which is marked by the first & in the string.
But, if the q parameter can be anywhere in the query string, as in:
https://www.google.com/search?source=hp&newwindow=1&q=some+search+here&ei=A_23ssOllsUx&oq=some+se....
then no amount of capturing or whatnot will be enough to cover every possible case with a single pattern, let alone a readable one. At this point, scripting is really the only reasonable approach, preferably with a language that understands URLs.
--- EDIT ---
Hmm, scratch that. The following seems to work across the board:
:%s#^\(https://www.google.com/search?\)\(.*\)\(q=.\{-}\)&.*#\1\3
We use # as separator because of the many / in a typical URL.
We capture a first group, up to and including the ? that marks the beginning of the query string.
We match whatever comes between the ? and the first occurrence of q= without capturing it.
We capture a second group, the q parameter, up to and excluding the next &.
We replace the whole thing with the first capture group followed by the second capture group.

Get only first match in Regex

Given this string: hello"C07","73" (quotes included) I want to get "C07". I'm using (?:hello)|(?<=")(?<screen>[a-zA-Z0-9]+)?(?=") to try to do this. However, it consistently matches "73" as well. I've tried ...0-9]+){1}..., but that doesn't work either. I must be misunderstanding how this is supposed to work, but I can't figure out any other way.
How can I get just the first set of characters between quotes?
EDIT: Here's a link to show my problem.
EDIT: Ok, here's exactly what I'm trying to do:
Basically, what I'm trying to get is this: 1) a positive match on "hello", 2) a group named "screen" with, in this case, "C07" in it and 3) a group named "format" with, in this case, "73" in it.
Both the "C07" and "73" will vary. "hello" will always be the same. There may or may not be an extra comma between "hello" and the first double-quote.
For you initial question of how to stop after the first match either removing the global search, or searching from the start of the string would accomplish that.
For the latter question you can name your groups and just keep extending the pattern throughout the line(s).
hello"(?<screen>[^"]+)","(?<format>[^"]+)"
Demo: http://regex101.com/r/PBXe8l/1
Based on your regex example, why not:
^(?:hello)"([a-zA-Z\d]+)"
Regex Demo

Why /^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i does not work as expected

I have this regex for email validation (assume only x#y.com, abc#defghi.org, something#anotherhting.edu are valid)
/^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$/i
But #abc.edu and abc#xyz.eduorg are both valid as to the regex above. Can anyone explain why that is?
My approach:
there should be at least one character or number before #
then there comes #
there should be at least one character or number after # and before .
the string should end with either edu, com, or org.
Try this
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
and it should become clear - you need to group those alternatives, otherwise you can match any string that has 'edu' in it, or any string that ends with org. To put it another way, your version matches any of these patterns
^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)
(edu)
(org)$
It's worth pointing out that the original poster is using this as a regex learning exercise. This would be a terrible regex for actual production use! It's a thorny problem - see Using a regular expression to validate an email address for a lot more depth.
Your grouping parentheses are incorrect:
/^[a-zA-Z0-9]+#[a-zA-Z0-9]+\.(com|edu|org)$/i
Can also just use one case as you're using the i modifier:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
N.B. you were also missing a + from the second set, I assume this was just a typo...
What you have written is the equivalent of matching something that:
Begins with [a-zA-Z0-9]+#[a-zA-Z0-9].com
contains edu
or ends with org
What you were looking for was:
/^[a-z0-9]+#[a-z0-9]+\.(com|edu|org)$/i
Your regex looks ok.
I guess you are looking using a find function in stead of a match function
Without specifying what you use it is a bit difficult, but in Python you would write
import re
pattern = re.compile ('^[a-zA-Z0-9]+#[a-zA-Z0-9]\.(com)|(edu)|(org)$')
re.match('#abc.edu') # fails, use this to validate an input
re.search('#abc.edu') # matches, finds the edu
Try to use it:
[a-zA-Z0-9]+#[a-zA-Z0-9]+.(com|edu|org)+$
U forget about + modificator if u want to catch any combinations of (com|edu|org)
Upd: as i see second [a-zA-Z0-9] u missed + too

regex: trying to improve this regex

I am using this regex :
[']?[%]?[^"]#([^#]*)#[%]?[']?
on this text:
insert into table (id,name,age) values ('#var1#' ,#var2#,'#var3#', 3, 'name') where id = '#id#' like ""
and test=<cfqueryparam value="#id#">
For some reason it is catching the comma between #var2# and '#var3#'
but when I include a [^,] it starts doing weird stuff.
Can someone help me with this one.
As I read my regex now, it should find anything that:
might have a single quote
might have a percentage
doesn't have a double quote
then has a hash (#)
followed by no hash, but all other characters
then has a hash and followed by a percentage or quote
So why, when I add "no comma" in front does the regex break??
Updated Question:
okay, Ill try to explain: a query can look like this:
SELECT e.*, m.man_id, m.man_title, c.cat_id, c.cat_name
FROM ec_products e, ec_categories c, ec_manufacturers m
WHERE c.cat_id = e.prod_category AND
e.prod_manufacturer = m.man_id AND
e.prod_title LIKE <cfqueryparam value="%#attributes.keyword#%"> and
test='#var1#'
ORDER BY e.prod_title
Now I want every value between ##, but not the values that are surrounded by a queryparam tag. So in the example I do want #var1# but not #attributes.keyword#. Reason for this is that all params in the query that are not surrounded by a tag are unsafe and can cause SQL injection. My current regex is
(?!")'?%?#(?!\d)[\w.\(\)]+#%?'?(?!")
and it is almost there. It does find the attributes.keyword because of the %. I just want anything that that has ## but not surrounded by double quotes, so not "##". This will give me all unsafe params in the sql, like '#var#', or #aNumber#, or '%##', or '%##%', or '##%, but NOT things like
<cfqueryparam value="#variable#">
. I hope you understand my intentions?
I think you might be misunderstanding [^"]. It doesn't mean "doesn't have a double quote", but rather means, "one character, which is not a double-quote". Similarly, [^,] means "one character, which is not a comma". So your regex:
[']?[%]?[^"]#([^#]*)#[%]?[']?
will match — for example — this:
2#,'#
which consists of zero single-quotes, zero percent-signs, one character-which-is-not-a-double-quote (namely 2), one hash-sign, two characters-which-are-not-hash-signs (namely ,'), one hash-sign, zero percent-signs, and zero apostrophes. The ,' is what will be captured by the parentheses.
Update for updated question:
I don't think that what you describe is possible using just a ColdFusion regex, because it would require "lookbehind" (to ensure that something is not preceded by a double-quote), which ColdFusion regexes apparently (according to a Google-search) do not support. However:
This StackOverflow answer gives a way of using Java regexes in ColdFusion. If you use that technique, then you can use the Java regex '?%?(?<!")(?<!"')(?<!"%)(?<!"'%)#(?!\d)[\w.()]+#(?!%?'?")%?'? to ensure that there's no preceding double-quote.
You never mentioned how you're actually using this regex. Would it work for you to match .'?%?#(?!\d)[\w.()]+#%?'?(?!") (i.e., to match not just the section of interest, but also the preceding character), and then separately confirm that the matched substring doesn't start with a double-quote?
I also feel compelled to mention, since it sounds like you're trying to use regex-based pattern-matching to help detect and address points of possible SQL injection, that this is a bad idea; you will never be able to do this perfectly, so if anything, I think it will end up increasing your risk of SQL injection (by increasing your reliance on a buggy methodology).
Preserving your capture group from the initial regex, here is a revised expression.
'?%?(?!")#([^#]+)#%?'?
Based on the information you provided this should be correct.
'?%?(?!")#[^#]+#%?'?

Capture string until first caret sign hit in regex?

I am working with legacy systems at the moment, and a lot of work involves breaking up delimited strings and testing against certain rules.
With this string, how could I return "Active" in a back reference and search terms, stopping when it hits the first caret (^)?:
Active^20080505^900^LT^100
Can it be done with an inclusion in the regex of this "(.+)" ? The reason I ask is that the actual regex "(.+)" is defined in a database as cutting up these messages and their associated rules can be set from a front-end system. The content could be anything ('Active' in this case), that's why ".+" has been used in this case.
Rule: The caret sign cannot feature between the brackets, as that would result with it being stored in the database field too, and it is defined elsewhere in another system field.
If you have a better suggestion than "(.+)" will be happy to hear it.
Thanks in advance.
(.+?)\^
Should grab up to the first ^
If you have to include (.+) w/o modifications you could use this:
(.+?)\^(.+)
The first backreference will still be the correct one and you can ignore the second.
A regex is really overkill here.
Just take the first n characters of the string where n is the position of the first caret.
Pseudo code:
InputString.Left(InputString.IndexOf("^"))
^([^\^]+)
That should work if your RE library doesn't support non-greediness.