Regex to validate company names - regex

I have this RegEx to validate a few things, unfortunately it won't validate P.C. only P.C - I tried adding {0,1} to each period but it still will not validate. Any ideas?
(new-line characters for readability)
/(^|\s)Corporation\.{0,1}(^|$)|
(^|\s)Corp\.{0,1}(^|$)|
(^|\s)Inc\.{0,1}(^|$)|
(^|\s)Incorporated\.{0,1}(^|$)|
(^|\s)Company\.{0,1}(^|$)|
(^|\s)(^|$)|
(^|\s)LTD\.{0,1}(^|$)|
(^|\s)PLLC\.{0,1}(^|$)|
(^|\s)P\.{0,1}C\.{0,1}(^|$)/ig;

Here's a simplified version of your regex:
/(?:^|\s)(?:Corporation|Corp|Inc|Incorporated|Company|LTD|PLLC|P\.C)\.?$/ig;
{0,1} can be replaced by ?
Repetition can be eliminated with some grouping.
This doesn't make much sense: (^|$). You are requiring either a beginning of a line or an end of a line to occur right after a match. This is functionally the same as requiring the match to be at the end of the line, so I just replaced it with $.
When you need to group things, use non-capturing groups (?:...) unless you need to grab that part of the match. They are more efficient.
All that being said, your original pattern should have matched P.C. at the end of a line. The problem may be something with your input data or the way you are using the regex.

Related

Abort regex execution when pattern found in negative lookahead syntax

While struggling trying to validate SQL Server's connection string pattern using regex I've achieved the following result:
^(?!.*?(?<=^|\;)[a-zA-Z]+( [a-zA-Z]+)*(\=[^\;]+?\=[^\;]*)?(\;|$))+([a-zA-Z]+( [a-zA-Z]+)*\=[^\;]+\;?)+$
Sample string used was:
option=value;missingvalue;multiple assignment=123=456
* (hosted and tested in regex101)
And, as expected, the string didn't match. The issue is that I think this may not be standard, recommended nor optimal regex implementation — especially at the negative lookahead part, considering it's just going through the whole string even after a successful match.
I'll try to break down how it works below:
Negative Lookahead
1. ^(?!.*?(?<=^|;)
Negative lookahead pattern starting either at the beginning of the string or recursively throughout just after the semi colon character
2. [a-zA-Z]+( [a-zA-Z]+)*(=[^;]+?=[^;]*)?(;|$))+
Matching the simple or composite option names — that is, just [a-zA-Z]+ (mandatory) or, additionally, ( [a-zA-Z]+)* any number of times; afterwards there's an optional group that tries to match when there's more than one consecutive value assignment for any given option; finally it ends with either ; or $ (end of string) — in case of the first one, the lookahead pattern restarts from the beginning (recursion)
Regular Pattern Matching
([a-zA-Z]+( [a-zA-Z]+)*=[^;]+;?)+$
Not much new to say here other than that this is the pattern which should actually match the string after the initial Negative Lookahead thorough scan/validation.
I can't deny that it's kinda working for what I intended, but I can't hold back the feeling that I'm misunderstanding something about regex's workings.
Is there an easier way to do this while avoiding having to recursively look ahead using the pattern described above multiple times?
EDIT: As requested, some closer to real life examples would be the following — for both valid and invalid formatting:
VALID
Database=somedb;Username=admin;Password=P#ssword!23;Port=1433
INVALID
missing delimiter between Username and Password options
Database=somedb;Username=adminPassword=P#ssword!23;Port=1433
missing value for Port option
Database=somedb;Port;Username=admin;Password=P#ssword!23
The following string accepts only letters for the names. for the purposes of testing it accepts any character except equals and semi colon in the values. This would need to be defined as characters like line ending and tab would need to be excluded.
We have a negative lookahead to forbid a second equals sign in the values and a negative lookback to forbid a semi-colon before the end. Please note that your "correct" example is found to be wrong because there is no semi-colon at the end
If we try to block the otherway round it becomes impossible to match the regex.
I've added an optional single space in the name to match "Connection Timeout" and similar
/^(\s*[a-zA-Z]+ ?[a-zA-Z]+=[^=;]+;)+$/gm
I have also allowed spaces before the name.
Our string is made up of
^beginning of line
( start group
\s* optional whitespace before name
[a-zA-Z]+ ?[a-zA-Z]+name containing at least one letter before and after an optional space. This means at least two letters
=an equals sign
(start inner group
(?!\=) negative look ahead for equals sign
[^=;] any character except equals and semi-colon at least once
; a literal semi-colon.
){4,}close the outer group and repeat it at least 4 times
$ end of line
Thank you Casimir et Hippolyte for the improvement. I was using look-aheads and look-backs following the question but your syntax is much cleaner.

How to format allowence of multiple whitespaces between characters in Regex more compact?

I came up with this regEx to check if a IBAN is entered correctly into a field which also let's the user enter up to 4 whitespaces between character without causing an error.
^\s?\s?\s?\s?N\s?\s?\s?\s?\s?O\s?\s?\s?\s?([0-9a-zA-Z]\s?\s?\s?\s?){13}$
It works perfectly, but I want to get rid of the "\s?\s?\s?\s?" and format it more compact, I've tried [\s?]{4} but that doesn't work.
What's the correct way to shorten this up?
The system I work with doesn't allow me to use any Javascript, I can only put pure regEx definitions to control entry into the field.
thank you
You can shorten the repeating \s parts using a quantifier {0,4} to match 0-4 times a whitespace char and add an anchor $ to assert the end of the string to prevent a partial match.
If you don't need that value of the capturing group afterwards, you could make it non capturing (?: instead.
^\s{0,4}N\s{0,4}O\s{0,4}(?:[0-9a-zA-Z]\s{0,4}){13}$
Regex demo
If you don't want to match a newline, you could use [^\S\r\n]{0,4} instead of \s{0,4} but that would defeat the purpose of making the pattern smaller.

Select Northings from a 1 Line String

I have the following string;
Start: 738392E, 6726376N
I extracted 738392 ok using (?<=.art\:\s)([0-9A-Z]*). This gave me a one group match allowing me to extract it as a column value
.
I want to extract 6726376 the same way. Have only one group appear because I am parsing that to a column value.
Not sure why is (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) giving me the entire line after S.
Helping me get it right with an explanation will go along way.
Because you used positive lookaheads. Those just make some assertions, but don't "move the head along".
(?=(art\:\s\s*)) makes sure you're before "art: ...". The next thing is another positive lookahead that you quantify with a star to make it optional. Finally you match anything, so you get the rest of the line in your capture group.
I propose a simpler regex:
(?<=(art\:\s))(\d+)\D+(\d+)
Demo
First we make a positive lookback that makes sure we're after "art: ", then we match two numbers, seperated by non-numbers.
There is no need for you to make it this complicated. Just use something like
Start: (\d+)E, (\d+)N
or
\b\d+(?=[EN]\b)
if you need to match each bit separately.
Your expression (?=(art\:\s\s*))(?=[,])*(.*[0-9]*) has several problems besides the ones already mentioned: 1) your first and second lookahead match at different locations, 2) your second lookahead is quantified, which, in 25 years, I have never seen someone do, so kudos. ;), 3) your capturing group matches about anything, including any line or the empty string.
You match the whole part after it because you use .* which will match until the end of the line.
Note that this part [0-9]* at the end of the pattern does not match because it is optional and the preceding .* already matches until the end of the string.
You could get the match without any lookarounds:
(art:\s)(\d+)[^,]+,\s(\d+)
Regex demo
If you want the matches only, you could make use of the PyPi regex module
(?<=\bStart:(?:\s+\d+[A-Z],)* )\d+(?=[A-Z])
Regex demo (For example only, using a different engine) | Python demo

Name validation - Adding a check to this regex to stop entering just identical characters

I'm trying to add another feature to a regex which is trying to validate names (first or last).
At the moment it looks like this:
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$)([a-z][a-z'-]{1,})$/i
https://regex101.com/r/pQ1tP2/1
The idea is to do the following
Don't allow just adding a title like Mr, Mrs etc
Ensure the first character is a letter
Ensure subsequent characters are either letters, hyphens or apostrophes
Minimum of two characters
I have managed to get this far (shockingly I find regex so confusing lol).
It matches things like O'Brian or Anne-Marie etc and is doing a pretty good job.
My next additions I've struggled with though! trying to add additional features to the regex to not match on the following:
Just entering the same characters i.e. aaa bbbbb etc
Thanks :)
I'd add another negative lookahead alternative matching against ^(.)\1*$, that is, any character, repetead until the end of the string.
Included as is in your regex, it would make that :
/^(?!^mr$|^mrs$|^ms$|^miss$|^dr$|^mr-mrs$|^(.)\1*$)([a-z][a-z'-]{1,})$/i
However, I would probably simplify your negative lookahead as follows :
/^(?!(mr|ms|miss|dr|mr-mrs|(.)\2*)$)([a-z][a-z'-]{1,})$/i
The modifications are as follow :
We're evaluating the lookahead at the start of the string, as indicated by the ^ preceding it : no need to repeat that we match the start of the string in its clauses
Each alternative match the end of the string. We can put the alternatives in a group, which will be followed by the end-of-string anchor
We have created a new group, which we have to take into account in our back-reference : to reference the same group, it now must address \2 rather than \1. An alternative in certain regex flavours would have been to use a non-capturing group (?:...)

regex optional word match

I'm trying to create a regex for extracting singers, lyricists. I was wondering how to make lyricists search optional.
Sample Multiline String:
Fireworks Singer: Katy Perry
Vogue Singers: Madonna, Karen Lyricist: Madonna
Regex: /Singers?:(.\*)\s?Lyricists?:(.\*)/
This matches the second line correctly and extracts Singers(Madonna, Karen) and Lyricists(Madonna)
But it does not work with the first line, when there are no Lyricists.
How do I make Lyricists search optional?
You can enclose the part you want to match in a non-capturing group: (?:). Then it can be treated as a single unit in the regex, and subsequently you can put a ? after it to make it optional. Example:
/Singers?:(.*)\s?(?:Lyricists?:(.*))?/
Note that here the \s? is useless since .* will greedily eat all characters, and no backtracking will be necessary. This also means that the (?:Lyricists?:(.*)) part will never be matched for the same reason. You can use the non-greedy version of .*, .*? along with the $ to fix this:
/Singers?:(.*?)\s*(?:Lyricists?:(.*))?$/
Some extra whitespace ends up captured; this can be removed also, giving a final regex of:
/Singers?:\s*(.*?)\s*(?:Lyricists?:\s*(.*))?$/
Just to add to Cameron's solution. if the source string has multiple lines each containing both Singers and Lyricists, you'll probably need to add the 'm' multi-line modifier so that the '$' will match ends-of-lines. (You didn't say what language you are using - you may want to add the 'i' modifier as well.)