Proper validation of the Name entered by a user - regex

I am trying to validate the Name entered by a user. I need to make sure the Name entered is 'literally' valid. I tried many regular expressions within my limited knowledge, none of them seem to work fully. For example /^[^.'-]+([a-zA-Z]+|[ .'-]{1})$/
Since the PHP website I'm working on, is fully in English, only English names are allowed.
The rules applicable to filtering the name are:
A name may contain any of these characters: [a-zA-Z .'-]
The name may start only with an Alphabet or an Apostrophe
Any of the characters in [ .'-] may not occur more than once in a stretch, ie., no '---' or '--'
A space should not follow - or ' nor come immediately before . or -
Can anyone please provide the proper regular expression to implement these?

Here's a regex to solve your problem (demo at regex101):
/^(?!.*(''| |--|- |' | \.| -|\.\.|\n))['a-z][- '.a-z]*$/gi
Breakdown:
(?!.*(''| |--|- |' | \.| -|\.\.|\n)) negative lookahead to ensure that no doubled characters are found
['a-z] start with one of these characters
[- '.a-z]* the rest can also include spaces and dashes, and are not required (* instead of +)

This should work for you:
/^'?([a-z]+((\. )|( ')|['\- \.])?)+$/i
Demo at regex101.com
Explanation:
'? Optionally start with an apostrophe
([a-z]+((\. )|( ')|['\- \.])?)+ Afterwards allow 1-n groups of at least 1 alphabetic character, followed by certain (both) valid combinations of special characters. These are ". ", " '" (quotes to see the spaces) or any single special character.
/i Match case insensitive, otherwise you'd just specify [a-zA-Z] instead of [a-z].
I am not sure if John,.Doe or ' should be considered valid names. With my expression they would not.

You could probably get better performance by limiting all the backtracking a giant assertion would have to do.
Something like this..
# (?i)^(?=[a-z'])(?:[a-z]+|([ ](?!-)(?<!'[ ])|[.'-])(?!\1))*$
(?i) # Modifier, case insensitive
^ # BOS
(?= [a-z'] ) # Starts with an alpha or apos '
(?: # Cluster
[a-z]+ # 1 to many alpha's
| # or
( # (1 start), Capture group, one of [ .'-]
[ ] # Single space
(?! - ) # not a dash - ahead
(?<! ' [ ] ) # nor apos space '[ ] behind
| # or,
[.'-] # Single dot . or apos ' or dash -
) # (1 end)
(?! \1 ) # Backref capt grp 1,
# not adjacent duplicate of one of [ .'-]
)* # Cluster end, do 0 to many times
$ # EOS

Related

Regex: match an empty string instead of nothing

I have a Python script in which I'm trying to parse a string of the form:
one[two=three].four
Each word should be in its own capture group. The punctuation should not be captured.
Additionally, each part of the string is optional, and the part delimited by brackets can be repeated. So the above is the most complete example, but all of the following should also be valid matches:
one
.four
one[two=three][five=six]
[two=three]
[two].four
[two][five]
[]
In the case that one of the words is not present, instead of failing to capture, I'd like to capture a string of length 0.
The regex that I'm using is as follows:
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)* # End the two-three group, allowing repeats
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
What I've tried to do during that regex is, instead of allowing 0 or 1 of a group by appending ? to the entire group, I allowed any number of characters to be in the actual match itself by appending * to the character selection. Therefore, the match is forced to exist, but the string itself can have a length of 0.
The problem comes with the bracketed block. The package I'm using allows me to access all captures of a named group using match.captures(groupname). This way, I can access all matches for cap_2 using match.captures("cap_2"):
>>> pattern.match("one[two=three][five=six].four").captures("cap_2")
["two", "five"]
This works fine when the brackets are present. However, when they're not:
>>> pattern.match("one.four").captures("cap_2")
[]
Expected: [""]
I expect there to be at least an empty string present for cap_2 and cap_3. However, there's nothing.
This is because of the * I place after the two+three section of the regex, in order to allow multiple of those groups - this is allowing that part of the regex to be skipped altogether.
Changing that * to + breaks the regex, as now it won't match the above example at all because it's trying to match the brackets. Adding a ? after each bracket means that cap_1 and cap_2 are not delimited and includes what should be in cap_4 in cap_3.
What's the solution here? How can I allow a group containing two capturing groups to be executed multiple times, but match only empty strings when the brackets are not present?
You may solve the problem by replacing * after the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* repeated group with + and adding an alternative with a second occurrence of groups cap_2 and cap_3 (note that PyPi regex module supports multiple identically named groups in the same regex):
import regex as re
s = 'one.four'
pattern = re.compile(
r"""
^ # Assert start of string
(?P<cap1> # Start a new group for "one"
[a-z]* #
) #
(?:
(?: # Start a group for "two" and "three"
\[ # Match the "["
(?P<cap_2> # Start a group for "two"
[a-z]* #
) #
=? # Delimit two/three with "="
(?P<cap_3> # Start a group for "three"
[a-z]* #
) #
\] # Match the "]"
)+ # End the two-three group, allowing repeats
|
(?P<cap_2>)(?P<cap_3>)
)
\.? # Delimit three/four with "."
(?P<cap_4> # Begin a group for "four"
[a-z]* #
) #
$ # Assert end of string
""", re.IGNORECASE|re.VERBOSE)
print ( pattern.match("one.four").captures("cap_2") )
# => ['']
See the Python demo
The thing is, the (?:\[(?P<cap_2>[a-z]*)=?(?P<cap_3>[a-z]*)\])* part matches by all means since it can match an empty string, and if you just add the alternatives without changing the modifier, the expected results won't be achieved. So, if there is no [...]s, the second cap_2 and cap_3 groups with empty patterns willmatch by all means capturing an empty string.
if you want it to either match the empty string or something else, you need the OR operator: |
if you want your regexp to match an empty string, you need something that matches the empty string: e.g. () or (not empty|)
Combined and applied to your case, that would look like this (simplified):
((?:\[stuff inside the brackets\])+|)
The outermost group captures the whole bracket construct (e.g. [two][three]) if it's present or the empty string. Notice that the left part of the | operator now has to match at least once (+).

How to comment SQL statements in Notepad++?

How can I "block comment" SQL statements in Notepad++?
For example:
CREATE TABLE gmr_virtuemart_calc_categories (
id int(1) UNSIGNED NOT NULL,
virtuemart_calc_id int(1) UNSIGNED NOT NULL DEFAULT '0',
virtuemart_category_id int(1) UNSIGNED NOT NULL DEFAULT '0'
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
It should be wrapped with /* at the start and */ at the end using regex in Notepad++ to produce:
/*CREATE TABLE ... (...) ENGINE=MyISAM DEFAULT CHARSET=utf8;*/
You only offer one sample input, so I am forced to build the pattern literally. If this pattern isn't suitable because there are alternative queries and/or other interfering text, then please update your question.
Tick the "Match case" box.
Find what: (CREATE[^;]+;) Replace with: /*$1*/
Otherwise, you can use this for sql query blocks that start with a capital and end in semicolon:
Find what: ([A-Z][^;]+;) Replace with: /*$1*/
To improve accuracy, you might include ^ start of line anchors or add \r\n after the semi-colon or match the CHARSET portion before the semi-colon. There are several adjustments that can be made. I cannot be confident of accuracy without knowing more about the larger body of text.
You could use a recursive regex.
I think NP uses boost or PCRE.
This works with both.
https://regex101.com/r/P75bXC/1
Find (?s)(CREATE\s+TABLE[^(]*(\((?:[^()']++|'.*?'|(?2))*\))(?:[^;']|'.*?')*;)
Replace /*$1*/
Explained
(?s) # Dot-all modifier
( # (1 start) The whole match
CREATE \s+ TABLE [^(]* # Create statement
( # (2 start), Recursion code group
\(
(?: # Cluster group
[^()']++ # Possesive, not parenth's or quotes
| # or,
' .*? ' # Quotes (can wrap in atomic group if need be)
| # or,
(?2) # Recurse to group 2
)* # End cluster, do 0 to many times
\)
) # (2 end)
# Trailer before colon statement end
(?: # Cluster group, can be atomic (?> ) if need be
[^;'] # Not quote or colon
| # or,
' .*? ' # Quotes
)* # End cluster, do 0 to many times
; # Colon at the end
) # (1 end)

Regex word can be optional but only if it matches the characters

Following pattern: (v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?((-schema)?(-dev)?)((-schema)?(-dev)?) from http://regexr.com/ is meant to be used in a shell script with grep and does match the following strings (working example):
Hello I am a text and this is my v1.12.33-32 version
Hello I am a text and this is my v1.12.33-dev version
Hello I am a text and this is my v1.12.33-dev-schema version
Hello I am a text and this is my v1.12.33-schema version
Hello I am a text and this is my v1.12.33-3-schema version
and so forth
So I made the words schema and dev optional. They can be ommitted or used in a arbitrary order. What I don't what is this:
Hello I am a text and this is my v1.12.33-foo version
or Hello I am a text and this is my v1.12.33-asfs version
to match.
I want the option to be a bit more constrained. At the moment the Regex is still matching the stuff that...well actually matches.
This for example:
Hello I am a text and this is my v1.123.33
results in an empty string while this:
`Hello I am a text and this is my v1.12.33-bla"
still results in v.1.12.33
Is this because of the grouping I made? So at least the fully matching groups will be taken for the returned match-string?
To match only the version string, disallow extra trailing tags, yet allow trailing unmatched text, you need a regex language that supports lookahead. Standard grep / egrep regexes do not support lookahead.
You have two options:
Since you seem to be relying on GNU grep anyway, you could use a Perl regex, such as
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(?!\S)
The negative lookahead at the end allows the match to appear at the end of the line, but also requires that if it does not end the line then the next character following the match must be whitespace (which is not itself included in the match).
You could give up on completely isolating the target text via -o, and instead allow the pattern to match the trailing context, too:
v[0-9]{1,2}(\.[0-9]{1,2}){2}(-[0-9]{1,2})?((-schema(-dev)?)?|(-dev(-schema)?)?)?(\s.*)?$
In this case, you could isolate the target text in a second step, by stripping off any tail beginning with whitespace.
Note that neither of these pays attention to text preceeding the match. You have similar options for handling that portion as you do for handling the trailing portion.
The problem seems to be all the optional expressions lurking at the
edge (end).
You can solve that a few ways, but none are %100 because you'd need
more rules to control what matches.
It's not like you can say no - is allowed afterword, the engine will
backtrack to one of the range digits {1,2} to make a match.
What seems to work for now is passing on a whitespace end edge
or matching the dev/schema items.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
( schema | dev ) # (5)
)?
) # (3 end)
)
edit
If you want to avoid matching the same schema|dev word twice, just add
a negative assertion of group 4, before capture group 5 above.
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(?:(?!\S)|(-(schema|dev)(?:-(?!\4)(schema|dev))?))
Expanded
( # (1 start)
v [0-9]{1,2}
\. [0-9]{1,2}
\. [0-9]{1,2}
) # (1 end)
( - [0-9]{1,2} )? # (2)
(?:
(?! \S ) # Whitespace boundary
| # or,
( # (3 start)
-
( schema | dev ) # (4)
(?:
-
(?! \4 ) # Not same word twice
( schema | dev ) # (5)
)?
) # (3 end)
)
Since regular expressions are open-ended, you need to specify with $ where you want the match to end, so you don't let the regex engine silently ignore trailing junk.
With only two tags in the optional set, I would just enumerate the 4 possibilities:
(v[0-9]{1,2}\.[0-9]{1,2}\.[0-9]{1,2})(-[0-9]{1,2})?(-schema|-dev|-dev-schema|-schema-dev)?$
My version:
grep --perl-regexp \
'\bv(?:\d{1,2}\.){2}\d{1,2}(?:\-\d{1,2})?(?:\-(?:schema|dev))?(?:\s|$)' \
path/to/file
Where
the first \b is a word boundary(you might want to make it stricter);
(?: ... ) expressions are non-capturing groups;
\s|$ is either a space character, or the end of line
The rest is just refactored for simplicity.
The expression allows only schema, or dev at the "end".

Make sure the presence of a particular character in a string

I'm using the following regex to parse my application log file to search for particular string
\[\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
This works fine but now we need to make sure that the string "must" contain "-" character to match. I'm confused to add this condition to the original regx.
Any pointers would be helpful.
Thanks and Regards,
Santhosh
The regex matches a string inside square brackets, [ and ], and may only consist of non-[ and non-] symbols.
You can easily add a positive lookahead restriction after the opening [ like check if the next characters other than ] and [ are followed with -:
\[ # opening [
(?=[^\]\[]*-) # There must be a hyphen in [...]
\s* # 0+ whitespaces
\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200}) # Part 1 (with obligatory subpattern)
(?:\. # Part 2, optional
(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200})
)*
(\.?|\b) # optional . or word boundary
\s* # 0+ whitespaces
] # closing ]
See the regex demo
And a one-liner:
\[(?=[^\]\[]*-)\s*\b(?:[0-9A-Za-z][0-9A-Za-z-_.#]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z-_.#]{0,200}))*(\.?|\b)\s*]
Tip: use the verbose /x modifier to split the pattern into separate multiline blocks for analysis, it will help you in the future when you need to modify the pattern again.
If you need to match only if - or # is present inside [...], modify the lookahead as (?=[^\]\[]*[-#]). For a more general case, use (?=[^\]\[]*(?:one|another|must-be-present)) alternatives inside an additional group inside the lookahead.
Updated answer - Assertion
In this case, the better way to do it is to use an assertion consisting of checking only the position's expected to match the character in question.
I know it's simple, but using the outter pseudo-anchor text \[ ... \] as
a delimiter that cannot exist in the body is a rarity.
You should always try to avoid doing it like this.
Things change, your input could change.
The rule to follow in validation of known characters that are Mid-String is to use only them
when using an assertion validator.
This avoids the necessity of relying on what is not there at the moment ie, not a ],
but should rely on what is there.
Again, this pertains to mid-string matching.
BOL/EOL is a different thing entirely ^$, and is a more permanent construct
with which to leverage.
It's always better to code smarter.
\[\s*\b(?=[0-9A-Za-z][0-9A-Za-z_.#]{0,199}-|[0-9A-Za-z][0-9A-Za-z_.#]{0,200}(?:\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,200})*\.[0-9*A-Za-z][0-9A-Za-z_.#]{0,199}-)(?:[0-9A-Za-z][0-9A-Za-z_.#-]{0,200})(?:\.(?:[0-9*A-Za-z][0-9A-Za-z_.#-]{0,200}))*(\.?|\b)\s*\]
Using Conditionals
If your engine supports conditionals, the easy way is to not rely on a fluke
of pseudo anchor text, ie. [..].
\[\s*\b[0-9A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}(?:\.(?:[0-9*A-Za-z](?:[0-9A-Za-z_.#]|(-)){0,200}))*(\.?|\b)\s*\](?(1)|(?(2)|(?!)))
Expanded
\[ \s* \b
[0-9A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (1)
){0,200}
(?:
\.
(?:
[0-9*A-Za-z]
(?:
[0-9A-Za-z_.#]
| ( - ) # (2)
){0,200}
)
)*
( \.? | \b ) # (3)
\s* \]
(?(1) # Fail if no dash found
| (?(2)
| (?!)
)
)
This conditional would work if just want to make sure that - occurs within your string before running your block of code.
if (myString.indexOf('-') >= 0) {
//your code
}
If you have to have a single hyphen, you'll have to either repeat most of the pattern, or check for it in a second phase:
if re.match(pattern, line):
if not '-' in line:
raise MissingDash('No dash in line: {}'.format(line))
I'd suggest adding the second check, since adding the requirement to the regex would make it even more horrible to read.

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL