Regex Social Security number validation with dummy characters - regex

I am modifying existing code that displays a SS#. I am trying to figure out the existing validation although I know next to nothing about regular expressions. What I need to do is refactor the existing validation to ALSO accept dummy characters (probably upper-case "X") for the first 5 places, displaying only the last 4 effectively. All this w/o messing up the existing validation. What I pass into the control will depend on roles within the application, either the full number, 000000000 or XXXXX0000. Any suggestions would be greatly appreciated.
<dx:ASPxTextBox ID="SSN" runat="server" CssClass="ContractTextEntry"
MaxLength="9" Width="145px" AutoPostBack="True"
ValidationSettings-RegularExpression-ValidationExpression="^(?!000)(?!666)(?!9)\d{3}([- ]?)(?!00)\d{2}\1(?!0000)\d{4}$">
<MaskSettings Mask="000-00-0000" PromptChar=" " />
<ValidationSettings SetFocusOnError="True">
<RegularExpression ErrorText="Please enter a valid SSN" />
</ValidationSettings>
</dx:ASPxTextBox>

If you just want to accept X as well as a digit in your first 5 numerals then its a fairly straightforward modification:
^(?!000)(?!666)(?!9)[X0-9]{3}([- ]?)(?!00)[X0-9]{2}\1(?!0000)\d{4}$
all I've done is replace a couple of instances of \d (meaning any digit) with [X0-9] (meaning X or a character in the range 0-9)
FYI - the {3} following the first means repeated 3 times (and the {2} on the 2nd instance means repeated 2 times)

Since you require a few things, either all the first 5 are X's or they're all digits.
I think Dot-Net supports conditionals, but not sure if group number match.
I know it supports group name conditional.
# ^(?!000)(?!666)(?!9)(?:(XXX)|\d{3})([- ]?)(?!00)(?(1)XX|\d{2})\2(?!0000)\d{4}$
^
(?! 000 )
(?! 666 )
(?! 9 )
(?:
( XXX ) # (1), XXX
| \d{3} # Or digits
)
( [- ]? ) # (2), Separator
(?! 00 )
(?(1) # Conditional, did group 1 match ?
XX # yes, get XX
| \d{2} # no, get digits
)
\2 # Backref to separator
(?! 0000 )
\d{4}
$

Related

How does this regex for FQDNs (excluding.arpa) work?

I am trying to understand how regex works. I understand it little by little. However, I don't understand this one completely. It's basically a regex for fully qualified domain names but a requirement is that the ending can't be .arpa.
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
https://regex101.com/r/hU6tP0/3
This doesn't match google.uk. If I change it to:
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{1,63}[^.arpa]$)
It works again.
But this works as well
(?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
Here is my thought process for
?=^.{4,253}$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}[^.arpa]$)
I see it as this
(?=
Is a positive look ahead (Can someone explain to me what this actually means?) As I understand it now, it just means that the string needs to match the regex.
^.{4,253}$)
Match all characters but it needs to be between 4 and 253 characters long.
(^([a-zA-Z0-9]{1,63}\.)
Start a capture group and make another capture group within. This capture group says that every non special character can be written 1 to 63 times or till the . is written.
+
The previous capture group can be repeated indefinitely, but it should always end with a .. This way the next capture group is started.
[a-zA-Z]{2,63}
Then as many times as you want you can write a to z with upper, but it needs to be between 2 and 63.
[^.arpa]$)
The last characters can't be .arpa.
Can someone tell me where I am going wrong?
This doesn't do what you think it does:
[^.arpa]
All that says is 'ends with something that isn't one of the letter apr.' - it's a negated character class.
You might be thinking of a negative lookahead assertion:
(?!\.arpa)$
But if you're trying to compound multiple criteria in a regex, I'd suggest you're probably using the wrong tool for the job. It ends up complicated and hard to debug, thanks to greedy/non-greedy matching, etc.
Your 'positive/negative' lookaheads are to match a piece of a pattern that aren't surrounded by other pieces of pattern. But that can have some unexpected outcomes if you're matching variable widths, because the regex engine will backtrack until it finds something that matches.
A simpler example:
([\w.]+)(?!arpa)$
Applied to:
www.test.arpa
Will it match? What's in the group?
... it will match, because [\w\.]+ will consume all of it, and then the lookahead won't "see" anything.
If you use:
([\w]+)\.(?!arpa)
Instead though - you'll capture.... www, but you won't match test (with e.g. g flag, because the www doesn't have .arpa after it, but the test does.
https://regex101.com/r/hU6tP0/5
It really does get complicated using negative assertions in a pattern as a result. I'd suggest simply not doing so, and applying two separate tests. It's hard for you to figure out, and it's hard for a future maintenance programmer too!
This is an analysis of your regex:
(?=^.{4,253}$) # force min length: 4 chars, max length: 253 chars
( # Capturing Group 1 (CG1) - not needed
^ # Match start of the string
( # CG2 (can be a non capturing group '(?:...)')
[a-zA-Z0-9]{1,63} # any sequence of letters and numbers with length between 1 and 63
\. # a literal dot
)+ # CLOSE CG2
[a-zA-Z]{1,63} # any letter sequence with length between 1 to 63
[^.arpa] # a negated char class: any char that is not a "literal" '.','a','r','p' (last 'a' is redundant)
$ # end of the string
) # CLOSE CG1
To avoid the tail of the string to be .arpa you need to use a negative lookahead (?!...), so modify just like this:
(?=^.{4,253}$)(?!.*\.arpa$)(^([a-zA-Z0-9]{1,63}\.)+[a-zA-Z]{2,63}$)
An online demo
Update:
I've upgraded the regex to rationalise it (i've incorporated also the Sobrique suggestion adding an important details):
/^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i
Compact version online demo
Legenda
/ # js regex delimiter
^ # start of the string
(?=.{4,253}$) # force min length: 4 chars, max length: 253 chars
(?: # Non capturing group 1 (NCG1)
[a-z0-9]{1,63} # any letter or digit in a sequence with length from 1 to 63 chars
[.] # a literal dot '.' (more readable than \.)
)+ # CLOSE NCG1 - repeat its content one or more time
(?!arpa$) # force that after the last literal dot '.' the string does not end with 'arpa' (i've added '$' to Sobrique suggestion instead it prevents also '.arpanet' too)
[a-z]{2,63} # a sequence of letters with length from 2 to 63
$ # end of the string
/i # Close the regex delimiter and add case insensitive flag [a-z] match also [A-Z] and viceversa
var re = /^(?=.{4,253}$)([a-z0-9]{1,63}[.])+(?!arpa$)[a-z]{2,63}$/i;
var tests = ['google.uk','domain.arpa','domain.arpa2','another.domain.arpa.net','domain.arpanet'];
var m;
while(t = tests.pop()) {
document.getElementById("r").innerHTML += '"' + t + '"<br/>';
document.getElementById("r").innerHTML += 'Valid domain? ' + ( (t.match(re)) ? '<font color="green">YES</font>' : '<font color="red">NO</font>') + '<br/><br/>';
}
<div id="r"/>

Regex to Extract Last Part of URL that Contains User ID Strings

I'm having a hard time figuring this one out and could use some help.
I'm using Google Analytics filters to reduce the number of unique pages being reported in our app by stripping out ID strings from the URLs that are coming in.
What I need is a regex that will look for URLs that have these IDs in the URL. Here's what sets them apart from the rest of the URL:
ID strings are always the last part of the URL
ID strings always contain both letters and numbers
ID strings are always either 16- or 32-characters in length
ID strings can show up twice in a URL
ID strings can end with either a "/" or without
Here are some example URLs that show how they appear in our reporting:
/app/6be031b9672be9b5/
/app/admin/client/settings/6be031b9672be9b5
/app/subscribers/ea33fb38c9efc4dc0367819f23434f99/
/app/subscribers/customfieldsettings/0359c487066727ae/
/app/reports/6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/
The second part of my question is that this regex should also group everything before these ID strings into a capturing group so that I can call that group later on in the filter, effectively stripping out these ID strings to look like the following:
/app/6be031b9672be9b5/ --> /app/
/app/subscribers/ea33fb38c9efc4dc0367819f23434f99/ --> /app/subscribers/
etc.
I've tried a couple different approaches but none seem to work perfectly, so I could really use the help, thank you!
Here's a solution:
^(.*?)(?:\/[a-zA-Z0-9]{16}|\/[a-zA-Z0-9]{32}){0,2}\/?$
Demo
This will remove the last part or 2 parts of URLs which are 16 or 32 characters long and contain only letters and digits.
You can make sure these parts contain both letters and numbers like this, if the tool supports lookaheads:
^(.*?)(?:\/(?=.{0,15}?\d)(?=.{0,15}?[a-zA-Z])[a-zA-Z0-9]{16}|\/(?=.{0,31}?\d)(?=.{0,31}?[a-zA-Z])[a-zA-Z0-9]{32}){0,2}\/?$
Demo
This adds assertions to the pattern.
Breakdown:
^(.*?) # Start of URL
(?:
\/ # a slash
(?=.{0,15}?\d) # check there's a digit at most 16 chars ahead
(?=.{0,15}?[a-zA-Z]) # check there's a letter at most 16 chars ahead
[a-zA-Z0-9]{16} # check the next 16 chars are digits or letters
| # .. or:
\/ # a slash
(?=.{0,31}?\d) # check there's a digit at most 32 chars ahead
(?=.{0,31}?[a-zA-Z]) # check there's a letter at most 32 chars ahead
[a-zA-Z0-9]{32} # check the next 32 chars are digits or letters
){0,2} # .. at most 2 times
\/?$ # optional slash at end
This will do it:
([a-z0-9]+)(?:\/?$)
Demo
Explanation:
([a-z0-9]+) matches and captures the alphanumeric part
(?:\/?$) looks for (but doesn't match or capture) the optional final / and then the end of the string ($)
modified - totally missed that can be 1 or 2 id's at the end thing.
Oh well, revised fwiw.
# (?i)^(.*?)/((?:(?=[^/]{0,31}[a-f])(?=[^/]{0,31}[0-9])(?:[a-f0-9]{16}|[a-f0-9]{32})(?:(?:/[a-z])?/?$|/)){1,2})$
(?i) # Case insensitive modifier
^ # BOS, begin the ride ..
( .*? ) # (1), Kreep up on the first ID
/ # Trim this / junk
( # (2 start), 1-2 ID's separated by a /
(?:
(?= [^/]{0,31} [a-f] ) # Use largest range (32), Must be a lettr AND number
(?= [^/]{0,31} [0-9] )
(?: # One of 16 or 32 length
[a-f0-9]{16}
| [a-f0-9]{32}
)
(?:
(?: / [a-z] )? # optional / letter
/? $ # /? EOS for end of 1 or 2
| # or,
/ # / between 2 only
)
){1,2}
) # (2 end)
$ # EOS, rides over !!
Sample output:
** Grp 0 - ( pos 195 , len 63 )
/app/reports/6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/
** Grp 1 - ( pos 195 , len 12 )
/app/reports
** Grp 2 - ( pos 208 , len 50 )
6fa92d36be0e6c16/dc5aa096fba9cbb97eea1dae616d4b3c/

Matching percentages

I've been trying to enhance some code which determines whether a string is a valid percentage.
I decided that it was time to finally have a hundred problems, and learned regex.
I've been using this web regex tester to build my pattern.
I'm trying to do this rather loosely, such that valid percentages may be integer or decimal, positive or negative, include commas or not, and have any amount of whitespace at the beginning and end, as well as around the optional negative sign and the required percentage sign.
So far, I have \s*-?\s*\d+(,\d+)*(?:\.\d*)?\s*%\s*, which matches almost all of my test cases correctly:
0
0
0
% 0
- 0 %
20948.924780%
315%
2,456,875 %
2,104.86%
89fqyf0gp948y1-%ghghpq98fy92,.?><
, , , ,,,, 0,0,000,00,00,,,0
, , , ,,,, 0,0,000,00,00,,,0%
000000000,00000000000 %
000000000,00000000000,00000000000 %
000000000,00000000000,00000000000,00000000000.00000000000 %
These are not in any particular order, some pass and some fail, but only one is incorrect. In , , , ,,,, 0,0,000,00,00,,,0%, the last 0%\n is a match, but the whole line should be invalid. Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
It may be something small, but as someone who only learned regex yesterday, it's far beyond my reach.
Thanks!
Start and end indicators do not seem to have the effect I had assumed, as a $ makes only the last example match, while a ^ at the beginning makes no matches register.
Those anchors should be working. However, it does depend on the regex engine and the options whether they match line begins/ends or file begins/ends. On RegExr, you'd have to check the multiline option: http://regexr.com?380p9 - in programming, use the m flag.
It could be done like this.
Edit: So after realizing its a line thing, this is the regex now.
Note(s) -
Uses multiline mode line Bergi's.
Also, you CANNOT just use \s wihitespace class in this.
It doesn't matter what mode used, \s will WILL match CRLF if it can, which means
-
000,000000.22
%
will match because it satisfies all the conditions.
[^\S\r\n] means match whitespace except CRLF characters. It could be replaced with
[^\S\n] in the real world. The initial input on that tester used \r\n linebreaks.
Good Luck!!
# ^[^\S\r\n]*-?[^\S\r\n]*(?:(?:\.\d+)|(?:\d+(?:,\d+)*(?:\.\d*)?))[^\S\r\n]*%[^\S\r\n]*$
^ # BOL
[^\S\r\n]*
-? # optional -
[^\S\r\n]*
(?: # group
(?: \. \d+ ) # .number
| # or
(?: # group
\d+ # number
(?: , \d+ )* # optional many ,number
(?: \. \d* )? # optional . optional number
) # end group
) # end group
[^\S\r\n]*
% # %
[^\S\r\n]*
$ # EOL

limit expression length

I am using the following in a script of mine to verify minutes entered... it allows for numbers and a comma for thousands in the correct format only... however, I would like to add a length restriction as well... I can't seem to do it or I'm just putting itin the wrong spot... here is the code as is with no limit :
(!preg_match("#^(\d{1,3}(\,\d{3})*|(\d+))$#",$values['minutes']))
I would like to make this at least one with a max of five... the entry is for minutes online per day... well there are only 1440 minutes in a day... if you entered 1,440 which is valid currently that is 5 characters and I want to limit the expression to that...
Anyone?
Two suggestions:
preg_match("#^(?:\d{1,3}|1,?\d{3})$#"
Explanation:
^ # Start of string
(?: # Either match...
\d{1,3} # a three-digit number
| # or
1 # a four digit number that starts with a 1
,? # and may have a thousands separator
\d{3} # (and three more digits)
)
$ # End of string
The problem is of course that this also allows 1,999, so you'd still need an extra sanity check. This probably is the better solution.
You can also do the range limitation in the regex itself, but that's cumbersome:
preg_match("#^(?:1,?440|1,?4[0-3]\d|1,?[0-3]\d{2}|[1-9]\d{1,2}|\d)$#"
Explanation:
^ # Start of string
(?: # Either match...
1,?440 # 1440
| # or
1,?4[0-3]\d # 1400-1439
| # or
1,?[0-3]\d{2} # 1000-1399
| # or
[1-9]\d{1,2} # 10-999
| # or
\d # 0-9
)
$ # End of string
You're probably better off just testing the string's length or even the integer value. But just to show that it's possible:
preg_match("#^(\d,\d{3}|\d{1,4})$#")
Yes, it's very simple, since a four-digit number can only take one of the forms
one digit, comma, three digits
four digits

Simple regex validation

I want to implement the following validation. Match at least 5 digits and also some other characters between(for example letters and slashes). For example 12345, 1A/2345, B22226, 21113C are all valid combinations. But 1234, AA1234 are not. I know that {5,} gives minimum number of occurrences, but I don't know how to cope with the other characters. I mean [0-9A-Z/]{5,} won't work:(. I just don't know where to put the other characters in the regex expression.
Thanks in advance!
Best regards,
Petar
Using the simplest regex features since you haven't specified which engine you're using, you can try:
.*([0-9].*){5}
|/|\ /|/| |
| | \ / | | +--> exactly five occurrences of the group
| | | | +----> end group
| | | +------> zero or more of any character
| | +---------> any digit
| +------------> begin group
+--------------> zero or more of any character
This gives you any number (including zero) of characters, followed by a group consisting of a single digit and any number of characters again. That group is repeated exactly five times.
That'll match any string with five or more digits in it, along with anything else.
If you want to limit what the other characters can be, use something other than .. For example, alphas only would be:
[A-Za-z]*([0-9][A-Za-z]*){5}
EDIT: I'm picking up your suggestion from a comment to paxdiablo's answer: This regex now implements an upper bound of five for the number of "other" characters:
^(?=(?:[A-Z/]*\d){5})(?!(?:\d*[A-Z/]){6})[\dA-Z/]*$
will match and return a string that has at least five digits and zero or more of the "other" allowed characters A-Z or /. No other characters are allowed.
Explanation:
^ # Start of string
(?= # Assert that it's possible to match the following:
(?: # Match this group:
[A-Z/]* # zero or more non-digits, but allowed characters
\d # exactly one digit
){5} # five times
) # End of lookahead assertion.
(?! # Now assert that it's impossible to match the following:
(?: # Match this group:
\d* # zero or more digits
[A-Z/] # exactly one "other" character
){6} # six times (change this number to "upper bound + 1")
) # End of assertion.
[\dA-Z/]* # Now match the actual string, allowing only these characters.
$ # Anchor the match at the end of the string.
You may want to try counting the digits instead. I feel its much cleaner than writing a complex regex.
>> "ABC12345".gsub(/[^0-9]/,"").size >= 5
=> true
the above says substitute all things not numbers, and then finding the length of those remaining. You can do the same thing using your own choice of language. The most fundamental way would be to iterate the string you have, counting each character which is a digit until it reaches 5 (or not) and doing stuff accordingly.