RegEx to find credit card numbers with embedded spaces - regex

We currently have a content compliance in place where by we monitor anything that contains a credit card number with no spaces (e.g 5100080000000000)
What we need is for a reg ex to pick up credit card numbers that are entered with spaces every 4 digits (eg: 5100 0800 0000 0000)
We've been looking at alternate reg exs but have not yet found one that works for both scenarios mentioned above.
The current reg ex we use is below
^((4\d{3})|(5[1-5]\d{2})|(6011)|(34\d{1})|(37\d{1}))-?\d{4}-?\d{4}-?\d{4}|3[4,7][\d\s-]{15}$

Just add optional /s? in where you already have the optional -?
So your regex becomes
^((4\d{3})|(5[1-5]\d{2})|(6011)|(34\d{1})|(37\d{1}))-?\s?\d{4}-?\s?\d{4}-?\s?\d{4}|3[4,7][\d\s-]{15}$

It seems that you already accept a dash every four characters. Thus you can simply replace -? with [- ]? everywhere.
If you require the dashes or spaces to be consistent - that is, allow no grouping at all, or a dash every four characters, or a space every four characters, you can use a back reference to force the repetitions to be identical to the first match:
^(?:4\d{3}|5[1-5]\d{2}|6011|3[47]\d{2})([- ]?)\d{4}\1\d{4}\1\d{4}$
You will notice I removed the final 3[4,7]... which looked like an erroneous addition, apparently made when attempting to solve this problem partially. Also I changed the parentheses to non-grouping ones (?:...) or simply removed them where no grouping seemed necessary or useful, mainly because this makes it easier to see what the backreference \1 refers to. Finally, the 34.. and 37.. patterns had \d{1} where apparently \d{2} was intended (or if those particular series are only three digits before the first dash, the repetition {1} was just superfluous, but then the 3[4,7]... part would have been even more wrong!)

Won't all these ideas blow up on you as soon as someone uses and AMEX card and enters 3 or 5 numbers instead of 4 in any one 'block'

((\d+) *(\d+) *(\d+) *(\d+))
That would be the general idea (and it even works!), you can polish it if you want. There is a great page to test your regexp live - http://rubular.com/

Try this:
(\d{4} *\d{4} *\d{4} *\d{4})

Related

Regular Expressions (Regex) 2 expressions: remove spaces before verifying 2nd one

TLDR:
I have option to add only one regex.
How to make those 2 expressions:
\s
(\d{10})(19|20)(\d{2})$:$1$3
work at the same time (one after another) and not separately?
This is not enough: \s|(\d{10})(19|20)(\d{2})$:$1$3
Long description:
I have an expression: '(\d{10})(19|20)(\d{2})$:$1$3'
What it does:
user password should have 12 digits - ending with last 2 digits of the year
in case phrase has 14 digits (someone added full year) - ignore digits 11th and 12th
Thanks to that we can accept both codes: 308814310175 and 30881431011975.
Now I'm looking for a way to ignore spaces in case user adds them anywhere by mistake (not my requirement).
Theoretically I can just add '|\s', to get '\s|(\d{10})(19|20)(\d{2})$:$1$3'.
Both regex works separately:
when someone adds full year - it removes 11th and 12th digits
when someone adds space - it removes it
but if someone adds space AND adds full year then only removing of spaces works (because phrase is longer than 14 digits).
So this works:
308814310175
30881431011975
3088143 10119
But this is not working:
3088143101 1975
because it removes space OR 11th/12th digits - not making both things work one after another.
How to make both expressions work at the same time?
Thank you in advance for your help.
A somewhat long solution would be to capture any digit seperately and avoid spaces and a possible 11th and 12 digit in case of 14 digits total:
^\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(\d)\s*(?:1\s*9|2\s*0)?\s*(\d)\s*(\d)\s*$
See an online demo. You would then replace this with $1$2$3$4$5$5$6$7$8$9$10$11$12
Another possibility (if supported) could be to replace:
(?:[^\S\n]|(?<=^\s*(?:\d\s*?){10})\s*(?:1\s*9|2\s*0)(?=\s*\d\s*\d\s*$))
With nothing. But this would require zero-width lookbehind. See demo
You are trying to solve a simple problem in a complicated way. Instead of using a complicated regex, just use two simple steps:
Remove unwanted spaces.
Apply the regex to validate the string and remove other unwanted characters.

Regex how to get a full match of nth word (without using non-capturing groups)

I am trying to use Regex to return the nth word in a string. This would be simple enough using other answers to similar questions; however, I do not have access to any of the code. I can only access a regex input field and the server only returns the 'full match' and cannot be made to return any captured groups such as 'group 1'
EDIT:
From the developers explaining the version of regex used:
"...its javascript regex so should mostly be compatible with perl i
believe but not as advanced, its fairly low level so wasn't really
intended for use by end users when originally implemented - i added
the dropdown with the intention of having some presets going
forwards."
/EDIT
Sample String:
One Two Three Four Five
Attempted solution (which is meant to get just the 2nd word):
^(?:\w+ ){1}(\S+)$
The result is:
One Two
I have also tried other variations of the regex:
(?:\w+ ){1}(\S+)$
^(?:\w+ ){1}(\S+)
But these just return the entire string.
I have tried replicating the behaviour that I see using regex101 but the results seem to be different, particularly when changing around the ^ and $.
For example, I get the same output on regex101 if I use the altered regex:
^(?:\w+ ){1}(\S+)
In any case, none of the comparing has helped me actually achieve my stated aim.
I am hoping that I have just missed something basic!
===EDIT===
Thanks to all of you who have contributed thus far, however, I am still running into issues. I am afraid that I do not know the language or restrictions on the regex other than what I can ascertain through trial and error, therefore here is a list of attempts and results all of which are trying to return "Two" from a sample of:
One Two Three Four Five
\w+(?=( \w+){1}$)
returns all words
^(\w+ ){1}\K(\w+)
returns no words atall (so I assume that \K does not work)
(\w+? ){1}\K(\w+?)(?= )
returns no words at all
\w+(?=\s\w+\s\w+\s\w+$)
returns all words
^(?:\w+\s){1}\K\w+
returns all words
====
With all of the above not working, I thought I would test out some others to see the limitations of the system
Attempting to return the last word:
\w+$
returns all words
This leads me to believe that something strange is going on with the start ^ and end $ characters, perhaps the server puts these in automatically if they are omitted? Any more ideas greatly appreciated.
I don't known if your language supports positive lookbehind, so using your example,
One Two Three Four Five
here is a solution which should work in every language :
\w+ match the first word
\w+$ match the last word
\w+(?=\s\w+$) match the 4th word
\w+(?=\s\w+\s\w+$) match the 3rd word
\w+(?=\s\w+\s\w+\s\w+$) match the 2nd word
So if a string contains 10 words :
The first and the last word are easy to find. To find a word at a position, then you simply have to use this rule :
\w+(?= followed by \s\w+ (10 - position) times followed by $)
Example
In this string :
One Two Three Four Five Six Seven Height Nine Ten
I want to find the 6th word.
10 - 6 = 4
\w+(?= followed by \s\w+ 4 times followed by $)
Our final regex is
\w+(?=\s\w+\s\w+\s\w+\s\w+$)
Demo
It's possible to use reset match (\K) to reset the position of the match and obtain the third word of a string as follows:
(\w+? ){2}\K(\w+?)(?= )
I'm not sure what language you're working in, so you may or may not have access to this feature.
I'm not sure if your language does support \K, but still sharing this anyway in case it does support:
^(?:\w+\s){3}\K\w+
to get the 4th word.
^ represents starting anchor
(?:\w+\s){3} is a non-capturing group that matches three words (ending with spaces)
\K is a match reset, so it resets the match and the previously matched characters aren't included
\w+ helps consume the nth word
Regex101 Demo
And similarly,
^(?:\w+\s){1}\K\w+ for the 2nd word
^(?:\w+\s){2}\K\w+ for the 3rd word
^(?:\w+\s){3}\K\w+ for the 4th word
and so on...
So, on the down side, you can't use look behind because that has to be a fixed width pattern, but the "full match" is just the last thing that "full matches", so you just need something whose last match is your word.
With Positive look-ahead, you can get the nth word from the right
\w+(?=( \w+){n}$)
If your server has extended regex, \K can "clear matched items", but most regex engines don't support this.
^(\w+ ){n}\K(\w+)
Unfortunately, Regex doesn't have a standard "match only n'th occurrence", So counting from the right is the best you can do. (Also, Regex101 has a searchable quick reference in the bottom right corner for looking up special characters, just remember that most of those characters are not supported by all regex engines)

Multiple spaces, multiple commas and multiple hypens in alphanumeric regex

I am very new to regex and regular expressions, and I am stuck in a situation where I want to apply a regex on an JSF input field.
Where
alphanumeric
multiple spaces
multiple dot(.)
multiple hyphen (‐)
are allowed, and Minimum limit is 1 and Maximum limit is 5.
And for multiple values - they must be separated by comma (,)
So a Single value can be:
3kd-R
or
k3
or
-4
And multiple values (must be comma separated):
kdk30,3.K-4,ER--U,2,.I3,
By the help of stackoverflow, so far I am able to achieve only this:
(^[a-zA-Z0-9 ]{5}(,[a-zA-Z0-9 ]{5})*$)
Something like
^[-.a-zA-Z0-9 ]{1,5}(,[-.a-zA-Z0-9 ]{1,5})*$
Changes made
[-.a-zA-Z0-9 ] Added - and . to the character class so that those are matched as well.
{1,5} Quantifier, ensures that it is matched minimum 1 and maximum 5 characters
Regex demo
You've done pretty good. You need to add hyphen and dot to that first character class. Note: With the hyphen, since it delegates ranges within a character class, you need to position it where contextually it cannot be specifying a range--not to say put it where it seems like it would be an invalid range, e.g., 7-., but positionally cannot be a range, i.e., first or last. So your first character class would look something like this:
[a-zA-Z 0-9.-]{1,5} or [-a-zA-Z0-9 .]{1,5}
So, we've just defined what one segment looks like. That pattern can reoccur zero or more times. Of course, there are many ways to do that, but I would favor a regex subroutine because this allows code reuse. Now if the specs change or you're testing and realize you have to tweak that segment pattern, you only need to change it in one place.
Subroutines are not supported in BRE or ERE, but most widely-used modern regex engines support them (Perl, PCRE, Ruby, Delphi, R, PHP). They are very simple to use and understand. Basically, you just need to be able to refer to it (sound familiar? refer-back? back-reference?), so this means we need to capture the regex we wish to repeat. Then it's as simple as referring back to it, but instead of \1 which refers to the captured value (data), we want to refer to it as (?1), the capturing expression. In doing so, we've logically defined a subroutine:
([a-zA-Z 0-9.-]{1,5})(,(?1))*
So, the first group basically defines our subroutine and the second group consists of a comma followed by the same segment-definition expression we used for the first group, and that is optional ('*' is the zero-or-more quantifier).
If you operate on large quantities of data where efficiency is a consideration, don't capture when you don't have to. If your sole purpose for using parenthesis is to alternate (e.g., \b[bB](asset|eagle)\b hound) or to quantify, as in our second group, use the (?: ... ) notation, which signifies to the regex engine that this is a non-capturing group. Without going into great detail, there is a lot of overhead in maintaining the match locations--not that it's complex, per se, just potentially highly repetitive. Regex engines will match, store the information, then when the match fails, they "give up" the match and try again starting with the next matching substring. Each time they match your capture group, they're storing that information again. Okay, I'm off the soapbox now. :-)
So, we're almost there. I say "almost" because I don't have all the information. But if this should be the sole occupant of the "subject" (line, field, etc.--the data sample you're evaluating), you should anchor it to "assert" that requirement. The caret '^' is beginning of subject, and the dollar '$' is end of subject, so by encapsulating our expression in ^ ... $ we are asserting that the subject matches in it's entirety, front-to-back. These assertions have zero-length; they consume no data, only assert a relative position. You can operate on them, e.g., s/^/ / would indent your entire document two spaces. You haven't really substituted the beginning of line with two spaces, but you're able to operate on that imaginary, zero-length location. (Do some research on zero-length assertions [aka zero-width assertions, or look-arounds] to uncover a powerful feature of modern regex. For example, in the previous regex if I wanted to make sure I did not insert two spaces on blank lines: s/^(?!$)/ /)
Also, you didn't say if you need to capture the results to do something with it. My impression was it's validation only, so that's not necessary. However, if it is needed, you can wrap the entire expression in capturing parenthesis: ^( ... )$.
I'm going to provide a final solution that does not assume you need to capture but does assume the entire subject should consist of this value:
^([a-zA-Z 0-9. -]{1,5})(?:,(?1))*$
I know I went on a bit, but you said you were new to regex, so wanted to provide some detail. I hope it wasn't too much detail.
By the way, an excellent resource with tutorials is regular-expressions dot info, and a wonderful regex development and testing tool is regex101 dot com. And I can never say enough about stack overflow!

Finding repetition by Vim's Regex and globbing

How can you find the repetiting sequences of at least 30 numbers?
Sample of the data
2.3758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840546697038724373576309794988610478359908883826879271070615034168564920273348519362186788154897494305239179954441913439635535307517084282460136674259681093394077448747152619589977220956719817767653758542141230068337129840547
My attempt in Vim
:g/\(\d\{4}\)\[^\1\]\1/
|
|----------- Problem here!
I do not know how you can have the negation of the first glob.
First of all, to find your repeating numbers, you can use this simple search:
/\(\d\{5\}\).\{-}\1
This search finds repetitions of 5 digits. Unfortunately, vim highlights from the start of the 5 digit number to the end of the repetition - including every digit in between - and this makes it hard to see what the 5 digit number is. Also, because your number sequence repeats so much, the whole thing is highlighted because there are repeats all the way through.
You will probably find it's more useful to use :set incsearch and type /\(\d\{5\}\).\{-}\1 or /\(\d\{5\}\)\ze.\{-}\1 without hitting enter so you can see what the digits are.
This command might be more useful to you:
:syn region repeatSection matchgroup=Search start=/\z(\d\{30}\)/ matchgroup=Error end=/\z1/ oneline
This will highlight a sequence of 30 digits in yellow (first time it is seen) or red (when it is repeated). Note that this only works for a single line of text (multi-line isn't possible).
How about :g/\(\d\{30,\}\{2,\}\)/?
I'm not sure why you need the negation. /\(\d\{4\}\)\1/ will match a sequence of (exactly) four digits, repeated once. You probably actually want something like /\(\d\{30,\}\)\1/ to get your "at least 30". This appears to work for me, unless I've misunderstood what you're trying to search for. Note that since the regex are greedy, you will get the longest possible repeated sequence.
If it helps you on the way, the appropriate way to make sure that the following set of characters aren't the same as what is stored in back-reference #1 would be (?!\1). Note that the (?!) (negative look-ahead) group is a zero-width assertion (i.e., it will not change the position of the cursor, it just checks whether the regex should fail or not.)
Whether that is supported by the regex engine you're using, I don't know.
Update
I just had a quick sketch on paper, and something along these lines might work in PCRE... but I haven't tested it and can't right now, but maybe it'll give you some ideas:
(?=(\d{30}))\d(?=\d{29,}?\1)
To ensure that I understood you correctly, the purpose of the above regex would be to match any sequence of 30 digits that also exists later in the whole string being searched.
My thoughts for the above regex were these:
First I want to match a sequence of 30 digits, but I don't want to consume them since I want to check 1 digit later (not 30) next time. Therefore I use a look-ahead with a capturing group that stores the next 30 digits.
Then I consume one digit to ensure I don't match the 30 digits with themselves.
Then I match at least 29 digits (which means I'll be starting on the digit just outside the current sequence of digits) with a non-greedy quantifier, so that it will try 30, then 31, etc.
Then I match the 30 digits I'm currently testing. If they exist later in the sequence, the regular expression will succeed; otherwise, it will fail.
This command will match lines with 123451234 but not 111111111
:g/\(\d\{4}\)\1\#!.\1/
\1\#!. uses a negative lookahead to say "make sure this position doesn't match (\#!) group 1 (\1), then consume a character (.)"

Regular expression greedy match not working as expected

I have a very basic regular expression that I just can't figure out why it's not working so the question is two parts. Why does my current version not work and what is the correct expression.
Rules are pretty simple:
Must have minimum 3 characters.
If a % character is the first character must be a minimum of 4 characters.
So the following cases should work out as follows:
AB - fail
ABC - pass
ABCDEFG - pass
% - fail
%AB - fail
%ABC - pass
%ABCDEFG - pass
%%AB - pass
The expression I am using is:
^%?\S{3}
Which to me means:
^ - Start of string
%? - Greedy check for 0 or 1 % character
\S{3} - 3 other characters that are not white space
The problem is, the %? for some reason is not doing a greedy check. It's not eating the % character if it exists so the '%AB' case is passing which I think should be failing. Why is the %? not eating the % character?
Someone please show me the light :)
Edit: The answer I used was Dav below: ^(%\S{3}|[^%\s]\S{2})
Although it was a 2 part answer and Alan's really made me understand why. I didn't use his version of ^(?>%?)\S{3} because it worked but not in the javascript implementation. Both great answers and a lot of help.
The word for the behavior you described isn't greedy, it's possessive. Normal, greedy quantifiers match as much as they can originally, but back off if necessary to allow the whole regex to match (I like to think of them as greedy but accommodating). That's what's happening to you: the %? originally matches the leading percent sign, but if there aren't enough characters left for an overall match, it gives up the percent sign and lets \S{3} match it instead.
Some regex flavors (including Java and PHP) support possessive quantifiers, which never back off, even if that causes the overall match to fail. .NET doesn't have those, but it has the next best thing: atomic groups. Whatever you put inside an atomic group acts like a separate regex--it either matches at the position where it's applied or it doesn't, but it never goes back and tries to match more or less than it originally did just because the rest of the regex is failing (that is, the regex engine never backtracks into the atomic group). Here's how you would use it for your problem:
^(?>%?)\S{3}
If the string starts with a percent sign, the (?>%?) matches it, and if there aren't enough characters left for \S{3} to match, the regex fails.
Note that atomic groups (or possessive quantifiers) are not necessary to solve this problem, as #Dav demonstrated. But they're very powerful tools which can easily make the difference between impossible and possible, or too damn slow and slick as can be.
Regex will always try to match the whole pattern if it can - "greedy" doesn't mean "will always grab the character if it exists", but instead means "will always grab the character if it exists and a match can be made with it grabbed".
Instead, what you probably want is something like this:
^(%\S{3}|[^%\s]\S{2})
Which will match either a % followed by 3 characters, or a non-%, non-whitespace followed by 2 more.
I always love to look at RE questions to see how much time people spend on them to "Save time"
str.len() >= str[0]=='&' ? 4 : 3
Although in real life I'd be more explicit, I just wrote it that way because for some reason some people consider code brevity an advantage (I'd call it an anti-advantage, but that's not a popular opinion right now)
Try the regex modified a little based on Dav's original one:
^(%\S{3,}|[^%\s]\S{2,})
with the regex option "^ and $ match at line breaks" on.