Replacing Credit Card Numbers - regex

I'm using an sql to replace credit card numbers with xxxx and finding that REGEX_REPLACE does not consistently replace everything. Below is the SET command i'm using on the SQL
SET COMMENTS_LONG =
REGEXP_REPLACE (COMMENTS_LONG,'\D[1-6]\d{3}.\d{4}.\d{4}.\d{3}(\d{1}.\d{3})?|\D[1-6]\d{12,15}|\D[1-6]\d{3}.\d{3}.?\d{3}.\d{5}', ' XXXXXXXXXXXXXXXX')
Before
Elizabeth aclled to change address.5430-6000-2111-1931 A
After
Elizabeth aclled to change address XXXXXXXXXXXXXXXX1 A
I tried increasing the number of X but result is the same. I also find that i have to put a space in front of the first X as it appears to move 1 char to the left.

I would't make the regex to specific, that increases the change of accidentially letting real numbers that don't match your expression pass to the end user.
I would just use a simple regex like this:
(\d+-){3}\d+
btw: why did you include \D at the beginning? The . is not part of the creditcard number, right?
Edit: Just found this regex
\b(?:\d[ -]*?){13,16}\b
at this site: http://www.regular-expressions.info/creditcard.html
You should read the paragraph "Finding Credit Card Numbers in Documents"

Related

Regular Expression - Extract Words and number

So I'm using Regexextract in GoogleSheet to find the value for a big amount of data. I have 2 problems I don't know to extract or what I did wrong. Feel free to point out my mistakes and help me with a solution.
Require: I need to extract the part number which format is ABCD#### or ABCD-#### which is is Upper character and numbers follow after, with or w/o "-" , for example KTA1763 or SPD-4124
# I use this formula: =Regexextract(A1,"([A-Z]+-?[0-9]*)") .FYI, the values I'm extracting, it could appear either at the beginning, middle or last.
1.First problem, I have the value as below:
REACH TECH 223/224 list document for KTD2026BEWE-TR
=> Extract result : REACH
[What I need: KTD2026]
2.I have the value as:
information for Part number KTA1550EDS-TR
=> Extract result: P
[What I need: KTA1550]
Please let me know which part in the formula should I fix to have the final expected result. Or how should I alter my formula for that matter, big thanks
go for:
=INDEX(IFNA(REGEXEXTRACT(A1:A, "[A-Z]+\d+")))
Try this in one cell.
=ArrayFormula(IF(A2:A="",,REGEXEXTRACT(REGEXEXTRACT(A2:A, ".+"&REGEXEXTRACT(A2:A, "[0-9]+")), ".+\s(.+)")))

Vim: Placing (,) in between CERTAIN high numbers Issue

source txt file:
34|Gurla Mandhata|7694|25243|2788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
command input:
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/g
command output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1,985|6 (4)|China
Desired output:
34|Gurla Mandhata|7,694|25,243|2,788|Nalakankar Himalaya|30°26'19"N
81°17'48"E|Dhaulagiri|1985|6 (4)|China
Basically 1985 is supposed to be 1985 and not 1,985. I tried to put a \? so every time the pattern matches it stops and a °+ after so it has to detect a ° to match the pattern, but no success. It just replaces the ° and everything before that, complete mess.
My knowledge of regular expressions however combined with the substitute is weak and I'm stuck here.
EDIT
the first 3 numbers represent heights of mountains, those 3 need to change with a (,) and the last number ( 1985 ) represents a year, which must not be changed.
Mathematical solutions are not going to work as loophole since there are mountains with a height off less than 1900
You haven't told us what is the difference between 1985 and other numbers, so I assumed that your "small" numbers are less than 2000.
You almost got it:
:%s/(\d*[2-90])(\d\d\d)/\1,\2/g
Alternatively if that isn't what you want, you can use c flag (:h s_flags):
:%s/\(\d\+\)\(\d\d\d\)/\1,\2/gc
this line will leave the last 3 columns untouched, just do substitution on the content before it:
%s/\v(.*)((\|[^|]*){3}$)/\=substitute(submatch(1),'\v(\d+)(\d{3})','\1,\2','g').submatch(2)/g
Note that the above line will change 1000000 into 1000,000 instead of 1,000,000. Vim's printf() doesn't support %'d, it is pity. If you do have number > 1m, we can find other solutions.
update
I solved it myself, by using 3 seperate commands; one for every number string in the file:
%s/^\(\d*|[^|]*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
:%s/^\(\d*|[^|]*|\d\+,*\d*|\d\+,*\d*|\)\(\d\+\)\(\d\d\d\)|/\1\2,\3|/g
In case you want to use perl:
:%!perl -F'\|' -lane 'for(#F[2..4]) { s/(\d+)(\d{3})/\1,\2/;} print join "|", #F'

RegEx to clean VISA merchant names (remove random strings)

I am trying to develop a ReGex (.Net flavor), which I can use to clean VISA merchant names.
Examples:
Norton *AP1223506209 --> Norton *AP
Norton *AP1223511428
EUROWINGS VYJD6J_123001 --> EUROWINGS
EUROWINGS W6PDFI_125626
AER LINGUCB22QKM2 --> AER LINGUCB
AER LINGUCB248L2W
AIR FRANCE JWNCSC --> AIR FRANCE
AIR FRANCE K8L7TT
PAYPAL *AIRBNB HMQXBW --> PAYPAL *AIRBNB
PAYPAL *AIRBNB HMQXNZ
SAS 1174565172360 --> SAS
SAS 1174565172368
I would like to keep the first "name" part, but remove the second "gibberish" part.
The following Regex works for Norton and Air Lingu as well as for Eurowings and Air France, if they contain numbers in the gibberish part. It totally fails for PAYPAL *AIRBNB and other strings, that don't contain any numbers in the gibberish part, and also for SAS, probably because the name is too short / there are too many spaces:
Search:
([A-z *-]{2,50}[A-z]{2,50})(.{0,3}([0-9-]{0,3}[A-z *+.#-/]{0,3}){1,10})
Replace:
$1
Is there any way to make this work for gibberish parts that don't contain numbers? I have something like this in mind, but don't manage to create an according RegEx:
Group 1 (to keep)
Must contain consonants and vowels
Can contain few numbers, spaces or punctuation signs (e.g.: "7x7: Taxi Service")
Group 2 (to be removed)
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists of consonants, only
OR: consists of numbers, only
Thanks for any help and best regards
Pesche
Edit:
If I add more examples, Lindens solution still works quite well, but does not recognize all of the examples or in some cases too much of the string. I tried to adjust it, but with my lacking skills didn't quite succeed:
https://regex101.com/r/7y9zGl/4
The following problems remain:
with a length of 6 for the last \w, longer patterns would not be matched in full length (e.g. after easyjet and after EMP Merchan). Increasing it, however, causes other strings to be truncated (e.g. AER LINGU, potentially also HOTELS.COM if > 12 was used).
The merchant names after PAYPAL * and GOOGLE * should not be deleted, as they are true merchant names. I tried to exclude strings containing GOOGLE * with a negative lookbehind, but it does not seem to work like that.
Whereas the merchant name after PAYPAL * should generally remain, in some cases it is followed by gibberish, e.g. PAYPAL *AIRBNB HMQXBW. If the negative lookbehind worked, those cases would no longer be cleaned.
if the merchant name is not followed by gibberish, part of the name itself may be deleted (e.g. EMP Merchan)
As the full list of merchant names is long and versatile, the approach to detect "gibberish" should be as generic as possible (i.e. not rely on a certain length of the gibberish part). Hence my original, now slightly modified "pattern":
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists non or very few vowels (EASYJET 000ESJ5TWN -> the gibberish contains only one vowel, EASYJET 3 of them; PAYPAL *NITSCHKE -> NITSCHKE should not be matched, it contains 2 vowels)
OR: consists of numbers, only
Is such a thing even possible? The goal is to use SQL to clean the merchant names. If necessary, this can be done in several run throughs (for different kind of patterns).
Thx again!
Updated regex based on extended sample and desired results:
[\s*<]+\d+$|[\s*<]+(?![A-Z]{6}.*)\w*\d[\w>]*$|\d{6,}$|[\s*<]+[A-Z]{6}$|(?![A-Z]+$)(?<=[A-Z])\w{6}$
Demo
I cannot validate as I'm only on my phone, but can you try something like this?
^([0-9A-Za-z\*][ ]{0-2})
Take all the numbers, the letters (capital and minor) the star and max 2 spaces from the beginning of the line.
Please check the () but I guess the idea is here.
Sorry, it seems wrong when there is no double space.
You want to take all the char until 2 spaces or 2 numbers according to your examples.
.* {2}|.*[0-9]{2}
Is it better?
Regards,
Thomas

Check if cell contains numbers in Google Spreadsheet using RegExMatch

I want to check if specific cell contain only numbers.
I know I should use RegExMatch but I get an error.
This is what I wrote : =if(RegExMatch(H2,[0-9]),"a","b")
I want it to say : write 'a' if H2 contains only numbers, 'b' otherwise.
Thank you
Try this:
=IF(ISNUMBER(H2,"A","B"))
or
=if(isna(REGEXEXTRACT(text(H2,"#"),"\d+")),"b","a")
One reason your match isn't working also - is that it in interpreting your numbers as text. the is number function is a bit more consistent, but if you really need to use regex, then you can see in the second formula where im making sure the that source text is matching against a string.
Your formula is right, simple you forget the double quotes at regexmatch function's regular_expression .
This is the right formula: =if(RegExMatch(B20,"[0-9]"),"a","b")
=REGEXREPLACE(“text”,”regex”,”replacement”)
It spits out the entire content but with the regular expression matched content replaced. =REGEXREPLACE(A2,[0-9],"a")
=REGEXREPLACE(A2,![0-9],"b")//not sure about not sign.
will fill a cell with the same text as A2, but with the 0-9 becoming an a!

EditPad: Need a regex that handles multiple possible data formats

First, I'm using EditPadPro for my regex cleaning, so any answers given should work within that environment.
I get a large spreadsheet full of data that I have to clean every day. I've managed to get it down to a couple of different regexes that I run, and this works... but I'm curious to see if it's possible to reduce down to a single regex.
Here is some sample data:
3-CPC_114851_70095_70095_CAN-bre
3-CPC_114851_70095_70095_CAN
b11-ao1-113775-bre
b7-ao-114441
b7-ao-114441-bre
b7-ao1-114441
b7-ao1-114441-bre
http://go.nlvid.com/results1/?http://bo
go.nlv/results1/?click
b4-sm-1359
b6-sm-1356-bre
1359_195_1453814569-bre
1356_104_1456856729
b15-rad-8905
b15-rad-8905-bre
Here is how the above data needs to end up:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
So, there are numerous rules, such as:
In cases of more than 2 underscores, the result needs to contain only the value immediately after the first underscore, and everything from the dash onwards.
In cases where the string contains "-ao-", "-ao1-", everything prior to the final numeric string should be removed.
If a question mark is present, everything from the mark onwards should be removed.
If the string contains "-sm-" or "-rad-", everything prior to those alpha strings should be removed.
If the string contains 2 underscores, averything after the first numeric string up to a dash
(if present) should be removed, and the string "sm-" should be prepended.
Additionally there is other data that must be left untouched, including but not limited to:
113535|24905|24905
as well as many variations on this pattern of xxxxxx|yyyyy|zzzzz (and not always those string lengths)
This may be asking way too much of regex, I'm not sure as I'm not great with it. But I've seen some pretty impressive things done with it, so I thought I'd put this out to the community and see what you come back with.
Jonathan, I can wrap all of those into one regex, except the last one (where you prepend sm- to a string that does not contain sm). It is not possible in this context, because we cannot capture "sm" to reuse in the replacement, and because there is no "conditional replacement" syntax in EPP.
That being said, you can achieve what you want in EPP with two regexes and one macro to chain the two.
Here is how.
The solution below is tested in EPP.
Regex 1
Press Ctrl + Sh + F to enter Search / Replace mode
Enter the following Search and Replace in the appropriate boxes
At the top right of the Search bar, click the Favorite Searches pull-down, select "Add", give it a name, e.g. Regex 1
Search:
(?mx)^
(?=(?:[^_\r\n]*?_){3})[^_\r\n]+?_([^_\r\n]+)[^-\r\n]+(-[^\r\n]+)?
|
[^\r\n]*?-ao1?-\D*([^\r\n]+)
|
([^\r\n?]*)(?=\?)[^\r\n]+
|
[^\r\n]*?-((?:sm|rad)-[^\r\n]+)
Replace:
\1\2\3\4\5
Regex 2
Same 1-2-3 steps as above.
Search
^(?!(?:[^_\r\n]*?_){3})(?=(?:[^_\r\n]*?_){2})(\d+)(?:[^-\r\n]+(-[^\r\n]+)?)
Replace
sm-\1\2
Chaining Regex 1 and Regex 2
Top menu: Macros, Record Macro, give it a name.
Click the Favorite searches pulldown, select Regex 1
Hit Replace All.
Click the Favorite searches pulldown, select Regex 2
Hit Replace All.
Macros, Stop recording.
Whenever you want to do your sequence of replacements, pull it by name under the Macros menu.
Testing This
I have tested my "Jonathan macro" on your input. Here is the result:
114851-bre
114851
113775-bre
114441
114441-bre
114441
114441-bre
http://go.nlvid.com/results1/
go.nlv/results1/
sm-1359
sm-1356-bre
sm-1359-bre
sm-1356
rad-8905
rad-8905-bre
Try this:
Toggle the Search Panel : SHIFT+CTRL+F
SEARCH: .*?((?:sm-|rad-)?(?:(?:\d+|[\w\.]+\/.*?))(?:-\w+)?$)
REPLACE: $1
Check REGEX and WORDS
Click Replace All or Hit CTRL+ALT+F3
Check the image below: