Regular Expression to Anonymize Names

Regular Expression to Anonymize Names - regex

I am using Notepad++ and the Find and Replace pattern with regular expressions to alter usernames such that only the first and last character of the screen name is shown, separated by exactly four asterisks (*). For example, "albobz" would become "a****z".
Usernames are listed directly after the cue "screen_name: " and I know I can find all the usernames using the regular expression:
screen_name:\s([^\s]+)
However, this expression won't store the first or last letter and I am not sure how to do it.
Here is a sample line:
February 3, 2018 screen_name: FR33Q location: Europe verified: false lang: en

Method 1
You have to work with \G meta-character. In N++ using \G is kinda tricky.
Regex to find:
(?>(screen_name:\s+\S)|\G(?!^))\S(?=\S)
Breakdown:
(?> Construct a non-capturing group (atomic)
( Beginning of first capturing group
screen_name:\s\S Match up to first letter of name
) End of first CG
| Or
\G(?!^) Continue from previous match
) End of NCG
\S Match a non-whitespace character
(?=\S) Up to last but one character
Replace with:
\1*
Live demo
Method 2
Above solution substitutes each inner character with a * so length remains intact. If you want to put four number of *s without considering length you would search for:
(screen_name:\s+\S)(\S*)(\S)
and replace with: \1****\3
Live demo

Related

Match string between delimiters, but ignore matches with specific substring

I have to parse all the text in a paranthesis but not the one that contains "GST"
e.g:
(AUSTRALIAN RED CROSS – ATHERTON)
(Total GST for this Invoice $1,104.96)
today for a quote (07) 55394226 − admin.nerang#waste.com.au − this applies to your Nerang services.
expected parsed value:
AUSTRALIAN RED CROSS – ATHERTON
I am trying:
^\(((?!GST).)*$
But its only matching the value and not grouping correctly.
https://regex101.com/r/HndrUv/1
What would be the correct regex for the same?

This regex should work to get the expected string:
^\((?!.*GST)(.*)\)$
It first checks if it does not contain the regular expression *GST. If true, it then captures the entire text.
(?!*GST)(.*)
All that is then surrounded by \( and \) to leave it out of the capturing group.
\((?!.*GST)(.*)\)
Finally you add the BOL and EOL symbols and you get the result.
^\((?!.*GST)(.*)\)$
The expected value is saved in the first capture group (.*).

You can use
^\((?![^()]*\bGST\b)([^()]*)\)$
See the regex demo. Details:
^ - start of string
\( - a ( char
(?![^()]*\bGST\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there are zero or more chars other than ) and ( and then GST as a whole word (remove \bs if you do not need whole word matching)
([^()]*) - Group 1: any zero or more chars other than ) and (
\) - a ) char
$ - end of string
Bonus:
If substrings in longer texts need to be matched, too, you need to remove ^ and $ anchors in the above regex.

Regex to extract static text and number using only regular expression

I am completely new to this regular expression.
But I tried to write the regular expression to get some static text and phone number for the below text
"password":"password123:cityaddress:mailaddress:9233321110:gender:45"
I written like below to extract this : "password":9233321110
(([\"]password[\"][\s]*:{1}[\s]*))(\d{10})?
regex link for demo:
https://regex101.com/r/2vNpMU/2
the correct regexp gives full match as "password":9233321110 in regex tool
I am not using any programming language here, this is for network packet capture at F5 level.
Please help me with the regexp;

I would use /^([^:]+)(?::[^:]+){3}:([^:]+)/ for this.
Explained (more detailed explanation at regex101):
^ matches from the start of the string
(…) is the first capture group. This will collect that initial "password"
[^:]+ matches one or more non-colon characters
(?:…) is a non-capturing group (it collects nothing for later)
:[^:]+ matches a colon and then 1+ non-colons
{3} instructs us to match the previous item (the non-capturing group) 3 times
: matches a literal colon
([^:]+) captures a match of 1+ non-colons, which will get us 9233321110 in this example
The first capture group is typically stored as $1 or the first item of the returned array. (In Javascript, the zeroth item is the full match and item index 1 is the first capture group.) The second capture group is $2, etc.
To always match the "password" key, hard-code it: /^("password")(?::[^:]+){3}:([^:]+)/
Here's a live snippet demonstrating it:
x = `"password":"password123:cityaddress:mailaddress:9233321110:gender:45"`;
match = x.match(/^([^:]+)(?::[^:]+){3}:([^:]+)/);
if (match) console.log(match[1] + ":" + match[2]);
else console.log("no match");

Regular expressions in notepad++ (Search and Replace)

I have a list of thousands of records within a .txt document.
some of them look like these records
201910031044 "00059" "11.31AG" "Senior Champion"
201910031044 "00060" "GBA146" "Junior Champion"
201910031044 "00999" "10.12G" "ProAM"
201910031044 "00362" "113.1LI" "Abcd"
Whenever a record similar to this occurs I'd like to get rid of the last words/numbers/etc in the last quotation marks (like "Senior Champion", "Junior Champion" etc. There are many possibilities here)
e.g. (before)
201910031044 "00059" "11.31AG" "Senior Champion"
after
201910031044 "00059" "11.31AG"
I tried the following regex but it wouldn't work.
Search: ^([0-9]{17,17} + "[0-9]{8,8}" + "[a-zA-Z0-9]").*$
Replace: \1 (replace string)
OK I forgot the . (dot) sign however even if I do not have a . (dot) sign it would not work. Not sure if it has anything to do when using the + sign used more than once.

I'd like to get rid of the last words/numbers/etc in the last quotation marks
This does the job:
Ctrl+H
Find what: ^.+\K\h+".*?"$
Replace with: LEAVE EMPTY
CHECK Wrap around
CHECK Regular expression
UNCHECK . matches newline*
Replace all
Explanation:
^ # beginning of line
.+ # 1 or more any character but newline
\K # forget all we have seen until this position
\h+ # 1 or more horizontal spaces
".*?" # something inside quotes
$ # end of line
Screen capture (before):
Screen capture (after):

The RegEx looks for the 4th double quote:
^(?:[^"]*\"){4}([^|]*)
You can see this demo: https://regex101.com/r/wJ9yS6/163
You will still need to parse the lines, so probably easier opening in excel or parsing using code as a CSV.

You have a problem with the count of your characters:
you specify that the line should start with exactly 17 digits ([0-9]{17,17}). However, there are only 12 digits in the data 201910031044.
you can specify exactly 12 digits by using {12} or if it could be 12-17, then {12,17}. I'll assume exactly 12 based on the current data.
similarly, for the second column you specify that it's exactly 8 digits surrounded by quotes ("[0-9]{8,8}") but it only has 5 digits surrounded by quotes.
again, you can specify exactly 5 with {5} or 5-8 with {5,8}. I will assume exactly 5.
finally, there is no quantifier for the final field, so the regex tries to match exactly one character that is a letter or a number surrounded by quotes "[a-zA-Z0-9]".
I'm not sure if there is any limit on the number of characters, so I would go with one or more using + as quantifier "[a-zA-Z0-9]+" - if you can have zero or more, then you can use *, or if it's any other count from m to n, then you can use {m,n} as before.
Not a character count problem but the final column can also have dots but the regex doesn't account for. You can just add . inside the square brackets and it will only match dot characters. It's usually used as a wildcard but it loses its special meaning inside a character class ([]), so you get "[a-zA-Z0-9.]+"
Putting it all together, you get
Search: ^([0-9]{12} + "[0-9]{5}" + "[a-zA-Z0-9.]+").*$
Replace: \1
Which will get rid of anything after the third field in Notepad++.
This can be shortened a bit by using \d instead of [0-9] for digits and \s+ for whitespace instead of +. As a benefit, \s will also match other whitespace like tabs, so you don't have to manually account for those. This leads to
Search: ^(\d{12}\s+"\d{5}"\s+"[a-zA-Z0-9.]+").*$
Replace: \1

If you want to get rid of the last words/numbers/etc in the last quotation marks you could capture in a group what is before that and match the last quotation marks and everything between it to remove it using a negated character class.
If what is between the values can be spaces or tabs, you could use [ \t]+ to match those (using \s could also match a newline)
Note that {17,17} and {8,8} may also be written as {17} and {8} which in this case should be {12} and {5}
^([0-9]{12}[ \t]+"[0-9]{5}"[ \t]+"[a-zA-Z0-9.]+")[ \t]{2,}"[^"\r\n]+"
In parts
^ Start of string
( Capture group 1
[0-9]{12}[ \t]+ Match 12 digits and 1+ spaces or tabs
"[0-9]{5}"[ \t]+ Match 5 digits between " and 1+ spaces or tabs
"[a-zA-Z0-9.]+" Match 1+ times any of the listed between "
) Close group
[ \t]{2,} Match 1+ times
"[^"\r\n]+"
In the replacement use group 1 $1
Regex demo
Before
After

RegEx for matching uppercase and dash followed by a comma

Trying to remove strings that follow the pattern
Tag Starts With
Size:
and before the next COMMA (,) includes the - character.
Example:
Size: XS-S-M-L-XL-2XL,
or
Size: XS-S-M,
etc.
WOULD get selected (including ,)
but Size_S, would be ignored because there is no -
I'm close with:
Size:(.*)-*(.?),
But still not stopping at ,
Here is 1 line of tags:
Athletics, Fitted, Mesh, Feature_Moisture Wicking, Material_Polyester 100%, , Material_Polyester 100%, Material_Polyester Over 50%, School, Style_Short Sleeves, Size_2XL, Size_L, Size_M, Size_S, Size_XL, Size_XS, Size: XS-S-M-L-XL-2XL, Uniforms, Unisex, V-Neck, VisibleLogos, Youth
To remove all size 'range' tags from my cells and only leave the single size tag.
Solution can be found here: regex101.com/r/VuTzba/1

In your pattern Size:(.*)-*(.?), you are first matching until the end of the string using (.*).
After that the hyphen -* and single character in the group (.?) are optional so it will backtrack until the last comma as that is the only character that has to be matched.
To get a more exact match, you could use a repeating pattern to match the sizes:
Size: (?:\d*X[SL]|L|M|S)(?:-(?:\d*X[LS]|L|M|S))*,
Explanation
Size: Match Size followed by a space
(?: Non capturing group
\d*X[SL]|L|M|S match one of the listed items in the alternation
) Close group
(?: Non capturing group
-(?:\d*X[LS]|L|M|S) Match a hyphen followed by any of the listed items
)*, Close group and repeat 0+ times and match a comma
Regex demo
As more broader pattern could be using a character class and list all the allowed characters Size: [XSML\d]+(?:-[XSML\d]+)*, or match until the first comma Size:[^,]+,
Edit
To also match Size: 28W-30W-32W-34W-36W-38W-40W, Size: 28W-30W-32W-34W or you could use extend the character class adding |\d+W to it and end the pattern matching either a comma or assert the end of the string $
Size: (?:\d*X[SL]|L|M|S|\d+W)(?:-(?:\d*X[LS]|L|M|S|\d+W))*(?:,|$)
Regex demo

We might want to add more boundaries in our expression here. Let's start with something similar to:
Size:\s+([A-Z0-9-]+),
where the capturing group () collects our desired data.
RegEx
If this expression wasn't desired, it can be modified or changed in regex101.com.
RegEx Circuit
jex.im also helps to visualize the expressions.

Size:\s*(.*?), will grab everything after the colon and before the next comma skipping leading white space.

Match 3 and 4 delimiters and between them; not less not more

I have a command-line program that its first argument ( = argv[ 1 ] ) is a regex pattern.
./program 's/one-or-more/anything/gi/digit-digit'
So I need a regex to check if the entered input from user is correct or not. This regex can be solve easily but since I use c++ library and std::regex_match and this function by default puts begin and end assertion (^ and $) at the given string, so the nan-greedy quantifier is ignored.
Let me clarify the subject. If I want to match /anything/ then I can use /.*?/ but std::regex_match considers this pattern as ^/.*?/$ and therefore if the user enters: /anything/anything/anyhting/ the std::regex_match still returns true whereas the input-pattern is not correct. The std::regex_match only returns true or false and the expected pattern form the user can only be a text according to the pattern. Since the pattern is various, here, I can not provide you all possibilities, but I give you some example.
Should be match
/.//
s/.//
/.//g
/.//i
/././gi
/one-or-more/anything/
/one-or-more/anything/g/3
/one-or-more/anything/i
/one-or-more/anything/gi/99
s/one-or-more/anything/g/4
s/one-or-more/anything/i
s/one-or-more/anything/gi/54
and anything look like this pattern
Rules:
delimiters are /|##
s letter at the beginning and g, i and 2 digits at the end are optional
std::regex_match function returns true if the entire target character sequence can be match, otherwise return false
between first and second delimiter can be one-or-more +
between second and third delimiter can be zero-or-more *
between third and fourth can be g or i
At least 3 delimiter should be match /.// not less so /./ should not be match
ECMAScript 262 is allowed for the pattern
NOTE
May you would need to see may question about std::regex_match:
std::regex_match and lazy quantifier with strange
behavior
I no need any C++ code, I just need a pattern.
Do not try d?([/|##]).+?\1.*?\1[gi]?[gi]?\1?d?\d?\d?. It fails.
My attempt so far: ^(?!s?([/|##]).+?\1.*?\1.*?\1)s?([/|##]).+?\2.*?\2[gi]?[gi]?\d?\d?$
If you are willing to try, you should put ^ and $ around your pattern
If you need more details please comment me, and I will update the question.
Thanks.

You could use this regular expression:
^s?([/|##])((?!\1).)+\1((?!\1).)*\1((gi?|ig)(\1\d\d?)?|i)?$
See regex101.com
Note how this also rejects these cases:
///anything/
/./anything/gg
/./anything/ii
/./anything/i/12
How it works:
Some explanation of the parts that are different:
((?!\1).): this will match any character that is not the delimiter. This way you are sure you can keep track of the exact number of delimiters used. You can this way also prevent that the first character after the first delimiter, is again that delimiter, which should not be allowed.
(gi?|ig): matches any of the valid modifier combinations, except a sole i, which is treated separately. So this also excludes gg and ii as valid character sequences.
(\1\d\d?)?: optionally allows for an extra delimiter (after a g modifier -- see previous) to be added with one or two digits following it.
( |i)?: for the case there is no g modifier present, but just the i or none: then no digits are allowed to follow.

This is a tricky one, but I took the challenge - here is what I have ended up with:
^s?([\/|##])(?:(?!\1).)+\1(?:(?!\1).)*\1(?:i|(?:gi?|ig)(\1\d{1,2})?)?$
Pattern breakdown:
^ matches start of string
s? matches an optional 's' character
([\/|##]) matches the delimeter characters and captures as group 1
(?:(?!\1).)+ matches anything other than the delimiter character one or more times (uses negative lookahead to make sure that the character isn't the delimiter matched in group 1)
\1 matches the delimiter character captured in group 1
(?:(?!\1).)* matches anything other than the delimiter character zero or more times
\1 matches the delimiter character captured in group 1
(?: starts a new group
i matches the i character
| or
(?:gi?|ig) matches either g, gi, or ig
(\1\d{1,2})? followed by an optional extra delimiter and 0-9 once or twice
)? closes group and makes it optional
$ matches end of string
I have used non capturing groups throughout - these are groups that start ?:

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regular Expression to Anonymize Names - regex

Related

Match string between delimiters, but ignore matches with specific substring

Regex to extract static text and number using only regular expression

Regular expressions in notepad++ (Search and Replace)

RegEx for matching uppercase and dash followed by a comma

Match 3 and 4 delimiters and between them; not less not more

Categories

Resources