O365 DLP custom regex - regex

I am trying to create a custom regex to detect social security numbers in O365 DLP. The conditions are the first three digit number should not started from 000 or 666 or 150 and the last ending four digit numbers should not end with 0000. Therefore i came up with the regex below,
(?!000|666|150)\d{3}-\d{2}-(?!0000)\d{4} - This works fine
Need Solution:
what if i want to exclude the same pattern if it starts by a word say for an example Apple: 173-12-9878 or Content: 173-12-9878, i tried adding the word into the negative lookahead like
(?!Apple: |Content: )(?!000|666|150)\d{3}-\d{2}-(?!0000)\d{4}, but am not able to get this work.
Please advise and also suggest if there is a better way to achieve this. Thanks.

Use a regex with a lookbehind:
\b(?<!Apple: |Content: )(?!0{2}|666|150)\d{3}-\d{2}-(?!0{4})\d{4}\b
See proof & explanation.
The (?<!Apple: |Content: ) negative lookbehind will prevent matches after Apple: and Content:.
Note \b is word boundary, it will disallow matches of longer numbers than you expect.

Related

Split complex string into mutliple parts using regex

I've tried a lot to split this string into something i can work with, however my experience isn't enough to reach the goal. Tried first 3 pages on google, which helped but still didn't give me an idea how to properly do this:
I have a string which looks like this:
My Dogs,213,220#Gallery,635,210#Screenshot,219,530#Good Morning,412,408#
The result should be:
MyDogs
213,229
Gallery
635,210
Screenshot
219,530
Good Morning
412,408
Anyone have an idea how to use regex to split the string like shown above?
Given the shared patterns, it seems you're looking for a regex like the following:
[A-Za-z ]+|\d+,\d+
It matches two patterns:
[A-Za-z ]+: any combination of letters and spaces
\d+,\d+: any combination of digits + a comma + any combination of digits
Check the demo here.
If you want a more strict regex, you can include the previous pattern between a lookbehind and a lookahead, so that you're sure that every match is preceeded by either a comma, a # or a start/end of string character.
(?<=^|,|#)([A-Za-z ]+|\d+,\d+)(?=,|#|$)
Check the demo here.

Regex for Wordle

using the online word game Wordle (https://www.powerlanguage.co.uk/wordle/) to sharpen my Regex.
I could use a little help with something that I imagine Regex should solve easily.
given a 5 letter english word
given that I know the word begins with pr
given that I know that the letters outyase are not found in the word
given that I know that the letter i IS found in the word
what is the correct - most simplified regex?
my limited regex gives is this ^pr.[^outyase][^outyase]$ which is
a. redundant and
b. does not include the request to match i
any of you Regex Ninjas want to lend a hand, I would be much obliged.
by the way, the correct regex should return two nouns in the english language prick and primi, you can validate here https://www.visca.com/regexdict/
You may use this regex with a positive and negative lookahead conditions:
^pr(?=[a-z]*i)(?![a-z]*[outyase])[a-z]{3}$
Regex Explanation:
^: Start
pr: Match pr
(?=[a-z]*i): Positive lookahead to make sure we have an i ahead after 0 or more letters
(?![a-z]*[outyase])): Negative lookahead to disallow any of the [outyase] characters
[a-z]{3}: Match 3 letters
Demo Screenshot:
Trivially, you can use:
^pr([^outyase][^outyase]i|[^outyase]i[^outyase]|i[^outyase][^outyase])$
Also, according to your site, there's actually four words matching, not just two:
prick
primi
primp
prink
Try
^pr(?!.*[outyase])(?=.*i)[a-z]{3}$
(?!.*[outyase]) means don't match if any of outyase is found ahead in the string.
(?=.*i) means only match if there is an i ahead in the string.
Adding a note for general usage.
For any char position:
(?!.*[<BadChars>])(?=.*<firstGoodChar>)(?=.*<SecondGoodChar>)(?=.*<ThirdGoodChar>).*
If you know mghtlc are bad and i, o, and s are good:
^(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{5}$
It's trivial to add a pinned char at the front/back:
^b(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{4}$
^(?!.*[mghtlc])(?=.*i)(?=.*o)(?=.*s).*{4}n$
but I'm not sure how the look-ahead would work with a pinned char in the middle given that the found chars (i, o) can be on either side of the pinned s:
NOT WORKING:
^(?!.*[mghtlc])(?=.*i)(?=.*o).*{2}s(?!.*[mghtlc])(?=.*i)(?=.*o).*{2}$

How to allow only WhatsApp format numbers in a regex?

so I'm trying to make this Regex allow this the Dash symbol - For Example this Phone Number is not matching right now
+212 659-123456
So I need someone to help me change the Regex to allow it
please Here is the Regex:
^\+(?:[0-9]\x20?){6,14}[0-9]$
Because I am trying to only accept the format that is used by WhatsApp and some numbers might have multiple spaces or multiple Dashes. Also the Plus sign has to be mandatory Here some more examples of the format on WA.
+96274567123
+967773-123-123
+212 627-024321
+212689-881234
+966 54 666 4373
The numbers above cover 99% of the cases. I would appreciate any help, thanks and regards
I would just use:
^(?=(?:[+ -]*[0-9][+ -]*){11,12}$)\+(?:[0-9]+[ -]?)+[0-9]$
Explanation:
(?=(?:[+ -]*[0-9][+ -]*){11,12}$) Positive lookahead which checks that the string has exactly 11 or 12 digits in it.
\+(?:[0-9]+[ -]?)+[0-9] Has to start with a + and end with a digit, in between can be groups of one ore more digits plus optionally a single or -.
regex101 demo
^\+([\s\-0-9]){6,14}$
This would catch all your entries. It would be easier if you delete all whitespaces and unwanted characters and test than. Especially when the String to test becomes longer and longer because of whitespaces.

Regex - Skip characters to match

I'm having an issue with Regex.
I'm trying to match T0000001 (2, 3 and so on).
However, some of the lines it searches has what I can describe as positioners. These are shown as a question mark, followed by 2 digits, such as ?21.
These positioners describe a new position if the document were to be printed off the website.
Example:
T123?214567
T?211234567
I need to disregard ?21 and match T1234567.
From what I can see, this is not possible.
I have looked everywhere and tried numerous attempts.
All we have to work off is the linked image. The creators cant even confirm the flavour of Regex it is - they believe its Python but I'm unsure.
Regex Image
Update
Unfortunately none of the codes below have worked so far. I thought to test each code in live (Rather than via regex thinking may work different but unfortunately still didn't work)
There is no replace feature, and as mentioned before I'm not sure if it is Python. Appreciate your help.
Do two regex operations
First do the regex replace to replace the positioners with an empty string.
(\?[0-9]{2})
Then do the regex match
T[0-9]{7}
If there's only one occurrence of the 'positioners' in each match, something like this should work: (T.*?)\?\d{2}(.*)
This can be tested here: https://regex101.com/r/XhQXkh/2
Basically, match two capture groups before and after the '?21' sequence. You'll need to concatenate these two matches.
At first, match the ?21 and repace it with a distinctive character, #, etc
\?21
Demo
and you may try this regex to find what you want
(T(?:\d{7}|[\#\d]{8}))\s
Demo,,, in which target string is captured to group 1 (or \1).
Finally, replace # with ?21 or something you like.
Python script may be like this
ss="""T123?214567
T?211234567
T1234567
T1234434?21
T5435433"""
rexpre= re.compile(r'\?21')
regx= re.compile(r'(T(?:\d{7}|[\#\d]{8}))\s')
for m in regx.findall(rexpre.sub('#',ss)):
print(m)
print()
for m in regx.findall(rexpre.sub('#',ss)):
print(re.sub('#',r'?21', m))
Output is
T123#4567
T#1234567
T1234567
T1234434#
T123?214567
T?211234567
T1234567
T1234434?21
If using a replace functionality is an option for you then this might be an approach to match T0000001 or T123?214567:
Capture a T followed by zero or more digits before the optional part in group 1 (T\d*)
Make the question mark followed by 2 digits part optional (?:\?\d{2})?
Capture one or more digits after in group 2 (\d+).
Then in the replacement you could use group1group2 \1\2.
Using word boundaries \b (Or use assertions for the start and the end of the line ^ $) this could look like:
\b(T\d*)(?:\?\d{2})?(\d+)\b
Example Python
Is the below what you want?
Use RegExReplace with multiline tag (m) and enable replace all occurrences!
Pattern = (T\d*)\?\d{2}(\d*)
replace = $1$2
Usage Example:

TextMate: Regex replacing $1 with following 0

I'm trying to fix a file full of 1- and 2-digit numbers to make them all 2 digits long.
The file is of the form:
10,5,2
2,4,5
7,7,12
...
I've managed to match the problem numbers with:
(^|,)(\d)(,|$)
All I want to do now is replace the offending string with:
${1}0$2$3
but TextMate gives me:
10${1}05,2
Any ideas?
Thanks in advance,
Ross
According to this, TextMate supports word boundary anchors, so you could also search for \b\d\b and replace all with 0$0. (Thanks to Peter Boughton for the suggestion!)
This has the advantage of catching all the numbers in one go - your solution will have to be applied at least twice because the regex engine has already consumed the comma before the next number after a successful replace.
Note: Tim's solution is simpler and solves this problem, but I'll leave this here for reference, in case someone has a similar but more complex problem, which using lookarounds can support.
A simpler way than your expression is to replace:
(?<!\d)\d(?!\d)
With:
0$0
Which is "replace all single digits with 0 then itself".
The regex is:
Negative lookbehind to not find a digit (?<!\d)
A single digit: \d
Negative lookahead to not find a digit (?!\d)
Single this is a positional match (not a character match), it caters for both comma and start/end positions.
The $0 part says "entire match" - since the lookbehind/ahead match positions, this will contain the single digit that was matched.
To anyone coming here, as #Amarghosh suggested, it's a bug, or intentional behavior that leads to problems if nothing else.
I just had this problem and had to use the following workaround: If you set up another capture group, and then use a conditional insertion, it will work. For example, I had a string like <WebObject name=Frage01 and wanted to replace the 01 with 02, so I captured the main string in $1 and the end number in $2, which gave me a regex that looked like (<WebObject name=(Frage|Antwort))(01).
Then the replace was $1(?2:02).
The (?2:02) is the conditional insertion, and in this instance will always find something, but it was necessary in order to work around the odd conundrum of appending a number to the end of $n. Hope that helps someone. There is documentation on the conditional insertion here
In TextMate 1.5.11 (1635) ${1} does not work (like the OP described).
I appreciate the many suggestions re altering the query string, however there is a much simpler solution, if you want to break between a capture group and a number: \u.
It is a TextMate specific replacement syntax, that converts the following character to uppercase. As there is no uppercase for numbers, it does nothing and moves on. It is described in the link from Tim Pietzcker's answer.
In my case I had to clean up a csv file, where box measurements were given in cm x cm x mm. Thus I had to add a zero to the first two numbers.
Text: "80 x 40 x 5 mm"
Desired text: "800 x 400 x 5 mm"
Find: (\d+) x (\d+) x (\d+)
Replace: $1\u0 x $2\u0 x $3 mm
Regarding the support of more than 10 capture groups, I do not know if this is a bug. But as OP and #rossmcf wrote, $10 is replaced with null.
You need not ${1} - replace strings support only up to nine groups maximum - so it won't mistake it for $10.
Replace with $10$2$3