php check ncr with negative lookbehind and greedy doesn't work - regex

I want to find a erroneous NCR without &# and remedy it, the unicode is 4 or 5 decimal digit, I write this PHP statement:
function repl0($m) {
return '&#'.$m[0];
}
$s = "This is a good 23200; sample ship";
echo "input1= ".htmlentities($s)."<br>";
$out1=preg_replace_callback('/(?<!#)(\d{4,5};)/','repl0',$s);
echo 'output1 = '.htmlentities($out1).'<br>';
The output is:
input1= This is a good 23200; sample ship
output1 = This is a good 2ಀ sample ship
The match only happens once according to the output message.
What I want is to match '23200;' instead of '3200;'.
Default should be greedy mode and I thought it will capture 5-digit number instead 4-digit's
Do I misunderstand 'greedy' here? How can I get what I want?

The (?<!#)(\d{4,5};) pattern matches like this:
(?<!#) - matches a location that is not immediately preceded with #
(\d{4,5};) - then tries to match and consume four or five digits and a ; char immediately after these digits.
So, if you have #32000; string input, 3 cannot be a starting character of a match, as it is preceded with #, but 2 can since it is not preceded by a # and there are five digits with a ; for the pattern to match.
What you need here is to curb the match on the left by adding a digit to the lookbehind,
(?<![#\d])(\d{4,5};)
With this trick, you ensure that the match cannot be immediately preceded with either # or a digit.
You say you finally used (?<!#)(?<!\d)\d{4,5};, and this pattern is functionally equivalent to the pattern above since the lookbehinds, as all lookarounds, "stand their ground", i.e. the regex index does not move when the lookaround patterns are matched. So, the check for a digit or a # char occurs at the same location in the string.

Related

Regex for masking data

I am trying to implement regex for a JSON Response on sensitive data.
JSON response comes with AccountNumber and AccountName.
Masking details are as below.
accountNumber Before: 7835673653678365
accountNumber Masked: 783567365367****
accountName Before : chris hemsworth
accountName Masked : chri* *********
I am able to match above if I just do [0-9]{12} and (?![0-9]{12}), when I replace this, it is replacing only with *, but my regex is not producing correct output.
How can I produce output as above from regex?
If all you want is to mask characters except first N characters, don't think you really a complicated regex. For ignoring first N characters and replacing every character there after with *, you can write a generic regex like this,
(?<=.{N}).
where N can be any number like 1,2,3 etc. and replace the match with *
The way this regex works is, it selects every character which has at least N characters before it and hence once it selects a character, all following characters also get selected.
For e.g in your AccountNumber case, N = 12, hence your regex becomes,
(?<=.{12}).
Regex Demo for AccountNumber masking
Java code,
String s = "7835673653678365";
System.out.println(s.replaceAll("(?<=.{12}).", "*"));
Prints,
783567365367****
And for AccountName case, N = 4, hence your regex becomes,
(?<=.{4}).
Regex Demo for AccountName masking
Java code,
String s = "chris hemsworth";
System.out.println(s.replaceAll("(?<=.{4}).", "*"));
Prints,
chri***********
If you match [0-9]{12} and replace that directly with a single asterix you are left with accountNumber Before: *8365
There is no programming language listed, but one option to replace the digits at the end is to use a positive lookbehind to assert what is on the left are 12 digits followed by a positive lookahead to assert what is on the right are 0+ digits followed by the end of the string.
Then in the replacement use *
If the value of the json exact the value of chris hemsworth and 7835673653678365 you can omit the positive lookaheads (?=\d*$) and (?=[\w ]*$) which assert the end of the string for the following 2 expressions.
Use the versions with the positive lookahead if the data to match is at the end of the string and the string contains more data so you don't replace more matches than you would expect.
(?<=[0-9]{12})(?=\d*$)\d
In Java:
(?<=[0-9]{12})(?=\\d*$)\\d
(?<=[0-9]{12}) Positive lookbehind, assert what is on the left are 12 digits
(?=\d*$) Positive lookahead, assert what is on the right are 0+ digits and assert the end of the string
\d Match a single digit
Regex demo
Result:
783567365367****
For the account name you might do that with 4 word characters \w but this will also replace the whitespace with an asterix because I believe you can not skip matching that space in one regex.
(?<=[\w ]{5})(?=[\w ]*$)[\w ]
In Java
(?<=[\\w ]{4})(?=[\\w ]*$)[\\w ]
Regex demo
Result
chri***********

How to make regex that can take at most one asterisk in character class?

I want to create a regular expression that match a string that starts with an optional minus sign - and ends with a minus sign. In between must begin with a letter (upper or lower case) which can be followed by any combination of letters, numbers and may, at most, contain one asterix (*)
So far I have came up with this
[-]?[a-zA-Z]+[a-zA-Z0-9(*{0,1})]*[-]
Some examples of what I am trying to achieve.
"-yyy-" // valid
"-u8r*y75-" // valid
"-u8r**y75-" // invalid
Code
See regex in use here
^-?[a-z](?!(?:.*\*){2})[a-z\d*]*-$
Alternatively, you can use the following regex to achieve the same results without using a negative lookahead.
See regex in use here
^-?[a-z][a-z\d]*(?:\*[a-z\d]*)?-$
Results
Input
** VALID **
-yyy-
-u8r*y75-
** INVALID **
-u8r**y75-
Output
-yyy-
-u8r*y75-
Explanation
^ Assert position at the start of the line
-? Match zero or one of the hyphen character -
[a-z] Match a single ASCII alpha character between a and z. Note that the i modifier is turned on, thus this will also match uppercase variations of the same letters
(?!(?:.*\*){2}) Negative lookahead ensuring what follows doesn't match
(?:.*\*){2} Match an asterisk * twice
[a-z\d*]* Match any ASCII letter between a and z, or a digit, or the asterisk symbol * literally, any number of times
- Match this character literally
$ Assert position at the end of the line
Try this one:
-(((\w|\d)*)(\*?)((\w|\d)*))-
You can try it here:
https://regex101.com/
(-)?(\w)+(\*(?!\*)|\w+)(-)
I used grouping to make it more clear. I changed [a-zA-Z0-9] to \w which stands for the same.
(\*(?!\*)|\w+)
This is the important change. Explained in words:
If it is a star \* and the preceding char was not a star(?!\*) (called negative lookahead = look at the preceding part) or if it is \w = [a-zA-Z0-9].
Use this site to test: https://regexr.com/
They have a pretty good explaination on the left menu under "Reference".

Global regex with multiple matches, where the separator should be shared in several matches

First of all, sorry for the unclear title, it's hard to describe (and to find an existing solution for the same reason).
I use this regex in Javascript, to collect numbers in a string :
/(?:^|[^\d])([\d]+)(?:$|[^\d])/g
Executing it on "5358..2145" returns 2 matches, where the submatches are "5358" and "2145"
But if I use it on "5358.2145", I receive only 1 match : "5358"
So, I understand it so :
The first match is found ("5358.") so the point goes in the first match
What I want as second match is not preceded with start of string or the point because this point already belongs to the first match
How can I change my pattern to find all numbers separated with 1 non-number character ?
Use a negative lookahead at the end:
/(?:^|\D)(\d+)(?!\d)/g
See the regex demo
The pattern matches:
(?:^|\D) - either start of string (^) or any non-digit char (\D)
(\d+) - Group 1: one or more digits
(?!\d) - the negative lookahead failing the match if there is a digit immediately to the right of the current location.

Regular expression of two digit number where two digits are not same

I am trying to write a regular expression that will match a two digit number where the two digits are not same.
I have used the following expression:
^([0-9])(?!\1)$
However, both the strings "11" and "12" are not matching. I thought "12" would match. Can anyone please tell me where I am going wrong?
You need to allow matching 2 digits. Your regex ^([0-9])(?!\1)$ only allows 1 digit string. Note that a lookahead does not consume characters, it only checks for presence or absence of something after the current position.
Use
^(\d)(?!\1)\d$
^^
See demo
Explanation of the pattern:
^ - start of string
(\d) - match and capture into Group #1 a digit
(?!\1) - make sure the next character is not the same digit as in Group 1
\d - one digit
$ - end of string.

regex: find one-digit number

I need to find the text of all the one-digit number.
My code:
$string = 'text 4 78 text 558 my.name#gmail.com 5 text 78998 text';
$pattern = '/ [\d]{1} /';
(result: 4 and 5)
Everything works perfectly, just wanted to ask it is correct to use spaces?
Maybe there is some other way to distinguish one-digit number.
Thanks
First of all, [\d]{1} is equivalent to \d.
As for your question, it would be better to use a zero width assertion like a lookbehind/lookahead or word boundary (\b). Otherwise you will not match consecutive single digits because the leading space of the second digit will be matched as the trailing space of the first digit (and overlapping matches won't be found).
Here is how I would write this:
(?<!\S)\d(?!\S)
This means "match a digit only if there is not a non-whitespace character before it, and there is not a non-whitespace character after it".
I used the double negative like (?!\S) instead of (?=\s) so that you will also match single digits that are at the beginning or end of the string.
I prefer this over \b\d\b for your example because it looks like you really only want to match when the digit is surrounded by spaces, and \b\d\b would match the 4 and the 5 in a string like 192.168.4.5
To allow punctuation at the end, you could use the following:
(?<!\S)\d(?![^\s.,?!])
Add any additional punctuation characters that you want to allow after the digit to the character class (inside of the square brackets, but make sure it is after the ^).
Use word boundaries. Note that the range quantifier {1} (a single \d will only match one digit) and the character class [] is redundant because it only consists of one character.
\b\d\b
Search around word boundaries:
\b\d\b
As explained by the others, this will extract single digits meaning that some special characters might not be respected like "." in an ip address. To address that, see F.J and Mike Brant's answer(s).
It really depends on where the numbers can appear and whether you care if they are adjacent to other characters (like . at the end of a sentence). At the very least, I would use word boundaries so that you can get numbers at the beginning and end of the input string:
$pattern = '/\b\d\b/';
But you might consider punctuation at the end like:
$pattern = '/\b\d(\b|\.|\?|\!)/';
If one-digit numbers can be preceded or followed by characters other than digits (e.g., "a1 cat" or "Call agent 7, pronto!") use
(?<!\d)\d(?!\d)
Demo
The regular expression reads, match a digit (\d) that is neither preceded nor followed by digit, (?<!\d) being a negative lookbehind and (?!\d) being a negative lookahead.