Extract a specific string from within a string using regular expressions - regex

I need a regular expression to match a string within a longer string.
Specifically I need to not match any leading zeros or the last 2 digits for the string.
For example, my input might be the following:
00009666666605
00010444444404
00007Z22222205
00033213433104
00009000G00005
And I would like to match
96666666
104444444
7Z222222
332134331
9000G000
For further information, the last 2 digits are always numbers and describe the starting point of the valid reference, after the leading zeros.
I thought I'd cracked it with something like
(?<=0000).{8}|((?<=000).{9})+? but that doesn't work as expected.

It sure takes a lot of steps, but this should do the trick:
(?<=^000)[^0].{8}|(?<=^0000).{8}
(?<= 'start lookbehind
^000 'for the beginning of the string then three zeroes
) 'end lookbehind
[^0] 'match a non-zero
.{8} 'match the remaining 8 chars
| ' OR
(?<= 'start lookbehind
^0000 'for the beginning of the string then four zeroes
) 'end lookbehind
.{8} 'match the remaining 8 chars
That said, in .NET, it will be quicker to do:
dim trimmed = line.TrimStart("0"c)
dim numberString = trimmed.Substring(0,trimmed.Length-2)
if the format of these string is always the same

I would use:
^0*(.*).{2}$
And access your matches via $1
Regex Storm demo

Related

Regular Expression: Find a specific group within other groups in VB.Net

I need to write a regular expression that has to replace everything except for a single group.
E.g
IN
OUT
OK THT PHP This is it 06222021
This is it
NO MTM PYT Get this content 111111
Get this content
I wrote the following Regular Expression: (\w{0,2}\s\w{0,3}\s\w{0,3}\s)(.*?)(\s\d{6}(\s|))
This RegEx creates 4 groups, using the first entry as an example the groups are:
OK THT PHP
This is it
06222021
Space Charachter
I need a way to:
Replace Group 1,2,4 with String.Empty
OR
Get Group 3, ONLY
You don't need 4 groups, you can use a single group 1 to be in the replacement and match 6-8 digits for the last part instead of only 6.
Note that this \w{0,2} will also match an empty string, you can use \w{1,2} if there has to be at least a single word char.
^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$
^ Start of string
\w{0,2}\s\w{0,3}\s\w{0,3}\s Match 3 times word characters with a quantifier and a whitespace in between
(.*?) Capture group 1 match any char as least as possible
\s\d{6,8} Match a whitespace char and 6-8 digits
\s? Match an optional whitespace char
$ End of string
Regex demo
Example code
Dim s As String = "OK THT PHP This is it 06222021"
Dim result As String = Regex.Replace(s, "^\w{0,2}\s\w{0,3}\s\w{0,3}\s(.*?)\s\d{6,8}\s?$", "$1")
Console.WriteLine(result)
Output
This is it
My approach does not work with groups and does use a Replace operation. The match itself yields the desired result.
It uses look-around expressions. To find a pattern between two other patterns, you can use the general form
(?<=prefix)find(?=suffix)
This will only return find as match, excluding prefix and suffix.
If we insert your expressions, we get
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6}\s?)
where I simplified (\s|) as \s?. We can also drop it completely, since we don't care about trailing spaces.
(?<=\w{0,2}\s\w{0,3}\s\w{0,3}\s).*?(?=\s\d{6})
Note that this works also if we have more than 6 digits because regex stops searching after it has found 6 digits and doesn't care about what follows.
This also gives a match if other things precede our pattern like in 123 OK THT PHP This is it 06222021. We can exclude such results by specifying that the search must start at the beginning of the string with ^.
If the exact length of the words and numbers does not matter, we simply write
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+)
If the find part can contain numbers, we must specify that we want to match until the end of the line with $ (and include a possible space again).
(?<=^\w+\s\w+\s\w+\s).*?(?=\s\d+\s?$)
Finally, we use a quantifier for the 3 ocurrences of word-space:
(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)
This is compact and will only return This is it or Get this content.
string result = Regex.Match(#"(?<=^(\w+\s){3}).*?(?=\s\d+\s?$)").Value;

How to use Ruby gsub with regex to do partial string substitution

I have a pipe delimited file which has a line
H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||
I want to substitute the date (28092017) with a regex "[0-9]{8}" if the first character is "H"
I tried the following example to test my understanding where Im trying to subtitute "a" with "i".
str = "|123||a|"
str.gsub /\|(.*?)\|(.*?)\|(.*?)\|/, "\|\\1\|\|\\1\|i\|"
But this is giving o/p as
"|123||123|i|"
Any clue how this can be achieved?
You may replace the first occurrence of 8 digits inside pipes if a string starts with H using
s = "H||CUSTCHQH2H||PHPCCIPHP|1010032000|28092017|25001853||||"
p s.gsub(/\A(H.*?\|)[0-9]{8}(?=\|)/, '\100000000')
# or
p s.gsub(/\AH.*?\|\K[0-9]{8}(?=\|)/, '00000000')
See the Ruby demo. Here, the value is replaced with 8 zeros.
Pattern details
\A - start of string (^ is the start of a line in Ruby)
(H.*?\|) - Capturing group 1 (you do not need it when using the variation with \K): H and then any 0+ chars as few as possible
\K - match reset operator that discards the text matched so far
[0-9]{8} - eight digits
(?=\|) - the next char must be |, but it is not added to the match value since it is a positive lookahead that does not consume text.
The \1 in the first gsub is a replacement backreference to the value in Group 1.

Regex for masking data

I am trying to implement regex for a JSON Response on sensitive data.
JSON response comes with AccountNumber and AccountName.
Masking details are as below.
accountNumber Before: 7835673653678365
accountNumber Masked: 783567365367****
accountName Before : chris hemsworth
accountName Masked : chri* *********
I am able to match above if I just do [0-9]{12} and (?![0-9]{12}), when I replace this, it is replacing only with *, but my regex is not producing correct output.
How can I produce output as above from regex?
If all you want is to mask characters except first N characters, don't think you really a complicated regex. For ignoring first N characters and replacing every character there after with *, you can write a generic regex like this,
(?<=.{N}).
where N can be any number like 1,2,3 etc. and replace the match with *
The way this regex works is, it selects every character which has at least N characters before it and hence once it selects a character, all following characters also get selected.
For e.g in your AccountNumber case, N = 12, hence your regex becomes,
(?<=.{12}).
Regex Demo for AccountNumber masking
Java code,
String s = "7835673653678365";
System.out.println(s.replaceAll("(?<=.{12}).", "*"));
Prints,
783567365367****
And for AccountName case, N = 4, hence your regex becomes,
(?<=.{4}).
Regex Demo for AccountName masking
Java code,
String s = "chris hemsworth";
System.out.println(s.replaceAll("(?<=.{4}).", "*"));
Prints,
chri***********
If you match [0-9]{12} and replace that directly with a single asterix you are left with accountNumber Before: *8365
There is no programming language listed, but one option to replace the digits at the end is to use a positive lookbehind to assert what is on the left are 12 digits followed by a positive lookahead to assert what is on the right are 0+ digits followed by the end of the string.
Then in the replacement use *
If the value of the json exact the value of chris hemsworth and 7835673653678365 you can omit the positive lookaheads (?=\d*$) and (?=[\w ]*$) which assert the end of the string for the following 2 expressions.
Use the versions with the positive lookahead if the data to match is at the end of the string and the string contains more data so you don't replace more matches than you would expect.
(?<=[0-9]{12})(?=\d*$)\d
In Java:
(?<=[0-9]{12})(?=\\d*$)\\d
(?<=[0-9]{12}) Positive lookbehind, assert what is on the left are 12 digits
(?=\d*$) Positive lookahead, assert what is on the right are 0+ digits and assert the end of the string
\d Match a single digit
Regex demo
Result:
783567365367****
For the account name you might do that with 4 word characters \w but this will also replace the whitespace with an asterix because I believe you can not skip matching that space in one regex.
(?<=[\w ]{5})(?=[\w ]*$)[\w ]
In Java
(?<=[\\w ]{4})(?=[\\w ]*$)[\\w ]
Regex demo
Result
chri***********

Regex perl with letters and numbers

I need to extract a strings from a text file that contains both letters and numbers. The lines start like this
Report filename: ABCL00-67900010079415.rpt ______________________
All I need is the last 8 numbers so in this example that would be 10079415
while(<DATA>){
if (/Report filename/) {
my ($bagID) = ( m/(\d{8}+)./ );
print $bagID;
}
Right now this prints out the first 8 but I want the last 8.
You just need to escape the dot, so that it would match the 8 digit characters which exists before the dot charcater.
my ($bagID) = ( m/(\d{8}+)\./ );
. is a special character in regex which matches any character. In-order to match a literal dot, you must need to escape that.
To match the last of anything, just precede it with a wildcard that will match as many characters as possible
my ($bag_id) = / .* (\d{8}) /x
Note that I have also use the /x modifier so that the regex can contain insignificant whitespace for readability. Also, your \d{8}+ is what is called a possessive quantifier; it is used for optimising some regex constructions and makes no difference at the end of the pattern

Regex in PHP: take all the words after the first one in string and truncate all of them to the first character

I'm quite terrible at regexes.
I have a string that may have 1 or more words in it (generally 2 or 3), usually a person name, for example:
$str1 = 'John Smith';
$str2 = 'John Doe';
$str3 = 'David X. Cohen';
$str4 = 'Kim Jong Un';
$str5 = 'Bob';
I'd like to convert each as follows:
$str1 = 'John S.';
$str2 = 'John D.';
$str3 = 'David X. C.';
$str4 = 'Kim J. U.';
$str5 = 'Bob';
My guess is that I should first match the first word, like so:
preg_match( "^([\w\-]+)", $str1, $first_word )
then all the words after the first one... but how do I match those? should I use again preg_match and use offset = 1 in the arguments? but that offset is in characters or bytes right?
Anyway after I matched the words following the first, if the exist, should I do for each of them something like:
$second_word = substr( $following_word, 1 ) . '. ';
Or my approach is completely wrong?
Thanks
ps - it would be a boon if the regex could maintain the whole first two words when the string contain three or more words... (e.g. 'Kim Jong U.').
It can be done in single preg_replace using a regex.
You can search using this regex:
^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+
And replace by:
$1.
RegEx Demo
Code:
$name = preg_replace('/^\w+(?:$| +)(*SKIP)(*F)|(\w)\w+/', '$1.', $name);
Explanation:
(*FAIL) behaves like a failing negative assertion and is a synonym for (?!)
(*SKIP) defines a point beyond which the regex engine is not allowed to backtrack when the subpattern fails later
(*SKIP)(*FAIL) together provide a nice alternative of restriction that you cannot have a variable length lookbehind in above regex.
^\w+(?:$| +)(*SKIP)(*F) matches first word in a name and skips it (does nothing)
(\w)\w+ matches all other words and replaces it with first letter and a dot.
You could use a positive lookbehind assertion.
(?<=\h)([A-Z])\w+
OR
Use this regex if you want to turn Bob F to Bob F.
(?<=\h)([A-Z])\w*(?!\.)
Then replace the matched characters with \1.
DEMO
Code would be like,
preg_replace('~(?<=\h)([A-Z])\w+~', '\1.', $string);
DEMO
(?<=\h)([A-Z]) Captures all the uppercase letters which are preceeded by a horizontal space character.
\w+ matches one or more word characters.
Replace the matched chars with the chars inside the group index 1 \1 plus a dot will give you the desired output.
A simple solution with only look-ahead and word boundary check:
preg_replace('~(?!^)\b(\w)\w+~', '$1.', $string);
(\w)\w+ is a word in the name, with the first character captured
(?!^)\b performs a word boundary check \b, and makes sure the match is not at the start of the string (?!^).
Demo