Good way to find / replace using regex - regex

I have an value that is not being read by our OCR program correctly. It's predicable so I would like to use a find/replace in regex (because this is how we are already extracting the data).
We get the named group like this: (?<Foo>.*?)
I would like to replace 'N1123456' with 'NY123456'. We know that we expect NY when we are getting N1.
What can I try to get this done in the same regular expression?
Edit: (?<Foo>.*?)

Make groups of non digits and digits and add Y after non digit group.
(\D+)(\d+)
Here is demo
Enclose it inside \b or ^ and $ for better precision.
Sample code:
PHP:
$re = ""(\\D+)(\\d+)"";
$str = "N1123456";
$subst = '$1Y$2';
$result = preg_replace($re, $subst, $str, 1);
Python:
import re
p = re.compile(ur'(\D+)(\d+)')
test_str = u"N1123456"
subst = u"$1Y$2"
result = re.sub(p, subst, test_str)
Java:
System.out.println("N1123456".replaceAll("(\\D+)(\\d+)","$1Y$2"));

If you expect N1 to be always followed by 6 digits, then you can do this:
Replace this: \bN1(\d{6})\b with this: NY$1.
This will replace any N1 followed by 6 digits with NY.

This is what I would do:
Dim str = Regex.Replace("N1123456", #"\bN1(\d+)", "NY$1");
The Expression to find the text is N1 followed by numbers like : \bN1(\d+).
The numbers belongs to the group(1) I would like to preserve and attach to NY during replacing: NY$1

Related

Python regex negative lookbehind embedded numeric number

I am trying to pull a certain number from various strings. The number has to be standalone, before ', or before (. The regex I came up with was:
\b(?<!\()(x)\b(,|\(|'|$) <- x is the numeric number.
If x is 2, this pulls the following string (almost) fine, except it also pulls 2'abd'. Any advice what I did wrong here?
2(2'Abf',3),212,2'abc',2(1,2'abd',3)
Your actual question is, as I understand it, get these specific number except those in parenthesis.
To do so I suggest using the skip_what_to_avoid|what_i_want pattern like this:
(\((?>[^()\\]++|\\.|(?1))*+\))
|\b(2)(?=\b(?:,|\(|'|$))
The idea here is to completely disregard the overall matches (and there first group use for the recursive pattern to capture everything between parenthesis: (\((?>[^()\\]++|\\.|(?1))*+\))): that's the trash bin. Instead, we only need to check capture group $2, which, when set, contains the asterisks outside of comments.
Demo
Sample Code:
import regex as re
regex = r"(\((?>[^()\\]++|\\.|(?1))*+\))|\b(2)(?=\b(?:,|\(|'|$))"
test_str = "2(2'Abf',3),212,2'abc',2(1,2'abd',3)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
if match.groups()[1] is not None:
print ("Found at {start}-{end}: {group}".format(start = match.start(2), end = match.end(2), group = match.group(2)))
Output:
Found at 0-1: 2
Found at 16-17: 2
Found at 23-24: 2
This solution requires the alternative Python regex package.

How to search and add a string before the pattern in python using regular expressions

Iam writing a program to find and replace a string from all the files with the given extension.Here Iam using regular expressions for searching.The task is to findall the occurances and modify those.
If my string is "The number is 1234567890"
Result after searching and replacing should be +911234567890
I think i can re.sub() here like
s = "The number is 1234567890"
re.sub(r"\d{10}",??,s)
What can be given as the second argument here i don't know what the number would be i have modify the same matched string by preceding it with +91
I could do it using the findall from re and replace from string like
s = "The number is 1234567890 and 2345678901"
matches = re.findall(r'\d{10}',s)
for match in matches:
s = s.replace(match,"+91"+match)
After this s is The number is +911234567890 and +912345678901
Is this the only way of doing it??Is it not possible using re.sub() ??
If it is please help me.Thank you...!
Try this regex:
(?=\d{10})
Click for Demo
Explanation:
(?=\d{10}) - positive lookahead to find a zero-length match which is immediately followed by 10 digits
Code
import re
regex = r"(?=\d{10})"
test_str = "The number is 1234567890"
subst = "+91"
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
Code Output
OR
If I use your regex, the code would look like:
import re
regex = r"(\d{10})"
test_str = "The number is 1234567890"
subst = "+91$1"
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)

Regex using condition on match results of another regex

I have a long string S and a string-to-string map M, where keys in M are the results of a regex match on S. I want to do a find-and-replace on S where, whenever one of the matches from that same regex is exactly one of my keys K in M, I replace it with its value M[K].
In order to do this I think I'd need to access the result of regex matches within a regex. If I try to store the result of a match and test equality outside a regex, I can't do my replace because I no longer know where the match was. How do I accomplish my goal?
Examples:
S = "abcd_a", regex = "[a-z]", M = {a:b}
result: "bbcd_b" because the regex would match the a's and replace them with b's
S = "abcd_a", regex = "[a-z]*", M = {a:b}
result: "abcd_b" because the regex would match "abcd" (but not replace it because it is not exactly "a") and the final 'a' (which it would replace because it is exactly "a")
EDIT Thanks for AlanMoore's suggestion. The code is now simpler.
I tried using python (2.7x) to solve this simple example, but it can be achieved with any other language. What's important is the approach (algorithm). Hope it helps:
import re
from itertools import cycle
S = "abcd_a"
REGEX = "[a-z]"
M = {'a':'b'}
def ReplaceWithDict(pattern):
# split by match group and map the match against map dict
return ''.join([M[v] if v and v in M else v for v in re.split(pattern, S)])
print ReplaceWithDict('([a-z])')
print ReplaceWithDict('([a-z]*)')
Output:
bbcd_b
abcd_b

Sliding window pattern match in perl or matlab regular expressions

I am trying to use either Perl or MATLAB to parse a few numbers out of a single line of text. My text line is:
t10_t20_t30_t40_
now in matlab, i used the following script
str = 't10_t20_t30_t40_';
a = regexp(str,'t(\d+)_t(\d+)','match')
and it returns
a =
't10_t20' 't30_t40'
What I want is for it to also return 't20_t30', since this obviously is a match. Why doesn't regexp scan it?
I thus turned to Perl, and wrote the following in Perl:
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ /(t\d+_t\d+)/g)
{
print "$1\n";
}
and the result is the same as matlab
t10_t20
t30_t40
but I really wanted "t20_t30" also be in the results.
Can anyone tell me how to accomplish that? Thanks!
[update with a solution]:
With help from colleagues, I identified a solution using the so-called "look-around assertion" afforded by Perl.
#!/usr/bin/perl -w
$str = "t10_t20_t30_t40_";
while($str =~ m/(?=(t\d+_t\d+))/g)
{print "$1\n";}
The key is to use "zero width look-ahead assertion" in Perl. When Perl (and other similar packages) uses regexp to scan a string, it does not re-scan what was already scanned in the last match. So in the above example, t20_t30 will never show up in the results. To capture that, we need to use a zero-width lookahead search to scan the string, producing matches that do not exclude any substrings from subsequent searches (see the working code above). The search will start from zero-th position and increment by one as many times as possible if "global" modifier is appended to the search (i.e. m//g), making it a "greedy" search.
This is explained in more detail in this blog post.
The expression (?=t\d+_t\d+) matches any 0-width string followed by t\d+_t\d+, and this creates the actual "sliding window". This effectively returns ALL t\d+_t\d+ patterns in $str without any exclusion since every position in $str is a 0-width string. The additional parenthesis captures the pattern while its doing sliding matching (?=(t\d+_t\d+)) and thus returns the desired sliding window outcome.
Using Perl:
#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;
my $re = qr/(?=(t\d+_t\d+))/;
my #l = 't10_t20_t30_t40' =~ /$re/g;
say Dumper(\#l);
Output:
$VAR1 = [
't10_t20',
't20_t30',
't30_t40'
];
Once the regexp algorithm has found a match, the matched characters are not considered for further matches (and usually, this is what one wants, e.g. .* is not supposed to match every conceivable contiguous substring of this post). A workaround would be to start the search again one character after the first match, and collect the results:
str = 't10_t20_t30_t40_';
sub_str = str;
reg_ex = 't(\d+)_t(\d+)';
start_idx = 0;
all_start_indeces = [];
all_end_indeces = [];
off_set = 0;
%// While there are matches later in the string and the first match of the
%// remaining string is not the last character
while ~isempty(start_idx) && (start_idx < numel(str))
%// Calculate offset to original string
off_set = off_set + start_idx;
%// extract string starting at first character after first match
sub_str = sub_str((start_idx + 1):end);
%// find further matches
[start_idx, end_idx] = regexp(sub_str, reg_ex, 'once');
%// save match if any
if ~isempty(start_idx)
all_start_indeces = [all_start_indeces, start_idx + off_set];
all_end_indeces = [all_end_indeces, end_idx + off_set];
end
end
display(all_start_indeces)
display(all_end_indeces)
matched_strings = arrayfun(#(st, en) str(st:en), all_start_indeces, all_end_indeces, 'uniformoutput', 0)

regex match where order of substrings doesn't matter

have a problem in using regexp. I have a code of the following format.
(01)123456789(17)987654321
Now I want to capture the digits after (01) in a named group: group01 and the digits after (17) in a namedGroup: group17.
the problem is that the code could be in different order like this:
(17)987654321(01)123456789
the named groups should contain the same content.
any ideas?
thank you Marco
In Python, PCRE and PHP
(?:(?<=\(17\))(?<group17>\d+)|(?<=\(01\))(?<group01>\d+)|.)+
.Net supports the above syntax and this one:
(?:(?<=\(17\))(?'group17'\d+)|(?<=\(01\))(?'group01'\d+)|.)+
This worked for me:
(?<group01>\(01\))[0-9]{9}|(?<group17>\(17\))[0-9]{9}
Everyone seems to be hardcoding "01" and "17". Here's a more general solution:
while ( my $data =~ /\((\d+)\)(\d+)/g ) {
my $group_number = $1;
my $group_data = $2;
$group{$group_number} = $group_data;
}
As long as you have unsatisfied (numbers)numbers patterns matching in your data, it will grab each one in succession. In this Perl snippet, it stores each group's data into a hash keyed on the group number.
you didn't say what language, they all have their own quirks. But something like this should work if there is always 9 digits after the (). ( In Ruby)
No groups, but its a little clearer like this, in my opinion, may not work for you.
string = "(01)123456789(17)987654321"
group17 = string =~ /\(17\)\d{9}/
group01 = string =~ /\(01\)\d{9}/
string[group17+4,9]
string[group01+4,9]
EDIT:
with named capture groups in ruby 1.9:
string = "(01)123456789(17)987654321"
if string =~ /\(17\)(?<g17>\d{9})/
match = Regexp.last_match
group17 = match[:g01]
end
if string =~ /\(01\)(?<g01>\d{9})/
match = Regexp.last_match
group01 = match[:g01]
end
Seeking for something like this?
(01|17)(\d*?)(01|17)(\d*?)
Expected matches:
0 => In most cases the whole match
1 => 01 or 17
2 => first decimal string
3 => second 01 or 17
4 => second decimal string
Tell me if it helps.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. -- Jamie Zawinski
Glib quotes aside, regex seems like overkill. Python code:
string = "(17)987654321(01)123456789"
substrings = [s for s in string.split("(") if len(s) > 0]
results = dict()
for substring in substrings:
substring = substring.split(")")
results["group" + substring[0]] = substring[1]
print results
>>> {'group17': '987654321', 'group01': '123456789'}