Replacing unknown number of named groups - regex

I am working on such a pattern:
<type>"<prefix>"<format>"<suffix>";<neg_type>"<prefix>"<format>"<suffix>"
So i wrote 2 examples here, with or without prefix:
n"prefix"#,##0"suffix";-"prefix"#,##0"suffix"
n#,##0"suffix";-#,##0"suffix"
Indeed i wrote the folowing regex to capture my group:
raw = r"(?P<type>^.)(?:\"(?P<prefix>[^\"]*)\"){0,1}(?P<format>[^\"]*)(?:\"(?P<suffix>[^\"]*)\"){0,1};(?P<negformat>.)(?:\"(?P=prefix)\"){0,1}(?P=format)(?:\"(?P=suffix)\"){0,1}"
Now i am parsing a big text which contain such structure and i would like to replace the prefix or suffix (only if they exist!). Due to the unknown number (potentially null) of captured group i do not know how to easily can make my replacements (with re.sub).
Additionnaly, due to some implementation constraint i treat sequentially prefix and suffix (so i do not get the suffix to replace at the same time than the prefix to replace even if they belong to the same sentence.

First, we can simplify your regex by using single quotes for the string. That removes the necessity of escaping the " character. Second, {0,1} can be replaced by ?:
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?:"(?P<prefix2>(?P=prefix))")?(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
Notice that I have added (?P<prefix2>) and (?P<suffix2) named groups above for the second occurrences of the prefix and suffix.
I am working on the assumption that your pattern may be repeated within the text (if the pattern only appears once, this code will still work). In that case, the character substitutions must be made from the last to first occurrence so that the start and last character offset information returned by the regex engine remains correct even after character substitutions are made. Similarly, when we find an occurrence of the pattern, we must first replace in the order suffix2, prefix2, suffix and prefix.
We use re.finditer to iterate through the text to return match objects and form these into a list, which we reverse so that we can process the last matches first:
import re
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?:"(?P<prefix2>(?P=prefix))")?(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
s = """a"prefix"format"suffix";b"prefix"format"suffix"
x"prefix_2"format_2"suffix_2";y"prefix_2"format_2"suffix_2"
"""
new_string = s
matches = list(re.finditer(raw, s, flags=re.MULTILINE))
matches.reverse()
if matches:
for match in matches:
if match.group('suffix2'):
new_string = new_string[0:match.start('suffix2')] + 'new_suffix' + new_string[match.end('suffix2'):]
if match.group('prefix2'):
new_string = new_string[0:match.start('prefix2')] + 'new_prefix' + new_string[match.end('prefix2'):]
if match.group('suffix'):
new_string = new_string[0:match.start('suffix')] + 'new_suffix' + new_string[match.end('suffix'):]
if match.group('prefix'):
new_string = new_string[0:match.start('prefix')] + 'new_prefix' + new_string[match.end('prefix'):]
print(new_string)
Prints:
a"new_prefix"format"new_suffix";b"new_prefix"format"new_suffix"
x"new_prefix"format_2"new_suffix";y"new_prefix"format_2"new_suffix"
The above code, for demo purposes, makes the same substitutions for each occurrence of the pattern.
As far as your second concern:
There is nothing preventing you from making two passes against the text, once to replace the prefixes and once to replace suffixes as these become know. Obviously, you would only be checking certain groups for each pass, but you could still be using the same regex. And, of course, for each occurrence of the pattern you can have unique substitutions. The above code shows how to find and make the substitutions.
To allow 0 to 9 instances or the prefix
import re
raw = r'(?P<type>^.)(?:"(?P<prefix>[^"]*)")?(?P<format>[^"]*)(?:"(?P<suffix>[^"]*)")?;(?P<negformat>.)(?P<prefix2>(?:"(?P=prefix)"){0,9})(?P=format)(?:"(?P<suffix2>(?P=suffix))")?'
s = """a"prefix"format"suffix";b"prefix""prefix""prefix"format"suffix"
x"prefix_2"format_2"suffix_2";y"prefix_2"format_2"suffix_2"
"""
new_string = s
matches = list(re.finditer(raw, s, flags=re.MULTILINE))
matches.reverse()
if matches:
for match in matches:
if match.group('suffix2'):
new_string = new_string[0:match.start('suffix2')] + 'new_suffix' + new_string[match.end('suffix2'):]
if match.group('prefix2'):
start = match.start('prefix2')
end = match.end('prefix2')
repl = s[start:end]
n = repl.count('"') // 2
new_string = new_string[0:start] + (n * '"new_prefix"') + new_string[end:]
if match.group('suffix'):
new_string = new_string[0:match.start('suffix')] + 'new_suffix' + new_string[match.end('suffix'):]
if match.group('prefix'):
new_string = new_string[0:match.start('prefix')] + 'new_prefix' + new_string[match.end('prefix'):]
print(new_string)
Prints:
a"new_prefix"format"new_suffix";b"new_prefix""new_prefix""new_prefix"format"new_suffix"
x"new_prefix"format_2"new_suffix";y"new_prefix"format_2"new_suffix"

Related

Regex split string by two consecutive pipe ||

I want to split below string by two pipe(|| ) regex .
Input String
value1=data1||value2=da|ta2||value3=test&user01|
Expected Output
value1=data1
value2=da|ta2
value3=test&user01|
I tried ([^||]+) but its consider single pipe | also to split .
Try out my example - Regex
value2 has single pipe it should not be considered as matching.
I am using lua script like
for pair in string.gmatch(params, "([^||]+)") do
print(pair)
end
You can explicitly find each ||.
$ cat foo.lua
s = 'value1=data1||value2=da|ta2||value3=test&user01|'
offset = 1
for idx in string.gmatch(s, '()||') do
print(string.sub(s, offset, idx - 1) )
offset = idx + 2
end
-- Deal with the part after the right-most `||`.
-- Must +1 or it'll fail to handle s like "a=b||".
if offset <= #s + 1 then
print(string.sub(s, offset) )
end
$ lua foo.lua
value1=data1
value2=da|ta2
value3=test&user01|
Regarding ()|| see Lua's doc about Patterns (Lua does not have regex support) —
Captures:
A pattern can contain sub-patterns enclosed in parentheses; they describe captures. When a match succeeds, the substrings of the subject string that match captures are stored (captured) for future use. Captures are numbered according to their left parentheses. For instance, in the pattern "(a*(.)%w(%s*))", the part of the string matching "a*(.)%w(%s*)" is stored as the first capture, and therefore has number 1; the character matching "." is captured with number 2, and the part matching "%s*" has number 3.
As a special case, the capture () captures the current string position (a number). For instance, if we apply the pattern "()aa()" on the string "flaaap", there will be two captures: 3 and 5.
the easiest way is to replace the sequence of 2 characters || with any other character (e.g. ;) that will not be used in the data, and only then use it as a separator:
local params = "value1=data1||value2=da|ta2||value3=test&user01|"
for pair in string.gmatch(params:gsub('||',';'), "([^;]+)") do
print(pair)
end
if all characters are possible, then any non-printable characters can be used, according to their codes: string.char("10") == "\10" == "\n"
even with code 1: "\1"
string.gmatch( params:gsub('||','\1'), "([^\1]+)" )

Replace a certain part of the string which matches a pattern in python 2.7 and save in the same file

I am trying to achieve something of this sort
My input file has this kind of entries
art.range.field = 100
art.net.cap = 200
art.net.ht = 1000
art.net.dep = 8000
I am trying to match the pattern like where art.range.field is there the value should be changed to 500. So the output of the code should be something like
art.range.field = 500
art.net.cap = 200
art.net.ht = 1000
art.net.dep = 8000
Here is my following attempt at solving this problem
file_path = /tmp/dimension
with open(file_path,"r") as file
file_content = file.read()
new_content = re.sub(r"^.*"+parameter+".*$",parameter+" = %s" % value, file_content)
file.seek(0)
file.write(new_content)
file.truncate()
Here I have taken parameter = art.range.field and value = 500.
But still my file is remaining unchanged as the new_content variable is not changing its value to the desired out put.
So I want to know where I am going wrong and what can be the possible solution to this.
You can get what you want with
import re
parameter = 'art.range.field'
value = 500
with open(file_path,"r+") as file:
new_content = re.sub(r"^("+re.escape(parameter)+r"\s*=\s*).*", r"\g<1>%d" % value, file.read(), flags=re.M)
file.seek(0)
file.write(new_content)
file.truncate()
See the regex demo.
Note:
You need to use r+ to actally read/write to a file
re.M to match start of any line with a ^
re.escape to escape special chars in the parameter variable
Regex details:
^ - start of line
(art\.range\.field\s*=\s*) - Group 1 (\g<1> in the replacement pattern, the unambiguous backreference is required as value starts with a digit):
art\.range\.field - a art.range.field string
\s*=\s* - a = enclosed with 0+ whitespaces
.* - any 0 or more chars other than line break chars as many as possible
You are not changing anything because you are not in multi-line mode. If you prepend (?m) to your regex (before the ^) it should work. See also this resource for deepening the argument of regexp's modifiers.
Furthermore:
You don't need $, since you are not in single line mode, so with .* you'll just match all the characters until the end of the line.
For avoiding false positives, I would also make sure that parameter is followed by an equal sign.
You'd better to escape the parameter (using the re.escape method) if you want to use it as part of a regex.
So you should use this line of code:
new_content = re.sub(r"(?m)^\s*" + re.escape(parameter) +"\s*=.*", parameter + " = " + value, file_content)

Simple Regular Expression matching

Im new to regular expressions and Im trying to use RegExp on gwt Client side. I want to do a simple * matching. (say if user enters 006* , I want to match 006...). Im having trouble writing this. What I have is :
input = (006*)
input = input.replaceAll("\\*", "(" + "\\" + "\\" + "S\\*" + ")");
RegExp regExp = RegExp.compile(input).
It returns true with strings like BKLFD006* too. What am I doing wrong ?
Put a ^ at the start of the regex you're generating.
The ^ character means to match at the start of the source string only.
I think you are mixing two things here, namely replacement and matching.
Matching is used when you want to extract part of the input string that matches a specific pattern. In your case it seems that is what you want, and in order to get one or more digits that are followed by a star and not preceded by anything then you can use the following regex:
^[0-9]+(?=\*)
and here is a Java snippet:
String subjectString = "006*";
String ResultString = null;
Pattern regex = Pattern.compile("^[0-9]+(?=\\*)");
Matcher regexMatcher = regex.matcher(subjectString);
if (regexMatcher.find()) {
ResultString = regexMatcher.group();
}
On the other hand, replacement is used when you want to replace a re-occurring pattern from the input string with something else.
For example, if you want to replace all digits followed by a star with the same digits surrounded by parentheses then you can do it like this:
String input = "006*";
String result = input.replaceAll("^([0-9]+)\\*", "($1)");
Notice the use of $1 to reference the digits that where captured using the capture group ([0-9]+) in the regex pattern.

regular expression to find the repeating pattern +X+X+X

I have a regular expression
'^[0-9]*d[0-9]+(\+[0-9]*)*$'
to limit an input in the following format
str1 = '3d8+10'
str2 = 'd8+2+4'
However, the re I have also lets the string below through:
str3 = 'd8++2'
is there a way to write the regular expression in order to limit the pattern to
+X+X+X...?
You need
^[0-9]*d[0-9]+(\+[0-9]+)*$
a * here ^ allows only + to match as well
If the string must have at least one + n then use + (one or more) at the end
^[0-9]*d[0-9]+(\+[0-9]+)+$
It appears you are looking for
'^[0-9]*d[0-9]+(\+[0-9]+)*$'

Regex not matching repeated letters

myString = "THIS THING CAN KISS MY BUTT. HERE ARE MORE SSS";
myNewString = reReplace(myString, "[^0-9|^S{2}]", "|", "All");
myNewString is "|||S||||||||||||SS||||||||||||||||||||||||SSS"
What I want is "||||||||||||||||SS|||||||||||||||||||||||||||" which is what I thought ^S{2} would do (exclude exactly 2 S). Why is it matching any S? Can someone tell me how to fix it? TIA.
actual goal
I'm trying to validate a list of values. Acceptable values would be 6 digit numbers or 5 digit numbers proceeded by SS so 123456,SS12345 is a valid list.
what i'm trying to do is make everything that isn't SS or a number into a new delimiter because i have no control over the input.
for example 123456 AND SS12345 should be changed to 123456|||||SS12345. after changing the | delimiter to , the result is 123456,SS12345. If a user were to enter 123456 PLUS SS12345 ends up with 123456||||S|SS12345 which = 123456,S,SS12345 which is invalid and the user gets an error, but it should be valid if it didn't match the single S.
The [^0-9|^S{2}] actually means:
[^ # any character except
0-9 # 0 to 9
| # a vertical bar
^ # a caret
S # an S <-----
{ # an open brace
2 # a 2, and
} # a close brace
]
Therefore it is not matching any S.
Since CodeFusion doesn't support lookbehind or having a callback in the replacement, I don't think this can be solved simply with just REReplace.
I don't know CF but I'll try something like:
resultString = "";
sCount = 0
for character in myString + "$":
if character == 'S':
sCount += 1
else:
if sCount == 2:
resultString += "SS"
else:
resultString += "|" * sCount
sCount = 0
if isdigit(character):
resultString += character
else:
resultString += "|"
resultString = resultString[:-1]
Am I reading this correctly in that you want to replace everything except exactly two consecutive S characters?
Is this restricted to a single replace call or can you run it through multiple regex operations? If multiple operations are allowed, it might be easier to run the string through one regex that matches against S{3,} (to pick up instances of three or more S characters) and then through a second one that uses ([^S])S([^S]) (to pick up single S characters). A third run could match against the rest of your rule ([^0-9]).
You're using a negative character class with [^....], any character NOT in 0-9|^S{2} will be replaced so 0-9, ^,{ & } will also survive. Negative matching of actual strings instead of characters would be quite hard. Simply only replacing 'SS{2}' would be: (?<!S)SS(?!S), anything BUT 'SS' is hardly doable. My best effort would be (?<=SS)S|S(?=SS)|(?<=S)S(?=S)|(?<!S)S(?!S)|[^S0-9], but I cannot guarantee it.