How to replace partial groups in python regex? - regex

I have a regex
(obligor_id): (\d+);(obligor_id): (\d+):
A sample match like below:
Match 1
Full match 57-95 `obligor_id: 505732;obligor_id: 505732:`
Group 1. 57-67 `obligor_id`
Group 2. 69-75 `505732`
Group 3. 76-86 `obligor_id`
Group 4. 88-94 `505732`
I am trying to partially replace the full match to the following:
obligor_id: 505732;obligor_id: 505732: -> obligor_id: 505732;
Two ways to achieve so,
replace group 3 and 4 with empty string
replace group 1 and 2 with empty string, and then replace group 4 to (\d+);
How can I achieve these 2 in python? I know there is a re.sub function, but I only know how to replace the whole, not partially replace group.
Thanks in advance.

You can change capturing groups and reference them in the substitution string:
s = 'obligor_id: 505732;obligor_id: 505732:'
re.sub(r'(obligor_id: \d+;)(obligor_id: \d+:)', r'\1', s)
# => 'obligor_id: 505732;

Thanks for answers and advices:
I achieved them as below for future users:
re.sub(regex, r'\1: \2;', str)
re.sub(regex, r'\3: \4;', str)

Related

How to return first match sub-string of a string using Ruby regex? [duplicate]

This question already has answers here:
Return first match of Ruby regex
(5 answers)
What do ^ and $ mean in a regular expression?
(2 answers)
Closed 2 years ago.
I'm looking for a way to perform a regex match on a string in Ruby and get the first match sub-string, and assign in to a variable. I have checked different solutions here in stack overflow but couldn't find a proper solution so far.
This is my string
/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv
I need to get the first sub-string of 20200904_151507. well, this file path can change time to time. And also the sub-string. But the pattern is, date_time. In the regex below, I tried to get the first eight(8) numbers, _ and last six(6) numbers.
here are the solutions I tried,
report_path[/^[0-9]{8}[_][0-9]{6}$/,1]
report_path.scan(/^[0-9]{8}[_][0-9]{6}$/).first
above report_path varibale has the full file path I have mentioned above.
What did I do wrong here?
scan will return all substrings that matches the pattern. You can use match, scan or [] to achieve your goal:
report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'
report_path.match(/\d{8}_\d{6}/)[0]
# => "20200904_151507"
report_path.scan(/\d{8}_\d{6}/)[0]
# => "20200904_151507"
# String#[] supports regex
report_path[/\d{8}_\d{6}/]
# => "20200904_151507"
Note that match returns a MatchData object, which may contains multiple matches (if we use capture groups). scan will return an Array containing all matches.
Here we're calling [0] on the MatchData to get the first match
Capture groups:
Regex allow us to capture multiples substring using one patern. We can use () to create capture groups. (?'some_name'<pattern>) allow us to create named capture groups.
report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'
matches = report_path.match(/(\d{8})_(\d{6})/)
matches[0] #=> "20200904_151507"
matches[1] #=> "20200904"
matches[2] #=> "151507"
matches = report_path.match(/(?'date'\d{8})_(?'id'\d{6})/)
matches[0] #=> "20200904_151507"
matches["date"] #=> "20200904"
matches["id"] #=> "151507"
We can even use (named) capture groups with []
From String#[] documentation:
If a Regexp is supplied, the matching portion of the string is returned. If a capture follows the regular expression, which may be a capture group index or name, follows the regular expression that component of the MatchData is returned instead.
report_path = '/usr/share/filebeat/reports/ui/local/20200904_151507/API/API_Test_suite/20200904_151508/20200904_151508.csv'
# returns the full match if no second parameter is passed
report_path[/(\d{8})_(\d{6})/]
# => 20200904_151507
# returns the capture group n°2
report_path[/(\d{8})_(\d{6})/, 2]
# => 151507
# returns the capture group called "date"
report_path[/(?'date'\d{8})_(?'id'\d{6})/, 'date']
# => 20200904

How to replace different matching groups with different text in Regex

I have the following text:
dangernounC2
cautionnounC2
alertverbC1
dangerousadjectiveB1
What I need as an output is:
danger (n)
caution (n)
alert (v)
dangerous (adj)
I would know how to do this if the list was, for example, all nouns or all verbs etc., but is there a way to replace each matching group with different corresponding text?
Here is a regular expression that would work for you. But it's a kind of trick that only works because this substitution is part of the match.
Regular expression
(n)ounC2|(v)erbC1|(adj)ectiveB1
Substitution
($1$2$3)
Use (\1\2\3) instead if you're using Python
Explanation
(n)ounC2|(v)erbC1|(adj)ectiveB1 will match either nounC2, verbC1 or adjectiveB1
When it matches nounC2, Group 1 will contain n, Group 2 and 3 contain nothing
When it matches verbC1, Group 2 will contain v, Group 1 and 3 contain nothing
When it matches adjectiveB1, Group 3 will contain adj, Group 1 and 2 contain nothing
Every match is replaced with a space followed by the values of the 3 groups between parenthesis.
Demos
Demo on RegEx101
Code snippet (JavaScript)
const regex = /(n)ounC2|(v)erbC1|(adj)ectiveB1/gm;
const str = `
dangernounC2
cautionnounC2
alertverbC1
dangerousadjectiveB1
eatverbC1
prettyadjectiveB1`;
const subst = ` ($1$2$3)`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

match url params with regex

Im trying to extract url params with regex.
here is an example string: param1=val1&param2=val2&adv=val3&param3=val4&param4=val5
This is the regex im using right now:
(\&)([^=]+)\=([^&]+)
I can't figure out how to match the first param. I what to have param1 be in group 2 and val2 in group 3 like the rest of the param matches.
https://regex101.com/r/Qzxyyo/1
How can I do this?
edit:
So this seems to work (meaning param1 is in group 2 and val1 in is group3). But I dont understand why it works or if it is reliable:
(\&|^)([^=]+)\=([^&]+)
(\&)([^=]+)\=([^&]+)
Break it down!
\& look for a & character
[^=]+ match characters up until it hits a =
\= match the = character
[^&]+ match characters up until &
() These define the groups!

Scala - Explanation for regex statement

Assuming I have a dataframe called df and regex as follows:
var df2 = df
regex = new Regex("_(.)")
for (col <- df.columns) {
df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase }))
}
I know that this code is renaming columns of df2 such that if I had a column name called "user_id", it would become userId.
I understand what withcolumnRenamed and replaceAllIn functions do. What I do not understand is this part: { M => M.group(1).toUpperCase }
What is M? What is group(1)?
I can guess what is happening because I know that the expected output is userId but I do not think I fully understand how this is happening.
Could someone help me understand this? Would really appreciate it.
Thanks!
M just stands for match, and group (1) refer to group (1) that is captured by regex. Consider this example:
World Cup
if you want to match the example above with regex, you will write something like this \w+\s\w+, however, you can make use of the groups, and write it this way:
(\w+)\s(\w+)
The parenthesis in Regex are used to indicated groups. In the example above, the first (\w+) is group 1 which will match World. The second (\w+) will match group 2 in regex which is Cup. If you want to match the whole thing, you can use group 0 which will match the whole thing.
See the groups in action here on the right side:
https://regex101.com/r/v0Ybsv/1
The signature of the replaceAllIn method is
replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String
So that M is a Match and it has a group method, which returns
The matched string in group i, or null if nothing was matched
A group in regex is what's matched by the (sub)regex in parenthesis (., i.e. one symbol in your case). You can have several capturing groups and you can name them or refer to them by index. You can read more about capturing groups here and in the Scala API docs for Regex.
So { M => M.group(1).toUpperCase } means that you replace every match with the symbol in it that goes after _ changed to upper case.

1 to 5 of the same groups in REGEX

For a string such as:
abzyxcabkmqfcmkcde
Notice that there are string patterns between ab and c in bold. To capture the first string pattern:
ab([a-z]{3,5})c
Is it possible to match both of the groups from the sample string? Actually, there should be 1 to 5 groups.
Note: python style regex.
You can verify that a given string conforms to the 1-5 repetitions of ab([a-z]{3,5})c using this regex
(?:ab([a-z]{3,5})c){1,5}
or this one if there are characters expected between the groups
(?:ab([a-z]{3,5})c.*?){1,5}
You will only be able to extract the last matching group from that string however, not any of the previous ones. to get a previous one you need to use hsz's approach
Just match all results - i.e. with g flag:
/ab([a-z]{3,5})c/g
or some method like in Python:
re.findall(pattern, string, flags=0)