Scala - Explanation for regex statement - regex

Assuming I have a dataframe called df and regex as follows:
var df2 = df
regex = new Regex("_(.)")
for (col <- df.columns) {
df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase }))
}
I know that this code is renaming columns of df2 such that if I had a column name called "user_id", it would become userId.
I understand what withcolumnRenamed and replaceAllIn functions do. What I do not understand is this part: { M => M.group(1).toUpperCase }
What is M? What is group(1)?
I can guess what is happening because I know that the expected output is userId but I do not think I fully understand how this is happening.
Could someone help me understand this? Would really appreciate it.
Thanks!

M just stands for match, and group (1) refer to group (1) that is captured by regex. Consider this example:
World Cup
if you want to match the example above with regex, you will write something like this \w+\s\w+, however, you can make use of the groups, and write it this way:
(\w+)\s(\w+)
The parenthesis in Regex are used to indicated groups. In the example above, the first (\w+) is group 1 which will match World. The second (\w+) will match group 2 in regex which is Cup. If you want to match the whole thing, you can use group 0 which will match the whole thing.
See the groups in action here on the right side:
https://regex101.com/r/v0Ybsv/1

The signature of the replaceAllIn method is
replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String
So that M is a Match and it has a group method, which returns
The matched string in group i, or null if nothing was matched
A group in regex is what's matched by the (sub)regex in parenthesis (., i.e. one symbol in your case). You can have several capturing groups and you can name them or refer to them by index. You can read more about capturing groups here and in the Scala API docs for Regex.
So { M => M.group(1).toUpperCase } means that you replace every match with the symbol in it that goes after _ changed to upper case.

Related

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

Python Regex - How to extract the third portion?

My input is of this format: (xxx)yyyy(zz)(eee)fff where {x,y,z,e,f} are all numbers. But fff is optional though.
Input: x = (123)4567(89)(660)
Expected output: Only the eeepart i.e. the number inside 3rd "()" i.e. 660 in my example.
I am able to achieve this so far:
re.search("\((\d*)\)", x).group()
Output: (123)
Expected: (660)
I am surely missing something fundamental. Please advise.
Edit 1: Just added fff to the input data format.
You could find all those matches that have round braces (), and print the third match with findall
import re
n = "(123)4567(89)(660)999"
r = re.findall("\(\d*\)", n)
print(r[2])
Output:
(660)
The (eee) part is identical to the (xxx) part in your regex. If you don't provide an anchor, or some sequencing requirement, then an unanchored search will match the first thing it finds, which is (xxx) in your case.
If you know the (eee) always appears at the end of the string, you could append an "at-end" anchor ($) to force the match at the end. Or perhaps you could append a following character, like a space or comma or something.
Otherwise, you might do well to match the other parts of the pattern and not capture them:
pattern = r'[0-9()]{13}\((\d{3})\)'
If you want to get the third group of numbers in brackets, you need to skip the first two groups which you can do with a repeating non-capturing group which looks for a set of digits enclosed in () followed by some number of non ( characters:
x = '(123)4567(89)(660)'
print(re.search("(?:\(\d+\)[^(]*){2}(\(\d+\))", x).group(1))
Output:
(660)
Demo on rextester

How to replace partial groups in python regex?

I have a regex
(obligor_id): (\d+);(obligor_id): (\d+):
A sample match like below:
Match 1
Full match 57-95 `obligor_id: 505732;obligor_id: 505732:`
Group 1. 57-67 `obligor_id`
Group 2. 69-75 `505732`
Group 3. 76-86 `obligor_id`
Group 4. 88-94 `505732`
I am trying to partially replace the full match to the following:
obligor_id: 505732;obligor_id: 505732: -> obligor_id: 505732;
Two ways to achieve so,
replace group 3 and 4 with empty string
replace group 1 and 2 with empty string, and then replace group 4 to (\d+);
How can I achieve these 2 in python? I know there is a re.sub function, but I only know how to replace the whole, not partially replace group.
Thanks in advance.
You can change capturing groups and reference them in the substitution string:
s = 'obligor_id: 505732;obligor_id: 505732:'
re.sub(r'(obligor_id: \d+;)(obligor_id: \d+:)', r'\1', s)
# => 'obligor_id: 505732;
Thanks for answers and advices:
I achieved them as below for future users:
re.sub(regex, r'\1: \2;', str)
re.sub(regex, r'\3: \4;', str)

Compound Words - Regex [duplicate]

I would expect this line of JavaScript:
"foo bar baz".match(/^(\s*\w+)+$/)
to return something like:
["foo bar baz", "foo", " bar", " baz"]
but instead it returns only the last captured match:
["foo bar baz", " baz"]
Is there a way to get all the captured matches?
When you repeat a capturing group, in most flavors, only the last capture is kept; any previous capture is overwritten. In some flavor, e.g. .NET, you can get all intermediate captures, but this is not the case with Javascript.
That is, in Javascript, if you have a pattern with N capturing groups, you can only capture exactly N strings per match, even if some of those groups were repeated.
So generally speaking, depending on what you need to do:
If it's an option, split on delimiters instead
Instead of matching /(pattern)+/, maybe match /pattern/g, perhaps in an exec loop
Do note that these two aren't exactly equivalent, but it may be an option
Do multilevel matching:
Capture the repeated group in one match
Then run another regex to break that match apart
References
regular-expressions.info/Repeating a Capturing Group vs Capturing a Repeating Group
Javascript flavor notes
Example
Here's an example of matching <some;words;here> in a text, using an exec loop, and then splitting on ; to get individual words (see also on ideone.com):
var text = "a;b;<c;d;e;f>;g;h;i;<no no no>;j;k;<xx;yy;zz>";
var r = /<(\w+(;\w+)*)>/g;
var match;
while ((match = r.exec(text)) != null) {
print(match[1].split(";"));
}
// c,d,e,f
// xx,yy,zz
The pattern used is:
_2__
/ \
<(\w+(;\w+)*)>
\__________/
1
This matches <word>, <word;another>, <word;another;please>, etc. Group 2 is repeated to capture any number of words, but it can only keep the last capture. The entire list of words is captured by group 1; this string is then split on the semicolon delimiter.
Related questions
How do you access the matched groups in a javascript regex?
How's about this? "foo bar baz".match(/(\w+)+/g)
Unless you have a more complicated requirement for how you're splitting your strings, you can split them, and then return the initial string with them:
var data = "foo bar baz";
var pieces = data.split(' ');
pieces.unshift(data);
try using 'g':
"foo bar baz".match(/\w+/g)
You can use LAZY evaluation.
So, instead of using * (GREEDY), try using ? (LAZY)
REGEX: (\s*\w+)?
RESULT:
Match 1: foo
Match 2: bar
Match 3: baz

scala regex to match tab separated words from a string

I'm trying to match the following string
"name type this is a comment"
Name and type are definitely there.
Comment may or may not exist.
I'm trying to store this into variables n,t and c.
val nameTypeComment = """^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+)"""
str match { case nameType(n, t, c) => print(n,t,c) }
This is what I have but doesn't seem to be working. Any help is appreciated.
val nameType = """^(\w+)\s+([\w\)\(\,]+)""".r
However this works when i was trying to work with strings only with name and type and no comment which is a group of words which might or not be there.
Note that ^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+) regex only contains 1 capturing group ((\w+\s+)) while you define 3 in the match block.
The ^(\w+)\s+([\w\)\(\,]+) only contains 2 capturing groups: (\w+) and ([\w\)\(\,]+).
To make your code work, you need to define 3 capturing groups. Also, it is not clear what the separators are, let me assume the first two fields are just 1 or more alphanumeric/underscore symbols separated by 1 or more whitespaces. The comment is anything after 2 first fields.
Then, use
val s = "name type this comment a comment"
val nameType = """(\w+)\s+(\w+)\s+(.*)""".r
val res = s match {
case nameType(n, t, c) => print(n,t,c)
case _ => print("NONE")
}
See the online demo
Note that we need to compile a regex object, pay attention at the .r after the regex pattern nameType.
Note that a pattern inside match is anchored by default, the start of string anchor ^ can be omitted.
Also, it is a good idea to add case _ to define the behavior when no match is found.