scala regex to match tab separated words from a string - regex

I'm trying to match the following string
"name type this is a comment"
Name and type are definitely there.
Comment may or may not exist.
I'm trying to store this into variables n,t and c.
val nameTypeComment = """^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+)"""
str match { case nameType(n, t, c) => print(n,t,c) }
This is what I have but doesn't seem to be working. Any help is appreciated.
val nameType = """^(\w+)\s+([\w\)\(\,]+)""".r
However this works when i was trying to work with strings only with name and type and no comment which is a group of words which might or not be there.

Note that ^(\w+\s+){2}(?:[\w+\s*)*\(\,\,]+) regex only contains 1 capturing group ((\w+\s+)) while you define 3 in the match block.
The ^(\w+)\s+([\w\)\(\,]+) only contains 2 capturing groups: (\w+) and ([\w\)\(\,]+).
To make your code work, you need to define 3 capturing groups. Also, it is not clear what the separators are, let me assume the first two fields are just 1 or more alphanumeric/underscore symbols separated by 1 or more whitespaces. The comment is anything after 2 first fields.
Then, use
val s = "name type this comment a comment"
val nameType = """(\w+)\s+(\w+)\s+(.*)""".r
val res = s match {
case nameType(n, t, c) => print(n,t,c)
case _ => print("NONE")
}
See the online demo
Note that we need to compile a regex object, pay attention at the .r after the regex pattern nameType.
Note that a pattern inside match is anchored by default, the start of string anchor ^ can be omitted.
Also, it is a good idea to add case _ to define the behavior when no match is found.

Related

Match same number of repetitions as previous group

I'm trying to match strings that are repeated the same number of times, like
abc123
abcabc123123
abcabcabc123123123
etc.
That is, I want the second group (123) to be matched the same number of times as the first group (abc). Something like
(abc)+(123){COUNT THE PREVIOUS GROUP MATCHED}
This is using the Rust regex crate https://docs.rs/regex/1.4.2/regex/
Edit As I feared, and pointed out by answers and comments, this is not possible to represent in regex, at least not without some sort of recursion which the Rust regex crate doesn't for the time being support. In this case, as I know the input length is limited, I just generated a rule like
(abc123)|(abcabc123123)|(abcabcabc123123123)
Horribly ugly, but got the job done, as this wasn't "serious" code, just a fun exercise.
As others have commented, I don't think it's possible to accomplish this in a single regex. If you can't guarantee the strings are well-formed then you'd have to validate them with the regex, capture each group, and then compare the group lengths to verify they are of equal repetitions. However, if it's guaranteed all strings will be well-formed then you don't even need to use regex to implement this check:
fn matching_reps(string: &str, group1: &str, group2: &str) -> bool {
let group2_start = string.find(group2).unwrap();
let group1_reps = (string.len() - group2_start) / group1.len();
let group2_reps = group2_start / group2.len();
group1_reps == group2_reps
}
fn main() {
assert_eq!(matching_reps("abc123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123", "abc", "123"), false);
assert_eq!(matching_reps("abcabc123123", "abc", "123"), true);
assert_eq!(matching_reps("abcabc123123123", "abc", "123"), false);
}
playground
Pure regular expressions are not able to represent that.
There may be some way to define back references, but I am not familiar with regexp syntax in Rust, and this would technically be a way to represent something more than a pure regular expression.
There is however a simple way to compute it :
use a regexp to make sure your string is a ^((abc)*)((123)*)$
if your string matches, take the two captured substrings, and compare their lengths
Building a pattern dynamically is also an option. Matching one, two or three nested abc and 123 is possible with
abc(?:abc(?:abc(?:)?123)?123)?123
See proof. (?:)? is redundant, it matches no text, (?:...)? matches an optional pattern.
Rust snippet:
let a = "abc"; // Prefix
let b = "123"; // Suffix
let level = 3; // Recursion (repetition) level
let mut result = "".to_string();
for _n in 0..level {
result = format!("{}(?:{})?{}", a, result, b);
}
println!("{}", result);
// abc(?:abc(?:abc(?:)?123)?123)?123
There's an extension to the regexp libraries, that is implemented from the old times unix and that allows to match (literally) an already scanned group literally after the group has been matched.
For example... let's say you have a number, and that number must be equal to another (e.g. the score of a soccer game, and you are interested only in draws between the two teams) You can use the following regexp:
([0-9][0-9]*) - \1
and suppose we feed it with "123-123" (it will match) but if we use "123-12" that will not match, as the \1 is not the same string as what was matched in the first group. When the first group is matched, the actual regular expression converts the \1 into the literal sequence of characters that was matched in the first group.
But there's a problem with your sample... is that there's no way to end the first group if you try:
([0-9][0-9]*)\1
to match 123123, because the automaton cannot close the first group (you need at least a nondigit character to make the first group to finalize)
But for example, this means that you can use:
\+(\([0-9][0-9]*\))\1(-\1)*
and this will match phone numbers in the form
+(358)358-358-358
or
+(1)1-1-1-1-1-1-1
(the number in between the parenthesys is catched as a sample, and then you use the group to build a sequence of that number separated by dashes. You can se the expression working in this demo.)

How to replace different matching groups with different text in Regex

I have the following text:
dangernounC2
cautionnounC2
alertverbC1
dangerousadjectiveB1
What I need as an output is:
danger (n)
caution (n)
alert (v)
dangerous (adj)
I would know how to do this if the list was, for example, all nouns or all verbs etc., but is there a way to replace each matching group with different corresponding text?
Here is a regular expression that would work for you. But it's a kind of trick that only works because this substitution is part of the match.
Regular expression
(n)ounC2|(v)erbC1|(adj)ectiveB1
Substitution
($1$2$3)
Use (\1\2\3) instead if you're using Python
Explanation
(n)ounC2|(v)erbC1|(adj)ectiveB1 will match either nounC2, verbC1 or adjectiveB1
When it matches nounC2, Group 1 will contain n, Group 2 and 3 contain nothing
When it matches verbC1, Group 2 will contain v, Group 1 and 3 contain nothing
When it matches adjectiveB1, Group 3 will contain adj, Group 1 and 2 contain nothing
Every match is replaced with a space followed by the values of the 3 groups between parenthesis.
Demos
Demo on RegEx101
Code snippet (JavaScript)
const regex = /(n)ounC2|(v)erbC1|(adj)ectiveB1/gm;
const str = `
dangernounC2
cautionnounC2
alertverbC1
dangerousadjectiveB1
eatverbC1
prettyadjectiveB1`;
const subst = ` ($1$2$3)`;
// The substituted value will be contained in the result variable
const result = str.replace(regex, subst);
console.log('Substitution result: ', result);

Regex for n-directory file path

I'm trying to use regex to evaluate if a given path is valid or not:
List of acceptable values:
1. /a_B_1/b_Sc2/c_d3/23_DS_xy/some_file_name.txt
2. /x_y_q/ffs/www/A/a_ol/some_file_name.txt
3. /tsf/ggg/wWw/abc/a_o#l/some=file name.csv
4. /a/b/c/d/some file.txt
As you can see for all groups, accepted range is [a-zA-Z0-9_]. Only the last group can have spaces, #, =.
Group ordering: /<group1>/<group2>/<group3>/<group4>/<group5>.
Group 5 can have sub-directories and hence the '*'.
I've tried:
"""/?[^/\\n]+/([^/\\n]+)/([^/\\n]+)/([^/\\n]+)/([^/\\n]+)/.*""".r
"""/(^[a-zA-Z0-9-_]+)/(^[a-zA-Z0-9-_]+)/(^[a-zA-Z0-9-_]+)/(^[a-zA-Z0-9-_]+)/(^[a-zA-Z0-9-_\\s]*)""".r
"""/([\\w,\\s-_]+)/([\\w,\\s-_]+)/([\\w,\\s-_]+)/([\\w,\\s-_]+)/([\\w,\\s]*)""".r
Can someone please guide?
Sample code
val regex = """ ... """.r
val testString = "/a/b/c/d/some file.txt"
regex.findFirstMatchIn(testString) match {
case Some(r) => println(r)
case _ => println("Regex did not match")
}
Not sure what your question is, but note that you don't need to escape backslahes inside triple-quoted strings.
Something like this """(/\w+){4}/([\w\s#=/]+)""".r seems like it should do what you are describing.

Scala - Explanation for regex statement

Assuming I have a dataframe called df and regex as follows:
var df2 = df
regex = new Regex("_(.)")
for (col <- df.columns) {
df2 = df2.withColumnRenamed(col, regex.replaceAllIn(col, { M => M.group(1).toUpperCase }))
}
I know that this code is renaming columns of df2 such that if I had a column name called "user_id", it would become userId.
I understand what withcolumnRenamed and replaceAllIn functions do. What I do not understand is this part: { M => M.group(1).toUpperCase }
What is M? What is group(1)?
I can guess what is happening because I know that the expected output is userId but I do not think I fully understand how this is happening.
Could someone help me understand this? Would really appreciate it.
Thanks!
M just stands for match, and group (1) refer to group (1) that is captured by regex. Consider this example:
World Cup
if you want to match the example above with regex, you will write something like this \w+\s\w+, however, you can make use of the groups, and write it this way:
(\w+)\s(\w+)
The parenthesis in Regex are used to indicated groups. In the example above, the first (\w+) is group 1 which will match World. The second (\w+) will match group 2 in regex which is Cup. If you want to match the whole thing, you can use group 0 which will match the whole thing.
See the groups in action here on the right side:
https://regex101.com/r/v0Ybsv/1
The signature of the replaceAllIn method is
replaceAllIn(target: CharSequence, replacer: (Match) ⇒ String): String
So that M is a Match and it has a group method, which returns
The matched string in group i, or null if nothing was matched
A group in regex is what's matched by the (sub)regex in parenthesis (., i.e. one symbol in your case). You can have several capturing groups and you can name them or refer to them by index. You can read more about capturing groups here and in the Scala API docs for Regex.
So { M => M.group(1).toUpperCase } means that you replace every match with the symbol in it that goes after _ changed to upper case.

scala matching optional set of characters

I am using scala regex to extract a token from a URL
my url is http://www.google.com?x=10&id=x10_23&y=2
here I want to extract the value of x10 in front of id. note that _23 is optional and may or may not appear but if it appears it must be removed.
The regex which I have written is
val regex = "^.*id=(.*)(\\_\\d+)?.*$".r
x match {
case regex(id) => print(id)
case _ => print("none")
}
this should work because (\\_\\d+)? should make the _23 optional as a whole.
So I don't understand why it prints none.
Note that your pattern ^.*id=(.*)(\\_\\d+)?.*$ actually puts x10_23&y=2 into Group 1 because of the 1st greedy dot matching subpattern. Since (_\d+)? is optional, the first greedy subpattern does not have to yield any characters to that capture group.
You can use
val regex = "(?s).*[?&]id=([^\\W&]+?)(?:_\\d+)?(?:&.*)?".r
val x = "http://www.google.com?x=10&id=x10_23&y=2"
x match {
case regex(id) => print(id)
case _ => print("none")
}
See the IDEONE demo (regex demo)
Note that there is no need defining ^ and $ - that pattern is anchored in Scala by default. (?s) ensures we match the full input string even if it contains newline symbols.
Another idea instead of using a regular expression to extract tokens would be to use the built-in URI Java class with its getQuery() method. There you can split the query by = and then check if one of the pair starts with id= and extract the value.
For instance (just as an example):
val x = "http://www.google.com?x=10&id=x10_23&y=2"
val uri = new URI(x)
uri.getQuery.split('&').find(_.startsWith("id=")) match {
case Some(param) => println(param.split('=')(1).replace("_23", ""))
case None => println("None")
}
I find it simpler to maintain that the regular expression you have, but that's just my thought!