How to get overlapping regex captures in Rust? - regex

I'm trying to match the two characters after a specific character. The trailing values may contain the specified character, which is ok, but I also need to capture that specified character as the beginning of the next capture group.
This code should illustrate what I mean:
extern crate regex;
use regex::Regex;
pub fn main() {
let re = Regex::new("(a..)").unwrap();
let st = String::from("aba34jf baacdaab");
println!("String to match: {}", st);
for cap in re.captures_iter(&st) {
println!("{}", cap[1].to_string());
// Prints "aba" and "aac",
// Should print "aba", "a34", "aac", "acd", "aab"
}
}
How do I get overlapping captures without using look around (which the regex crate doesn't support in Rust)? Is there something similar to what is in Python (as mentioned here) but in Rust?
Edit:
Using onig as BurntSushi5 suggested, we get the following:
extern crate onig;
use onig::*;
pub fn main() {
let re = Regex::new("(?=(a.{2}))").unwrap();
let st = String::from("aba34jf baacdaab");
println!("String to match: {}", st);
for ch in re.find_iter(&st) {
print!("{} ", &st[ch.0..=ch.1+2]);
// aba a34 aac acd aab, as it should.
// but we have to know how long the capture is.
}
println!("");
}
Now the problem with this is that you have to know how long the regex is, because the look ahead group doesn't capture. Is there a way to get the look ahead regex captured without knowing the length beforehand? How would we print it out if we had something like (?=(a.+)) as the regex?

You can't. Your only recourse is to either find a different approach entirely, or use a different regex engine that supports look-around like onig or pcre2.

I found a solution, unfortunately not regex though:
pub fn main() {
print_char_matches ("aba34jf baacdaab", 'a', 2);
//aba a34 aac acd aab, as it should.
}
pub fn print_char_matches( st:&str, char_match:char, match_length:usize ) {
let chars:Vec<_> = st.char_indices().collect();
println!("String to match: {}", st);
for i in 0..chars.len()-match_length {
if chars[i].1 == char_match {
for j in 0..=match_length {
print!("{}", chars[i+j].1);
}
print!(" ");
}
}
println!("");
}
This is a bit more generalizable, ASCII only. Matches the character provided and the specified number of digits after the match.

Related

Regex array of named group matches

I would like to get an array of all captured group matches in chronological order (the order they appear in in the input string).
So for examples with the following regex:
(?P<fooGroup>foo)|(?P<barGroup>bar)
and the following input:
foo bar foo
I would like to get something that resembles the following output:
[("fooGroup", (0,3)), ("barGroup", (4,7)), ("fooGroup", (8,11))]
Is this possible to do without manually sorting all matches?
I don't know what you mean by "without manually sorting all matches," but this Rust code produces the output you want for this particular style of pattern:
use regex::Regex;
fn main() {
let pattern = r"(?P<fooGroup>foo)|(?P<barGroup>bar)";
let haystack = "foo bar foo";
let mut matches: Vec<(String, (usize, usize))> = vec![];
let re = Regex::new(pattern).unwrap();
// We skip the first capture group, which always corresponds
// to the entire pattern and is unnamed. Otherwise, we assume
// every capturing group has a name and corresponds to a single
// alternation in the regex.
let group_names: Vec<&str> =
re.capture_names().skip(1).map(|x| x.unwrap()).collect();
for caps in re.captures_iter(haystack) {
for name in &group_names {
if let Some(m) = caps.name(name) {
matches.push((name.to_string(), (m.start(), m.end())));
}
}
}
println!("{:?}", matches);
}
The only real trick here is to make sure group_names is correct. It's correct for any pattern of the form (?P<name1>re1)|(?P<name2>re2)|...|(?P<nameN>reN) where each reI contains no other capturing groups.

Filter strings with regex

I need to filter (select) strings that follow certain rules, print them and count the number filtered strings. The input is a big string and I need to apply the following rules on each line:
line must not contain any of ab, cd, pq, or xy
line must contain any of the vowels
line must contain a letter that repeats itself, like aa, ff, yy etc
I'm using the regex crate and it provides regex::RegexSet so I can combine multiple rules. The rules I added are as follows
let regexp = regex::RegexSet::new(&[
r"^((?!ab|cd|pq|xy).)*", // rule 1
r"((.)\1{9,}).*", // rule 3
r"(\b[aeiyou]+\b).*", // rule 2
])
But I don't know how to use these rules to filter the lines and iterate over them.
pub fn p1(lines: &str) -> u32 {
lines
.split_whitespace().filter(|line| { /* regex filter goes here */ })
.map(|line| println!("{}", line))
.count() as u32
}
Also the compiler says that the crate doesn't support look-around, including look-ahead and look-behind.
If you're looking to use a single regex, then doing this via the regex crate (which, by design, and as documented, does not support look-around or backreferences) is probably not possible. You could use a RegexSet, but implementing your third rule would require using a regex that lists every repetition of a Unicode letter. This would not be as bad if you were okay limiting this to ASCII, but your comments suggest this isn't acceptable.
So I think your practical options here are to either use a library that supports fancier regex features (such as fancy-regex for a pure Rust library, or pcre2 if you're okay using a C library), or writing just a bit more code:
use regex::Regex;
fn main() {
let corpus = "\
baz
ab
cwm
foobar
quux
foo pq bar
";
let blacklist = Regex::new(r"ab|cd|pq|xy").unwrap();
let vowels = Regex::new(r"[aeiouy]").unwrap();
let it = corpus
.lines()
.filter(|line| !blacklist.is_match(line))
.filter(|line| vowels.is_match(line))
.filter(|line| repeated_letter(line));
for line in it {
println!("{}", line);
}
}
fn repeated_letter(line: &str) -> bool {
let mut prev = None;
for ch in line.chars() {
if prev.map_or(false, |prev| prev == ch) {
return true;
}
prev = Some(ch);
}
false
}
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c0928793474af1f9c0180c1ac8fd2d47

Kotlin Regex Boundary Matching Not working

I'm trying to parse a word, bounded by whitespace or punctuation on either side.
I tried this:
fun main(args: Array<String>) {
val regex = "\bval\b".toRegex();
regex.matches("fn foo() { val x = 2;} x;").also { println(it) }
}
But this prints out false. I tested the regex on here https://regex101.com/r/vNBefF/2 and it works, matching against the input string.
What am I doing wrong?
I think you're using the wrong method. From the KotlinDoc:
Indicates whether the regular expression matches the entire input.
I think what you may want is containsMatchIn. You can play with this on the playground.

Regex, close it off in both ends? [duplicate]

What is the regular expression (in JavaScript if it matters) to only match if the text is an exact match? That is, there should be no extra characters at other end of the string.
For example, if I'm trying to match for abc, then 1abc1, 1abc, and abc1 would not match.
Use the start and end delimiters: ^abc$
It depends. You could
string.match(/^abc$/)
But that would not match the following string: 'the first 3 letters of the alphabet are abc. not abc123'
I think you would want to use \b (word boundaries):
var str = 'the first 3 letters of the alphabet are abc. not abc123';
var pat = /\b(abc)\b/g;
console.log(str.match(pat));
Live example: http://jsfiddle.net/uu5VJ/
If the former solution works for you, I would advise against using it.
That means you may have something like the following:
var strs = ['abc', 'abc1', 'abc2']
for (var i = 0; i < strs.length; i++) {
if (strs[i] == 'abc') {
//do something
}
else {
//do something else
}
}
While you could use
if (str[i].match(/^abc$/g)) {
//do something
}
It would be considerably more resource-intensive. For me, a general rule of thumb is for a simple string comparison use a conditional expression, for a more dynamic pattern use a regular expression.
More on JavaScript regexes: https://developer.mozilla.org/en/JavaScript/Guide/Regular_Expressions
"^" For the begining of the line "$" for the end of it. Eg.:
var re = /^abc$/;
Would match "abc" but not "1abc" or "abc1". You can learn more at https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions

How to match a word between single but not double brackets without look-around?

The Rust Regex crate has no look-around so I cannot use negative look-behind for { and a negative look-ahead for }.
I tried:
extern crate regex;
use regex::Regex;
fn main() {
let exp = Regex::new("(?:[^{]|^)\\{([^{}]*)\\}").unwrap();
let text = "{this} is a match, {{escaped}} is not, but {these}{also} are.";
for capture in exp.captures_iter(text) {
println!("{}", &capture[1]);
}
// expected result: "this", "these", "also"
}
This does not catch "also" because the matches do not overlap. Is there a way to do so without look-around?
You can use the discard technique and use a pattern like this:
{{|}}|{([^}]+)}
Working demo
Or simpler to read if you need to match alphanumeric and underscore
{{|}}|{(\w+)}
In your code, you must now check if there is matching group 1:
extern crate regex;
use regex::Regex;
fn main() {
let exp = Regex::new(r"\{\{|\}\}|\{([^}]+)\}").unwrap();
let text = "{this} is a match, {{escaped}} is not, but {these}{also} are.";
for capture in exp.captures_iter(text) {
if let Some(matched) = capture.get(1) {
println!("{}", matched.as_str());
}
}
// printed: "this", "these", "also"
}