Filter strings with regex - regex

I need to filter (select) strings that follow certain rules, print them and count the number filtered strings. The input is a big string and I need to apply the following rules on each line:
line must not contain any of ab, cd, pq, or xy
line must contain any of the vowels
line must contain a letter that repeats itself, like aa, ff, yy etc
I'm using the regex crate and it provides regex::RegexSet so I can combine multiple rules. The rules I added are as follows
let regexp = regex::RegexSet::new(&[
r"^((?!ab|cd|pq|xy).)*", // rule 1
r"((.)\1{9,}).*", // rule 3
r"(\b[aeiyou]+\b).*", // rule 2
])
But I don't know how to use these rules to filter the lines and iterate over them.
pub fn p1(lines: &str) -> u32 {
lines
.split_whitespace().filter(|line| { /* regex filter goes here */ })
.map(|line| println!("{}", line))
.count() as u32
}
Also the compiler says that the crate doesn't support look-around, including look-ahead and look-behind.

If you're looking to use a single regex, then doing this via the regex crate (which, by design, and as documented, does not support look-around or backreferences) is probably not possible. You could use a RegexSet, but implementing your third rule would require using a regex that lists every repetition of a Unicode letter. This would not be as bad if you were okay limiting this to ASCII, but your comments suggest this isn't acceptable.
So I think your practical options here are to either use a library that supports fancier regex features (such as fancy-regex for a pure Rust library, or pcre2 if you're okay using a C library), or writing just a bit more code:
use regex::Regex;
fn main() {
let corpus = "\
baz
ab
cwm
foobar
quux
foo pq bar
";
let blacklist = Regex::new(r"ab|cd|pq|xy").unwrap();
let vowels = Regex::new(r"[aeiouy]").unwrap();
let it = corpus
.lines()
.filter(|line| !blacklist.is_match(line))
.filter(|line| vowels.is_match(line))
.filter(|line| repeated_letter(line));
for line in it {
println!("{}", line);
}
}
fn repeated_letter(line: &str) -> bool {
let mut prev = None;
for ch in line.chars() {
if prev.map_or(false, |prev| prev == ch) {
return true;
}
prev = Some(ch);
}
false
}
Playground link: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=c0928793474af1f9c0180c1ac8fd2d47

Related

Regex array of named group matches

I would like to get an array of all captured group matches in chronological order (the order they appear in in the input string).
So for examples with the following regex:
(?P<fooGroup>foo)|(?P<barGroup>bar)
and the following input:
foo bar foo
I would like to get something that resembles the following output:
[("fooGroup", (0,3)), ("barGroup", (4,7)), ("fooGroup", (8,11))]
Is this possible to do without manually sorting all matches?
I don't know what you mean by "without manually sorting all matches," but this Rust code produces the output you want for this particular style of pattern:
use regex::Regex;
fn main() {
let pattern = r"(?P<fooGroup>foo)|(?P<barGroup>bar)";
let haystack = "foo bar foo";
let mut matches: Vec<(String, (usize, usize))> = vec![];
let re = Regex::new(pattern).unwrap();
// We skip the first capture group, which always corresponds
// to the entire pattern and is unnamed. Otherwise, we assume
// every capturing group has a name and corresponds to a single
// alternation in the regex.
let group_names: Vec<&str> =
re.capture_names().skip(1).map(|x| x.unwrap()).collect();
for caps in re.captures_iter(haystack) {
for name in &group_names {
if let Some(m) = caps.name(name) {
matches.push((name.to_string(), (m.start(), m.end())));
}
}
}
println!("{:?}", matches);
}
The only real trick here is to make sure group_names is correct. It's correct for any pattern of the form (?P<name1>re1)|(?P<name2>re2)|...|(?P<nameN>reN) where each reI contains no other capturing groups.

How to get overlapping regex captures in Rust?

I'm trying to match the two characters after a specific character. The trailing values may contain the specified character, which is ok, but I also need to capture that specified character as the beginning of the next capture group.
This code should illustrate what I mean:
extern crate regex;
use regex::Regex;
pub fn main() {
let re = Regex::new("(a..)").unwrap();
let st = String::from("aba34jf baacdaab");
println!("String to match: {}", st);
for cap in re.captures_iter(&st) {
println!("{}", cap[1].to_string());
// Prints "aba" and "aac",
// Should print "aba", "a34", "aac", "acd", "aab"
}
}
How do I get overlapping captures without using look around (which the regex crate doesn't support in Rust)? Is there something similar to what is in Python (as mentioned here) but in Rust?
Edit:
Using onig as BurntSushi5 suggested, we get the following:
extern crate onig;
use onig::*;
pub fn main() {
let re = Regex::new("(?=(a.{2}))").unwrap();
let st = String::from("aba34jf baacdaab");
println!("String to match: {}", st);
for ch in re.find_iter(&st) {
print!("{} ", &st[ch.0..=ch.1+2]);
// aba a34 aac acd aab, as it should.
// but we have to know how long the capture is.
}
println!("");
}
Now the problem with this is that you have to know how long the regex is, because the look ahead group doesn't capture. Is there a way to get the look ahead regex captured without knowing the length beforehand? How would we print it out if we had something like (?=(a.+)) as the regex?
You can't. Your only recourse is to either find a different approach entirely, or use a different regex engine that supports look-around like onig or pcre2.
I found a solution, unfortunately not regex though:
pub fn main() {
print_char_matches ("aba34jf baacdaab", 'a', 2);
//aba a34 aac acd aab, as it should.
}
pub fn print_char_matches( st:&str, char_match:char, match_length:usize ) {
let chars:Vec<_> = st.char_indices().collect();
println!("String to match: {}", st);
for i in 0..chars.len()-match_length {
if chars[i].1 == char_match {
for j in 0..=match_length {
print!("{}", chars[i+j].1);
}
print!(" ");
}
}
println!("");
}
This is a bit more generalizable, ASCII only. Matches the character provided and the specified number of digits after the match.

"decimal literal empty" when combining several strings for a regex in Rust

I'm looking to parse a string to create a vector of floats:
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
// --- expected output vec: Vec<f32> = vec![12.34, 13.]
}
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let pattern_float = String::from(r"\s*(\d*.*\d*)\s*");
let pattern_opening = String::from(r"\s*{{");
let pattern_closing = String::from(r"}}\s*");
let pattern =
pattern_opening + "(" + &pattern_float + ",)*" + &pattern_float + &pattern_closing;
let re = Regex::new(&pattern).unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
// --- snip : for loop for adding the elements to the vector ---
vec_axis1
}
This code compiles but an error arises at runtime when unwrapping the Regex::new():
regex parse error:
\s*{{(\s*(\d*.*\d*)\s*,)*\s*(\d*.*\d*)\s*}}\s*
^
error: decimal literal empty
According to other posts, this error can arise when escaping the curly bracket { is not properly done, but I think I escaped the bracket properly.
What is wrong with this regex?
There are several problems in your code:
Escaping a { in regex is done with \{.
Your . matches any character and doesn't take what you want. You must escape it.
You're capturing more than just the number, which makes the parsing more complex.
Your regex building is unnecessary verbose, you may comment without it.
Here's a proposed improved version:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"(?x)
\s*\{\s* # opening
(\d*\.\d*) # captured float
\s*,\s* # separator
\d*\.\d* # ignored float
\s*\}\s* # closing
").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
if let Some(c) = re.captures(str_values) {
if let Some(g) = c.get(1) {
vec_axis1.push(g.as_str().parse().unwrap());
}
}
vec_axis1
}
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
println!("v: {:?}", vec);
}
playground
If you call this function several times, you might want to avoid recompiling the regex at each call too.
I want to be able to match 0.123, .123, 123 or 123., the use of d+ would break these possibilities
It looks like you want to fetch all the floats in the string. This could be simply done like this:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"\d*\.\d*").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
for c in re.captures_iter(str_values) {
vec_axis1.push(c[0].parse().unwrap());
}
vec_axis1
}
If you want both:
to check the complete string is correctly wrapped between { and }
to capture all numbers
Then you could either:
combine two regexes (the first one used to extract the internal part)
use a Serde-based parser (I wouldn't at this point but it would be interesting if the problem's complexity grows)

Nicer way to access match results?

My requirement is to transform some textual message ids. Input is
a.messageid=X0001E
b.messageid=Y0001E
The task is to turn that into
a.messageid=Z00001E
b.messageid=Z00002E
In other words: fetch the first part each line (like: a.), and append a slightly different id.
My current solution:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String): String {
val result = matcherForIds.matchEntire(line) ?: return line
return "${result.groupValues.get(1)}.messageid=Z%05dE".format(messageCounter++)
}
This works, but find the way how I get to first match "${result.groupValues.get(1)} to be not very elegant.
Is there a nicer to read/more concise way to access that first match?
You may get the result without a separate function:
val line = s.replace("""^(.*\.messageid=)[XY]\d{4,6}E$""".toRegex()) {
"${it.groupValues[1]}Z%05dE".format(messageCounter++)
}
However, as you need to format the messageCounter into the result, you cannot just use a string replacement pattern and you cannot get rid of ${it.groupValues[1]}.
Also, note:
You may get rid of double backslashes by means of the triple-quoted string literal
There is no need adding .messageid= to the replacement if you capture that part into Group 1 (see (.*\.messageid=))
There is no need capturing X or Y since you are not using them later, thus, (X|Y) can be replaced with a more efficient character class [XY].
The ^ and $ make sure the pattern should match the entire string, else, there will be no match and the string will be returned as is, without any modification.
See the Kotlin demo online.
Maybe not really what you are looking for, but maybe it is. What if you first ensure (filter) the lines of interest and just replace what needs to be replaced instead, e.g. use the following transformation function:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
val idRegex = Regex("[XY]\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String) = idRegex.replace(line) {
"Z%05dE".format(idCounter++)
}
with the following filter:
"a.messageid=X0001E\nb.messageid=Y0001E"
.lineSequence()
.filter(matcherForIds::matches)
.map(::transformIds)
.forEach(::println)
In case there are also other strings that are relevant which you want to keep then the following is also possible but not as nice as the solution at the end:
"a.messageid=X0001E\nnot interested line, but required in the output!\nb.messageid=Y0001E"
.lineSequence()
.map {
when {
matcherForIds.matches(it) -> transformIds(it)
else -> it
}
}
.forEach(::println)
Alternatively (now just copying Wiktors regex, as it already contains all we need (complete match from begin of line ^ upto end of line $, etc.)):
val matcherForIds = Regex("""^(.*\.messageid=)[XY]\d{4,6}E$""")
fun transformIds(line: String) = matcherForIds.replace(line) {
"${it.groupValues[1]}Z%05dE".format(idCounter++)
}
This way you ensure that lines that completely match the desired input are replaced and the others are kept but not replaced.

regex match string starting at offset

I'm learning Rust and trying to write a simple tokenizer right now. I want to go through a string running each regular expression against the current position in the string, create a token, then skip ahead and repeat until I've processed the whole string. I know I can put them into a larger regex and loop through captures, but I need to process them individually for domain reseasons.
However, I see nowhere in the regex crate that allows an offset so I can begin matching again at specific point.
extern crate regex;
use regex::Regex;
fn main() {
let input = "3 + foo/4";
let ident_re = Regex::new("[a-zA-Z][a-zA-Z0-9]*").unwrap();
let number_re = Regex::new("[1-9][0-9]*").unwrap();
let ops_re = Regex::new(r"[+-*/]").unwrap();
let ws_re = Regex::new(r"[ \t\n\r]*").unwrap();
let mut i: usize = 0;
while i < input.len() {
// Here check each regex to see if a match starting at input[i]
// if so copy the match and increment i by length of match.
}
}
Those regexs that I'm currently scaning for will actually vary at runtime too. Sometimes I may only be looking for a few of them while others (at top level) I might be looking for almost all of them.
The regex crate works on string slices. You can always take a sub-slice of another slice and then operate on that one. Instead of moving along indices, you can modify the variable that points to your slice to point to your subslice.
fn main() {
let mut s = "hello";
while !s.is_empty() {
println!("{}", s);
s = &s[1..];
}
}
Note that the slice operation slices at byte-positions, not utf8-char-positions. This allows the slicing operation to be done in O(1) instead of O(n), but will also cause the program to panic if the indices you are slicing from and to happen to be in the middle of a multi-byte utf8 character.