I'd like to partition a string into two groups by providing the regex for only one group in Rust.
The regex for the opposite group is not known. I only know the regex for the separator.
For example, with the regex \d+ and the following string
123abcdef456ghj789
I'd like to obtain both these two strings
abcdefghj
and
123456789
Using the regex and itertools crates, I'm able to get the first group like this
let text = "123abcdef456ghj789";
let re = Regex::new(r"\d+").unwrap();
let text1 = re.split(text).join(""); //abcdefghj
How can I get the second group?
You can get the desired result very similarly:
re.find_iter(text).map(|m| m.as_str()).join("");
.find_iter() returns all matches as an iterator, which you can then call .as_str() on get the full matched text. And then of course use .join() from itertools as you've done before.
Full example on the playground.
It would be nice though if there was a single method that returned a tuple of the disjoined partitions.
It would be nice and certainly possible since the matches return all the information needed to slice-and-dice the text in one pass. Here's my attempt that iteratively calls .find_at():
fn partition_regex(re: &Regex, text: &str) -> (String, String) {
let mut a = String::new();
let mut b = String::new();
let mut search_idx = 0;
while let Some(m) = re.find_at(text, search_idx) {
a.push_str(m.as_str());
b.push_str(&text[search_idx..m.start()]);
search_idx = m.end();
}
b.push_str(&text[search_idx..]);
(a, b)
}
Full example on the playground.
You can use partition to create two sets based on a predicate.
let re = Regex::new(r"(^[a-z]+)").unwrap();
let (matches, non_matches): (String, String)
= content.lines().partition(|x| re.is_match(x));
A bit more extensive but w/o an external library:
let re = regex::Regex::new(r"\d+").unwrap();
let mut text1 = String::new();
let mut text2 = String::new();
let mut beg = 0;
let txt = "123abcdef456ghj789";
for r in re.find_iter(txt).map(|m| m.range()) {
text1 += &txt[r.clone()];
text2 += &txt[beg..r.start];
beg = r.end;
}
text2 += &txt[beg..];
println!("{text1}\n{text2}");
Playground
Related
My Rust code parses a log file and accumulates some information:
use regex::Regex;
fn parse(line: &str) {
let re_str = concat!(
r"^\s+(?P<qrw1>\d+)\|(?P<qrw2>\d+)",//qrw 0|0
r"\s+(?P<arw1>\d+)\|(?P<arw2>\d+)",//arw 34|118
);
let re = Regex::new(re_str).unwrap();
match re.captures(line) {
Some(caps) => {
let qrw1 = caps.name("qrw1").unwrap().as_str().parse::<i32>().unwrap();
let qrw2 = caps.name("qrw2").unwrap().as_str().parse::<i32>().unwrap();
let arw1 = caps.name("arw1").unwrap().as_str().parse::<i32>().unwrap();
let arw2 = caps.name("arw2").unwrap().as_str().parse::<i32>().unwrap();
}
None => todo!(),
}
}
Playground
This works as expected, but I think those long chained calls which I created to get integer values of regex capture groups are a bit ugly. How do I make them shorter/nicer?
One thing you could do is extract the parsing into a closure internal_parse:
fn parse(line: &str) -> Option<(i32, i32, i32, i32)> {
let re_str = concat!(
r"^\s+(?P<qrw1>\d+)\|(?P<qrw2>\d+)",//qrw 0|0
r"\s+(?P<arw1>\d+)\|(?P<arw2>\d+)",//arw 34|118
);
let re = Regex::new(re_str).unwrap();
match re.captures(line) {
Some(caps) => {
let internal_parse = |key| {
caps.name(key).unwrap().as_str().parse::<i32>().unwrap()
};
let qrw1 = internal_parse("qrw1");
let qrw2 = internal_parse("qrw2");
let arw1 = internal_parse("arw1");
let arw2 = internal_parse("arw2");
Some((qrw1, qrw2, arw1, arw2))
}
None => None,
}
}
However, you should keep in mind that parse::<i32> may fail. (Consider e.g. the string " 00|45 57|4894444444444444444444444 ".)
You could also try to solve this problem by using a parser combinator library (the crates nom, pest or combine come to one's mind) that traverses the string and spits out the i32s directly (so that you do not have to parse manually after matching via regex).
I'm looking to parse a string to create a vector of floats:
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
// --- expected output vec: Vec<f32> = vec![12.34, 13.]
}
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let pattern_float = String::from(r"\s*(\d*.*\d*)\s*");
let pattern_opening = String::from(r"\s*{{");
let pattern_closing = String::from(r"}}\s*");
let pattern =
pattern_opening + "(" + &pattern_float + ",)*" + &pattern_float + &pattern_closing;
let re = Regex::new(&pattern).unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
// --- snip : for loop for adding the elements to the vector ---
vec_axis1
}
This code compiles but an error arises at runtime when unwrapping the Regex::new():
regex parse error:
\s*{{(\s*(\d*.*\d*)\s*,)*\s*(\d*.*\d*)\s*}}\s*
^
error: decimal literal empty
According to other posts, this error can arise when escaping the curly bracket { is not properly done, but I think I escaped the bracket properly.
What is wrong with this regex?
There are several problems in your code:
Escaping a { in regex is done with \{.
Your . matches any character and doesn't take what you want. You must escape it.
You're capturing more than just the number, which makes the parsing more complex.
Your regex building is unnecessary verbose, you may comment without it.
Here's a proposed improved version:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"(?x)
\s*\{\s* # opening
(\d*\.\d*) # captured float
\s*,\s* # separator
\d*\.\d* # ignored float
\s*\}\s* # closing
").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
if let Some(c) = re.captures(str_values) {
if let Some(g) = c.get(1) {
vec_axis1.push(g.as_str().parse().unwrap());
}
}
vec_axis1
}
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
println!("v: {:?}", vec);
}
playground
If you call this function several times, you might want to avoid recompiling the regex at each call too.
I want to be able to match 0.123, .123, 123 or 123., the use of d+ would break these possibilities
It looks like you want to fetch all the floats in the string. This could be simply done like this:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"\d*\.\d*").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
for c in re.captures_iter(str_values) {
vec_axis1.push(c[0].parse().unwrap());
}
vec_axis1
}
If you want both:
to check the complete string is correctly wrapped between { and }
to capture all numbers
Then you could either:
combine two regexes (the first one used to extract the internal part)
use a Serde-based parser (I wouldn't at this point but it would be interesting if the problem's complexity grows)
Swift 3 introduced String.range(of:options). Then, with this function, is possible match a part of string without creating a NSRegularExpression object, for example:
let text = "it is need #match my both #hashtag!"
let match = text.range(of: "(?:^#|\\s#)[\\p{L}0-9_]*", options: .regularExpression)!
print(text[match]) // #math
But, is possible match both occurrences of the regexp (that is, #match and #hashtag), instead of only the first?
let text = "it is need #match my both #hashtag!"
// create an object to store the ranges found
var ranges: [Range<String.Index>] = []
// create an object to store your search position
var start = text.startIndex
// create a while loop to find your regex ranges
while let range = text.range(of: "(?:^#|\\s#)[\\p{L}0-9_]*", options: .regularExpression, range: start..<text.endIndex) {
// append your range found
ranges.append(range)
// and change the startIndex of your string search
start = range.lowerBound < range.upperBound ? range.upperBound : text.index(range.lowerBound, offsetBy: 1, limitedBy: text.endIndex) ?? text.endIndex
}
ranges.forEach({print(text[$0])})
This will print
#match
#hashtag
If you need to use it more than once in your code you should add this extension to your project:
extension StringProtocol {
func ranges<S: StringProtocol>(of string: S, options: String.CompareOptions = []) -> [Range<Index>] {
var result: [Range<Index>] = []
var start = startIndex
while start < endIndex,
let range = self[start...].range(of: string, options: options) {
result.append(range)
start = range.lowerBound < range.upperBound ?
range.upperBound : index(after: range.lowerBound)
}
return result
}
}
usage:
let text = "it is need #match my both #hashtag!"
let pattern = "(?<!\\S)#[\\p{L}0-9_]*"
let ranges = text.ranges(of: pattern, options: .regularExpression)
let matches = ranges.map{text[$0]}
print(matches) // ["#match", "#hashtag"]
How to check whether a WHOLE string can be matches to regex? In Java is method String.matches(regex)
You need to use anchors, ^ (start of string anchor) and $ (end of string anchor), with range(of:options:range:locale:), passing the .regularExpression option:
import Foundation
let phoneNumber = "123-456-789"
let result = phoneNumber.range(of: "^\\d{3}-\\d{3}-\\d{3}$", options: .regularExpression) != nil
print(result)
Or, you may pass an array of options, [.regularExpression, .anchored], where .anchored will anchor the pattern at the start of the string only, and you will be able to omit ^, but still, $ will be required to anchor at the string end:
let result = phoneNumber.range(of: "\\d{3}-\\d{3}-\\d{3}$", options: [.regularExpression, .anchored]) != nil
See the online Swift demo
Also, using NSPredicate with MATCHES is an alternative here:
The left hand expression equals the right hand expression using a regex-style comparison according to ICU v3 (for more details see the ICU User Guide for Regular Expressions).
MATCHES actually anchors the regex match both at the start and end of the string (note this might not work in all Swift 3 builds):
let pattern = "\\d{3}-\\d{3}-\\d{3}"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
let result = predicate.evaluate(with: "123-456-789")
What you are looking for is range(of:options:range:locale:) then you can then compare the result of range(of:option:) with whole range of comparing string..
Example:
let phoneNumber = "(999) 555-1111"
let wholeRange = phoneNumber.startIndex..<phoneNumber.endIndex
if let match = phoneNumber.range(of: "\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}", options: .regularExpression), wholeRange == match {
print("Valid number")
}
else {
print("Invalid number")
}
//Valid number
Edit: You can also use NSPredicate and compare your string with evaluate(with:) method of its.
let pattern = "^\\(?\\d{3}\\)?\\s\\d{3}-\\d{4}$"
let predicate = NSPredicate(format: "self MATCHES [c] %#", pattern)
if predicate.evaluate(with: "(888) 555-1111") {
print("Valid")
}
else {
print("Invalid")
}
Swift extract regex matches
with little bit of edit
import Foundation
func matches(for regex: String, in text: String) -> Bool {
do {
let regex = try NSRegularExpression(pattern: regex)
let nsString = text as NSString
let results = regex.matches(in: text, range: NSRange(location: 0, length: nsString.length))
return !results.isEmpty
} catch let error {
print("invalid regex: \(error.localizedDescription)")
return false
}
}
Example usage from link above:
let string = "19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will match
let string = "a19320"
let matched = matches(for: "^[1-9]\\d*$", in: string)
print(matched) // will not match
I am making an app in Swift and I need to catch 8 numbers from a string.
Here's the string:
index.php?page=index&l=99182677
My pattern is:
&l=(\d{8,})
And here's my code:
var yourAccountNumber = "index.php?page=index&l=99182677"
let regex = try! NSRegularExpression(pattern: "&l=(\\d{8,})", options: NSRegularExpressionOptions.CaseInsensitive)
let range = NSMakeRange(0, yourAccountNumber.characters.count)
let match = regex.matchesInString(yourAccountNumber, options: NSMatchingOptions.Anchored, range: range)
Firstly, I don't know what the NSMatchingOptions means, on the official Apple library, I don't get all the .Anchored, .ReportProgress, etc stuff. Anyone would be able to lighten me up on this?
Then, when I print(match), nothing seems to contain on that variable ([]).
I am using Xcode 7 Beta 3, with Swift 2.0.
ORIGINAL ANSWER
Here is a function you can leverage to get captured group texts:
import Foundation
extension String {
func firstMatchIn(string: NSString!, atRangeIndex: Int!) -> String {
var error : NSError?
let re = NSRegularExpression(pattern: self, options: .CaseInsensitive, error: &error)
let match = re.firstMatchInString(string, options: .WithoutAnchoringBounds, range: NSMakeRange(0, string.length))
return string.substringWithRange(match.rangeAtIndex(atRangeIndex))
}
}
And then:
var result = "&l=(\\d{8,})".firstMatchIn(yourAccountNumber, atRangeIndex: 1)
The 1 in atRangeIndex: 1 will extract the text captured by (\d{8,}) capture group.
NOTE1: If you plan to extract 8, and only 8 digits after &l=, you do not need the , in the limiting quantifier, as {8,} means 8 or more. Change to {8} if you plan to capture just 8 digits.
NOTE2: NSMatchingAnchored is something you would like to avoid if your expected result is not at the beginning of a search range. See documentation:
Specifies that matches are limited to those at the start of the search range.
NOTE3: Speaking about "simplest" things, I'd advise to avoid using look-arounds whenever you do not have to. Look-arounds usually come at some cost to performance, and if you are not going to capture overlapping text, I'd recommend to use capture groups.
UPDATE FOR SWIFT 2
I have come up with a function that will return all matches with all capturing groups (similar to preg_match_all in PHP). Here is a way to use it for your scenario:
func regMatchGroup(regex: String, text: String) -> [[String]] {
do {
var resultsFinal = [[String]]()
let regex = try NSRegularExpression(pattern: regex, options: [])
let nsString = text as NSString
let results = regex.matchesInString(text,
options: [], range: NSMakeRange(0, nsString.length))
for result in results {
var internalString = [String]()
for var i = 0; i < result.numberOfRanges; ++i{
internalString.append(nsString.substringWithRange(result.rangeAtIndex(i)))
}
resultsFinal.append(internalString)
}
return resultsFinal
} catch let error as NSError {
print("invalid regex: \(error.localizedDescription)")
return [[]]
}
}
// USAGE:
let yourAccountNumber = "index.php?page=index&l=99182677"
let matches = regMatchGroup("&l=(\\d{8,})", text: yourAccountNumber)
if (matches.count > 0) // If we have matches....
{
print(matches[0][1]) // Print the first one, Group 1.
}
It may be easier just to use the NSString method instead of NSRegularExpression.
var yourAccountNumber = "index.php?page=index&l=99182677"
println(yourAccountNumber) // index.php?page=index&l=99182677
let regexString = "(?<=&l=)\\d{8,}+"
let options :NSStringCompareOptions = .RegularExpressionSearch | .CaseInsensitiveSearch
if let range = yourAccountNumber.rangeOfString(regexString, options:options) {
let digits = yourAccountNumber.substringWithRange(range)
println("digits: \(digits)")
}
else {
print("Match not found")
}
The (?<=&l=) means precedes but not part of.
In detail:
Look-behind assertion. True if the parenthesized pattern matches text preceding the current input position, with the last character of the match being the input character just before the current position. Does not alter the input position. The length of possible strings matched by the look-behind pattern must not be unbounded (no * or + operators.)
In general performance considerations of a look-behind without instrumented proof is just premature optimization. That being said there may be other valid reasons for and against look-arounds in regular expressions.
ICU User Guide: Regular Expressions
For Swift 2, you can use this extension of String:
import Foundation
extension String {
func firstMatchIn(string: NSString!, atRangeIndex: Int!) -> String {
do {
let re = try NSRegularExpression(pattern: self, options: NSRegularExpressionOptions.CaseInsensitive)
let match = re.firstMatchInString(string as String, options: .WithoutAnchoringBounds, range: NSMakeRange(0, string.length))
return string.substringWithRange(match!.rangeAtIndex(atRangeIndex))
} catch {
return ""
}
}
}
You can get the account-number with:
var result = "&l=(\\d{8,})".firstMatchIn(yourAccountNumber, atRangeIndex: 1)
Replace NSMatchingOptions.Anchored with NSMatchingOptions() (no options)