How to parse matched separators by nom? - regex

I want to parse YMD date in four forms ("20190919", "2019.09.19", "2019-09-19", and "2019/09/19") by nom library.
I started with iso8601 parser which parse only "YYYY-MM-DD" form. And I tryed to match separator and reuse it for next matching like in regex (\d{4})([.-/]?)(\d{2})\2(\d{2}).
Turned out that this code works:
fn parse_ymd(i: &[u8]) -> IResult<&[u8], DateType> {
let (i, y) = year(i)?;
// Match separator if it exist.
let (i, sep) = opt(one_of(".-/"))(i)?;
let (i, m) = month(i)?;
// If first separator was matched then try to find next one.
let (i, _) = if let Some(sep) = sep {
tag(&[sep as u8])(i)?
} else {
// Support the same signature as previous branch.
(i, &[' ' as u8][..])
};
let (i, d) = day(i)?;
Ok((
i,
DateType::YMD {
year: y,
month: m,
day: d,
},
))
}
But obviously it looks weird.
Are there some nom tools to do it more appropriate way?
(This question about nom functionality, and how to do things there right. Not about just this particular example.)

Your solution is decent enough. There is only one suggestion I can offer really:
fn parse_ymd(i: &[u8]) -> IResult<&[u8], DateType> {
...
// If first separator was matched then try to find next one.
let i = match sep {
Some(sep) => tag(&[sep as u8])(i)?.0,
_ => i,
};
...
}
You may not be familiar with the syntax of accessing a tuple element directly. From rust book:
In addition to destructuring through pattern matching, we can access a tuple element directly by using a period (.) followed by the index of the value we want to access.
In this case, it saves you the awkwardness of trying to match the signature of two arms.

Related

Extract Int from regex using pattern matching without extracting as String and then casting toInt

I have a year, expressed in the format XXYY-ZZ. For example, the year 2020-21 would represent a year spanning 2020 to 2021. I need to extract XXYY, YY and ZZ as Ints to use in calculations later.
Using Pattern matching and regex, I can extract values I want as Strings, like this:
import scala.util.matching.Regex
val YearFormatRegex: Regex = "(20([1-9][0-9]))-([1-9][0-9])".r
"2020-21" match {
case YearFormatRegex(fullStartYear, start, end) => println(fullStartYear, start, end)
case _ => println("did not match")
}
// will print (2020, 20, 21)
However I need the values as Ints. Is there a way to extract these values as Ints without throwing .toInt all over the place? I understand that the regex specifically looks for numbers so extracting them as Strings and then parsing as Ints seems like an unnecessary step if I can avoid it.
If you want to simply encapsulate the conversion, one way to do it could be to create your own extractor object built around your regular expression, e.g.:
import scala.util.matching.Regex
object Year {
private val regex: Regex = "(20([1-9][0-9]))-([1-9][0-9])".r
def unapply(s: String): Option[(Int, Int, Int)] =
s match {
case regex(prefix, from, to) => Some((prefix.toInt, from.toInt, to.toInt))
case _ => None
}
}
"2020-21" match {
case Year(fullStartYear, start, end) => fullStartYear - start + end
case _ => 0
} // returns 2020 - 20 + 21 = 2021
You can read more on extractor objects here on the Scala official documentation.
You can play around with this code here on Scastie.

How to use match with regular expressions in Scala

I am starting to learn Scala and want to use regular expressions to match a character from a string so I can populate a mutable map of characters and their value (String values, numbers etc) and then print the result.
I have looked at several answers on SO and gone over the Scala Docs but can't seem to get this right. I have a short Lexer class that currently looks like this:
class Lexer {
private val tokens: mutable.Map[String, Any] = collection.mutable.Map()
private def checkCharacter(char: Character): Unit = {
val Operator = "[-+*/^%=()]".r
val Digit = "[\\d]".r
val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => tokens(c) = "Operator"
case Digit(c) => tokens(c) = Integer.parseInt(c)
case Other(c) => tokens(c) = "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
for (s <- inputArray)
checkCharacter(s)
for((key, value) <- tokens)
println(key + ": " + value)
}
}
I'm pretty confused by the sort of strange method syntax, Operator(c), that I have seen being used to handle the value to match and am also unsure if this is the correct way to use regex in Scala. I think what I want this code to do is clear, I'd really appreciate some help understanding this. If more info is needed I will supply what I can
This official doc has lot's of examples: https://www.scala-lang.org/api/2.12.1/scala/util/matching/Regex.html. What might be confusing is the type of the regular expression and its use in pattern matching...
You can construct a regex from any string by using .r:
scala> val regex = "(something)".r
regex: scala.util.matching.Regex = (something)
Your regex becomes an object that has a few useful methods to be able to find matching groups like findAllIn.
In Scala it's idiomatic to use pattern matching for safe extraction of values, thus Regex class also has unapplySeq method to support pattern matching. This makes it an extractor object. You can use it directly (not common):
scala> regex.unapplySeq("something")
res1: Option[List[String]] = Some(List(something))
or you can let Scala compiler call it for you when you do pattern matching:
scala> "something" match {
| case regex(x) => x
| case _ => ???
| }
res2: String = something
You might ask why exactly this return type on unapply/unapplySeq. The doc explains it very well:
The return type of an unapply should be chosen as follows:
If it is just a test, return a Boolean. For instance case even().
If it returns a single sub-value of type T, return an Option[T].
If you want to return several sub-values T1,...,Tn, group them in an optional tuple Option[(T1,...,Tn)].
Sometimes, the number of values to extract isn’t fixed and we would
like to return an arbitrary number of values, depending on the input.
For this use case, you can define extractors with an unapplySeq method
which returns an Option[Seq[T]]. Common examples of these patterns
include deconstructing a List using case List(x, y, z) => and
decomposing a String using a regular expression Regex, such as case
r(name, remainingFields # _*) =>
In short your regex might match one or more groups, thus you need to return a list/seq. It has to be wrapped in an Option to comply with extractor contract.
The way you are using regex is correct, I would just map your function over the input array to avoid creating mutable maps. Perhaps something like this:
class Lexer {
private def getCharacterType(char: Character): Any = {
val Operator = "([-+*/^%=()])".r
val Digit = "([\\d])".r
//val Other = "[^\\d][^-+*/^%=()]".r
char.toString match {
case Operator(c) => "Operator"
case Digit(c) => Integer.parseInt(c)
case _ => "Other" // Temp value, write function for this
}
}
def lex(input: String): Unit = {
val inputArray = input.toArray
val tokens = inputArray.map(x => x -> getCharacterType(x))
for((key, value) <- tokens)
println(key + ": " + value)
}
}
scala> val l = new Lexer()
l: Lexer = Lexer#60f662bd
scala> l.lex("a-1")
a: Other
-: Operator
1: 1

"decimal literal empty" when combining several strings for a regex in Rust

I'm looking to parse a string to create a vector of floats:
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
// --- expected output vec: Vec<f32> = vec![12.34, 13.]
}
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let pattern_float = String::from(r"\s*(\d*.*\d*)\s*");
let pattern_opening = String::from(r"\s*{{");
let pattern_closing = String::from(r"}}\s*");
let pattern =
pattern_opening + "(" + &pattern_float + ",)*" + &pattern_float + &pattern_closing;
let re = Regex::new(&pattern).unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
// --- snip : for loop for adding the elements to the vector ---
vec_axis1
}
This code compiles but an error arises at runtime when unwrapping the Regex::new():
regex parse error:
\s*{{(\s*(\d*.*\d*)\s*,)*\s*(\d*.*\d*)\s*}}\s*
^
error: decimal literal empty
According to other posts, this error can arise when escaping the curly bracket { is not properly done, but I think I escaped the bracket properly.
What is wrong with this regex?
There are several problems in your code:
Escaping a { in regex is done with \{.
Your . matches any character and doesn't take what you want. You must escape it.
You're capturing more than just the number, which makes the parsing more complex.
Your regex building is unnecessary verbose, you may comment without it.
Here's a proposed improved version:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"(?x)
\s*\{\s* # opening
(\d*\.\d*) # captured float
\s*,\s* # separator
\d*\.\d* # ignored float
\s*\}\s* # closing
").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
if let Some(c) = re.captures(str_values) {
if let Some(g) = c.get(1) {
vec_axis1.push(g.as_str().parse().unwrap());
}
}
vec_axis1
}
fn main() {
let vector_string: &str = "{12.34, 13.}";
let vec = parse_axis_values(vector_string);
println!("v: {:?}", vec);
}
playground
If you call this function several times, you might want to avoid recompiling the regex at each call too.
I want to be able to match 0.123, .123, 123 or 123., the use of d+ would break these possibilities
It looks like you want to fetch all the floats in the string. This could be simply done like this:
use regex::Regex;
pub fn parse_axis_values(str_values: &str) -> Vec<f32> {
let re = Regex::new(r"\d*\.\d*").unwrap();
let mut vec_axis1: Vec<f32> = Vec::new();
for c in re.captures_iter(str_values) {
vec_axis1.push(c[0].parse().unwrap());
}
vec_axis1
}
If you want both:
to check the complete string is correctly wrapped between { and }
to capture all numbers
Then you could either:
combine two regexes (the first one used to extract the internal part)
use a Serde-based parser (I wouldn't at this point but it would be interesting if the problem's complexity grows)

Find index locations by regex pattern and replace them with a list of indexes in Scala

I have strings in this format:
object[i].base.base_x[i] and I get lists like List(0,1).
I want to use regular expressions in scala to find the match [i] in the given string and replace the first occurance with 0 and the second with 1. Hence getting something like object[0].base.base_x[1].
I have the following code:
val stringWithoutIndex = "object[i].base.base_x[i]" // basically this string is generated dynamically
val indexReplacePattern = raw"\[i\]".r
val indexValues = List(0,1) // list generated dynamically
if(indexValues.nonEmpty){
indexValues.map(row => {
indexReplacePattern.replaceFirstIn(stringWithoutIndex , "[" + row + "]")
})
else stringWithoutIndex
Since String is immutable, I cannot update stringWithoutIndex resulting into an output like List("object[0].base.base_x[i]", "object[1].base.base_x[i]").
I tried looking into StringBuilder but I am not sure how to update it. Also, is there a better way to do this? Suggestions other than regex are also welcome.
You couldloop through the integers in indexValues using foldLeft and pass the string stringWithoutIndex as the start value.
Then use replaceFirst to replace the first match with the current value of indexValues.
If you want to use a regex, you might use a positive lookahead (?=]) and a positive lookbehind (?<=\[) to assert the i is between opening and square brackets.
(?<=\[)i(?=])
For example:
val strRegex = """(?<=\[)i(?=])"""
val res = indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
s.replaceFirst(strRegex, row.toString)
}
See the regex demo | Scala demo
How about this:
scala> val str = "object[i].base.base_x[i]"
str: String = object[i].base.base_x[i]
scala> str.replace('i', '0').replace("base_x[0]", "base_x[1]")
res0: String = object[0].base.base_x[1]
This sounds like a job for foldLeft. No need for the if (indexValues.nonEmpty) check.
indexValues.foldLeft(stringWithoutIndex) { (s, row) =>
indexReplacePattern.replaceFirstIn(s, "[" + row + "]")
}

Nicer way to access match results?

My requirement is to transform some textual message ids. Input is
a.messageid=X0001E
b.messageid=Y0001E
The task is to turn that into
a.messageid=Z00001E
b.messageid=Z00002E
In other words: fetch the first part each line (like: a.), and append a slightly different id.
My current solution:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String): String {
val result = matcherForIds.matchEntire(line) ?: return line
return "${result.groupValues.get(1)}.messageid=Z%05dE".format(messageCounter++)
}
This works, but find the way how I get to first match "${result.groupValues.get(1)} to be not very elegant.
Is there a nicer to read/more concise way to access that first match?
You may get the result without a separate function:
val line = s.replace("""^(.*\.messageid=)[XY]\d{4,6}E$""".toRegex()) {
"${it.groupValues[1]}Z%05dE".format(messageCounter++)
}
However, as you need to format the messageCounter into the result, you cannot just use a string replacement pattern and you cannot get rid of ${it.groupValues[1]}.
Also, note:
You may get rid of double backslashes by means of the triple-quoted string literal
There is no need adding .messageid= to the replacement if you capture that part into Group 1 (see (.*\.messageid=))
There is no need capturing X or Y since you are not using them later, thus, (X|Y) can be replaced with a more efficient character class [XY].
The ^ and $ make sure the pattern should match the entire string, else, there will be no match and the string will be returned as is, without any modification.
See the Kotlin demo online.
Maybe not really what you are looking for, but maybe it is. What if you first ensure (filter) the lines of interest and just replace what needs to be replaced instead, e.g. use the following transformation function:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
val idRegex = Regex("[XY]\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String) = idRegex.replace(line) {
"Z%05dE".format(idCounter++)
}
with the following filter:
"a.messageid=X0001E\nb.messageid=Y0001E"
.lineSequence()
.filter(matcherForIds::matches)
.map(::transformIds)
.forEach(::println)
In case there are also other strings that are relevant which you want to keep then the following is also possible but not as nice as the solution at the end:
"a.messageid=X0001E\nnot interested line, but required in the output!\nb.messageid=Y0001E"
.lineSequence()
.map {
when {
matcherForIds.matches(it) -> transformIds(it)
else -> it
}
}
.forEach(::println)
Alternatively (now just copying Wiktors regex, as it already contains all we need (complete match from begin of line ^ upto end of line $, etc.)):
val matcherForIds = Regex("""^(.*\.messageid=)[XY]\d{4,6}E$""")
fun transformIds(line: String) = matcherForIds.replace(line) {
"${it.groupValues[1]}Z%05dE".format(idCounter++)
}
This way you ensure that lines that completely match the desired input are replaced and the others are kept but not replaced.