regex match string starting at offset - regex

I'm learning Rust and trying to write a simple tokenizer right now. I want to go through a string running each regular expression against the current position in the string, create a token, then skip ahead and repeat until I've processed the whole string. I know I can put them into a larger regex and loop through captures, but I need to process them individually for domain reseasons.
However, I see nowhere in the regex crate that allows an offset so I can begin matching again at specific point.
extern crate regex;
use regex::Regex;
fn main() {
let input = "3 + foo/4";
let ident_re = Regex::new("[a-zA-Z][a-zA-Z0-9]*").unwrap();
let number_re = Regex::new("[1-9][0-9]*").unwrap();
let ops_re = Regex::new(r"[+-*/]").unwrap();
let ws_re = Regex::new(r"[ \t\n\r]*").unwrap();
let mut i: usize = 0;
while i < input.len() {
// Here check each regex to see if a match starting at input[i]
// if so copy the match and increment i by length of match.
}
}
Those regexs that I'm currently scaning for will actually vary at runtime too. Sometimes I may only be looking for a few of them while others (at top level) I might be looking for almost all of them.

The regex crate works on string slices. You can always take a sub-slice of another slice and then operate on that one. Instead of moving along indices, you can modify the variable that points to your slice to point to your subslice.
fn main() {
let mut s = "hello";
while !s.is_empty() {
println!("{}", s);
s = &s[1..];
}
}
Note that the slice operation slices at byte-positions, not utf8-char-positions. This allows the slicing operation to be done in O(1) instead of O(n), but will also cause the program to panic if the indices you are slicing from and to happen to be in the middle of a multi-byte utf8 character.

Related

Incremental regex matching in Kotlin

This code snipped searches for the first regex match in a file for which one of the capture groups matches a local variable, and then obtains the value of the other capture group.
Is it possible to write this in an idiomatic, more efficient version which doesn't find all matches up front, but rather matches incrementally, without an explicit loop?
val id = ANCHOR_REGEX.findAll(apiFile.readText())
.find { label == it.groups["label"]?.value }
?.let { it.groups["id"]?.value }
Yes, there is. And you've already done it. findAll returns a Sequence<MatchResult>, and just to be extra sure, we can look in the source code and see the implementation ourselves.
public actual fun findAll(input: CharSequence, startIndex: Int = 0): Sequence<MatchResult> {
if (startIndex < 0 || startIndex > input.length) {
throw IndexOutOfBoundsException("Start index out of bounds: $startIndex, input length: ${input.length}")
}
return generateSequence({ find(input, startIndex) }, MatchResult::next)
}
That's generateSequence from the standard library, which produces lazy iterators whose next element is determined by calling the function repeatedly. find on iterables is also perfectly capable of short-circuiting, so the code you've already written will incrementally find matches until it finds the one you want or exhausts the string.

Quick regex_search/replace, or clear indication of replacement?

I must browse a collection of strings to replace a pattern and save the changes.
The saving operation is (very) expensive and out of my hands, so I would like to know beforehand if the replacement did anything.
I can use std::regex_search to gain knowledge on the pattern's presence in my input, and use capture groups to store details in a std::smatch. std::regex_replace does not seem to explicitely tell me wether it did anything.
The patterns and strings are arbitrarily long and complicated; running regex_replace after a regex_search seems wasteful.
I can directly compare the input and output to search for a discrepancy but that too is uncomfortable.
Is there either a simple way to observe regex_replace to determine its impact, or to use a smatch filled by the regex_search to do a faster replacement operation ?
Thanks in advance.
No regex_replace doesn't provide this info and yes you can do it with a regex_search loop.
For example like this:
std::regex pattern("...");
std::string replacement_format = "...";
std::string input = "......"; // a very, very long string
std::string output, replacement;
std::smatch match;
auto begin = input.cbegin();
int replacements = 0;
while (std::regex_search(begin, input.cend(), match, pattern)) {
output += match.prefix();
replacement = match.format(replacement_format);
if (match[0] != replacement) {
replacements++;
}
output += replacement;
begin = match.suffix().first;
}
output.append(begin, input.cend());
if (replacements > 0) {
// process output ...
}
Live demo
As regex_replace creates a copy of your string you could simply compare the replaced string with the original one and only "store" the new one if they differ.
For C++14 it seems that regex_replace returns a pointer to the last place it has written to:
https://www.cplusplus.com/reference/regex/regex_replace/ Versions 5
and 6 return an iterator that points to the element past the last
character written to the sequence pointed by out.

Nicer way to access match results?

My requirement is to transform some textual message ids. Input is
a.messageid=X0001E
b.messageid=Y0001E
The task is to turn that into
a.messageid=Z00001E
b.messageid=Z00002E
In other words: fetch the first part each line (like: a.), and append a slightly different id.
My current solution:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String): String {
val result = matcherForIds.matchEntire(line) ?: return line
return "${result.groupValues.get(1)}.messageid=Z%05dE".format(messageCounter++)
}
This works, but find the way how I get to first match "${result.groupValues.get(1)} to be not very elegant.
Is there a nicer to read/more concise way to access that first match?
You may get the result without a separate function:
val line = s.replace("""^(.*\.messageid=)[XY]\d{4,6}E$""".toRegex()) {
"${it.groupValues[1]}Z%05dE".format(messageCounter++)
}
However, as you need to format the messageCounter into the result, you cannot just use a string replacement pattern and you cannot get rid of ${it.groupValues[1]}.
Also, note:
You may get rid of double backslashes by means of the triple-quoted string literal
There is no need adding .messageid= to the replacement if you capture that part into Group 1 (see (.*\.messageid=))
There is no need capturing X or Y since you are not using them later, thus, (X|Y) can be replaced with a more efficient character class [XY].
The ^ and $ make sure the pattern should match the entire string, else, there will be no match and the string will be returned as is, without any modification.
See the Kotlin demo online.
Maybe not really what you are looking for, but maybe it is. What if you first ensure (filter) the lines of interest and just replace what needs to be replaced instead, e.g. use the following transformation function:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
val idRegex = Regex("[XY]\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String) = idRegex.replace(line) {
"Z%05dE".format(idCounter++)
}
with the following filter:
"a.messageid=X0001E\nb.messageid=Y0001E"
.lineSequence()
.filter(matcherForIds::matches)
.map(::transformIds)
.forEach(::println)
In case there are also other strings that are relevant which you want to keep then the following is also possible but not as nice as the solution at the end:
"a.messageid=X0001E\nnot interested line, but required in the output!\nb.messageid=Y0001E"
.lineSequence()
.map {
when {
matcherForIds.matches(it) -> transformIds(it)
else -> it
}
}
.forEach(::println)
Alternatively (now just copying Wiktors regex, as it already contains all we need (complete match from begin of line ^ upto end of line $, etc.)):
val matcherForIds = Regex("""^(.*\.messageid=)[XY]\d{4,6}E$""")
fun transformIds(line: String) = matcherForIds.replace(line) {
"${it.groupValues[1]}Z%05dE".format(idCounter++)
}
This way you ensure that lines that completely match the desired input are replaced and the others are kept but not replaced.

RegExp JS regarding sequential patttern matching

P.S: --> I know there is an easy solution to my needs, and I can do it that way but, -- I am looking for a "diff" solution for learning sake & challenge sake. So, this is just to solve an algorithm in a lesser traditional way.
I am working on solving an algorithm, and thought I had everything working well but one use case is failing. That is because I am building a regexp dynamically - now, my issue is this.
I need to match letters sequentially up until one doesn't match, then I just "match" what did match sequentially.
so... lets say I was matching this:
"zaazizz"
with this: /\bz[a]?[z]?/
"zizzi".match(/\bz[z]?[i]?/)
currently, that is matching with a : [zi], but the match should only be [z]
zzi only matches "z" from the front of "zizzi", in that order zzi - I now I am using [z]? etc... so it is optional.. but what I really need is match sequentially.. I'd only get "zi" IF from the front, it matched: zzi per my regex.... so, some sort of lookahead or ?. I tried ?= and != no luck.
I still think a non-regex-approach is best here. Have a look at the following JS-Code:
var match = "abcdef";
var input = "abcxdef";
var mArray = match.split("");
var inArray = input.split("");
var max = Math.min(mArray.length, inArray.length) - 1;
for (var i = 0; i < max; i++) {
if (mArray[i] != inArray[i]) { break; }
}
input.substring(0, i);
Where match is the string to be partially matched, input is the input and input.substring(0, i) is the result of the matching part. And you can change match as often as you like.

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function