Nicer way to access match results? - regex

My requirement is to transform some textual message ids. Input is
a.messageid=X0001E
b.messageid=Y0001E
The task is to turn that into
a.messageid=Z00001E
b.messageid=Z00002E
In other words: fetch the first part each line (like: a.), and append a slightly different id.
My current solution:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String): String {
val result = matcherForIds.matchEntire(line) ?: return line
return "${result.groupValues.get(1)}.messageid=Z%05dE".format(messageCounter++)
}
This works, but find the way how I get to first match "${result.groupValues.get(1)} to be not very elegant.
Is there a nicer to read/more concise way to access that first match?

You may get the result without a separate function:
val line = s.replace("""^(.*\.messageid=)[XY]\d{4,6}E$""".toRegex()) {
"${it.groupValues[1]}Z%05dE".format(messageCounter++)
}
However, as you need to format the messageCounter into the result, you cannot just use a string replacement pattern and you cannot get rid of ${it.groupValues[1]}.
Also, note:
You may get rid of double backslashes by means of the triple-quoted string literal
There is no need adding .messageid= to the replacement if you capture that part into Group 1 (see (.*\.messageid=))
There is no need capturing X or Y since you are not using them later, thus, (X|Y) can be replaced with a more efficient character class [XY].
The ^ and $ make sure the pattern should match the entire string, else, there will be no match and the string will be returned as is, without any modification.
See the Kotlin demo online.

Maybe not really what you are looking for, but maybe it is. What if you first ensure (filter) the lines of interest and just replace what needs to be replaced instead, e.g. use the following transformation function:
val matcherForIds = Regex("(.*)\\.messageid=(X|Y)\\d{4,6}E")
val idRegex = Regex("[XY]\\d{4,6}E")
var idCounter = 5
fun transformIds(line: String) = idRegex.replace(line) {
"Z%05dE".format(idCounter++)
}
with the following filter:
"a.messageid=X0001E\nb.messageid=Y0001E"
.lineSequence()
.filter(matcherForIds::matches)
.map(::transformIds)
.forEach(::println)
In case there are also other strings that are relevant which you want to keep then the following is also possible but not as nice as the solution at the end:
"a.messageid=X0001E\nnot interested line, but required in the output!\nb.messageid=Y0001E"
.lineSequence()
.map {
when {
matcherForIds.matches(it) -> transformIds(it)
else -> it
}
}
.forEach(::println)
Alternatively (now just copying Wiktors regex, as it already contains all we need (complete match from begin of line ^ upto end of line $, etc.)):
val matcherForIds = Regex("""^(.*\.messageid=)[XY]\d{4,6}E$""")
fun transformIds(line: String) = matcherForIds.replace(line) {
"${it.groupValues[1]}Z%05dE".format(idCounter++)
}
This way you ensure that lines that completely match the desired input are replaced and the others are kept but not replaced.

Related

Regex array of named group matches

I would like to get an array of all captured group matches in chronological order (the order they appear in in the input string).
So for examples with the following regex:
(?P<fooGroup>foo)|(?P<barGroup>bar)
and the following input:
foo bar foo
I would like to get something that resembles the following output:
[("fooGroup", (0,3)), ("barGroup", (4,7)), ("fooGroup", (8,11))]
Is this possible to do without manually sorting all matches?
I don't know what you mean by "without manually sorting all matches," but this Rust code produces the output you want for this particular style of pattern:
use regex::Regex;
fn main() {
let pattern = r"(?P<fooGroup>foo)|(?P<barGroup>bar)";
let haystack = "foo bar foo";
let mut matches: Vec<(String, (usize, usize))> = vec![];
let re = Regex::new(pattern).unwrap();
// We skip the first capture group, which always corresponds
// to the entire pattern and is unnamed. Otherwise, we assume
// every capturing group has a name and corresponds to a single
// alternation in the regex.
let group_names: Vec<&str> =
re.capture_names().skip(1).map(|x| x.unwrap()).collect();
for caps in re.captures_iter(haystack) {
for name in &group_names {
if let Some(m) = caps.name(name) {
matches.push((name.to_string(), (m.start(), m.end())));
}
}
}
println!("{:?}", matches);
}
The only real trick here is to make sure group_names is correct. It's correct for any pattern of the form (?P<name1>re1)|(?P<name2>re2)|...|(?P<nameN>reN) where each reI contains no other capturing groups.

regex match string starting at offset

I'm learning Rust and trying to write a simple tokenizer right now. I want to go through a string running each regular expression against the current position in the string, create a token, then skip ahead and repeat until I've processed the whole string. I know I can put them into a larger regex and loop through captures, but I need to process them individually for domain reseasons.
However, I see nowhere in the regex crate that allows an offset so I can begin matching again at specific point.
extern crate regex;
use regex::Regex;
fn main() {
let input = "3 + foo/4";
let ident_re = Regex::new("[a-zA-Z][a-zA-Z0-9]*").unwrap();
let number_re = Regex::new("[1-9][0-9]*").unwrap();
let ops_re = Regex::new(r"[+-*/]").unwrap();
let ws_re = Regex::new(r"[ \t\n\r]*").unwrap();
let mut i: usize = 0;
while i < input.len() {
// Here check each regex to see if a match starting at input[i]
// if so copy the match and increment i by length of match.
}
}
Those regexs that I'm currently scaning for will actually vary at runtime too. Sometimes I may only be looking for a few of them while others (at top level) I might be looking for almost all of them.
The regex crate works on string slices. You can always take a sub-slice of another slice and then operate on that one. Instead of moving along indices, you can modify the variable that points to your slice to point to your subslice.
fn main() {
let mut s = "hello";
while !s.is_empty() {
println!("{}", s);
s = &s[1..];
}
}
Note that the slice operation slices at byte-positions, not utf8-char-positions. This allows the slicing operation to be done in O(1) instead of O(n), but will also cause the program to panic if the indices you are slicing from and to happen to be in the middle of a multi-byte utf8 character.

Google sheet : REGEXREPLACE match everything except a particular pattern

I would try to replace everything inside this string :
[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs
1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567,
1410315029
Except the following numbers :
1303222074
1403281851
1307239335
1403277567
1410315029
I have built a REGEX to match them :
1[0-9]{9}
But I have not figured it out to do the opposite that is everything except all matches ...
google spreadsheet use the Re2 regex engine and doesn't support many usefull features that can help you to do that. So a basic workaround can help you:
match what you want to preserve first and capture it:
pattern: [0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)
replacement: $1 (with a space after)
demo
So probably something like this:
=TRIM(REGEXREPLACE("[JGMORGAN - BANK2] n° 10 NEWYORK, n° 222 CAEN, MONTELLIER, VANNES / TARARTA TIs 1303222074, 1403281851 & 1307239335 et Cloture TIs 1403277567, 1410315029"; "[0-9]*(?:[0-9]{0,9}[^0-9]+)*(?:([0-9]{9,})|[0-9]*\z)"; "$1 "))
You can also do this with dynamic native functions:
=REGEXEXTRACT(A1,rept("(\d{10}).*",counta(split(regexreplace(A1,"\d{10}","#"),"#"))-1))
basically it is first split by the desired string, to figure out how many occurrences there are of it, then repeats the regex to dynamically create that number of capture groups, thus leaving you in the end with only those values.
First of all thank you Casimir for your help. It gave me an idea that will not be possible with a built-in functions and strong regex lol.
I found out that I can make a homemade function for my own purposes (yes I'm not very "up to date").
It's not very well coded and it returns doublons. But rather than fixing it properly, I use the built in UNIQUE() function on top of if to get rid of them; it's ugly and I'm lazy but it does the job, that is, a list of all matches of on specific regex (which is: 1[0-9]{9}). Here it is:
function ti_extract(input) {
var tab_tis = new Array();
var tab_strings = new Array();
tab_tis.push(input.match(/1[0-9]{9}/)); // get the TI and insert in tab_tis
var string_modif = input.replace(tab_tis[0], " "); // modify source string (remove everything except the TI)
tab_strings.push(string_modif); // insert this new string in the table
var v = 0;
var patt = new RegExp(/1[0-9]{9}/);
var fin = patt.test(tab_strings[v]);
var first_string = tab_strings[v];
do {
first_string = tab_strings[v]; // string 0, or the string with the first removed TI
tab_tis.push(first_string.match(/1[0-9]{9}/)); // analyze the string and get the new TI to put it in the table
var string_modif2 = first_string.replace(tab_tis[v], " "); // modify the string again to remove the new TI from the old string
tab_strings.push(string_modif2);
v += 1;
}
while(v < 15)
return tab_tis;
}

regex how can I split this word?

I have a list of several phrases in the following format
thisIsAnExampleSentance
hereIsAnotherExampleWithMoreWordsInIt
and I'm trying to end up with
This Is An Example Sentance
Here Is Another Example With More Words In It
Each phrase has the white space condensed and the first letter is forced to lowercase.
Can I use regex to add a space before each A-Z and have the first letter of the phrase be capitalized?
I thought of doing something like
([a-z]+)([A-Z])([a-z]+)([A-Z])([a-z]+) // etc
$1 $2$3 $4$5 // etc
but on 50 records of varying length, my idea is a poor solution. Is there a way to regex in a way that will be more dynamic? Thanks
A Java fragment I use looks like this (now revised):
result = source.replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
result = result.substring(0, 1).toUpperCase() + result.substring(1);
This, by the way, converts the string givenProductUPCSymbol into Given Product UPC Symbol - make sure this is fine with the way you use this type of thing
Finally, a single line version could be:
result = source.substring(0, 1).toUpperCase() + source(1).replaceAll("(?<=^|[a-z])([A-Z])|([A-Z])(?=[a-z])", " $1$2");
Also, in an Example similar to one given in the question comments, the string hiMyNameIsBobAndIWantAPuppy will be changed to Hi My Name Is Bob And I Want A Puppy
For the space problem it's easy if your language supports zero-width-look-behind
var result = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "(?<=[a-z])([A-Z])", " $1");
or even if it doesn't support them
var result2 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "([a-z])([A-Z])", "$1 $2");
I'm using C#, but the regexes should be usable in any language that support the replace using the $1...$n .
But for the lower-to-upper case you can't do it directly in Regex. You can get the first character through a regex like: ^[a-z] but you can't convet it.
For example in C# you could do
var result4 = Regex.Replace(result, "^([a-z])", m =>
{
return m.ToString().ToUpperInvariant();
});
using a match evaluator to change the input string.
You could then even fuse the two together
var result4 = Regex.Replace(#"thisIsAnExampleSentanceHereIsAnotherExampleWithMoreWordsInIt", "^([a-z])|([a-z])([A-Z])", m =>
{
if (m.Groups[1].Success)
{
return m.ToString().ToUpperInvariant();
}
else
{
return m.Groups[2].ToString() + " " + m.Groups[3].ToString();
}
});
A Perl example with unicode character support:
s/\p{Lu}/ $&/g;
s/^./\U$&/;

VB.Net Matching and replacing the contents of multiple overlapping sets of brackets in a string

I am using vb.net to parse my own basic scripting language, sample below. I am a bit stuck trying to deal with the 2 separate types of nested brackets.
Assuming name = Sam
Assuming timeFormat = hh:mm:ss
Assuming time() is a function that takes a format string but
has a default value and returns a string.
Hello [[name]], the time is [[time(hh:mm:ss)]].
Result: Hello Sam, the time is 19:54:32.
The full time is [[time()]].
Result: The full time is 05/06/2011 19:54:32.
The time in the format of your choice is [[time([[timeFormat]])]].
Result: The time in the format of your choice is 19:54:32.
I could in theory change the syntax of the script completely but I would rather not. It is designed like this to enable strings without quotes because it will be included in an XML file and quotes in that context were getting messy and very prone to errors and readability issues. If this fails I could redesign using something other than quotes to mark out strings but I would rather use this method.
Preferably, unless there is some other way I am not aware of, I would like to do this using regex. I am aware that the standard regex is not really capable of this but I believe this is possible using MatchEvaluators in vb.net and some form of recursion based replacing. However I have not been able to get my head around it for the last day or so, possibly because it is hugely difficult, possibly because I am ill, or possibly because I am plain thick.
I do have the following regex for parts of it.
Detecting the parentheses: (\w*?)\((.*?)\)(?=[^\(+\)]*(\(|$))
Detecting the square brackets: \[\[(.*?)\]\](?=[^\[+\]]*(\[\[|$))
I would really appreciate some help with this as it is holding the rest of my project back at the moment. And sorry if I have babbled on too much or not put enough detail, this is my first question on here.
Here's a little sample which might help you iterate through several matches/groups/captures. I realize that I am posting C# code, but it would be easy for you to convert that into VB.Net
//these two may be passed in as parameters:
string tosearch;//the string you are searching through
string regex;//your pattern to match
//...
Match m;
CaptureCollection cc;
GroupCollection gc;
Regex r = new Regex(regex, RegexOptions.IgnoreCase);
m = r.Match(tosearch);
gc = m.Groups;
Debug.WriteLine("Number of groups found = " + gc.Count.ToString());
// Loop through each group.
for (int i = 0; i < gc.Count; i++)
{
cc = gc[i].Captures;
counter = cc.Count;
int grpnum = i + 1;
Debug.WriteLine("Scanning group: " + grpnum.ToString() );
// Print number of captures in this group.
Debug.WriteLine(" Captures count = " + counter.ToString());
if (cc.Count >= 1)
{
foreach (Capture cap in cc)
{
Debug.WriteLine(string.format(" Capture found: {0}", cap.ToString()));
}
}
}
Here is a slightly simplified version of the code I wrote for this. Thanks for the help everyone and sorry I forgot to post this before. If you have any questions or anything feel free to ask.
Function processString(ByVal scriptString As String)
' Functions
Dim pattern As String = "\[\[((\w+?)\((.*?)\))(?=[^\(+\)]*(\(|$))\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processFunction(match)))
' Variables
pattern = "\[\[([A-Za-z0-9+_]+)\]\]"
scriptString = Regex.Replace(scriptString, pattern, New MatchEvaluator(Function(match) processVariable(match)))
Return scriptString
End Function
Function processFunction(ByVal match As Match)
Dim nameString As String = match.Groups(2).Value
Dim paramString As String = match.Groups(3).Value
paramString = processString(paramString)
Select Case nameString
Case "time"
Return getLocalValueTime(paramString)
Case "math"
Return getLocalValueMath(paramString)
End Select
Return ""
End Function
Function processVariable(ByVal match As Match)
Try
Return moduleDictionary("properties")("vars")(match.Groups(1).Value)
Catch ex As Exception
End Try
End Function