Regex: Capture any (dynamic) amount of lines - regex

I've been trying to match the following:
First Group:Line1,
Line2,
..
LineX
Second Group:Some_Sample_text
With this query:
First Group:(?<first_group>.+\n*\n)Second Group:(?<second_group>.*)
My main goal is to capture any amount of lines between Line1 and LineX (because I can't anticipate how many there'll be), but since there's no option to match the end of files I'll probably need to use the "\n" tokens. I've also tried with IF and THEN statements but I just can't get it to work.
Any ideas appreciated.

Here, we might want to design an expression that'd just pass newlines, such as
First Group:([\s\S]*)Second Group:(.*)
First Group:([\d\D]*)Second Group:(.*)
First Group:([\w\W]*)Second Group:(.*)
Demo 1
and we'd expand it to,
First Group:([\s\S]*)Second Group:([\s\S]*)
First Group:([\d\D]*)Second Group:([\d\D]*)
First Group:([\w\W]*)Second Group:([\w\W]*)
If our second group would have had multiple lines.
Demo 2
Advice
The fourth bird advises that:
You could make the charachter class non greedy to prevent over matching ([\s\S]*?)
which then the expression would become,
First Group:([\s\S]*?)Second Group:([\s\S]*)
for instance.
Demo 3

Related

Regex Multiple rows [duplicate]

I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as  from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("&regex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)

Modify position in a line if Regular Expression found

I need to modify the positions number 10 of every line that finds the word 'Example' (can´t use the actual data here) and add the string '(ID) '. It doesn´t necessarily have to begin with 9 numbers, it just needs to add the string to the position number 10.
For example, this line should be modified like this:
ORIGINAL: 123456789This line is being used as an Example
SOLUTION: 123456789(ID) This line is being used as an Example
So far I have this, to find the Example and copy the rest of the line as to not lose the text:
Find: (.*)Example
Bonus points if it works for two different words 'Example1' and 'Example2' in different sentences, the 'and also' part of this example would change in every line.
ORIGINAL: 123456789This line is being used as an Example1 and also Example2
SOLUTION: 123456789(ID) This line is being used as an Example1 and also Example2
This would have this search:
Find: (.*)Example1(.*)Example2
Thank you
You could try:
Find: (\d{9})(?=.*\bExample1\b.*\bExample2\b)
Replace: $(ID)
^^^ single space after (ID)
Demo
The regex pattern used matches and captures a 9 digit number (you may adjust to any width, or range of widths, which you want). It also uses a positive lookahead to assert that Example1 and Example2 in fact occur later in the same line:
(?=.*\bExample1\b.*\bExample2\b)
This is how you add characters in a certain position, even tho I accepted Tims answer because it´s very similar and made me figure it out:
^(\S{9})(?=.*\bExample1\b.*\bExample2\b)
As you can see, I only added '^' so it´s the position from the start of the line, and 'S' instead of 'd' so it counts characters that are not whitespace, instead of numbers. This should work for any type of line you have.

Regular Expression to match groups that may not exist

I'm trying to capture some data from logs in an application. The logs look like so:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE2}, {count=93.0, state=STATE3}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
If the count for a particular state is ever 0, it actually won't be in the log at all, so I can't guarantee the ordering of the objects in the log (The only ordering is that they are sorted alphabetically by state name)
So, this is also a potential log:
*junk* [{count=240.0, state=STATE1}, {count=1.0, state=STATE4}, {count=1147.0, state=STATE5}, etc. ] *junk*
I'm somewhat new to using regular expressions, and I think I'm overdoing it, but this is what I've tried.
^[^=\n]*=(?:(?P<STATE1>\d+)(?=\.0,\s+\w+=STATE1))*.*?=(?P<STATE2>\d+)(?=\.0,\s+\w+=STATE2)*.*?=(?P<STATE3>\d+)(?=\.0,\s+\w+=STATE3)
The idea being that I'll loook for the '=' and then look ahead to see if this is for the state that I want, and it may or may not be there. Then skip all the junk after the count until the next state that I'm interested in(this is the part that I'm having issues with I believe). Sometimes it matches too far, and skips the state I'm interested in, giving me a bad value. If I use the lazy operator(as above), sometimes it doesn't go far enough and gets the count for a state that is before the one I want in the log.
See if this approach works for you:
Regex: (?<=count=)\d+(?:\.\d+)?(?=, state=(STATE\d+))
Demo
The group will be your State# and Full match will be the count value
You might use 2 capturing groups to capture the count and the state.
To capture for example STATE1, STATE2, STATE3 and STATE5, you could specify the numbers using a character class with ranges and / or an alternation.
{count=(\d+(?:\.\d+)?), state=(STATE(?:[123]|5))}
Explanation
{count= Match literally
( Capture group 1
\d+(?:\.\d+)? Match 1+ digits with an optional decimal part
) Close group
, state= Match literally
( Capture group 2
STATE(?:[123]|5) Match STATE and specify the allowed numbers
)} Close group and match }
Regex demo
If you want to match all states and digits:
{count=(\d+(?:\.\d+)?), state=(STATE\d+)}
Regex demo
After some experimentation, this is what I've come up with:
The answers provided here, although good answers, don't quite work if your state names don't end with a number (mine don't, I just changed them to make the question easier to read and to remove business information from the question).
Here's a completely tile-able regex where you can add on as many matches as needed
count=(?P<GROUP_NAME_HERE>\d+(?=\.0, state=STATE_NAME_HERE))?
This can be copied and appended with the new state name and group name.
Additionally, if any of the states do not appear in the string, it will still match the following states. For example:
count=(?P<G1>\d+(?=\.0, state=STATE_ONE))?(?P<G2>\d+(?=\.0, state=STATE_TWO))?(?P<G3>\d+(?=\.0, state=STATE_THREE))?
will match states STATE_ONE and STATE_THREE with named groups G1 & G3 in the following string even though STATE_TWO is missing:
[{count=55.0, state=STATE_ONE}, {count=10.0, state=STATE_THREE}]
I'm sure this could be improved, but it's fast enough for me, and with 11 groups, regex101 shows 803 steps with a time of ~1ms
Here's a regex101 playground to mess with: https://regex101.com/r/3a3iQf/1
Notice how groups 1,2,3,4,5,6,7,9, & 11 match. 8 & 10 are missing and the following groups still match.

Complex regex situation

I have a results list that looks like this:
1lemon_king9mumu (2-1), YearofHell (2-0), kriswithak (2-1)0.44440.75000.4444
2mumu6lemon_king (1-2), MogwaiAC (2-0), Dathanja (2-1)0.66670.62500.5655
3MogwaiAC6Dathanja (2-0), mumu (0-2), Jebnarf (2-1)0.55560.57140.5417
4Jebnarf6YearofHell (2-1), kriswithak (2-0), MogwaiAC (1-2)0.44440.62500.4266
5YearofHell3Jebnarf (1-2), lemon_king (0-2), Mig82 (2-1)0.66670.37500.6012
6Dathanja3MogwaiAC (0-2), Mig82 (2-1), mumu (1-2)0.55560.37500.5417
7Mig823Bye, Dathanja (1-2), YearofHell (1-2)0.33330.42860.3750
8kriswithak0Jebnarf (0-2), lemon_king (1-2)0.83330.20000.6875
I want to be able to pull the username of the person AFTER the rank (first number) but it is mashed together with points gained by the player, as well as their first opponent.
For example, the first persons name is "Lemon_king", and his opponents were "Mumu", "YearofHell" and "Kriswithak". The numbers on the right are irrelevant for me, but the major problem I have is that the number of points won by the player is there. Lemon_King wins 9 points for first place. I would normally just get the name by looking for the string between 1 and 9, but players usernames can have a 9 in it as well.
Can anyone think of a good solution to this problem to be able to grab the persons username?
Thanks
I think you'd need a list of the usernames to compare against; it doesn't look like the results list is "regular" enough for a regular expression.
For example the line
7Mig823Bye, Dathanja
Could be "Mig82" 3 points vs "Bye, Dathanja", but it could also be "Mig8", 23 points, "Bye, Dathanja" or "Mig8", 2 points, "3Bye, Dathanja".
Is that correct? Because if it is, you aren't going to get away with a simple solution.
Edit: Wilson commented that getting the list of usernames might be an option. In that case, something like the following might work:
/^\d+?(username1|username2|username3)\d+?(username1|username2|username3)/
It will probably take some fiddling to get right.
Here's a plnkr demonstrating it with the data you provided: http://plnkr.co/edit/nJeGfbfHgvh5zJcTWRXS?p=preview
That said, a regex might not be the right tool for this job.
As far as I can tell, you want something like
(?x) # allow whitespace and comments just like
# any real programming language
^ # beginning of line
( \d+ ) # starts with one or more digits: CAPTURE 1
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 2
( \d ) # next is a single digit: CAPTURE 3
(?= \D ) # must have a non-digit following
( \w+ ) # capture one or more "word" characters: CAPTURE 4
# now add things for the rest of the line if you want
Your username should now be in the second capture. I’ve been a tad more careful than strictly necessary, but if you end up munging this, you may need that. I’ve alos put all the captures in case you want to move stuff around or pull more stuff out.
Please provide a bit more information, if you want the thing between the first number and second number:
[0-9]+([^0-9])
The first group will contain the first username.
Please comment on this (so I check) an edit your question with more detail though.
I wouldnt use regex. It will be a pain to debug it, and you'll never be 100% certain you've covered all the edge cases.
Try doing 'manual' parsing using your language of choice's built in string manipulation functions.

Remove the first character of each line and append using Vim

I have a data file as follows.
1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
1,13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050
1,13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185
1,14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480
1,13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735
Using vim, I want to reomve the 1's from each of the lines and append them to the end. The resultant file would look like this:
14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065,1
13.2,1.78,2.14,11.2,100,2.65,2.76,.26,1.28,4.38,1.05,3.4,1050,1
13.16,2.36,2.67,18.6,101,2.8,3.24,.3,2.81,5.68,1.03,3.17,1185,1
14.37,1.95,2.5,16.8,113,3.85,3.49,.24,2.18,7.8,.86,3.45,1480,1
13.24,2.59,2.87,21,118,2.8,2.69,.39,1.82,4.32,1.04,2.93,735,1
I was looking for an elegant way to do this.
Actually I tried it like
:%s/$/,/g
And then
:%s/$/^./g
But I could not make it to work.
EDIT : Well, actually I made one mistake in my question. In the data-file, the first character is not always 1, they are mixture of 1, 2 and 3. So, from all the answers from this questions, I came up with the solution --
:%s/^\([1-3]\),\(.*\)/\2,\1/g
and it is working now.
A regular expression that doesn't care which number, its digits, or separator you've used. That is, this would work for lines that have both 1 as their first number, or 114:
:%s/\([0-9]*\)\(.\)\(.*\)/\3\2\1/
Explanation:
:%s// - Substitute every line (%)
\(<something>\) - Extract and store to \n
[0-9]* - A number 0 or more times
. - Every char, in this case,
.* - Every char 0 or more times
\3\2\1 - Replace what is captured with \(\)
So: Cut up 1 , <the rest> to \1, \2 and \3 respectively, and reorder them.
This
:%s/^1,//
:%s/$/,1/
could be somewhat simpler to understand.
:%s/^1,\(.*\)/\1,1/
This will do the replacement on each line in the file. The \1 replaces everything captured by the (.*)
:%s/1,\(.*$\)/\1,1/gc
.........................
You could also solve this one using a macro. First, think about how to delete the 1, from the start of a line and append it to the end:
0 go the the start of the line
df, delete everything to and including the first ,
A,<ESC> append a comma to the end of the line
p paste the thing you deleted with df,
x delete the trailing comma
So, to sum it up, the following will convert a single line:
0df,A,<ESC>px
Now if you'd like to apply this set of modifications to all the lines, you will first need to record them:
qj start recording into the 'j' register
0df,A,<ESC>px convert a single line
j go to the next line
q stop recording
Finally, you can execute the macro anytime you want using #j, or convert your entire file with 99#j (using a higher number than 99 if you have more than 99 lines).
Here's the complete version:
qj0df,A,<ESC>pxjq99#j
This one might be easier to understand than the other solutions if you're not used to regular expressions!