I use string.format(str, regex) of LUA to fetch some key word.
local RICH_TAGS = {
"texture",
"img",
}
--\[((img)|(texture))=
local START_OF_PATTER = "\\[("
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
START_OF_PATTER = START_OF_PATTER .. "("..RICH_TAGS[#RICH_TAGS].."))"
function RichTextDecoder.decodeRich(str)
local result = {}
print(str, START_OF_PATTER)
dump({string.find(str, START_OF_PATTER)})
end
output
hello[img=123] \[((texture)|(img))
dump from: [string "utils/RichTextDecoder.lua"]:21: in function 'decodeRich'
"<var>" = {
}
The output means:
str = hello[img=123]
START_OF_PATTER = \[((texture)|(img))
This regex works well with some online regex tools. But it find nothing in LUA.
Is there any wrong using in my code?
You cannot use regular expressions in Lua. Use Lua's string patterns to match strings.
See How to write this regular expression in Lua?
Try dump({str:find("\\%[%("))})
Also note that this loop:
for index = 1, #RICH_TAGS - 1 do
START_OF_PATTER = START_OF_PATTER .. "(" .. RICH_TAGS[index]..")|"
end
will leave out the last element of RICH_TAGS, I assume that was not your intention.
Edit:
But what I want is to fetch several specific word. For example, the
pattern can fetch "[img=" "[texture=" "[font=" any one of them. With
the regex string I wrote in my question, regex can do the work. But
with Lua, the way to do the job is write code like string.find(str,
"[img=") and string.find(str, "[texture=") and string.find(str,
"[font="). I wonder there should be a way to do the job with a single
pattern string. I tryed pattern string like "%[%a*=", but obviously it
will fetch a lot more string I need.
You cannot match several specific words with a single pattern unless they are in that string in a specific order. The only thing you could do is to put all the characters that make up those words into a class, but then you risk to find any word you can build from those letters.
Usually you would match each word with a separate pattern or you match any word and check if the match is one of your words using a look up table for example.
So basically you do what a regex library would do in a few lines of Lua.
I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("®ex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)
How i can read 4th Value(inside "" i.e "vV0...." using Regex in below condition ?
I am updating a bit this part - Is it possible to first find Word "LaunchFileUploader" and then select the 4th Value, if there are multiple instance of LaunchFileUploader in the file just select 4th Value of first word found ? Attaching screenshot of file where this needs to be searched (In the file word is "LaunchFileUploader")
I tried this but it gives as - I need 4th value (Group 1 is giving me third value)
\bLaunchFileUploader\b(\:?.*?,){3}.*?\)
Match 1
Full match 11030-11428 LaunchFileUploader("ERM-1BLX3D04R10-0001", 1662, "2ecbb644-34fa-4919-9809-a5ff47594c2d", "8dZOPyHKBK...
Group 1. n/a "2ecbb644-34fa-4919-9809-a5ff47594c2d",
I am still looking for solution for this. Any help is aprreciated.
Depending on what's available to you to use, there's a couple of ways to do it.
Either way, this would work better if there were no new lines in the string, just plain ("value1","value2","value3","value4") etc. It'll still work, but you may need to clean up some new lines from the resulting string.
The easy way - use code for the hard part. Grab the inner string with:
(?<=\().*?(?=\))
This will get everything that's between the 2 parentheses (using positive lookarounds). In code, you could then split/explode this string on , and take the 4th item.
If you want to do it all in regex, you could use something along the lines of:
(?<=\()(?:.*?,){3}(.*?)(?=\))
This would a) match the entire contents of the parentheses and b) capture the 4th option in a capture group. To go even deeper:
(?<=\()(?:.*?,){3}\"(.*?)\"(?=\))
would capture the contents of the "" quotation marks only.
Some tools don't allow you to use lookarounds, if this is the case let me know and I'll see what other ways there are around it.
EDIT Ran this in JS console on browser. This absolutely does work.
EDIT 2 I see you've updated your question with the text you're actually searching in. This pattern will include the space and the new line character as per the copy/paste of the above text.
(?<=\(\")(?:.*?,\s?\n?){3}\"(.*?)\"(?=\))
See my second image for the test in console
This works for python and PHP:
(?<=\")(.*)(?:\"\);)\Z
Demo for Python and PHP
For Java, replace \Z with $ as follows:
(?:")(.*)(?:\"\);)$
Demo for JavaScript
NOTE: Be sure to look the captured group and not the matched group.
UPDATE:
Try this for your updated request:
"(.*)"(?:[\\);\] \/>}]*)$
Demo for updated input string
all the above regex patterns assume there is a line break after each comma
Auto-generated Java Code:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
final String regex = "\"(.*)\"(?:[\\\\);\\] \\/>\\}]*)$";
final String string = "\n"
+ "}$(document).ready( function(){ PathUploader\n"
+ " (\"ERM-1BLX3D04R10-0001\", \n"
+ " 1662, \n"
+ " \"1bff5c85-7a52-4cc5-86ef-a4ccbf14c5d5\", \n"
+ "\"vV0mX3VadCSPnN8FsAO7%2fysNbP5b3SnaWWHQETFy7ORSoz9QUQUwK7jqvCEr%2f8UnHkNNVLkJedu5l%2bA%2bne%2fD%2b2F5EWVlGox95BYDhl6EEkVAVFmMlRThh1sPzPU5LLylSsR9T7TAODjtaJ2wslruS5nW1A7%2fnLB%2bljZaQhaT9vZLcFkDqLjouf9vu08K9Gmiu6neRVSaISP3cEVAmSz5kxxhV2oiEF9Y0i6Y5%2f5ASaRiW21w3054SmRF0rq3IwZzBvLx0%2fAk1m6B0gs3841b%2fw%3d%3d\"); } );//]]>";
final Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
final Matcher matcher = pattern.matcher(string);
while (matcher.find()) {
System.out.println("Full match: " + matcher.group(0));
for (int i = 1; i <= matcher.groupCount(); i++) {
System.out.println("Group " + i + ": " + matcher.group(i));
}
}
So i am trying to search php files using preg_match_all to find method usages in there and take parameter values from it.
For example if i there is method call like:
method("some text", null, "third param")
I want to get some text, null, and third param to an array.
I was able to get first parameter, bug having problems for second and third one.
What i tryed:
\(method)\([\'"](.*?[ \na-zA-Z0-9_-]+)[\'"]\im
https://regex101.com/r/dC8wV6/2
You can try
method\s*\(((?:\s*(?:['"][^'"]*['"]|\w+)\s*(?:,|(?=\))))*)
to capture the parameters in group 1, then split them on , or ,(?=(?:(?:[^'"]*['"]){2})*[^'"]*$) to get an array of parameters.
This pattern comes with the usual pitfalls, i.e. commas in strings (method(" , ")) or mismatched quotes (" ' ") or escaped quotes (" \" ") will trip it up. Those issues could be fixed, but I think the regex is long enough as is.
I'm trying to retrieve a filename without the extension in ColdFusion. I am using the following function:
REMatchNoCase( "(.+?)(\.[^.]*$|$)" , "Doe, John 8.15.2012.docx" );
I would like this to return an array like: ["Doe, John 8.15.2012","docx"]
but instead I always get an array with one element - the entire filename:["Doe, John 8.15.2012.docx"]
I tried the regex string above on rexv.org and it works as expected, but not on ColdFusion. I got the string from this SO question: Regex: Get Filename Without Extension in One Shot?
Does ColdFusion use a different syntax? Or am I doing something wrong?
Thanks.
Why you're not getting expected results...
The reason you are getting a one-item array with the whole filename is because your pattern matches the entire filename, and matches once.
It is capturing the two groups, but rematch returns arrays of matches, not arrays of the captured groups, so you don't see those groups.
How to solve the problem...
If you are dealing with simple files (i.e. no .htaccess or similar), then the simplest solution is to just use...
ListLast( filename , '.' )
....to get only the file extension and to get the name without extension you can do...
rematch( '.+(?=\.[^.]+$)' , filename )
This uses a lookahead to ensure there is a . followed by at least one non-. at the end of the string, but (since it's a lookahead) it is excluded from the match (so you only get the pre-extension part in your match).
To deal with non-extensioned files (e.g. .htaccess or README) you can modify the above regex to .+(?=(?:\.[^.]+)?$) which basically does the same thing except making the extension optional. However, there isn't a trivial way to get update the ListLast method for these (guess you'd need to check len(extension) LT len(filename)-1 or similar).
(optional) Accessing captured groups...
If you want to get at the actual captured groups, the closest native way to do this in CF is using the refind function, with the fourth argument set to true - however, this only gives you positions and lengths - requiring that you use mid to extract them yourself.
For this reason (amongst many others), I've created an improved regex implementation for CF, called cfRegex, which lets you return the group text directly (i.e. no messing around with mid).
If you wanted to use cfRegex, you can do so with your original pattern like so:
RegexMatch( '(.+?)(\.[^.]*$|$)' , filename , 1 , 0 , 'groups' )
Or with named arguments:
RegexMatch( pattern='(.+?)(\.[^.]*$|$)' , text=filename , returntype='groups' )
And you get returned an array of matches, within each element being an array of the captured groups for that match.
If you're doing lots of regex work dealing with captured groups, cfRegex is definitely better than doing it with CF's re methods.
If all you care about is getting the extension and/or the filename with extension excluded then the previous examples above are sufficient.
#Peter's response is great, however the approach is perhaps a bit longer-winded than necessary. One can do this with reMatch() with a slight tweak to the regex.
<cfscript>
param name="URL.filename";
sRegex = "^.+?(?=(?:\.[^.]+?)?$)";
aMatch = reMatch(sRegex, URL.filename);
writeDump(aMatch);
</cfscript>
This works on the following filename patterns:
foo.bar
foo
.htaccess
John 8.15.2012.docx
Explanation of the regex:
^ From the beginning of the string
.+? One or more (+) characters (.), but the fewest (?) that will work with the rest of the regex. This is the file name.
(?=) Look ahead. Make sure the stuff in here appears in the string, but don't actually match it. This is the key bit to NOT return any file extension that might be present.
(?: Group this stuff together, but don't remember it for a back reference.
. A dot. This is the separator between file name and file extension.
[^.]+? One or more (+) single ([]) non-dot characters (^.), again matching the fewest possible (?) that will allow the regex as a whole to work.
? (This is the one after the (?:) group). Zero or one of those groups: ie: zero or one file extensions.
$ To the end of the string
I've only tested with those four file name patterns, but it seems to work OK. Other people might be able to finetune it.
A few more ways of achieving the same result. They all execute in roughly the same amount of time.
<cfscript>
str = 'Doe, John 8.15.2012.docx';
// sans regex
arr1 = [
reverse( listRest( reverse( str ), '.' ) ),
listLast( str, '.' )
];
// using Java String lastIndexOf()
arr2 = [
str.substring( 0, str.lastIndexOf( '.' ) ),
str.substring( str.lastIndexOf( '.' ) + 1 )
];
// using listToArray with non-filename safe character replace
arr3 = listToArray( str.replaceAll( '\.([^\.]+)$', '|$1' ), '|' );
</cfscript>