My RAKU Code:
sub comments {
if ($DEBUG) { say "<filtering comments>\n"; }
my #filteredtitles = ();
# This loops through each track
for #tracks -> $title {
##########################
# LAB 1 TASK 2 #
##########################
## Add regex substitutions to remove superflous comments and all that follows them
## Assign to $_ with smartmatcher (~~)
##########################
$_ = $title;
if ($_) ~~ s:g:mrx/ .*<?[\(^.*]> / {
# Repeat for the other symbols
########################## End Task 2
# Add the edited $title to the new array of titles
#filteredtitles.push: $_;
}
}
# Updates #tracks
return #filteredtitles;
}
Result when compiling:
Error Compiling! Placeholder variable '#_' may not be used here because the surrounding block doesn't take a signature.
Is there something obvious that I am missing? Any help is appreciated.
So, in contrast with #raiph's answer, here's what I have:
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
Just that. Nothing else. Let's dissect it, from the inside out:
This part: / <[\(^]> / is a regular expression that will match one character, as long as it is an open parenthesis (represented by the \() or a caret (^). When they go inside the angle brackets/square brackets combo, it means that is an Enumerated character class.
Then, the: S introduces the non-destructive substitution, i.e., a quoting construct that will make regex-based substitutions over the topic variable $_ but will not modify it, just return its value with the modifications requested. In the code above, S:g brings the adverb :g or :global (see the global adverb in the adverbs section of the documentation) to play, meaning (in the case of the substitution) "please make as many as possible of this substitution" and the final / marks the end of the substitution text, and as it is adjacent to the second /, that means that
S:g / <[\(^]> //
means "please return the contents of $_, but modified in such a way that all its characters matching the regex <[\(^]> are deleted (substituted for the empty string)"
At this point, I should emphasize that regular expressions in Raku are really powerful, and that reading the entire page (and probably the best practices and gotchas page too) is a good idea.
Next, the: .map method, documented here, will be applied to any Iterable (List, Array and all their alikes) and will return a sequence based on each element of the Iterable, altered by a Code passed to it. So, something like:
#x.map({ S:g / foo /bar/ })
essencially means "please return a Sequence of every item on #x, modified by substituting any appearance of the substring foo for bar" (nothing will be altered on #x). A nice place to start to learn about sequences and iterables would be here.
Finally, my one-liner
my #tracks = <Foo Ba(r B^az>.map: { S:g / <[\(^]> // };
can be translated as:
I have a List with three string elements
Foo
Ba(r
B^az
(This would be a placeholder for your "list of titles"). Take that list and generate a second one, that contains every element on it, but with all instances of the chars "open parenthesis" and "caret" removed.
Ah, and store the result in the variable #tracks (that has my scope)
Here's what I ended up with:
my #tracks = <Foo Ba(r B^az>;
sub comments {
my #filteredtitles;
for #tracks -> $_ is copy {
s:g / <[\(^]> //;
#filteredtitles.push: $_;
}
return #filteredtitles;
}
The is copy ensures the variable set up by the for loop is mutable.
The s:g/...//; is all that's needed to strip the unwanted characters.
One thing no one can help you with is the error you reported. I currently think you just got confused.
Here's an example of code that generates that error:
do { #_ }
But there is no way the code you've shared could generate that error because it requires that there is an #_ variable in your code, and there isn't one.
One way I can help in relation to future problems you may report on StackOverflow is to encourage you to read and apply the guidance in Minimal Reproducible Example.
While your code did not generate the error you reported, it will perhaps help you if you know about some of the other compile time and run time errors there were in the code you shared.
Compile-time errors:
You wrote s:g:mrx. That's invalid: Adverb mrx not allowed on substitution.
You missed out the third slash of the s///. That causes mayhem (see below).
There were several run-time errors, once I got past the compile-time errors. I'll discuss just one, the regex:
.*<?[...]> will match any sub-string with a final character that's one of the ones listed in the [...], and will then capture that sub-string except without the final character. In the context of an s:g/...// substitution this will strip ordinary characters (captured by the .*) but leave the special characters.
This makes no sense.
So I dropped the .*, and also the ? from the special character pattern, changing it from <?[...]> (which just tries to match against the character, but does not capture it if it succeeds) to just <[...]> (which also tries to match against the character, but, if it succeeds, does capture it as well).
A final comment is about an error you made that may well have seriously confused you.
In a nutshell, the s/// construct must have three slashes.
In your question you had code of the form s/.../ (or s:g/.../ etc), without the final slash. If you try to compile such code the parser gets utterly confused because it will think you're just writing a long replacement string.
For example, if you wrote this code:
if s/foo/ { say 'foo' }
if m/bar/ { say 'bar' }
it'd be as if you'd written:
if s/foo/ { say 'foo' }\nif m/...
which in turn would mean you'd get the compile-time error:
Missing block
------> if m/⏏bar/ { ... }
expecting any of:
block or pointy block
...
because Raku(do) would have interpreted the part between the second and third /s as the replacement double quoted string of what it interpreted as an s/.../.../ construct, leading it to barf when it encountered bar.
So, to recap, the s/// construct requires three slashes, not two.
(I'm ignoring syntactic variants of the construct such as, say, s [...] = '...'.)
So I had this code working for a few months already, lets say I have a table called Categories, which has a string column called name, so I receive a string and I want to know if any category was mentioned (a mention occur when the string contains the substring: #name_of_a_category), the approach I follow for this was something like below:
categories.select { |category_i| content_received.downcase.match(/##{category_i.downcase}/)}
That worked pretty well until today suddenly started to receive an exception unmatched close parenthesis, I realized that the categories names can contain special chars so I decided to not consider special chars or spaces anymore (don't want to add restrictions to the user and at the same time don't want to deal with those cases so the policy is just to ignore it).
So the question is there a clean way of removing these special chars (maintaining the #) and matching the string (don't want to modify the data just ignore it while looking for mentions)?
You can also use
prep_content_received = content_received.gsub(/[^\w\s]|_/,'')
p categories.select { |c|
prep_content_received.match?(/\b#{c.gsub(/[^\w\s]|_/, '').strip()}\b/i)
}
See the Ruby demo
Details:
The prep_content_received = content_received.gsub(/[^\w\s]|_/,'') creates a copy of content_received with no special chars and _. Using it once reduced overhead if there are a lot of categories
Then, you iterate over the categories list, and each time check if the prep_content_received matches \b (word boundary) + category with all special chars, _ and leading/trailing whitespace stripped from it + \b in a case insensitive way (see the /i flag, no need to .downcase).
So after looking around I found some answers on the platform but nothing with my specific requirements (maybe I missed something, if so please let me know), and this is how I fix it for my case:
content_received = 'pepe is watching a #comedy :)'
categories = ['comedy :)', 'terror']
temp_content = content_received.downcase
categories.select { |category_i| temp_content.gsub(/[^\sa-zA-Z0-9]/, '#' => '#').match?(/##{category_i.downcase.
gsub(/[^\sa-zA-Z0-9]/, '')}/) }
For the sake of the example, I reduced the categories to a simple array of strings, basically the first gsub, remove any character that is not a letter or a number (any special character) and replace each # with an #, the second gsub is a simpler version of the first one.
You can test the snippet above here
I'm trying to get the list of all digits preceding a hyphen in a given string (let's say in cell A1), using a Google Sheets regex formula :
=REGEXEXTRACT(A1, "\d-")
My problem is that it only returns the first match... how can I get all matches?
Example text:
"A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq"
My formula returns 1-, whereas I want to get 1-2-2-2-2-2-2-2-2-2-3-3- (either as an array or concatenated text).
I know I could use a script or another function (like SPLIT) to achieve the desired result, but what I really want to know is how I could get a re2 regular expression to return such multiple matches in a "REGEX.*" Google Sheets formula.
Something like the "global - Don't return after first match" option on regex101.com
I've also tried removing the undesired text with REGEXREPLACE, with no success either (I couldn't get rid of other digits not preceding a hyphen).
Any help appreciated!
Thanks :)
You can actually do this in a single formula using regexreplace to surround all the values with a capture group instead of replacing the text:
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
basically what it does is surround all instances of the \d- with a "capture group" then using regex extract, it neatly returns all the captures. if you want to join it back into a single string you can just use join to pack it back into a single cell:
You may create your own custom function in the Script Editor:
function ExtractAllRegex(input, pattern,groupId) {
return [Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId])];
}
Or, if you need to return all matches in a single cell joined with some separator:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then, just call it like =ExtractAllRegex(A1, "\d-", 0, ", ").
Description:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.
Edit
I came up with more general solution:
=regexreplace(A1,"(.)?(\d-)|(.)","$2")
It replaces any text except the second group match (\d-) with just the second group $2.
"(.)?(\d-)|(.)"
1 2 3
Groups are in ()
---------------------------------------
"$2" -- means return the group number 2
Learn regular expressions: https://regexone.com
Try this formula:
=regexreplace(regexreplace(A1,"[^\-0-9]",""),"(\d-)|(.)","$1")
It will handle string like this:
"A1-Nutrition;A2-ActPhysiq;A2-BioM---eta;A2-PH3-Généti***566*9q"
with output:
1-2-2-2-3-
I wasn't able to get the accepted answer to work for my case. I'd like to do it that way, but needed a quick solution and went with the following:
Input:
1111 days, 123 hours 1234 minutes and 121 seconds
Expected output:
1111 123 1234 121
Formula:
=split(REGEXREPLACE(C26,"[a-z,]"," ")," ")
The shortest possible regex:
=regexreplace(A1,".?(\d-)|.", "$1")
Which returns 1-2-2-2-2-2-2-2-2-2-3-3- for "A1-Nutrition;A2-ActPhysiq;A2-BioMeta;A2-Patho-jour;A2-StgMrktg2;H2-Bioth2/EtudeCas;H2-Bioth2/Gemmo;H2-Bioth2/Oligo;H2-Bioth2/Opo;H2-Bioth2/Organo;H3-Endocrino;H3-Génétiq".
Explanation of regex:
.? -- optional character
(\d-) -- capture group 1 with a digit followed by a dash (specify (\d+-) multiple digits)
| -- logical or
. -- any character
the replacement "$1" uses just the capture group 1, and discards anything else
Learn more about regex: https://twiki.org/cgi-bin/view/Codev/TWikiPresentation2018x10x14Regex
This seems to work and I have tried to verify it.
The logic is
(1) Replace letter followed by hyphen with nothing
(2) Replace any digit not followed by a hyphen with nothing
(3) Replace everything which is not a digit or hyphen with nothing
=regexreplace(A1,"[a-zA-Z]-|[0-9][^-]|[a-zA-Z;/é]","")
Result
1-2-2-2-2-2-2-2-2-2-3-3-
Analysis
I had to step through these procedurally to convince myself that this was correct. According to this reference when there are alternatives separated by the pipe symbol, regex should match them in order left-to-right. The above formula doesn't work properly unless rule 1 comes first (otherwise it reduces all characters except a digit or hyphen to null before rule (1) can come into play and you get an extra hyphen from "Patho-jour").
Here are some examples of how I think it must deal with the text
The solution to capture groups with RegexReplace and then do the RegexExctract works here too, but there is a catch.
=join("",REGEXEXTRACT(A1,REGEXREPLACE(A1,"(\d-)","($1)")))
If the cell that you are trying to get the values has Special Characters like parentheses "(" or question mark "?" the solution provided won´t work.
In my case, I was trying to list all “variables text” contained in the cell. Those “variables text “ was wrote inside like that: “{example_name}”. But the full content of the cell had special characters making the regex formula do break. When I removed theses specials characters, then I could list all captured groups like the solution did.
There are two general ('Excel' / 'native' / non-Apps Script) solutions to return an array of regex matches in the style of REGEXEXTRACT:
Method 1)
insert a delimiter around matches, remove junk, and call SPLIT
Regexes work by iterating over the string from left to right, and 'consuming'. If we are careful to consume junk values, we can throw them away.
(This gets around the problem faced by the currently accepted solution, which is that as Carlos Eduardo Oliveira mentions, it will obviously fail if the corpus text contains special regex characters.)
First we pick a delimiter, which must not already exist in the text. The proper way to do this is to parse the text to temporarily replace our delimiter with a "temporary delimiter", like if we were going to use commas "," we'd first replace all existing commas with something like "<<QUOTED-COMMA>>" then un-replace them later. BUT, for simplicity's sake, we'll just grab a random character such as from the private-use unicode blocks and use it as our special delimiter (note that it is 2 bytes... google spreadsheets might not count bytes in graphemes in a consistent way, but we'll be careful later).
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
"xyzSixSpaces:[ ]123ThreeSpaces:[ ]aaaa 12345",".*?( |$)",
"$1"
)
),
""
)
We just use a lambda to define temp="match1match2match3", then use that to remove the last delimiter into "match1match2match3", then SPLIT it.
Taking COLUMNS of the result will prove that the correct result is returned, i.e. {" ", " ", " "}.
This is a particularly good function to turn into a Named Function, and call it something like REGEXGLOBALEXTRACT(text,regex) or REGEXALLEXTRACT(text,regex), e.g.:
=SPLIT(
LAMBDA(temp,
MID(temp, 1, LEN(temp)-LEN(""))
)(
REGEXREPLACE(
text,
".*?("®ex&"|$)",
"$1"
)
),
""
)
Method 2)
use recursion
With LAMBDA (i.e. lets you define a function like any other programming language), you can use some tricks from the well-studied lambda calculus and function programming: you have access to recursion. Defining a recursive function is confusing because there's no easy way for it to refer to itself, so you have to use a trick/convention:
trick for recursive functions: to actually define a function f which needs to refer to itself, instead define a function that takes a parameter of itself and returns the function you actually want; pass in this 'convention' to the Y-combinator to turn it into an actual recursive function
The plumbing which takes such a function work is called the Y-combinator. Here is a good article to understand it if you have some programming background.
For example to get the result of 5! (5 factorial, i.e. implement our own FACT(5)), we could define:
Named Function Y(f)=LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) ) (this is the Y-combinator and is magic; you don't have to understand it to use it)
Named Function MY_FACTORIAL(n)=
Y(LAMBDA(self,
LAMBDA(n,
IF(n=0, 1, n*self(n-1))
)
))
result of MY_FACTORIAL(5): 120
The Y-combinator makes writing recursive functions look relatively easy, like an introduction to programming class. I'm using Named Functions for clarity, but you could just dump it all together at the expense of sanity...
=LAMBDA(Y,
Y(LAMBDA(self, LAMBDA(n, IF(n=0,1,n*self(n-1))) ))(5)
)(
LAMBDA(f, (LAMBDA(x,x(x)))( LAMBDA(x, f(LAMBDA(y, x(x)(y)))) ) )
)
How does this apply to the problem at hand? Well a recursive solution is as follows:
in pseudocode below, I use 'function' instead of LAMBDA, but it's the same thing:
// code to get around the fact that you can't have 0-length arrays
function emptyList() {
return {"ignore this value"}
}
function listToArray(myList) {
return OFFSET(myList,0,1)
}
function allMatches(text, regex) {
allMatchesHelper(emptyList(), text, regex)
}
function allMatchesHelper(resultsToReturn, text, regex) {
currentMatch = REGEXEXTRACT(...)
if (currentMatch succeeds) {
textWithoutMatch = SUBSTITUTE(text, currentMatch, "", 1)
return allMatches(
{resultsToReturn,currentMatch},
textWithoutMatch,
regex
)
} else {
return listToArray(resultsToReturn)
}
}
Unfortunately, the recursive approach is quadratic order of growth (because it's appending the results over and over to itself, while recreating the giant search string with smaller and smaller bites taken out of it, so 1+2+3+4+5+... = big^2, which can add up to a lot of time), so may be slow if you have many many matches. It's better to stay inside the regex engine for speed, since it's probably highly optimized.
You could of course avoid using Named Functions by doing temporary bindings with LAMBDA(varName, expr)(varValue) if you want to use varName in an expression. (You can define this pattern as a Named Function =cont(varValue) to invert the order of the parameters to keep code cleaner, or not.)
Whenever I use varName = varValue, write that instead.
to see if a match succeeds, use ISNA(...)
It would look something like:
Named Function allMatches(resultsToReturn, text, regex):
UNTESTED:
LAMBDA(helper,
OFFSET(
helper({"ignore"}, text, regex),
0,1)
)(
Y(LAMBDA(helperItself,
LAMBDA(results, partialText,
LAMBDA(currentMatch,
IF(ISNA(currentMatch),
results,
LAMBDA(textWithoutMatch,
helperItself({results,currentMatch}, textWithoutMatch)
)(
SUBSTITUTE(partialText, currentMatch, "", 1)
)
)
)(
REGEXEXTRACT(partialText, regex)
)
)
))
)
I had a programming exercise where we had to write a format method that would take as parameters a String, as the text input we want to format, and an integer, as the length of the lines we want.
We also had specific rules, and one of them was: if a word is longer than the desired length, it should go on a line by itself.
So an easy way to do this was to search with a Regex expression a series of non-white space characters being at least length and possibly more, and to replace this matched by itself with two \n at each side...
perhaps the code is clearer: let's imagine our desired line length is 10
String s = line.replaceAll("([ \t])(\\S{10,})([ \t])", "\n$2\n");
System.out.println(s);
This actually works fine.
My problem is that the line length should be passed as a parameter to the format method; so the only way I can refer to this parameter is by using its reference. For example:
public static String format(String s, int length) { ...
//missing stuff
String s = line.replaceAll("([ \t])(\S{length,})([ \t])", "\n$2\n");
System.out.println(s);
However in that case it will return an error: PatternSyntaxException
My question is: is there any way to use a reference into a quantifier (instead of a number), as argument for the number of occurrences we want to match? As it seems it would be really useful, I was surprised that it didn't seem to work. Also it doesn't look like a lot of people encountered this issue.
try this
String s = line.replaceAll("([ \t])(\S{"+length+",})([ \t])", "\n$2\n");
System.out.println(s);
I am wanting to create a regular expression for the following scenario:
If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.
So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:
http://www.test.com/?&Name=My%20Name%20Is%20Vader
But these would fail:
http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25
%%%25
Any help would be greatly appreciated,
Kyle
EDIT:
The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.
Doesn't require a %:
/^[^%]*(%20[^%]*)*$/
Which language are you using?
Most languages have a Uri Encoder / Decoder function or class.
I would suggest you decode the string first and than check for valid (or invalid) characters.
i.e. something like /[\w ]/ (empty is a space)
With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".
Another solution if look-arounds are not available:
^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Reject the string if it matches %[^2][^0]
I think that would find what you need
/^([^%]|%%|%20)+$/
Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:
In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:
/^\([^%]\|%%\|%20\)\+$/
Maybe a better approach is to deal with that validation after you decode that string:
string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
/^([^%]|%20)*$/
This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.
As others have said before, %% is valid too... and %%25would be %25
The below regex matches anything that doesn't fit into the above rules
/(?<![^%]%)%(?!(20|%))/
The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20
This means that if anything is identified by the regex, then you should probably reject it.
I agree with dominic's comment on the question. Don't use Regex.
If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)
// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
if mystring[pos+1] = "%" then
pos = pos + 2 // ok, this is a literal, skip ahead
else if mystring.substring(pos,2) != "20"
return false; // string is invalid
end if
}
return true;