is groovy regex (slightly broken)? - regex

println "p(cat || cats, n)" ==~ /^p\(.+||.+,sn\)$/
println "" ==~ /^p\(.+||.+,sn\)$/
why does the 2nd line return true? Is this a bug?

| is a special character that means "OR" and needs to be escaped to obtain a literal |. The second regex returns true because || matches the empty string (between the two "OR")
Note there is no "s" after the comma in the first string but a space.

Related

match everything but a given string and do not match single characters from that string

Let's start with the following input.
Input = 'blue, blueblue, b l u e'
I want to match everything that is not the string 'blue'. Note that blueblue should not match, but single characters should (even if present in match string).
From this, If I replace the matches with an empty string, it should return:
Result = 'blueblueblue'
I have tried with [^\bblue\b]+
but this matches the last four single characters 'b', 'l','u','e'
Another solution:
(?<=blue)(?:(?!blue).)+(?=blue|$)|^(?:(?!blue).)+(?=blue|$)
Regex demo
If you regex engine support the \K flag, then we can try:
/blue\K|.*?(?=blue|$)/gm
Demo
This pattern says to match:
blue match "blue"
\K but then forget that match
| OR
.*? match anything else until reaching
(?=blue|$) the next "blue" or the end of the string
Edit:
On JavaScript, we can try the following replacement:
var input = "blue, blueblue, b l u e";
var output = input.replace(/blue|.*?(?=blue|$)/g, (x) => x != "blue" ? "" : "blue");
console.log(output);

Why ++ becomes -+-+-+- : string.gsub "strange" behavior

Why ++ becomes -+-+-+- ?
I'd like to clean a string from double operating signs. How should I process ?
String = "++"
print (String ) -- -> ++
String = string.gsub( String, "++", "+")
print (String ) -- -> + ok
String = string.gsub( String, "--", "+")
print (String ) -- -> +++ ?
String = string.gsub( String, "+-", "-")
print (String ) -- -> -+-+-+- ??
String = string.gsub( String, "-+", "-")
print (String ) -- -> -+-+-+- ??? ;-)
The core problem is that gsub operates on patterns (Lua's minimal regular expressions) and your string contains unescaped magic characters. However, even knowing that I found myself surprised by your results.
It's easier to see what gsub is doing if we change the replacement string:
string.gsub('+', '--', '|') => |+|
string.gsub('+++', '--', '|') => |+|+|+|
- means "0 or more occurrences of the preceding atom". Unlike +, it's non-greedy, matching the fewest characters possible.
I just tested it and apparently "fewest characters possible" mostly means 0 characters. For instance, my intuition about this:
string.gsub('aaa','a-', '|')
Is that the expression a- would match each a, replace them with '|', resulting in '|||'. In fact, it matches on the 0-length gaps before and after each character, resulting in: '|a|a|a|'
In fact, it doesn't matter what atom we precede with -, it always matches on the smallest length, 0:
string.gsub('aaa','x-', '|') => |a|a|a|
string.gsub('aaa','a-', '|') => |a|a|a|
string.gsub('aaa','?-', '|') => |a|a|a|
string.gsub('aaa','--', '|') => |a|a|a|
You can see that last one is your case and explains your results. Your next result is the exact same thing:
string.gsub('+++','+-','|') => |+|+|+|
Your final result is more straightforward:
string.gsub('-+-+-+-','-+','|') => |+|+|+|
In this case, you're matching "1 or more occurances of the atom -", so you're just replacing the - characters, just as you'd expect.

Groovy regex. Matching begining of the line

I am little puzzled by Groovy regex behavior.
"dog" == /dog/ - return true
"dog" == /^dog/ - return false
My understanding that ^ matches start of the line so second expression should return true as well.
What I am actually trying to do is to replace "#" at beginning of the line using
line = line.replace(/^#/, '')
but "#" does not get removed
In Groovy, there are many ways of declaring Strings ie;
println( 'foo' ) // regular string
println( '''foo''' ) // multiline string
println( "foo" ) // templatable string
println( """foo""" ) // multiline templatable string
println( /foo/ ) // slashy string
println( $/foo/$ ) // dollar slashy string (also multiline)
All of the above are Strings, so:
"dog" == /dog/ - return true
As both sides are a String (and the same String), so they equal each other
If you want to do Pattern matching, you need the ==~ operator:
"dog" ==~ /^dog/
Which returns true. Not sure why you have a ) in your replaceAll:
def line = '#Foo'
line.replaceAll( /^#/, '' ) == 'Foo'
Returns true

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}

Emacs align-regexp working with "="

Example code:
f x
| "s" == x = 1
| otherwise = 0
I can see the regexp as "match the equals sign when surrounded by whitespace characters". However, \s-+=\s-+ doesn't work (\s-+ is the pattern for 1+ whitespace) because it ends up inserting an extra space before the equals sign. I need a pattern that says "match empty string when there is whitespace here", but not sure how to do this?
This works for me:
C-u M-x align-regexp RET \(\s-+\)=\s- RET RET RET n
Note the '+' inside the parens, the default has '*'