Emacs align-regexp working with "=" - regex

Example code:
f x
| "s" == x = 1
| otherwise = 0
I can see the regexp as "match the equals sign when surrounded by whitespace characters". However, \s-+=\s-+ doesn't work (\s-+ is the pattern for 1+ whitespace) because it ends up inserting an extra space before the equals sign. I need a pattern that says "match empty string when there is whitespace here", but not sure how to do this?

This works for me:
C-u M-x align-regexp RET \(\s-+\)=\s- RET RET RET n
Note the '+' inside the parens, the default has '*'

Related

Character not at begining of line; not followed or preceded by character

I'm trying to isolate a " character when (simultaneously):
it's not in the beginning of the line
it's not followed by the character ";"
it's not preceded by the character ";"
E.g.:
Line: "Best Before - NO MATCH
Line: Best Before"; - NO MATCH
Line: ;"Best "Before - NO MATCH
Line: Best "Before - MATCH
My best solution is (?<![;])([^^])(")(?![;]) but it's not working correctly.
I also tried (?<![;])(")(?![;]), but it's only partial (missing the "not at the beginning" part)
I don't understand why I'm spelling the "AND not at the beginning" wrong.
Where am I missing it?
If you want to allow partial matches, you can extend the lookbehind with an alternation not asserting the start of the string to the left.
The semi colon [;] does not have to be between square brackets.
(?<!;|^)"(?!;)
Regex demo
if you want to match the " when there is no occurrence of '" to the left and right, and a infinite quantifier in a lookbehind assertion is allowed:
(?<!^.*;(?=").*|^)"(?!;|.*;")
Regex demo
In notepad++ you can use
^.*(?:;"|";).*$(*SKIP)(*F)|(?<!^)"
Regex demo
You can use the fact that not preceded by ; means that it's also not the first character on the line to simplify things
[^;]"(?:[^;]|$)
This gives you
Match a character that's not a ; (so there must be a character and thus the next character can't be the start of the line)
Match a "
Match a character that's not a ; or the end of the line
I know you are asking for a regex solution, but, almost always, strings can also be filtered using string methods in whatever language you are working in.
For the sake of completeness, to show that regex is not your only available tool here, here is a short javascript using the string methods:
myString.charAt()
myString.includes()
Working Example:
const checkLine = (line) => {
switch (true) {
// DOUBLE QUOTES AT THE BEGINNING
case(line.charAt(0) === '"') :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY FOLLOWED BY SEMI-COLON
case(line.includes('";')) :
return console.log(line, '// NO MATCH');
// DOUBLE QUOTES IMMEDIATELY PRECEDED BY SEMI-COLON
case(line.includes(';"')) :
return console.log(line, '// NO MATCH');
default:
return console.log(line, '// MATCH');
}
}
checkLine('"Best Before');
checkLine('Best Before";');
checkLine(';"Best "Before');
checkLine('Best "Before');
Further Reading:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/charAt
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/includes

is groovy regex (slightly broken)?

println "p(cat || cats, n)" ==~ /^p\(.+||.+,sn\)$/
println "" ==~ /^p\(.+||.+,sn\)$/
why does the 2nd line return true? Is this a bug?
| is a special character that means "OR" and needs to be escaped to obtain a literal |. The second regex returns true because || matches the empty string (between the two "OR")
Note there is no "s" after the comma in the first string but a space.

(Vim regex) Following by anything except bracket character

Test string:
best.string_a = true;
best.string_b + bad.string_c;
best.string_d ();
best.string_e );
I want to catch string that after '.' and followed by anything except '('. My expression:
\.\#<=[_a-z]\+\(\s*[^(]\)\#=
I want :
string_a
string_b
string_c
string_e
But it doesn't work and result :
string_a
string_b
string_c
string_d
string_e
I am new to vim regex and i dont know why :(
Make this \.\#<=\<[_a-z]\+\>\(\s*(\)\#!
This matches:
\.\#<= Assure a dot is in front of the match followed by
\<[_a-z]\+\> A word containing only lowercase or '_' chars
\(\s*(\)\#! not followed by (any amount of spaces in front of a '(')
this would work for your needs too:
\.\zs[_a-z]\+\>\ze\s*[^( ]

Non-greedy regular expression match for multicharacter delimiters in awk

Consider the string "AB 1 BA 2 AB 3 BA". How can I match the content between "AB" and "BA" in a non-greedy fashion (in awk)?
I have tried the following:
awk '
BEGIN {
str="AB 1 BA 2 AB 3 BA"
regex="AB([^B][^A]|B[^A]|[^B]A)*BA"
if (match(str,regex))
print substr(str,RSTART,RLENGTH)
}'
with no output. I believe the reason for no match is that there is an odd number of characters between "AB" and "BA". If I replace str with "AB 11 BA 22 AB 33 BA" the regex seems to work..
Merge your two negated character classes and remove the [^A] from the second alternation:
regex = "AB([^AB]|B|[^B]A)*BA"
This regex fails on the string ABABA, though - not sure if that is a problem.
Explanation:
AB # Match AB
( # Group 1 (could also be non-capturing)
[^AB] # Match any character except A or B
| # or
B # Match B
| # or
[^B]A # Match any character except B, then A
)* # Repeat as needed
BA # Match BA
Since the only way to match an A in the alternation is by matching a character except B before it, we can safely use the simple B as one of the alternatives.
The other answer didn't really answer: how to match non-greedily?
Looks like it can't be done in (G)AWK. The manual says this:
awk (and POSIX) regular expressions always match the leftmost, longest
sequence of input characters that can match.
https://www.gnu.org/software/gawk/manual/gawk.html#Leftmost-Longest
And the whole manual doesn't contain the words "greedy" nor "lazy". It mentions Extended Regular Expressions, but for greedy matching you'd need Perl-Compatible Regular Expressions. So… no, can't be done.
For general expressions, I'm using this as a non-greedy match:
function smatch(s, r) {
if (match(s, r)) {
m = RSTART
do {
n = RLENGTH
} while (match(substr(s, m, n - 1), r))
RSTART = m
RLENGTH = n
return RSTART
} else return 0
}
smatch behaves like match, returning:
the position in s where the regular expression r occurs, or 0 if it does not. The variables RSTART and RLENGTH are set to the position and length of the matched string.

How to validate a string to have only certain letters by perl and regex

I am looking for a perl regex which will validate a string containing only the letters ACGT. For example "AACGGGTTA" should be valid while "AAYYGGTTA" should be invalid, since the second string has "YY" which is not one of A,C,G,T letters. I have the following code, but it validates both the above strings
if($userinput =~/[A|C|G|T]/i)
{
$validEntry = 1;
print "Valid\n";
}
Thanks
Use a character class, and make sure you check the whole string by using the start of string token, \A, and end of string token, \z.
You should also use * or + to indicate how many characters you want to match -- * means "zero or more" and + means "one or more."
Thus, the regex below is saying "between the start and the end of the (case insensitive) string, there should be one or more of the following characters only: a, c, g, t"
if($userinput =~ /\A[acgt]+\z/i)
{
$validEntry = 1;
print "Valid\n";
}
Using the character-counting tr operator:
if( $userinput !~ tr/ACGT//c )
{
$validEntry = 1;
print "Valid\n";
}
tr/characterset// counts how many characters in the string are in characterset; with the /c flag, it counts how many are not in the characterset. Using !~ instead of =~ negates the result, so it will be true if there are no characters not in characterset or false if there are characters not in characterset.
Your character class [A|C|G|T] contains |. | does not stand for alternation in a character class, it only stands for itself. Therefore, the character class would include the | character, which is not what you want.
Your pattern is not anchored. The pattern /[ACGT]+/ would match any string that contains one or more of any of those characters. Instead, you need to anchor your pattern, so that only strings that contain just those characters from beginning to end are matched.
$ can match a newline. To avoid that, use \z to anchor at the end. \A anchors at the beginning (although it doesn't make a difference whether you use that or ^ in this case, using \A provides a nice symmetry.
So, you check should be written:
if ($userinput =~ /\A [ACGT]+ \z/ix)
{
$validEntry = 1;
print "Valid\n";
}