re.VERBOSE with embedded space fails to compile - regex

import re
re.compile(r"(?: *)", flags=re.VERBOSE)
raises re.error: nothing to repeat at position 4. Any of the following compile fine:
re.compile(r"(?:\ *)", flags=re.VERBOSE)
re.compile(r"(?:[ ]*)", flags=re.VERBOSE)
re.compile(r"(?: *)")
The docs for VERBOSE say "Whitespace within the pattern is ignored, except ... within tokens like *?, (?: or (?P<...>." but it seems not to be honoring the "(?:" part. Is this a library bug, or am I just not getting my head around what the docs mean?
I can reproduce this on either:
Python 3.9.13 / MacOS 12.6.1 (Monterey)
Python 3.9.2 / Debian 11.5 (bullseye)

There are two important details of the documentation:
The "ignore, except" does not necessarily mean a correct pattern. It means the whitespace is taken into account.
Moreover, the wording is "within tokens". As the documentation also says (immediately after your citation): "For example, (? : and * ? are not allowed." Notice the whitespace appears "inside" the tokens, not afterwards.
If you run re.compile(r"(?:*)", flags=re.VERBOSE) (i.e. whitespace removed), you will get the same error (though now at position 3).

Related

Regex; Why is here a between difference \p{Katakana} and \x{30A0}-\x{30FF}?

I found that ー, ゠ and ・ are not detected with \p{Katakana} but as range \x{30A0}-\x{30FF}.
See https://regex101.com/r/PZzTLm/1 and http://www.rikai.com/library/kanjitables/kanji_codes.unicode.shtml
I can't find anything on this. Does anyone have a source that explains why these characters are not included? The problem is not unique to \p{Katakana}. \p{Hiragana} and others have similar issues.
In \p{Katakana}, \x{30A1}-\x{30FA}\x{30FD}-\x{30FF} is used instead of the \x{30A0}-\x{30FF} range, where \x{30A0}, \x{30FB} and \x{30FC} are excluded.
There is no reason these chars should not have been included because when you use \p{Block=Katakana} Katakana script block Unicode property class you will match all chars in the \x{30A0}-\x{30FF} range.
If you may actually combine the two, [\p{Katakana}\p{Block=Katakana}], you will match all the chars you expect.
If you use ECMAScript regex flavor, the implementation is
\p{scx=Katakana}
See the regex demo. The scx prefix means all indicated script extensions are included:
Scx set contains multiple explicit Script values; Script(cp) is implicit
and
For example, U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK is shared across the Hiragana and Katakana scripts, but is not used in other scripts, so it is assigned an scx set value of {Hira Kana}.

commands like 'stack exec — yesod --help'

In the Yesod book
https://www.yesodweb.com/book/scaffolding-and-the-site-template
commands are given like
'stack exec — yesod --help'
and
'stack exec — yesod devel'
but what is that '—' symbol (I can only copy / paste it), and what is its meaning?
It is called an em dash (and how to create it on your keyboard). In the case where it is being used with Yesod, it is somewhat unclear to me if using the em dash was on purpose or not. Some programs (like Microsoft Word) automatically convert two hyphens (--) into an em dash. I think this may be what happened in this case - because over at GitHub there is a post where the command stack exec -- yesod devel is used. This would be more consistent with typical command line usage of hyphens - the single and double hyphens are supposed to have specific/standard meaning when used in command line utilities.

ctags and R regex

I'm trying to use ctags with R. Using this answer I've added
--langdef=R
--langmap=r:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^f][^u][^n][^c][^t][^i][^o][^n]/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^f][^u][^n][^c][^t][^i][^o][^n]/\1/v,FunctionVariables
to my .ctags file. However when trying to generate tags with the following R file
x <- 1
foo <- function () {
y <- 2
return(y)
}
Only the function foo is recognized. I want to generate tags also for variables (i.e for x and y in my code above). Should I change the regex in the ctags file? Thanks in advance.
Those patterns don't appear to be correct, because a variable is only identified if it is assigned a value that has the same length as "function" but does not share any of its characters. E.g.:
x <- aaaaaaaa
The following ctags configuration should work properly:
--langdef=R
--langmap=R:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t][^\(]+$/\1/v,FunctionVariables/
The idea here is that a function must be followed by a parenthesis while the variable declarations do not. Given the limitations of a regex parser this parenthesis must appear in the same line as the keyword "function" for this configuration to work.
Update: R support will be included in Universal Ctags, so give it a try and report bugs and missing features.
The problem with the original regex is that it does not capture variable declarations / assignments that are shorter than 8 characters (since the regex requires 8 characters that are NOT f-u-n-c-t-i-o-n in that specific order).
The problem with the regex updated by Vitor is that, it does not capture variable declarations / assignments which incorporate parantheses, a situation which is quite common in R.
And a forgotton issue is that, neither of them captures superassigned objects with <<- (only local assignment with <- is covered).
As I checked the Universal Ctags repo, although pcre regex is planned to be supported as also raised in Issue 519 and a commented out pcre flag exists in the configuration file, there is not yet support for pcre type positive/negative lookahead or lookbehind expressions in ctags unfortunately. When that support starts, things will be much easier.
My solution, first of all, takes into account superassignment by "<{1,2}-" and also the fact that the right side of assignment can include:
either a sequence of 8 characters which are not f-u-n-c-t-i-o-n followed by any or no characters (.*)
OR a sequence of at most 7 any characters (.{1,7})
My proposed regex patterns are as such:
--langdef=R
--langmap=R:.R.r
--regex-R=/^[ \t]*"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<-[ \t]function[ \t]*\(/\1/f,Functions/
--regex-R=/^"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<{1,2}-[ \t]([^f][^u][^n][^c][^t][^i][^o][^n].*|.{1,7}$)/\1/g,GlobalVars/
--regex-R=/[ \t]"?([.A-Za-z][.A-Za-z0-9_]*)"?[ \t]*<{1,2}-[ \t]([^f][^u][^n][^c][^t][^i][^o][^n].*|.{1,7}$)/\1/v,FunctionVariables/
And my tests show that, it captures much more objects than the previous ones.
Edit:
Even these do not capture the variables that are declared as arguments to functions. Since ctags regex makes a non-greedy search and non-capturing groups do not work, the result is still limited. But we can at least capture the first arguments to functions and define them as an additional type as a, Arguments:
--regex-R=/.+function[ ]*\([ \t]*(,*([^,= \(\)]+)[,= ]*)/\2/a,Arguments/

Log parsing rules in Jenkins

I'm using Jenkins log parser plugin to extract and display the build log.
The rule file looks like,
# Compiler Error
error /(?i) error:/
# Compiler Warning
warning /(?i) warning:/
Everything works fine but for some reasons, at the end of "Parsed Output Console", I see this message,
NOTE: Some bad parsing rules have been found:
Bad parsing rule: , Error:1
Bad parsing rule: , Error:1
This, I'm sure is a trivial issue but not able to figure it out at this moment.
Please help :)
EDIT:
Based on Kobi's answer and having looked into the "Parsing rules files", I fixed it this way (a single space after colon). This worked perfectly as expected.
# Compiler Error
error /(?i)error: /
# Compiler Warning
warning /(?i)warning: /
The Log Parser Plugin does not support spaces in your pattern.
This can be clearly seen in their source code:
final String ruleParts[] = parsingRule.split("\\s");
String regexp = ruleParts[1];
They should probably have used .split("\\s", 2).
As an alternative, you can use \s, \b, or an escape sequence - \u0020.
I had tried no spaces in the pattern, but that did not work.
Turns out that the Parsing Rules files does not support
empty lines in it. Once I removed the empty lines, I did not
get this "Bad parsing rule: , Error:1".
I think since the line is empty - it doesn't echo any rule after
the first colon. Would have been nice it the line number was
reported where the problem is.

StackOverflowError with Checkstyle 4.4 RegExp check

Hello,
Background:
I'm using Checkstyle 4.4.2 with a RegExp checker module to detect when the file name in out java source headers do not match the file name of the class or interface in which they reside. This can happen when a developer copies a header from one class to another and does not modify the "File:" tag.
The regular expression use in the RexExp checker has been through many incarnations and (though it is possibly overkill at this point) looks like this:
File: (\w+)\.java\n(?:.*\n)*?(?:[\w|\s]*?(?: class | interface )\1)
The basic form of files I am checking (though greatly simplified) looks like this
/*
*
* Copyright 2009
* ...
* File: Bar.java
* ...
*/
package foo
...
import ..
...
/**
* ...
*/
public class Bar
{...}
The Problem:
When no match is found, (i.e. when a header containing "File: Bar.java" is copied into file Bat.java ) I receive a StackOverflowError on very long files (my test case is #1300 lines).
I have experimented with several visual regular expression testers and can see that in the non-matching case when the regex engine passes the line containing the class or interface name it starts searching again on the next line and does some backtracking which probably causes the StackOverflowError
The Question:
How to prevent the StackOverflowError by modifying the regular expression
Is there some way to modify my regular expression such that in the non-matching case (i.e. when a header containing "File: Bar.java" is copied into file Bat.java ) that the matching would stop once it examines the line containing the interface or class name and sees that "\1" does not match the first group.
Alternatively if that can be done, Is is possible minimize the searching and matching that takes place after it examines the line containing the interface or class thus minimizing processing and (hopefully) the StackOverflow error?
Try
File: (\w+)\.java\n.*^[\w \t]+(?:class|interface) \1
in dot-matches-all mode. Rationale:
[\w\s] (the | doesn't belong there) matches anything, including line breaks. This results in a lot of backtracking back up into the lines that the previous part of the regex had matched.
If you let the greedy dot gobble up everything up to the end of the file (quick) and then backtrack until you find a line that starts with words or spaces/tabs (but no newlines) and then class or interface and \1, then that doesn't require as much stack space.
A different, and probably even better solution would be to split the problem into parts.
First match the File: (\w+)\.java part. Then do a second search with ^[\w \t]+(?:class|interface) plus the \1 match from the first search on the same file.
Follow up:
I plugged in Tim Pietzcher's suggestion above and his greedy solution did indeed fail faster and without a StackOverflowError when no match was found. However, in the positive case, the StackOverflowError still occurred.
I took a look at the source code RegexpCheck.java. The classes pattern is constructed in multiline mode such that the expressions ^ and $ match just after or just before, respectively, a line terminator or the end of the input sequence. Then it reads the entire class file into a string and does a recursive search for the pattern(see findMatch()). That is undoubtedly the source of the StackOverflowException.
In the end I didn't get it to work (and gave up) Since Maven 2 released the maven-checkstyle-plugin-2.4/Checkstyle 5.0 about 6 weeks ago we've decided to upgrade our tools. This may not solve the StackOverflowError problem, but it will give me something else to work on until someone decides that we need to pursue this again.