Cross platform inconsistency in haxe Regular Expressions - regex

I'm trying to adapt a Haxe Markdown library ( http://code.google.com/p/mdown/ ) into an official haxelib that works across platforms. I'm running into some weirdness where something works on flash and javascript, but not neko.
See this sample code:
var str = "<p>This is a blockquote</p>";
var out = ~/(^|\n)/g.replace(str, "$1 ");
trace(out);
On Javascript and Flash I get this, as expected:
" <p>This is a blockquote</p>"
On Neko I get this:
" < p > T h i s i s a b l o c k q u o t e < / p > "
I can work around it for now (not use regular expressions) - but can anyone show me at what point this breaking?
Thanks,
Jason
p.s. This might help answer the question: http://haxe.org/doc/cross/regexp#implementation-details

If you use the m flag to convert it into a multiline regex, you can leave out the newline part. That might help.
The relevant part of documentation is right at the beginning of the linked page:
m : multiline matching, ^ and $ represent the beginning and end of a line
As for why your problem is happening, it would appear that Neko's regex library is wrongly simplifying your regex to empty, which will match between every charater. You could put a . at the end of the regex and move the space to the front of your replacement string, which might prevent that bug from occurring, and it should be compatible with all the implementations.

Related

A Perl 6 Regex to match a Perl 6 delimited comment

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.
As an example of what I am looking for, I want something that can parse the comments in here:
#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}
My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.
Matching your examples
my %openers-closers = < { } « » ( ) >; # (many more in reality)
my #openers = %openers-closers.keys; # { « ( ...
my ($open, $close); # possibly multiple chars
my token comment { '#`' <&open> <&middle> <&close> }
my token open {
# Store first delimiter char: Slurp as many as are repeated:
( ( #openers ) $0* )
# Store the full (possibly multiple character) delimiters:
{ $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}
my token middle {
:my $nest-level; # for tracking nesting
[
# Continue if nested: or if not at unnested end delimiter:
[ <?{$nest-level}> || <!&close> ]
# Match either a nested delimiter: or a single character:
( $open || $close || . )
# Keep track of nesting:
{ $_ = ~$0.tail; # set topic to latest match in list
$nest-level++ when $open; $nest-level-- when $close }
]*
}
my token close { $close }
.say for $your-examples ~~ m:g / <.&comment> /
displays:
「{ foo {} bar }」
「« woo woo »」
「(
This is a (
long )
multiliner())」
「{{ { And don't forget the tricky repeating delimiters }}」
Hopefully the code is self-explanatory if you know Raku regexes. Please use the comments if you want clarification of any of it.
Looking at related Rakudo source code
I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)
But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.
As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):
#`{{ {{ And don't forget the tricky repeating delimiters } }}
yields the rather LTA (Less Than Awesome) compiler error:
Starter {{ is immediately followed by a combining codepoint...
This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.
So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the Raku grammar. (Perhaps not of interest to you but it is to me.)
Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:
The comment:sym<#`(...)> token that parses these comments. This leads to:
The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.
The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:
The babble token. This establishes a "start" and "stop" with this code:
$<B>=[<?before .>]
{
# Work out the delimiters.
my $c := $/;
my #delims := $c.peek_delimiters($c.target, $c.pos);
my $start := #delims[0];
my $stop := #delims[1];
The rule peek_delimiters is not in the Raku grammar file.
A search in the Rakudo repo shows it's not anywhere in Rakudo or Raku.
A search in NQP yields a routine in nqp's grammar (from which the Raku grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/Raku).
I'll stop at this point to draw a conclusion.
Conclusion
You've got a regex. It might work out as you intend. I don't know.
If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

Type? of text blocks regEx function

I have recently started to use regEx for work and have now found a rather peculiar problem which I apparently can't solve myself...
Problem: I receive data from customers (from all over the world) and have to analyze it. The data this time has some specialties.
e.g. for raw data:
Screw М4х20 , DIN7985 - This is the original text with the problem
Screw M4x20 , DIN7985 - This is manual written text, which gives me
perfect results
If I now try to pick out the dimension "M4x20" with following regEx:
(\b[M]?\d+x\d+\b)
it yields me no results... neither in Excel, nor in websites like regExr:
Regex demo
If I delete the M4x20 and write it a new, I do get results.
I have absolutely no idea where the problem lies, except that it is caused by the M char and the x char - for reference: the rest of the text/letters (a-z) also doesn't work. The numbers are working ok.
Is there some way to analyze it?
Edit:
There is and I just found out: The letters are Cyrillic letters which are not being recognized.
Though they can apparently be changed to latin letters quite easily.
Two chars M and x are part of Cyrillic letters and they are represented in regex as \u041C (M) and \u0445 (x).
Regex demo
VBA code:
Set re = CreateObject("VBScript.RegExp")
re.Global = True
re.Pattern = "\u041C?\d+\u0445\d+"
For Each Match In re.Execute("Screw М4х20 , DIN7985")
Debug.Print (Match)
Next
Output:
М4х20

error: multiple repeat for regex in robot [duplicate]

I'm trying to determine whether a term appears in a string.
Before and after the term must appear a space, and a standard suffix is also allowed.
Example:
term: google
string: "I love google!!! "
result: found
term: dog
string: "I love dogs "
result: found
I'm trying the following code:
regexPart1 = "\s"
regexPart2 = "(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
and get the error:
raise error("multiple repeat")
sre_constants.error: multiple repeat
Update
Real code that fails:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + term + regexPart2 , re.IGNORECASE)
On the other hand, the following term passes smoothly (+ instead of ++)
term = 'lg incite" OR author:"http+www.dealitem.com" OR "for sale'
The problem is that, in a non-raw string, \" is ".
You get lucky with all of your other unescaped backslashes—\s is the same as \\s, not s; \( is the same as \\(, not (, and so on. But you should never rely on getting lucky, or assuming that you know the whole list of Python escape sequences by heart.
Either print out your string and escape the backslashes that get lost (bad), escape all of your backslashes (OK), or just use raw strings in the first place (best).
That being said, your regexp as posted won't match some expressions that it should, but it will never raise that "multiple repeat" error. Clearly, your actual code is different from the code you've shown us, and it's impossible to debug code we can't see.
Now that you've shown a real reproducible test case, that's a separate problem.
You're searching for terms that may have special regexp characters in them, like this:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
That p++ in the middle of a regexp means "1 or more of 1 or more of the letter p" (in the others, the same as "1 or more of the letter p") in some regexp languages, "always fail" in others, and "raise an exception" in others. Python's re falls into the last group. In fact, you can test this in isolation:
>>> re.compile('p++')
error: multiple repeat
If you want to put random strings into a regexp, you need to call re.escape on them.
One more problem (thanks to Ωmega):
. in a regexp means "any character". So, ,|.|;|:" (I've just extracted a short fragment of your longer alternation chain) means "a comma, or any character, or a semicolon, or a colon"… which is the same as "any character". You probably wanted to escape the ..
Putting all three fixes together:
term = 'lg incite" OR author:"http++www.dealitem.com" OR "for sale'
regexPart1 = r"\s"
regexPart2 = r"(?:s|'s|!+|,|\.|;|:|\(|\)|\"|\?+)?\s"
p = re.compile(regexPart1 + re.escape(term) + regexPart2 , re.IGNORECASE)
As Ωmega also pointed out in a comment, you don't need to use a chain of alternations if they're all one character long; a character class will do just as well, more concisely and more readably.
And I'm sure there are other ways this could be improved.
The other answer is great, but I would like to point out that using regular expressions to find strings in other strings is not the best way to go about it. In python simply write:
if term in string:
#do whatever
i have an example_str = "i love you c++" when using regex get error multiple repeat Error. The error I'm getting here is because the string contains "++" which is equivalent to the special characters used in the regex. my fix was to use re.escape(example_str ), here is my code.
example_str = "i love you c++"
regex_word = re.search(rf'\b{re.escape(word_filter)}\b', word_en)
Also make sure that your arguments are in the correct order!
I was trying to run a regular expression on some html code. I kept getting the multiple repeat error, even with very simple patterns of just a few letters.
Turns out I had the pattern and the html mixed up. I tried re.findall(html, pattern) instead of re.findall(pattern, html).
A general solution to "multiple repeat" is using re.escape to match the literal pattern.
Example:
>>>> re.compile(re.escape("c++"))
re.compile('c\\+\\+')
However if you want to match a literal word with space before and after try out this example:
>>>> re.findall(rf"\s{re.escape('c++')}\s", "i love c++ you c++")
[' c++ ']

Regexp: Keyword followed by value to extract

I had this question a couple of times before, and I still couldn't find a good answer..
In my current problem, I have a console program output (string) that looks like this:
Number of assemblies processed = 1200
Number of assemblies uninstalled = 1197
Number of failures = 3
Now I want to extract those numbers and to check if there were failures. (That's a gacutil.exe output, btw.) In other words, I want to match any number [0-9]+ in the string that is preceded by 'failures = '.
How would I do that? I want to get the number only. Of course I can match the whole thing like /failures = [0-9]+/ .. and then trim the first characters with length("failures = ") or something like that. The point is, I don't want to do that, it's a lame workaround.
Because it's odd; if my pattern-to-match-but-not-into-output ("failures = ") comes after the thing i want to extract ([0-9]+), there is a way to do it:
pattern(?=expression)
To show the absurdity of this, if the whole file was processed backwards, I could use:
[0-9]+(?= = seruliaf)
... so, is there no forward-way? :T
pattern(?=expression) is a regex positive lookahead and what you are looking for is a regex positive lookbehind that goes like this (?<=expression)pattern but this feature is not supported by all flavors of regex. It depends which language you are using.
more infos at regular-expressions.info for comparison of Lookaround feature scroll down 2/3 on this page.
If your console output does actually look like that throughout, try splitting the string on "=" when the word "failure" is found, then get the last element (or the 2nd element). You did not say what your language is, but any decent language with string splitting capability would do the job. For example
gacutil.exe.... | ruby -F"=" -ane "print $F[-1] if /failure/"

looking for a regular expression to extract all text outputs to user from js file

i have some huge js files and there are some texts/messages/... which are output for a human beeing. the problem is they don't run over the same method.
but i want to find them all to refactor the code.
now i am searching for a regular expression to find those messages.
...his.submit_register = function(){
if(!this.agb_accept.checked) {
out_message("This is a Messge tot the User in English." , "And the Title of the Box. In English as well");
return fals;
}
this.valida...
what i want to find is all the strings which are not source code.
in this case i want as return:
This is a Messge tot the User in
English. And the Title of the Box. In
English as well
i tried something like: /\"(\S+\s{1})+\S\"/, but this wont work ...
thanks for help
It's not possible to parse Javascript source code using regular expressions because Javascript is not a regular language. You can write a regular expression that works most of the time:
/"(.*?)"/
The ? means that the match is not greedy.
Note: this will not correctly handle strings that contain ecaped quotes.
A simple java regex solving your problem (assuming that the message doesn't contain a " character):
Pattern p = Pattern.compile("\"(.+?)\"");
The extraction code :
Matcher m;
for(String line : lines) {
m = p.matcher(line);
while(m.find()) {
System.out.println(m.group(1));
}
}