Setting up rules for Flex, warning:"rule cannot be matched" - regex

I have these flex rules:
^User-Agent: [^\n]*Firefox {useragent = TFIREFOX; }
^User-Agent: [^\n]*MSIE {useragent = TMSIE; }
^User-Agent: [^\n]*Opera {useragent = TOPERA; }
^User-Agent: [^\n]*Safari {guseragent = TSAFARI; }
...
I get warnings: rule cannot be matched on all lines after the first rule. I expect the first rule to match just lines, with "Firefox" in them but I think Im wrong. How to repair these rules? I read flex manpage and I'm still helpless.

I believe the issue here is that flex uses spaces to delimit tokens for regex matching. So when it parses your file it is treating everything after "^User-Agent:" as part of the action. You can make this work by escaping the space:
^User-Agent:\ [^\n]*Firefox
^User-Agent:\ [^\n]*MSIE
^User-Agent:\ [^\n]*Opera
^User-Agent:\ [^\n]*Safari
I tested with flex 2.5.35, will do what you want.

Related

Having a problem with multiple hyphens in a multistring regex match in Perl

I am downloading a webpage and converting into a string using LWP::Simple. When I copy the results into an editor I find multiple instances of the pattern I'm looking for "data-src-hq".
While I'm trying to do something more complex using regex I am starting in baby steps so I can properly learn how to use regex, I started off with just to match "data-src-hq" with the following code:
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
else
{
print "nope\n";
}
My code returns "nope". However, if I modify the pattern search to just "data" or "data-src" I do get a match. The same happens no matter how I use and combine the string and multiline modifier.
My understanding is that a hyphen is not a special character unless it's within brackets, am I missing something simple?
How to fix this?
You are likely getting two outputs, one of match and one of nope. Your code is missing the keyword else:
See your code's current execution here
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
{
print "nope\n";
}
Should be:
See this code's execution here
if($html =~ /data-src-hq/ism)
{
print "match\n";
}
else {
print "nope\n";
}
Otherwise, your code is fine and works to identify whether data-src-hq exists in $html.
So why does your existing code output nope?
That's because {} is a basic block (see Basic BLOCKs in Perl's documentation). An excerpt from the documentation:
A BLOCK by itself (labeled or not) is semantically equivalent to a
loop that executes once. Thus you can use any of the loop control
statements in it to leave or restart the block. (Note that this is NOT
true in eval{}, sub{}, or contrary to popular belief do{} blocks,
which do NOT count as loops.) The continue block is optional.

A Perl 6 Regex to match a Perl 6 delimited comment

Anyone have a Perl 6 regular expression that will match Perl 6 delimited comments? I would prefer something that's short rather than a full grammar, but I rule out nothing.
As an example of what I am looking for, I want something that can parse the comments in here:
#`{ foo {} bar }
#`« woo woo »
say #`(
This is a (
long )
multiliner()) "You rock!"
#`{{ { And don't forget the tricky repeating delimiters }}
My overall goal is to be able to take a source file and strip the pod and comments and then do interesting things with the code that is left. Stripping line comments and pod is pretty easy, but delimited comments requires additional finesse. I also want this solution to be small and using only Perl 6 core so I can stick it in my dotfiles repo without having external dependencies.
Matching your examples
my %openers-closers = < { } « » ( ) >; # (many more in reality)
my #openers = %openers-closers.keys; # { « ( ...
my ($open, $close); # possibly multiple chars
my token comment { '#`' <&open> <&middle> <&close> }
my token open {
# Store first delimiter char: Slurp as many as are repeated:
( ( #openers ) $0* )
# Store the full (possibly multiple character) delimiters:
{ $open = ~$0; $close = %openers-closers{$0[0]} x $0.chars }
}
my token middle {
:my $nest-level; # for tracking nesting
[
# Continue if nested: or if not at unnested end delimiter:
[ <?{$nest-level}> || <!&close> ]
# Match either a nested delimiter: or a single character:
( $open || $close || . )
# Keep track of nesting:
{ $_ = ~$0.tail; # set topic to latest match in list
$nest-level++ when $open; $nest-level-- when $close }
]*
}
my token close { $close }
.say for $your-examples ~~ m:g / <.&comment> /
displays:
「{ foo {} bar }」
「« woo woo »」
「(
This is a (
long )
multiliner())」
「{{ { And don't forget the tricky repeating delimiters }}」
Hopefully the code is self-explanatory if you know Raku regexes. Please use the comments if you want clarification of any of it.
Looking at related Rakudo source code
I wrote the above without referring to Rakudo's source code. (I wanted to see what I came up with without doing so.)
But I've now looked at the source code, which imo would be a more or less mandatory thing to do for anyone trying to do what you're trying to do and serious about understanding how well it might work in the general case.
As I starting point, I was particularly interested in seeing if I could figure out why feeding this code to rakudo (2018.12):
#`{{ {{ And don't forget the tricky repeating delimiters } }}
yields the rather LTA (Less Than Awesome) compiler error:
Starter {{ is immediately followed by a combining codepoint...
This doesn't look directly relevant to your question but I encountered it when trying to understand the nested delimiter rules.
So when I got to this part of my answer I started by searching the Rakudo repo for "immediately followed". That led to a fail-terminator method in the Raku grammar. (Perhaps not of interest to you but it is to me.)
Here's what else I found in the standard grammar that imo is directly related to what you're trying to do, or at least understanding precisely what the code says the rules are about matching comments:
The comment:sym<#`(...)> token that parses these comments. This leads to:
The list of openers. This list should replace the measly 3 opener/closer pairs in my code that just match your examples.
The quibble token. This seems to be a generic "parse 'quoted' (delimited) thing". It leads to:
The babble token. This establishes a "start" and "stop" with this code:
$<B>=[<?before .>]
{
# Work out the delimiters.
my $c := $/;
my #delims := $c.peek_delimiters($c.target, $c.pos);
my $start := #delims[0];
my $stop := #delims[1];
The rule peek_delimiters is not in the Raku grammar file.
A search in the Rakudo repo shows it's not anywhere in Rakudo or Raku.
A search in NQP yields a routine in nqp's grammar (from which the Raku grammar inherits, which is why the peek_delimiters call works and why I looked in NQP when I didn't find it in Rakudo/Raku).
I'll stop at this point to draw a conclusion.
Conclusion
You've got a regex. It might work out as you intend. I don't know.
If you end up investigating the above Rakudo/NQP code and understand it well enough to write a walk through of what quibble, babble, nibble, et al do, or discover a good existing write up (I haven't searched for one yet), please add a comment to this answer linking to it. I'll do likewise. TIA!

flex regex not matching properly

In my tokenizer (.lex) file I want to match the following pattern :
AaBC12/awD41/dfs21 etc...
I've written this rule
[A-Za-z]+[A-Za-z0-9]*[[/]+[A-Za-z][A-Za-z0-9]*]*
{lline = cpflineno;cpflval.str = strdup(cpftext);return K_IDENTIFIER;}
This rule seems correct to me but if i have an input like this :
TOP/MD1
TOP/MD2
TOP/MD2/D/E
My output is
TOP/MD1
TOP/MD2
TOP/MD2
/D/E
instead of
TOP/MD1
TOP/MD2
TOP/MD2/D/E
Could you tell me where my rule fails ?
What about this:
[A-Za-z]+[A-Za-z0-9]*([/]+[A-Za-z][A-Za-z0-9]*)*
Replaced [] with () where you mean a group.
Note that it will match foo////bar, if you don't want that remove the second + (and the first + for that matter too, it's useless in this case).

Flex 3 Regular Expression Problem

I've written a url validator for a project I am working on. For my requirements it works great, except when the last part for the url goes longer than 22 characters it breaks. My expression:
/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i
It expects input that looks like "http(s)://hostname:port/location".
When I give it the input:
https://demo10:443/111112222233333444445
it works, but if I pass the input
https://demo10:443/1111122222333334444455
it breaks. You can test it out easily at http://ryanswanson.com/regexp/#start. Oddly, I can't reproduce the problem with just the relevant (I would think) part /(:\d+\/\S+)/i. I can have as many characters after the required / and it works great. Any ideas or known bugs?
Edit:
Here is some code for a sample application that demonstrates the problem:
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml" layout="absolute">
<mx:Script>
<![CDATA[
private function click():void {
var value:String = input.text;
var matches:Array = value.match(/((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)/i);
if(matches == null || matches.length < 1 || matches[0] != value) {
area.text = "No Match";
}
else {
area.text = "Match!!!";
}
}
]]>
</mx:Script>
<mx:TextInput x="10" y="10" id="input"/>
<mx:Button x="178" y="10" label="Button" click="click()"/>
<mx:TextArea x="10" y="40" width="233" height="101" id="area"/>
</mx:Application>
I debugged your regular expression on RegexBuddy and apparently it takes millions of steps to find a match. This usually means that something is terribly wrong with the regular expression.
Look at ([^\s.]+.)+([^\s.]+)(:\d+\/\S+).
1- It seems like you're trying to match subdomains too, but it doesn't work as intended since you didn't escape the dot. If you escape it, demo10:443/123 won't match because it'll need at least one dot. Change ([^\s.]+\.)+ to ([^\s.]+\.)* and it'll work.
2- [^\s.]+ is a bad character class, it will match the whole string and start backtracking from there. You can avoid this by using [^\s:.] which will stop at the colon.
This one should work as you want:
https?:\/\/([^\s:.]+\.)*([^\s:.]+):\d+\/\S+
This is a bug, either in Ryan's implementation or within Flex/Flash.
The regular expression syntax used above (less surrounding slashes and flags) matches Python which provides the following output:
# ignore case insensitive flag as it doesn't matter in this case
>>> import re
>>> rx = re.compile('((https?):\/\/)([^\s.]+.)+([^\s.]+)(:\d+\/\S+)')
>>> print rx.match('https://demo10:443/1111122222333334444455').groups()
('https://', 'https', 'demo1', '0', ':443/1111122222333334444455')

Regex - If contains '%', can only contain '%20'

I am wanting to create a regular expression for the following scenario:
If a string contains the percentage character (%) then it can only contain the following: %20, and cannot be preceded by another '%'.
So if there was for instance, %25 it would be rejected. For instance, the following string would be valid:
http://www.test.com/?&Name=My%20Name%20Is%20Vader
But these would fail:
http://www.test.com/?&Name=My%20Name%20Is%20VadersAccountant%25
%%%25
Any help would be greatly appreciated,
Kyle
EDIT:
The scenario in a nutshell is that a link is written to an encoded state and then launched via JavaScript. No decoding works. I tried .net decoding and JS decoding, each having the same result - The results stay encoded when executed.
Doesn't require a %:
/^[^%]*(%20[^%]*)*$/
Which language are you using?
Most languages have a Uri Encoder / Decoder function or class.
I would suggest you decode the string first and than check for valid (or invalid) characters.
i.e. something like /[\w ]/ (empty is a space)
With a regex in the first place you need to respect that www.example.com/index.html?user=admin&pass=%%250 means that the pass really is "%250".
Another solution if look-arounds are not available:
^([^%]|%([013-9a-fA-F][0-9a-fA-F]|2[1-9a-fA-F]))*$
Reject the string if it matches %[^2][^0]
I think that would find what you need
/^([^%]|%%|%20)+$/
Edit: Added case where %% is valid string inside URI
Edit2: And fixed it for case where it should fail :-)
Edit3:
In case you need to use it in editor (which would explain why you can't use more programmatic way), then you have to correctly escape all special characters, for example in Vim that regex should lool:
/^\([^%]\|%%\|%20\)\+$/
Maybe a better approach is to deal with that validation after you decode that string:
string name = HttpUtility.UrlDecode(Request.QueryString["Name"]);
/^([^%]|%20)*$/
This requires a test against the "bad" patterns. If we're allowing %20 - we don't need to make sure it exists.
As others have said before, %% is valid too... and %%25would be %25
The below regex matches anything that doesn't fit into the above rules
/(?<![^%]%)%(?!(20|%))/
The first brackets check whether there is a % before the character (meaning that it's %%) and also checks that it's not %%%. it then checks for a %, and checks whether the item after doesn't match 20
This means that if anything is identified by the regex, then you should probably reject it.
I agree with dominic's comment on the question. Don't use Regex.
If you want to avoid scanning the string twice, you can just iteratively search for % and then check that it is being followed by 20 and nothing else. (Update: allow a % after to be interpreted as a literal %nnn sequence)
// pseudo code
pos = 0
while (pos = mystring.find(pos, '%'))
{
if mystring[pos+1] = "%" then
pos = pos + 2 // ok, this is a literal, skip ahead
else if mystring.substring(pos,2) != "20"
return false; // string is invalid
end if
}
return true;