Regexp in lex. Why does flex behave this way - regex

Consider a simple integer digit identifying expression like this:
[0-9]+ printf("Integer");
Now if i give 123 as an input it returns Integer, fair enough. Now if I give s123 as the input it prints out sInteger. The unmatched s is being printed by default ECHO that's cool with me. But why is Integer also printed. Shouldn't lex return just s? My input is considered as a whole string right? I mean s123 is considered as a 1 full input?. As soon as s is encountered which does not match [0-9]+ so it should just echo default unmatched value s123 but why sInteger?

The string s123 is being matched by the regex [0-9]+. If you want to match strings which consist of only integers, you should try ^[0-9]+$.

Related

Need regex expression with multiple conditions

I need regex with following conditions
It should accept maximum of 5 digits then upto 3 decimal places
it can be negative
it can be zero
it can be only numbers (max. upto 5 digit place)
it can be null
I have tried following but its not, its not fulfilling all conditions
#"^([\-\+]?)\d{0,5}(.[0-9]{1,3})?)$"
E.g. maximum value can hold is from -99999.999 to 99999.999
Use this regex:
^[-+]?\d{0,5}(\.[0-9]{1,3})?$
I only made two changes here. First, you don't need to escape any characters inside a character class normally, except for opening and closing brackets, or possibly backslash itself. Hence, we can use [-+] to capture an initial plus or minus. Second, you need to escape the dot in your regex, to tell the engine that you want to match a literal dot.
However, I would probably phrase this regex as follows:
^[-+]?\d{1,5}(\.[0-9]{1,3})?$
This will match one to five digits, followed by an optional decimal point, followed by one to three digits.
Note that we want to capture things like:
0.123
But not
.123
i.e. we don't want to capture a leading decimal point should it not be prefixed by at least one number.
Demo here:
Regex101
I assume you're doing this in C# given the notation. Here's a little code you can use to test your expression, with two corrections:
You have to escape the dot, otherwise it means "any character". So, \. instead of .
There was an extraneous close parenthesis that prevented the expression from compiling
C#:
var expr = #"^([\-\+]?)\d{0,5}(\.[0-9]{1,3})?$";
var re = new Regex(expr);
string[] samples = {
"",
"0",
"1.1",
"1.12",
"1.123",
"12.3",
"12.34",
"12.345",
"123.4",
"12345.123",
".1",
".1234"
};
foreach(var s in samples) {
Console.WriteLine("Testing [{0}]: {1}", s, re.IsMatch(s) ? "PASS" : "FAIL");
}
Results:
Testing []: PASS
Testing [0]: PASS
Testing [1.1]: PASS
Testing [1.12]: PASS
Testing [1.123]: PASS
Testing [12.3]: PASS
Testing [12.34]: PASS
Testing [12.345]: PASS
Testing [123.4]: PASS
Testing [12345.123]: PASS
Testing [.1]: PASS
Testing [.1234]: FAIL
It should accept maximum of 5 digits
[0-9]{1,5}
then upto 3 decimal places
[0-9]{1,5}(\.[0-9]{1,3})?
it can be negative
[-]?[0-9]{1,5}(\.[0-9]{1,3})?
it can be zero
Already covered.
it can be only numbers (max. upto 5 digit place)
Already covered. 'Up to 5 digit place' contradicts your first rule, which allows 5.3.
it can be null
Not covered. I strongly suggest you remove this requirement. Even if you mean 'empty', as I sincerely hope you do, you should detect that case separately and beforehand, as you will certainly have to handle it differently.
Your regular expression contains ^ and $. I don't know why. There is nothing about start of line or end of line in the rules you specified. It also allows a leading +, which again isn't specified in your rules.

regex cant be equal to 0

I am trying to come up with a regular expression that will accept an input as long as it doesn't equal 0. To clarify, I mean only 0. An input that contains a 0 would still match, for instance 408.
This is what I have so far:
^[^0]
Use this:
^(?!0$).*
Essentially, it checks that the start of the string is not followed by a 0 and the end of the string. The only string that this is the case is "0", so all others will be matched.
However, if you're controlling the validation yourself, it would be easier to just forgo regex and check that the string is not equal to "0".
see this demo https://regex101.com/r/vS6vT3/1
/^[^0]{1}.*$/gm
or with Negative Lookahead
/^(?!0{1}$).*$/gm

Difference between ? and * in regular expressions - match same input?

I am not able to understand the practical difference between ? and * in regular expressions. I know that ? means to check if previous character/group is present 0 or 1 times and * means to check if the previous character/group is present 0 or more times.
But this code
while(<>) {
chomp($_);
if(/hello?/) {
print "metch $_ \n";
}
else {
print "naot metch $_ \n";
}
}
gives the same out put for both hello? and hello*. The external file that is given to this Perl program contains
hello
helloooo
hell
And the output is
metch hello
metch helloooo
metch hell
for both hello? and hello*. I am not able to understand the exact difference between ? and *
In Perl (and unlike Java), the m//-match operator is not anchored by default.
As such all of the input it trivially matched by both /hello?/ and /hello*/. That is, these will match any string that contains "hell" (as both quantifiers make the "o" optional) anywhere.
Compare with /^hello?$/ and /^hello*$/, respectively. Since these employ anchors the former will not match "helloo" (as at most one "o" is allowed) while the latter will.
Under Regexp Quote-like Operators:
m/PATTERN/ searches [anywhere in] a string for a pattern match, and in scalar context returns true if it succeeds, false if it fails.
What is confusing you is that, without anchors like ^ and $ a regex pattern match checks only whether the pattern appears anywhere in the target string.
If you add something to the pattern after the hello, like
if (/hello?, Ashwin/) { ... }
Then the strings
hello, Ashwin
and
hell, Ashwin
will match, but
helloooo, Ashwin
will not, because there are too many o characters between hell and the comma ,.
However, if you use a star * instead, like
if (/hello*, Ashwin/) { ... }
then all three strings will match.
? Means the last item is optional. * Means it is both optional and you can have multiple items.
ie.
hello? matches hell, hello
hello* matches hell, hello, helloo, hellooo, ....
But not using either ^ or $ means these matches can occur anywhere in the string
Here's an example I came up with that makes it quite clear:
What if you wanted to only match up to tens of people and your data was like below:
2 people. 20 people. 200 people. 2000 people.
Only ? would be useful in that case, whereas * would incorrectly capture larger numbers.

How do I specify a regex of certain length where I want to disallow characters instead of allowing them?

I need a regular expression which will validate a string to have length 7 and doesn't contain vowels, number 0 and number 1.
I know about character classes like [a-z] but it seems a pain to have to specify every possibility that way: [2-9~!##$%^&*()b-df-hj-np-t...]
For example:
If I pass a String June2013 - it should fail because length of the string is 8 and it contains 2 vowels and number 0 and 1.
If I pass a String XYZ2003 - it should fail because it contains 0.
If I pass a String XYZ2223 - it should pass.
Thanks in advance!
So that would be something like this:
^[^aeiouAEIOU01]{7}$
The ^$ anchors ensure there's nothing in there but what you specify, the character class [^...] means any character except those listed and the {7} means exactly seven of them.
That's following the English definition of vowel, other cultures may have a different idea as to what constitutes voweliness.
Based on your test data, the results are:
pax> echo 'June2013' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2003' | egrep '^[^aeiouAEIOU01]{7}$'
pax> echo 'XYZ2223' | egrep '^[^aeiouAEIOU01]{7}$'
XYZ2223
This is the briefest way to express it:
(?i)^[^aeiou01]{7}$
The term (?i) means "ignore case", which obviates typing both upper and lower vowels.

RegEx Lookaround issue

I am using Powershell 2.0. I have file names like my_file_name_01012013_111546.xls. I am trying to get my_file_name.xls. I have tried:
.*(?=_.{8}_.{6})
which returns my_file_name. However, when I try
.*(?=_.{8}_.{6}).{3}
it returns my_file_name_01.
I can't figure out how to get the extension (which can be any 3 characters. The time/date part will always be _ 8 characters _ 6 characters.
I've looked at a ton of examples and tried a bunch of things, but no luck.
If you just want to find the name and extension, you probably want something like this: ^(.*)_[0-9]{8}_[0-9]{6}(\..{3})$
my_file_name will be in backreference 1 and .xls in backreference 2.
If you want to remove everything else and return the answer, you want to substitute the "numbers" with nothing: 'my_file_name_01012013_111546.xls' -replace '_[0-9]{8}_[0-9]{6}' ''. You can't simply pull two bits (name and extension) of the string out as one match - regex patterns match contiguous chunks only.
try this ( not tested), but it should works for any 'my_file_name' lenght , any lenght of digit and any kind of extension.
"my_file_name_01012013_111546.xls" -replace '(?<=[\D_]*)(_[\d_]*)(\..*)','$2'
non regex solution:
$a = "my_file_name_01012013_111546.xls"
$a.replace( ($a.substring( ($a.LastIndexOf('.') - 16 ) , 16 )),"")
The original regex you specified returns the maximum match that has 14 characters after it (you can change to (?=.{14}) who is the same).
Once you've changed it, it returns the maximum match that has 14 characters after it + the next 3 characters. This is why you're getting this result.
The approach described by Inductiveload is probably better in case you can use backreferences. I'd use the following regex: (.*)[_\d]{16}\.(.*) Otherwise, I'd do it in two separate stages
get the initial part
get the extension
The reason you get my_filename_01 when you add that is because lookaheads are zero-width. This means that they do not consume characters in the string.
As you stated, .*(?=_.{8}_.{6}) matches my_file_name because that string is is followed by something matching _.{8}_.{6}, however once that match is found, you've only consumed my_file_name, so the addition of .{3} will then consume the next 3 characters, namely _01.
As for a regex that would fit your needs, others have posted viable alternatives.