Regex: plus sign not doing what I expect - regex

I have a file with many lines including a string like this: blah blah num=12345; blah blah
I would like to find lines where the number after the equals sign is greater than 1, with no upper limit. (I do not expect a number to ever start with zero.)
I started with this expression that will match any number starting with any digit that's not a 1, and it works fine and I understand it.
grep 'num=[2-9][0-9]*;'
This next expression should, I thought, return any number starting with a 1 that has two or more digits, but I instead get nothing back:
grep 'num=1[0-9]+;'
I though the above meant: must match num=1, then must match something between 0-9 one or more times, then must match ;. Where am I going wrong?

With grep you must escape the + quantifier
grep 'num=1[0-9]\+;'
For your problem you can use this (for all numbers >1, if i understand well):
grep 'num=\([2-9]\|1[0-9]\)[0-9]*;'

Related

Using regular expressions in Unix to find lines with numbers greater than 40000?

So I think the first thing I need to do is make the first digit be greater than 4. If the first digit is found to be greater than 4 and the number is five digits or more, that number should be found. If the digit is 4, than one of the digits following has to be greater than 0. I'm really struggling on how to set this up. I don't know if I have all of the correct conditions, and the technical aspect of writing the regex is confusing me also. Any help is appreciated.
Why are you using regular expressions for this? Iterate over the words in each line, and compare the word to the number 40000
awk '{
for (i = 1; i <= NF; i++)
if ($i > 40000) {
print
break
}
}' file
For greater or equal to 40000, use
egrep '[4-9][[:digit:]]{4}|[[:digit:]]{6,}' file
Note extra brackets. [:digit:] is equivalent to one of the five characters in dgit:
Note further that the {6,} part will also match values less than 40000 if there are zero-padded numbers, such as 012345.
For greater than 40000 the regex gets ugly if you don't want a second grep to just remove 40000 and glenn jackman's solution is what I recommend.
5 Digit numbers - regex [4-9][[:digit:]]{4}
6 or more digits - regex [1-9][[:digit:]]{5,} - Checks that the first digit is not zero!
Put it together
[4-9][[:digit:]]{4}|[1-9][[:digit:]]{5,}

Matching numbers greater than 40

I'm trying to match numbers greater than 40. The good point is that all of them have 2 decimal places, so all of them are like: 3.25, 5.89, 999.75 and they don't use any leading zeros (except on the decimal part that always have 2 digits)...
At first I tried the following code but then I realized this wouldn't match numbers like 100, 1000... even if they are greater than 40.
[4-9][0-9]\.
I don't have to match the decimal part, so don't worry about matching that, just help me to find how to match numbers greater than 40 (up to 9999 would be fine).
Thanks for your help.
This should do the job:
([4-9][0-9]|\d{3,})\.
Check it here:
http://www.regexr.com/3a5v9
Don't use regular expressions for number comparison. If, for example, you're using Javascript:
var aNumber = parseFloat("50");
if (aNumber > 40) {
// yay!
}
If your regex flavour can use negative lookbehind to match the numbers from 41 to 9999 without decimal:
\b(?:[1-9][0-9]{2,3}|[5-9][0-9]|4[1-9])(?<!\.\d{1,2})\b
(40\.(?!0[^\d]|00)\d{1,2}|(((4[1-9](?!\d)|[5-9][0-9])(?![\d])|\d*[1-9]\d{2,})(\.\d{1,2})?))
This prevents false positives from leading 0s.
This worked for me.
It tries to match 40 followed by 1 or two decimals that are not 00.
It then tries to match 4 followed by 1-9, decimal optional.
If it can't match that it matches 5-9 followed by 0-9, decimal optional.
It then triese to match any digit, any number of times, followed by 1-9, followed by 1 or 2 digits, decimal optional.
If you want to require the decimal, just remove the last question mark.
This will do it:
([4-9][0-9]+|\d{3,})
This it will get all the numbers of two digits having the first one greater than 4 or any number with three digits.
As an example http://www.regexr.com/3a5v0
You can use brackets to indicate a minimum and, if desired, maximum number of characters to match. So,
([4-9][0-9]|[1-9][0-9]{2,})\.
matches 4-9 followed by one or more digits. Presumably there's a boundary of some sort at the beginning of this, but it sounds like you have that part worked out. This uses an OR to allow for two possible groups of first digits.
(Most of the other answer are perfect for me -- This is paranoia and a bad idea :)
for use with grep -Po or Perl we could use:
'\b(\d{3,}|[4-9]\d)\.\d\d'
but this would get 40.00 (not greater than 40)
'\b(\d{3,}|[5-9]\d|4[1-9])\.\d\d|\b40\.\d?[1-9]\d?'
Corresponding to:
DDD.DD
| [5-9]D.DD
| 4[1-9].DD
| 40.D[1-9]
| 40.[1-9]D
In flex(1) you have this code to parse strings and get numbers greater than 40:
pru.l:
%option noyywrap
%%
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)? { printf("Greater than 40: %s\n", yytext); }
\-?[0-9]*(\.[0-9]*)? { printf("Lesser than 40: %s\n", yytext); }
\n |
. ;
%%
int main()
{ yylex(); }
Install flex and compile this file it with
make pru
Then run it as:
pru <filein >fileout
or just
pru
This code constructs a deterministic finite automaton from the regular expressions listed and prints the commands listed on the right when recognizes a value greater than 40. It allows a leading optional sign and leading zeros, and an optional fractional part composed of any number of digits. And it does this with only one asignment and one decision for each character read. You have access to the automaton state table generated by flex (it writes C code for you)
the regex that recognizes numbers greater than 40 (with decimals and leading sign and zeros) is:
\+?(0*[4-9][0-9]|0*[1-9][0-9][0-9][0-9]*)(\.[0-9]*)?
and can be abreviated as:
\+?(0*[4-9][0-9]|0*[1-9][0-9]{3,})(\.[0-9]*)?
explanation:
\+? matches an optional plus sign.
(...|...) two options:
0* optional arbitrary number of leadin zeros.
[4-9][0-9] the numbers 40 to 99
[1-9][0-9]{3,} the numbers 100 and up.
(.[0-9]*)? optional decimal point followed by an arbitrary number of digits.

Regex - how to make sure a string contain a word and numbers

I need a little help with Regex.
I want the regex to validate the following sentences:
fdsufgdsugfugh PCL 6
dfdagf PCL 11
fdsfds PCL6
fsfs PCL13
kl;klkPCL6
fdsgfdsPCL13
some chars, than PCL and than 6 or a greater number.
How this can be done?
I'd go with something like this:
^(.*)(PCL *)([6-9][0-9]*|[1-5][0-9]+)$
Meaning:
(.*) = some chars
(PCL *) = then PCL with optional whitespaces afterwards
([6-9][0-9]*|[1-5][0-9]+) then 6 or a greater number
This one should suit your needs:
^.*PCL\s*(?:[6-9]|\d{2,})$
Visualization by Debuggex
In bash:
EXPR=^[a-zA-Z]\+ *PCL *\([6-9]\|[0-9]\{2,\}\)
Translated:
Line begins with at least 1 occurence of a character (ignore caps)
Any amount of spaces, PCL, any amount of spaces
Either a number between 6 or 9, or a number with at least 2 digits
This expression used with something like grep "$EXPR" file.txt will output in stdout the lines that are valid.
This worked well for me. Reads logically too according to the way you described the matching
/[^PCL]+PCL\s?*[6-9]\d*/

How to match this expression with regex?

I have a text with some lines (200+) in this format:
10684 - The jackpot ? discuss Lev 3 --- ? ---
10755 - Garbage Heap ? discuss Lev 5 --- ? ---
I hant to retrieve the first number (10684 or 10755) only if number after "Lev" is greater than 3.
I'm able to get the first number with this regex: ([0-9]+) - but without the 'level' restrictions.
How this could be made?
Thanks in advance.
(\d+) - .*?Lev (?:[4-9]|[1-9]\d+)
The first \d+ matches line number as you have done.
The next .*? is a lazy quantifier, which will not consume too many characters. And the following expression will guide it to the right place. (lazy quantifier is usually more efficient)
The second parenthesis, (?:[4-9]|[1-9]\d+), matches either single digital numbers greater than 3 or two digital numbers without leading zero.
Alright stackoverflow doesn't properly show my image. Take this link : http://regexr.com?36n5l
Example Output:
Regular expressions doesn't recognize numbers as numbers (only strings). You can do this though:
([0-9]+) - .*Lev (?:[4-9][^0-9]|[1-9][0-9]+)
Basically, we use the alternation operator (|) to accept only a single digit greater than 3 (enforced by checking that the following character is not a digit) or a multi-digit number not beginning with a zero.
In case that level number might be the end of the line, though, you might have to do this:
([0-9]+) - .*Lev (?:[4-9](?:[^0-9]|$)|[1-9][0-9]+)
(I'm assuming whatever regex engine you're using can't handle lookaround assertions. In the future, try to always include what language you're using when you're asking a regex question.)
Ah, I just read your edit that the number is always less than 10. Well, that's much easier then:
([0-9]+) - .*Lev [4-9]
A lookahead is really the best thing because it will leave just the number:
/\d+(?=.*Lev (0*[4-9]|[1-9]\d))/
A bit of Awk trickery:
awk -F '\? +discuss +Lev' '$2>3 { split($1,a,/ */); print a[1] }' file
In bash use this:
var=">3"
perl -lne '/(\d+) - .*Lev (\d+)/; print $1 if $2'"$var"
This is a good solution to be able to pass the condition by parameter.

Regex match and grouping

Here's a sample string which I want do a regex on
101-nocola_conte_-_fuoco_fatuo_(koop_remix)
The first digit in "101" is the disc number and the next 2 digits are the track numbers. How do I match the track numbers and ignore the disc number (first digit)?
Something like
/^\d(\d\d)/
Would match one digit at the start of the string, then capture the following two digits
Do you mean that you don't mind what the disk number is, but you want to match, say, track number 01 ?
In perl you would match it like so: "^[0-9]01.*"
or more simply "^.01.*" - which means that you don't even mind if the first char is not a digit.
^\d(\d\d)
You may need \ in front of the ( depending on which environment you intend to run the regex into (like vi(1)).
Which programming language? For the shell something with egrep will do the job:
echo '101-nocola_conte_-_fuoco_fatuo_(koop_remix)' | egrep -o '^[0-9]{3}' | egrep -o '[0-9]{2}$'