I'm using regex in Notepad++, recent version at time of post Nov 2016
I'm working on problem where I have dozens of text files containing financial information of wage and commission earnings. I need to find all employees that earned a commission
Files are formatted a such.
emp001smithj20150000095000
is an example of no commission earned last year (position 17, 5 chars = "00000")
emp002jonest20151752545000
is an example of $17525 commission earned last year (position 17, 5 chars = "17525")
I've tried...
^.{16}\b[1-9]{5}\b
The rationale is that I want any non-zero number among five character word boundary that starts at position 16, but no luck. I'm obviously missing something!
^.{16}(?!00000)
Use negative look around to find the next five characters that are not 00000
You shouldn't use word boundaries here, as your numbers are surrounded by other numbers. You will need a lookahead to check, that your 5 digit number doesn't consist of numbers only, so a way would be:
^.{16}(?=0{0,4}[1-9])\d{5}
You shouldn't try to match [1-9]{5}, this e.g. won't match 15000
Related
Hello and thank you in advance,
I buy items that have a variety of human written listings on auction sites and forums.
Often times, the quantity is clear to a person, but extracting it has been a real challenge. I'm using google sheets and REGEXEXTRACT().
I consider myself to be a intermediate regex user, but this has me stumped, so I need an expert.
Here's a few examples, my desired return, and what I'm getting.
Listing
Desired Return
Actual Return
Red 1996 Corvette 2x - Matchbox
2
2
3 x SmartCar, broken 2nd door
3
3
2nd edition Kindle (x3)
3
3
**1x** 2008 financial crash notice
1
1
Collectors Edition Beannie Baby, item 204/343
1
4
(6) Nissan window motors (1995-1998 ONLY)
6
N/A
White chevy F150, 1996
1
6
Green bowl, cracked (stored in room 2A5)
1
5
As I thought through this, I think I can put some reasonable limitations on this logic, but the code is harder.
The quantities will only be a single number 1-9. (perhaps reject all numbers > 9?)
They'll possibly be precede by or followed by an X or x, with or without a space
The quantity may be next to a special character like * , () or -
It should ignore all 1st, 2nd, 3rd, - 9th style notation
If a number is mixed in a word, like 2A3, it should ignore all
Obviously most description don't have any quantity, so if there's no return or zero, that's fine.
I have something that feels close, and does a reasonable job:
[^a-wy-zA-WY-Z0-9]*([1-4]){1}([^a-wA-w0-9]|$)
It doesn't return anything with the returns marked of 1*, and that's fine. It breaks on the last two, and I've struggled for too long!
Thanks in advance!
You can use
=IFNA(INT(REGEXEXTRACT(REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2"), "(\d)")), 1)
Here,
REGEXREPLACE(LOWER(A27), "\d{2,}|(x\d)|(\dx)|[^\W\d]+\d\w*|\d+[^\W\d]\w*", "$1$2") finds and removes chunks of two or more digits, or chunks with a digit and at least one letter, but keeps the sequences where a digit is preceded or followed with x
REGEXEXTRACT(..., "(\d)")) extracts the first digit left after the replacement
=IFNA(INT(...), 1) either casts the found digit to integer, or, if there was no match, inserts 1 into the column.
See the long regex demo.
\d{2,} - two or more digits
| - or
(x\d) - Group 1 ($1): x and a digit
| - or
(\dx) - Group 2 ($2): a digit and x
| - or
[^\W\d]+\d\w* - one or more word chars except digits, a digit and then zero or more word chars
| - or
\d+[^\W\d]\w* - one or more digits, a letter or underscore, and then zero or more word chars.
Demo:
I'm figuring out a way to validate my input date in Altva Mapforce which is in the format "YYYYMMDD".
I know to verify year I can use [0-9]{4} but I'm having trouble figuring out a way to "restrict" date range to "01-31" and month to "01-12". Please note "01" is valid while "1" should be invalid.
Can someone please provide a regex expression to validate this sort of input?
From searching internet I got one for month: ([1-9]|[12]\d|3[01]) but this one is valid for range 1-31. I want 01-31 and so on.
For a month from 1-12 adding a zero for a single digit 1-9:
(?:0[1-9]|1[012])
For a day 1-31 adding a zero for a single digit 1-9:
(?:0[1-9]|[12]\d|3[01])
Putting it all together with 4 digits to match a year (note that \d{4} can also match 0000 and 9999), enclosed in word boundaries \b to prevent a partial match with leading or trailing digits / word characters:
\b\d{4}(?:0[1-9]|1[012])(?:0[1-9]|[12]\d|3[01])\b
A variation, limiting the scope for a year to for example 1900 - 2099
\b(?:19|20)\d{2}(?:0[1-9]|1[012])(?:0[1-9]|[12]\d|3[01])\b
But note that this does not validate the date itself, it can for example also match 20210231. To validate a date, use a designated api for handling a date in the tool or code.
I'm having trouble trying to regex extract the 'positions' from the following types of strings:
6 red players position 5, button 2
earn $50 pos3, up to $1,000
earn $50 pos 2, up to $500
table button 4, before Jan 21
I want to get the number that comes after 'pos' or 'position', and if there's no such keyword, get the last number before the first comma. The position value can be a number between 1 and 100. So 'position' for each of the previous rows would be:
Input text
Desired match (position)
6 red players position 5, button 2
5
earn $50 pos3, up to $1,000
3
earn $50 pos 2, up to $500
2
table button 4, before Jan 21
4
I have a big data set (in BigQuery) populated with basically those 4 types of strings.
I've already searched for this type of problem but found no solution or point to start from.
I've tried .+?(?=,) (link) which extracts everything up to the first comma (,), but then I'm not sure how to go about extracting only the numbers from this.
I've tried (?:position|pos)\s?(\d) (link) which extracts what I want for group 1 (by using non-capturing groups), but doesn't solve the 4th type of string.
I feel like there's a way to combine these two, but I just don't know how to get there yet.
And so, after the two things I've tried, I have two questions:
Is this possible with only regex? If so, how?
What would I need to do in SQL to make my life easier at getting these values?
I'd appreciate the help/guidance with this. Thanks a ton!
You can use
^(?:[^,]*[^0-9,])?(\d+),
See the RE2 regex demo. Details:
^ - start of string
(?:[^,]*[^0-9,])? - an optional sequence of:
[^,]* - zero or more chars other than comma
[^0-9,] - a char other than a digit and comma
(\d+) - Group 1: one or more digits
, - a comma
Use look ahead for a comma, with a look behind requiring the previous char to be a space or a letter to prevent matching the “1” in “$1,000”:
(?<=[ a-z])(\d+)(?=,)
See live demo.
I want to check if a number is 50 or more using a regular expression. This in itself is no problem but the number field has another regex checking the format of the entered number.
The number will be in the continental format: 123.456,78 (a dot between groups of three digits and always a comma with 2 digits at the end)
Examples:
100.000,00
50.000,00
50,00
34,34
etc.
I want to capture numbers which are 50 or more. So from the four examples above the first three should be matched.
I've come up with this rather complicated one and am wondering if there is an easier way to do this.
^(\d{1,3}[.]|[5-9][0-9]|\d{3}|[.]\d{1,3})*[,]\d{2}$
EDIT
I want to match continental numbers here. The numbers have this format due to internal regulations and specify a price.
Example: 1000 EUR would be written as 1.000,00 EUR
50000 as 50.000,00 and so on.
It's a matter of taste, obviously, but using a negative lookahead gives a simple solution.
^(?!([1-4]?\d),)[1-9](\d{1,2})?(\.\d{3})*,\d{2}\b
In words: starting from a boundary ignore all numbers that start with 1 digit OR 2 digits (the first being a 1,2,3 or 4), followed by a comma.
Check on regex101.com
Try:
EDIT ^(.{3,}|[5-9]\d),\d{2}$
It checks if:
there 3 chars or more before the ,
there are 2 numbers before the , and the first is between 5 and 9
and then a , and 2 numbers
Donno if it answer your question as it'll return true for:
aa50,00
1sdf,54
But this assumes that your original string is a number in the format you expect (as it was not a requirement in your question).
EDIT 3
The regex below tests if the number is valid referring to the continental format and if it's equal or greater than 50. See tests here.
Regex: ^((([1-9]\d{0,2}\.)(\d{3}\.){0,}\d{3})|([1-9]\d{2})|([5-9]\d)),\d{2}$
Explanation (d is a number):
([1-9]\d{0,2}\.): either d., dd. or ddd. one time with the first d between 1 and 9.
(\d{3}\.){0,}: ddd. zero or x time
\d{3}: ddd 3 digit
These 3 parts combined match any numbers equals or greater than 1000 like: 1.000, 22.002 or 100.000.000.
([1-9]\d{2}): any number between 100 and 999.
([5-9]\d)): a number between 5 and 9 followed by a number. Matches anything between 50 and 99.
So it's either the one of the parts above or this one.
Then ,\d{2}$ matches the comma and the two last digits.
I have named all inner groups, for better understanding what part of number is matched by each group. After you understand how it works, change all ?P<..> to ?:.
This one is for any dec number in the continental format.
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})*|0)(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})*,)|0,|,)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
test
This one is for the same with the limit number>=50
^(?P<common_int>(?P<int>(?P<int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<int_end>\.\d{3})+|(?P<int_short>[1-9]\d{2}|[5-9]\d))(?!,)|(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})+,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d))(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_start>\d{3}\.)*(?P<frac_end>\d{1,3})))?$
tests
If you always have the integer part under 999.999 and fractal part always 2 digits, it will be a bit more simple:
^(?P<dec_int_having_frac>(?P<dec_int>(?P<dec_int_start>[1-9]\d{1,2}|[1-9]\d|[1-9])(?P<dec_int_end>\.\d{3})?,)|(?P<dec_short_int>[1-9]\d{2}|[5-9]\d),)(?=\d)(?P<frac_from_comma>(?<=,)(?P<frac>(?P<frac_end>\d{1,2})))?$
test
If you can guarantee that the number is correctly formed -- that is, that the regex isn't expected to detect that 5,0.1 is invalid, then there are a limited number of passing cases:
ends with \d{3}
ends with [5-9]\d
contains \d{3},
contains [5-9]\d,
It's not actually necessary to do anything with \.
The easiest regex is to code for each of these individually:
(\d{3}$|[5-9]\d$|\d{3},|[5-9]\d)
You could make it more compact and efficient by merging some of the cases:
(\d{3}[$,]|[5-9]\d[$,])
If you need to also validate the format, you will need extra complexity. I would advise against attempting to do both in a single regex.
However unless you have a very good reason for having to do this with a regex, I recommend against it. Parse the string into an integer, and compare it with 50.
I'm using an online tool to create contests. In order to send prizes, there's a form in there asking for user information (first name, last name, address,... etc).
There's an option to use regular expressions to validate the data entered in this form.
I'm struggling with the regular expression to put for the street number (I'm located in Belgium).
A street number can be the following:
1234
1234a
1234a12
begins with a number (max 4 digits)
can have letters as well (max 2 char)
Can have numbers after the letter(s) (max3)
I came up with the following expression:
^([0-9]{1,4})([A-Za-z]{1,2})?([0-9]{1,3})?$
But the problem is that as letters and second part of numbers are optional, it allows to enter numbers with up to 8 digits, which is not optimal.
1234 (first group)(no letters in the second group) 5678 (third group)
If one of you can tip me on how to achieve the expected result, it would be greatly appreciated !
You might use this regex:
^\d{1,4}([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|)$
where:
\d{1,4} - 1-4 digits
([a-zA-Z]{1,2}\d{1,3}|[a-zA-Z]{1,2}|) - optional group, which can be
[a-zA-Z]{1,2}\d{1,3} - 1-2 letters + 1-3 digits
or
[a-zA-Z]{1,2} - 1-2 letters
or
empty
\d{0,4}[a-zA-Z]{0,2}\d{0,3}
\d{0,4} The first groupe matches a number with 4 digits max
[a-zA-Z]{0,2} The second groupe matches a char with 2 digit in max
\d{0,3} The first groupe matches a number with 3 digits max
You have to keep the last two groups together, not allowing the last one to be present, if the second isn't, e.g.
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
or a little less optimized (but showing the approach a bit better)
^\d{1,4}(?:[a-zA-z]{1,2}(?:\d{1,3})?)?$
As you are using this for a validation I assumed that you don't need the capturing groups and replaced them with non-capturing ones.
You might want to change the first number check to [1-9]\d{0,3} to disallow leading zeros.
Thank you so much for your answers ! I tried Sebastian's solution :
^\d{1,4}(?:[a-zA-z]{1,2}\d{0,3})?$
And it works like a charm ! I still don't really understand what the ":" stand for, but I'll try to figure it out next time i have to fiddle with Regex !
Have a nice day,
Stan
The first digit cannot be 0.
There shouldn't be other symbols before and after the number.
So:
^[1-9]\d{0,3}(?:[a-zA-Z]{1,2}\d{0,3})?$
The ?: combination means that the () construction does not create a matching substring.
Here is the regex with tests for it.