Regex newbie: How to isolate 'num-num-num' in a string - regex

I'm sure this is a super simple question for many of you, but I've only just started learning regex and at the moment can't for the life of me isolate what I'm after from the following:
June 2015 - Won / Void / Lost = 3-0-1
I need a solution to isolate the 'num-num-num' part at the end of the string that would work for any positive integers.
Thanks for any help
EDIT
So this line of code from a scrapy spider I'm writing produces the line above:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
I've tried to isolate the part I'm after with:
tips_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').re(r'\d+-\d+-\d+$').extract()[0]
No luck though :(

The regex to capture that is:
\d+-\d+-\d+$
It works as follows:
\d+- means: capture 1 or more digits (the numbers [0-9]), and then a "-".
$ means: you should now be at the end of the line.
Translating that into the full regex pattern:
Capture 1 or more digits, then a hyphen, then 1 or more digits, then a hyphen, then 1 or more digits, and we should now be at the end of the string.
EDIT: Addressing your edits and comments:
I'm not so sure what you mean by "isolate". I'll assume that you mean you want tips_str to equal "3-0-1".
I believe the easiest way would be to first use xpath extract the string for the entire line without doing any regex. Then, when we're simply dealing with a string (instead of xpath stuff), it should be nice and easy to use regex and get the pattern out.
As far as I understand, sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0] (without .re()) is providing you with the string: "June 2015 - Won / Void / Lost = 3-0-1".
So then:
full_str = sel.xpath('//*[#class="recent-picks"]//div[#class="title3"]/text()').extract()[0]
Now that we've got the full string, we can use standard string regex to pluck the part we want out:
tips_str = false
search = re.search(r'\d+-\d+-\d+$', full_str)
if(search):
tips_str = search.group(0)
Now tips_str will equal "3-0-1". If the pattern wasn't matched at all, it'd instead equal false.
If any of my assumptions are wrong then let me know what's actually happening (like if .extract()[0] isn't giving back a string, then what is it giving back?) and I'll try to adjust this response.

Any and all numbers, so negatives, scientific notation, etc? This will match it.
/(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)-(\-?[\.\d]+(e\+|e\-)?[\.\d]*)$/ig
Tested with these:
June 2015 - Won / Void / Lost = -1.1e+3-1.01-0.1e+2
June 2015 - Won / Void / Lost = 1-2-3
June 2015 - Won / Void / Lost = 0.1--5-5.6
If you take $ out if it, it will match on all lines at the same time.

Related

How to match a whole string if certain conditions are met

Im working a lot with trying to isolate sizes from strings, however i have come into some issues.
Current:
https://regex101.com/r/zbEtOU/1
Current regex
^([a-z]+\d*(?:\s*-\s*[a-z\d]+[/-][a-z\d]+)?|\d+)
Examples:
30/32
Fixed 8 (32-36)
XS/S
m/l
1-2Y
s/m
0-3M
32
Desired result:
I want to isolate the first value from, but when i encounter parentheses i want to match on those values.
So actual desired outcome from the examples:
30/32 = 30
Fixed 8 (32-36) = 32
XS/S = XS
m/l = m
1-2Y = 1-2Y (im guessing there is no way to output "1Y" in this case? Else it would overlap with 1-2M causing confusion as 1 != 1 in this case. When this happens I would prefer to get the original string) ideal case = 1Y
s/m = s
1-3M = 1-3M (im guessing there is no way to output "1M" in this case? Else it would overlap with 1-2Y causing confusion as 1 != 1 in this case. When this happens I would prefer to get the original string)
ideal case = 1M
32 = 32
I'm really out of my bounds on solving this as there is a lot of different conditions here!
All regex is run insensitive, so no need to worry about capital letters.
Anyone got a nice and easy way to solve my issue??
Everything needs to be captured in Group 1 - else my system cant isolate it
Run in Python 3.7
You can use
(?:^|.*\()(\d+(?:-\d+[A-Za-z]{1,3})?|[A-Za-z]{1,3})\b
See the regex demo.
Details:
(?:^|.*\() - start of string or any zero or more chars other than line break chars as many as possible, and then a ( char
(\d+(?:-\d+[A-Za-z]{1,3})?|[A-Za-z]{1,3}) - Group 1:
\d+(?:-\d+[A-Za-z]{1,3})? - one or more digits, followed with an optional occurrence of a -, one or more digits, and then one to three ASCII letters
| - or
[A-Za-z]{1,3} - one, two or three ASCII letters
\b - a word boundary.

Detecting whole number with an "x" or "-" after using regex

I'm trying to use regex to detect the quantity in a list of items on a receipt. The software uses OCR so the return can vary a bit. To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number. The use cases I'm trying to cover are:
2 Burgers $4.00
2 x Burgers $4.00
2 X Burgers $4.00
2x Burgers $4.00
2X Burgers $4.00
2- Burgers $4.00
2 - Burgers $4.00
The plan is for the regex to return 2 for each example above. The regex I have so far is \\d{1,2}(\\s[xX]|[xX]) this returns the top three examples fine but as much as I have tried I cant seem to get the rest detected, I haven't looked at adding the - yet as was stuck on detecting the x next to the Int.
Any help would be great, thanks
To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number.
I suggest using something like
let pattern = "(?m)^\\d+"
See the regex demo.
The pattern will match 1 or more digits at the start of any line:
(?m) - a MULTILINE modifier that makes ^ match the start of a line rather than the start of a string
^ - start of a line
\d+ - 1 or more (+) digits.
If you need to specify that some text should follow the digits, use a positive lookahead. E.g. you may require x/X/- after 0+ whitespaces, or a whitespace right after. Then, you need to use
let pattern = "(?m)\\d+(?=\\s*[xX-]|\\s)"
Here, (?=\\s*[xX-]|\\s) will make the regex match only those digits at the start of the line(s) that are immediately followed with either 0+ whitespace chars and then X, x or -, or that are immediately followed with a whitespace.
See this regex demo.
^(\\d+)\\s?[xX-]?.*?([$£](?:\\d{1,2})(?:,?\\d{3})*\.?\\d{0,2})$
See it working here (extra backslashes have been added in the code above to allow it to work in Swift, whereas the below link shows the expected result in JS, Python, Go and PHP, which means there are less backslashes there).
Will capture number of items and the price, what the item is is not captured.

Using Regex to clean a csv file in R

This is my first post so I hope it is clear enough.
I am having a problem regarding cleaning my CSV files before I can read them into R and have spent the entire day trying to find a solution.
My data is supposed to be in the form of two columns. The first column is a timestamp consisting of 10 digits and the second an ID consisting of 11 or 12 Letters and numbers (the first 6 are always numbers).
For example:
logger10 |
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F0831102744
010700EDDA0F|
would become:
0821164100 | 010300033ADD
0821164523 | 010300033ADD
0821164531 | 010700EDDA0F
0831102744 | 010700EDDA0F
(please excuse the lines in the middle, that was my attempt at separating the columns...).
The csv file seems to occasionally be missing a comma which means that sometimes one row will end up like this:
0923120531,010300033ADD0925075301,010700EDD00A
My hardware also adds the word logger10 (or whichever number logger this is) whenever it restarts which gives a similar problem e.g. logger10logger100831102744.
I think I have managed to solve the logger text problem (see code) but I am sure this could be improved. Also, I really don't want to delete any of the data.
My real trouble is making sure there is a line break in the right place after the ID and, if not, I would like to add one. I thought I could use regex for this but I'm having difficulty understanding it.
Any help would be greatly appreciated!
Here is my attempt:
temp <- list.files(pattern="*.CSV") #list of each csv/logger file
for(i in temp){
#clean each csv
tmp<-readLines(i) #check each line in file
tmp<-gsub("logger([0-9]{2})","",tmp) #remove logger text
pattern <- ("[0-9]{10}\\,[0-9]{6}[A-Z,0-9]{5,6}") #regex pattern ??
if (tmp!= pattern){
#I have no idea where to start here...
}
}
here is some raw data:
logger01
0729131218,020700EE1961
0729131226,020700EE1961
0831103159,0203000316DB
0831103207,0203000316DB0831103253,010700EDE28C
0831103301,010700EDE28C
0831103522,010300029815
0831103636,010300029815
0831103657,020300029815
If you want to do this in a single pass:
(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?
can be replaced with
\1\t\2\n
What this does is look for any of those rogue logger01 entries (including the space after it) optionally: That trailing ? after the group means that it can match 0 or 1 time: if it does match, it will. If it's not there, the match just keeps going anyway.
Following that, you look for (and capture) exactly 10 hex values (either digits or A-F). The ,? means that if a comma exists, it will match, but it can match 0 or 1 time as well (making it optional).
Following that, look for (and capture) exactly 12 hex values. Finally, to get rid of any strange trailing spaces, the ? (a space character followed by ?) will optionally match the trailing space.
Your replacement will replace the first captured group (the 10 hex digits), add in a tab, replace the second captured group (the 12 hex digits), and then a newline.
You can see this in use on regex101 to see the results. You can use code generator on the left side of that page to get some preformatted PHP/Javascript/Python that you can just drop into a script.
If you're doing this from the command line, perl could be used:
perl -pe 's/(?:logger\d\d )?([\dA-F]{10}),?([\dA-F]{12}) ?/\1\t\2\n/g'
If another language, you may need to adapt it slightly to fit your needs.
EDIT
Re-reading the OP and comments, a slightly more rigid regex could be
(?:logger\d\d\ )?([\dA-F]{10}),?(\d{6}[\dA-F]{5,6})\ ?
I updated the regex101 link with the changes.
This still looks for the first 10 hex values, but now looks for exactly 6 digits, followed by 5-6 hex values, so the total number of characters matched is 11 or 12.
The replacement would be the same.
Paste your regex here https://regex101.com/ to see whether it catches all cases. The 5 or 6 letters or digits could pose an issue as it may catch the first digit of the timestamp when the logger misses out a comma. Append an '\n' to the end of the tmp string should work provided the regex catches all cases.

VIN Validation RegEx

I have written a VIN validation RegEx based on the http://en.wikipedia.org/wiki/Vehicle_identification_number but then when I try to run some tests it is not accepting some valid VIN Numbers.
My RegEx:
^[A-HJ-NPR-Za-hj-npr-z\\d]{8}[\\dX][A-HJ-NPR-Za-hj-npr-z\\d]{2}\\d{6}$
VIN Number Not Working:
1ftfw1et4bfc45903
WP0ZZZ99ZTS392124
VIN Numbers Working:
19uya31581l000000
1hwa31aa5ae006086
(I think the problem occurs with the numbers at the end, Wikipedia made it sound like it would end with only 6 numbers and the one that is not working but is a valid number only ends with 5)
Any Help Correcting this issue would be greatly appreciated!
I can't help you with a perfect regex for VIN numbers -- but I can explain why this one is failing in your example of 1ftfw1et4bfc45903:
^[A-HJ-NPR-Za-hj-npr-z\d]{8}[\dX][A-HJ-NPR-Za-hj-npr-z\d]{2}\d{6}$
Explanation:
^[A-HJ-NPR-Za-hj-npr-z\d]{8}
This allows for 8 characters, composed of any digits and any letters except I, O, and Q; it properly finds the first 8 characters:
1ftfw1et
[\dX]
This allows for 1 character, either a digit or a capital X; it properly finds the next character:
4
[A-HJ-NPR-Za-hj-npr-z\d]{2}
This allows for 2 characters, composed of any digits and any letters except I, O, and Q; it properly finds the next 2 characters:
bf
\d{6}$
This allows for exactly 6 digits, and is the reason the regex fails; because the final 6 characters are not all digits:
c45903
Dan is correct - VINs have a checksum. You can't utilize that in regex, so the best you can do with regex is casting too wide of a net. By that I mean that your regex will accept all valid VINs, and also around a trillion (rough estimate) non-VIN 17-character strings.
If you are working in a language with named capture groups, you can extract that data as well.
So, if your goal is:
Only to not reject valid VINs (letting in invalid ones is ok) then use Fransisco's answer, [A-HJ-NPR-Z0-9]{17}.
Not reject valid VINs, and grab info like model year, plant code, etc, then use this (note, you must use a language that can support named capture groups - off the top of my head: Perl, Python, Elixir, almost certainly others but not JS): /^(?<wmi>[A-HJ-NPR-Z\d]{3})(?<vds>[A-HJ-NPR-Z\d]{5})(?<check>[\dX])(?<vis>(?<year>[A-HJ-NPR-Z\d])(?<plant>[A-HJ-NPR-Z\d])(?<seq>[A-HJ-NPR-Z\d]{6}))$/ where the names are defined at the end of this answer.
Not reject valid VINs, and prevent some but not all invalid VINs, you can get specific like Pedro does.
Only accept valid VINs: you need to write code (just kidding, GitHub exists).
Capture group name key:
wmi - World manufacturer identifier
vds - Vehicle descriptor section
check - Check digit
vis - Vehicle identifier section
year - Model year
plant - Plant code
seq - Production sequence number
This regular expression is working fine for validating US VINs, including the one you described:
[A-HJ-NPR-Z0-9]{17}
Remember to make it case insensitive with flag i
Source: https://github.com/rfink/angular-vin
VIN should have only A-Z, 0-9 characters, but not I, O, or Q
Last 6 characters of VIN should be a number
VIN should be 17 characters long
You didn't specify which language you're using but the following regex can be used to validate a US VIN with php:
/^(?:([A-HJ-NPR-Z]){3}|\d{3})(?1){2}\d{2}(?:(?1)|\d)(?:\d|X)(?:(?1)+\d+|\d+(?1)+)\d{6}$/i
I feel regex is not the ideal validation. VINs have a built in check digit. https://en.wikibooks.org/wiki/Vehicle_Identification_Numbers_(VIN_codes)/Check_digit or http://www.vsource.org/VFR-RVF_files/BVINcalc.htm
I suggest you build an algorithm using this. (Untested algorithm example)
This should work, it is from splunk search, so there are some additional exclusions**
(?i)(?<VIN>[A-Z0-9^IOQioq_]{11}\d{6})
The NHTSA website provides the method used to calculate the 9th character checksum, if you're interested. It also provides lots of other useful data, such as which characters are allowed in which position, or how to determine whether the 10th character, if alphabetic, refers to a model year up to 1999 or a model year from 2010.
NHTSA VIN eCFR
Hope that helps.
Please, use this regular expression. It is shorter and works with all VIN types
(?=.*\d|[A-Z])(?=.*[A-Z])[A-Z0-9]{17}
I changed above formula by new below formula
(?=.*\d|=.*[A-Z])(?=.*[A-Z])[A-Z0-9]{17}
This regular expression consider any letter but at leats one digit, max 17 characters

Create shortest possible regex

I want to create a regex that will match any of these values
7-5
6-6 ((0-99) - (0-99))
6-4
6-3
6-2
6-1
6-0
0-6
1-6
2-6
3-6
4-6
the 6-6 example is a special case, here are some examples of values:
6-6 (23-8)
6-6 (4-25)
6-6 (56-34)
Is it possible to make one regex that can do this?
If so, is it possible to further extend that regex for the 6-6 special case such that the the difference between the two numbers within the parentheses is equal to 2 or -2?
I could easily write this with procedural code, but i'm really curious if someone can devise a regex for this.
Lastly, if it could be further extended such that the individual digits were in their own match groups I'd be amazed. An example would be for 7-5, i could have a match group that just had the value 7, and another that had the value 5. However for 6-6 (24-26) I'd like a match group that had the first six, a match group for the second 6, a match group for the 24 and a match group for the 26.
This may be impossible, but some of you can probably get this part of the way there.
Good luck, and thanks for the help.
NO. The answer is "We can't," and the reason is because you're trying to use a hammer to dig a hole.
The problem with writing one long "clever" (this word causes a knee-jerk reaction in many people who are far more anti-regex than I) regex is that, six months from now, you'll have forgotten those clever regex features that you used so heavily, and you'll have written six months worth of code related to something else, and you'll get back to your impressive regex and have to tweak one detail, and you'll say, "WTF?"
This is what (I understand) you want, in Perl:
# data is in $_
if(/7-5|6-[0-4]|[0-4]-6|6-6 \((\d{1,2})-(\d{1,2})\)/) {
if($1 and $2 and abs($1 - $2) == 2) {
# we have the right difference
}
}
Some might say that the given regex is a bit much, but I don't think it's too bad. If the \d{1,2} bit is a little too obscure you could use \d\d? (which is what I used at first, but didn't like the repetition).
You can do it like this:
7-5|6-[0-4]|[0-5]-6|6-6 \(\d\d?-\d\d?\)
Just add parens to get your match groups.
Off the top of my head (there may be some errors but the principle should be good):
\d-\d|6-6 (\d+-\d+)
And like with any regexp, you can surround what you want to extract with parentheses for match groups:
(\d)-(\d)|(6)-(6) ((\d)+-(\d+))
In the 6-6 case, the first two parentheses should get the sixes, and the second two should get the multi-digit values that come afterwards.
Here is one that will match only the numbers you want and let you get each digit by name:
p = r'(?P<a>[0-4]|6|7)-(?P<b>[0-4]|6|5) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?'
To get each digit you could use:
values = re.search(p, string).group('a', 'b', 'c', 'd')
Which will return a four element tuple with the values you are looking for (or None if no match was found).
One problem with this pattern is that it will patch the stuff in the parenthesis whether or not there was a match to '6-6'. This one will only match the final parenthesis if 6-6 is matched:
p = r'(?P<a>[0-4]|(?P<tmp_a>6)|7)-(?P<b>(?(tmp_a)(?P<tmp_b>6)|([0-4]|5)))(?(tmp_b) *(\((?P<c>\d{1,2})-(?P<d>\d{1,2})\))?)'
I don't know of any way to look for a difference between the numbers in the parenthesis; regex only knows about strings, not numerical values . . .
(I am assuming python syntax here; the perl syntax is slightly different, though perl supports the python way of doing things.)