Problems with regex parsing

Problems with regex parsing - regex

I am trying to write a program using the lynx command on this page "http://www.rottentomatoes.com/movie/box_office.php" and I can't seem to wrap my head around a certain problem.... getting the title by itself. My problem is a title can contain special characters, numbers, and all titles are variable in length. I want to write a regex that could parse the entire page and find lines like this....
(I added spaces between the title and the next number, which is how many weeks it has been out, to distinguish between title and weeks released)
1 -- 30% The Vow 1 $41.2M $41.2M $13.9k 2958
2 -- 53% Safe House 1 $40.2M $40.2M $12.9k 3119
3 -- 42% Journey 2: The Mysterious Island 1 $27.3M $27.3M $7.9k 3470
4 -- 57% Star Wars: Episode I - The Phantom Menace (in 3D) 1 $22.5M $22.5M $8.5k 2655
5 1 86% Chronicle 2 $12.1M $40.0M $4.2k 2908
the regex I have started out with is:
/(\d+)\s(\d+|\-\-)\s(\d+\%)\s
If someone can help me figure out how to grab the title successfully that would be much appreciated! Thanks in advanced.

Capture all the things!!
^(\d+)\s+(\d+|\-\-)\s+(\d+\%)\s+(.*)\s+(\d+)\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\$\d+(?:.\d+)?[Mk])\s+(\d+)$
Explained:
^ <- Start of the line
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\d+|\-\-)\s+ <- Numbers [or "--"] (captured) followed by as many spaces as you want
(\d+\%)\s+ <- Numbers [with '%'] (captured) followed by as many spaces as you want
(.*)\s+ <- Anything you can match [don't be greedy] (captured) followed by as many spaces as you want
(\d+)\s+ <- Numbers (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\$\d+(?:.\d+)?[Mk])\s+ <- "$" and Numbers [with floating point] and "M or k" (captured) followed by as many spaces as you want
(\d+) <- Numbers (captured)
$ <- End of the line
So to be serious this is what I've done, I cheated a bit and captured everything (as I think you'll do in the end) to get a lookahead for the title capture.
In a non-greedy regex (.*) [or (.*?) if you want to force the "ungreedyness"] will capture the least possible characters, and the end of the regex tries to capture everything else.
Your regex ends up capturing only the title (the only thing left).
What you can do is using an actual lookahead and make assertions.
Resources:
regular-expressions.info - Lookaround
regexr.com - This regex tested

Related

Regex to get any numbers after the occurrence of a string in a line

Hi guys im trying to get the the substring as well as the corresponding number from this string
text = "Milk for human consumption may be taken only from cattle from 80 hours after the last treatment."
I want to select the word milk and the corresponding number 80 from this sentence. This is part of a larger file and i want a generic solution to get the word milk in a line and then the first number that occurs after this word anywhere in that line.
(Milk+)\d
This is what i came up with thinking that i can make a group milk and then check for digits but im stumped how to start a search for numbers anywhere on line and not just immediately after the word milk. Also is there any way to make the search case insensitive?
Edit: im looking to get both the word and the number if possible eg: "milk" "80" and using python

/(?<!\p{L})([Mm]ilk)(?!p{L})\D*(\d+)/
This matches the following strings, with the match and the contents of the two capture groups noted.
"The Milk99" # "Milk99" 1:"Milk" 2:"99"
"The milk99 is white" # "milk99" 1:"milk" 2:"99"
"The 8 milk is 99" # "milk is 99" 1:"milk" 2:"99"
"The 8milk is 45 or 73" # "milk is 45" 1:"milk" 2:"45"
The following strings are not matched.
"The Milk is white"
"The OJ is 99"
"The milkman is 37"
"Buttermilk is 99"
"MILK is 99"
This regular expression could be made self-documenting by writing it in free-spacing mode:
/
(?<!\p{L}) # the following match is not preceded by a Unicode letter
([Mm]ilk) # match 'M' or 'm' followed by 'ilk' in capture group 2
(?!p{L}) # the preceding match is not followed by a Unicode letter
\D* # match zero or more characters other than digits
(\d+) # match one or more digits in capture group 2
/x # free-spacing regex definition mode
\D* could be replaced with .*?, ? making the match non-greedy. If the greedy variant were used (.*), the second capture group for "The 8milk is 45 or 73" would contain "3".
To match "MILK is 99", change ([Mm]ilk) to (?i)(milk).

This seems to work in java (I overlooked that the questioner wanted python or the question was later edited) like you want to:
String example =
"Test 40\n" +
"Test Test milk for human consumption may be taken only from cattle from hours after the last treatment." +
"\nTest Milk for human consumption may be taken only from cattle from 80 hours after the last treatment." +
"\nTest miLk for human consumption may be taken only from cattle from 80 hours after the last treatment.";
Matcher m = Pattern.compile("((?i)(milk).*?(\\d+).*\n?)+").matcher(example);
m.find();
System.out.print(m.group(2) + m.group(3));
Look at how it tests whether the word "milk" appears in a case insensitive manner anywhere before a number in the exact same line and only prints these both. It also prints only the first found occurence (making it find all occurencies is also possible pretty easily just by a little modifications of the given code).
I hope the way it extracts these both things from a matching pattern is in the sense of your task.

You should try this one
(Milk).*?(\d+)
Based on your language, you can also specify a case-insensitive search. Example in JS: /(Milk).*?(\d+)/i, the final i makes the search case insensitive.
Note the *?, the most important part ! This is a lazy iteration. In other words, it reads any char, but as soon as it can stop and process the next instruction successfully then it does. Here, as soon as you can read a digit, you read it. A simple * would have returned the last number from this line after Milk instead

Regex Select & Replace to Clean Up US phone numbers

We pull in a list of phone numbers as part of a datafeed. They are all for North America based companies. I would like to remove any leading "1" or "+1" and any trailing information like "x100", " EXT400", etc. They are stored in MariaDB so I would like to do
UPDATE `CompanyPhone` SET `number`= REGEXP_SUBSTR(`number`,pattern)
to remove the unwanted stuff, I just need the REGEX to select the correct part of the phone number.
"1 (555) 555-5555 x100" -> "(555) 555-5555"
"+15555555555 EXT400" -> "5555555555"
" 555-555-5555" -> "555-555-5555" (remove leading space)
Basically, I need just the first 10 digits, ignoring the first digit if it is a 1, and the formatting currently in the first 10 digits ("()" or " " or "-") if it is possible to keep it.
If everything could be reformatted to (555) 555-5555 that would be a bonus but is not required. I could do this a 2nd query if needed.

You could use REGEXP_REPLACE for this. Assuming you are using MariaDB 10.0.5 or later, you can use PCRE regular expressions. For your sample expressions, this regexp will give you the desired results (demo on Regex101). It looks for 3 groups of numbers (3 digits, 3 digits and then 4 digits) possibly preceded by a 1, and with other non-digit characters (e.g. +, -) around them.
^(?:\D*)1?(?:\D*)(\d{3})(?:\D*)(\d{3})(?:\D*)(\d{4}).*$
So your UPDATE statement will become
UPDATE `CompanyPhone` SET `number`= REGEXP_REPLACE(`number`, '^(?:\\D*)1?(?:\\D*)(\\d{3})(?:\\D*)(\\d{3})(?:\\D*)(\\d{4}).*$', '(\\1) \\2-\\3')

Python Regular Expression: No space in between

I have the following string:
"......(some chars) aaa bbb ###8/13/2018 ......(some chars)"
The ### in the string represent some random characters. ###'s length is unknown and it could be None (just "aaa bbb 8/13/2018").
My goal is to find the date from the string (8/13/2018) and the starting index of ###.
I currently used the following code:
m = re.search(r'\s.*?([0-9]{1,}/[0-9]{1,}/[0-9]{2,})', str)
m.groups()[0] ## The date
m.start() ## index of ###
But the regex is matching bbb ###8/13/2018 instead of ###8/13/2018
I also tried change the regex to:
r'\s(?!\s).*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
r'\s(?!\s)*?[0-9]{1,}/[0-9]{1,}/[0-9]{2,}'
But neither of them works.
I will be appreciated for any help or comments. Thank you.

I tend to believe you are looking for:
#*(?:\d{1,2}/){2}\d{2,4} or even \S*(?:\d{1,2}/){2}\d{2,4}
This is simply saying:
\S* start with 0 or more non-space charaters.
(?:\d{1,2}/){2} find two groups of \d{1,2}/ but do not capture them. ie not capturing: (?:..).this will match the month and date part 8/13/. \d{1,2} means atleast one digit and atmost two digits
\d{2,4} match the year .Atleast 2 digits and atmost 4 digits

Using a part of your regex, I think you mean something like this
r'\S*([0-9]+/[0-9]+/[0-9]{2,})'
https://regex101.com/r/dxF4sT/1
To find the starting index, it would be where the match was found.
Note that \S will find all consecutive non-whitespace.
You can change this to other things like [#a-zA-Z] etc..., just add it to the class.

Detecting whole number with an "x" or "-" after using regex

I'm trying to use regex to detect the quantity in a list of items on a receipt. The software uses OCR so the return can vary a bit. To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number. The use cases I'm trying to cover are:
2 Burgers $4.00
2 x Burgers $4.00
2 X Burgers $4.00
2x Burgers $4.00
2X Burgers $4.00
2- Burgers $4.00
2 - Burgers $4.00
The plan is for the regex to return 2 for each example above. The regex I have so far is \\d{1,2}(\\s[xX]|[xX]) this returns the top three examples fine but as much as I have tried I cant seem to get the rest detected, I haven't looked at adding the - yet as was stuck on detecting the x next to the Int.
Any help would be great, thanks

To help ive narrowed it to assume that the quantity will always be at the start of the line and is always a whole number.
I suggest using something like
let pattern = "(?m)^\\d+"
See the regex demo.
The pattern will match 1 or more digits at the start of any line:
(?m) - a MULTILINE modifier that makes ^ match the start of a line rather than the start of a string
^ - start of a line
\d+ - 1 or more (+) digits.
If you need to specify that some text should follow the digits, use a positive lookahead. E.g. you may require x/X/- after 0+ whitespaces, or a whitespace right after. Then, you need to use
let pattern = "(?m)\\d+(?=\\s*[xX-]|\\s)"
Here, (?=\\s*[xX-]|\\s) will make the regex match only those digits at the start of the line(s) that are immediately followed with either 0+ whitespace chars and then X, x or -, or that are immediately followed with a whitespace.
See this regex demo.

^(\\d+)\\s?[xX-]?.*?([$£](?:\\d{1,2})(?:,?\\d{3})*\.?\\d{0,2})$
See it working here (extra backslashes have been added in the code above to allow it to work in Swift, whereas the below link shows the expected result in JS, Python, Go and PHP, which means there are less backslashes there).
Will capture number of items and the price, what the item is is not captured.

Regular Expressions in R

I found somewhat similar questions
R - Select string text between two values, regex for n characters or at least m characters,
but I'm still having trouble
say I have a string in r
testing_String <- "AK ADAK NAS PADK ADK 70454 51 53N 176 39W 4 X T 7"
And I need to be able to pull anything between the first element in the string that contains 2 characters (AK) and PADK,ADK. PADK and ADK will change in character but will always be 4 and 3 characters in length respectively.
So I would need to pull
ADAK NAS
I came up with this but its picking up everything from AK to ADK
^[A-Za-z0_9_]{2}(.*?) +[A-Za-z0_9_]{4}|[A-Za-z0_9_]{3,}

If I understood your question correctly, this should do the trick:
\b[A-Z]{2}\s+(.+?)\s+[A-Z]{4}\s+[A-Z]{3}\b
Demo
You'll have to switch the perl = TRUE option (to use a decent regex engine).
\b means word boundary. So this pattern looks for a match starting with a 2-letter word and ending with a 4 letter word followed by a 3 letter word. Your value will be in the first group.
Alternatively, you can write the following to avoid using the capturing group:
\b[A-Z]{2}\s+\K.+?(?=\s+[A-Z]{4}\s+[A-Z]{3}\b)
But I'd prefer the first method because it's easier to read.

Lookbehind is supported for perl=TRUE, so this regex will do what you want:
(?<=\w{2}\s).*?(?=\s+[^\s]{4}\s[^\s]{2})

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Problems with regex parsing - regex

Related

Regex to get any numbers after the occurrence of a string in a line

Regex Select & Replace to Clean Up US phone numbers

Python Regular Expression: No space in between

Detecting whole number with an "x" or "-" after using regex

Regular Expressions in R

Categories

Resources