Cannot for the life of me figure this out...
I am parsing a recipe:
Steak and Eggs
Serves 1
150g potatoes, diced
2 tsp olive oil
½ white onion, peeled and chopped
Salt and pepper
200g sirloin steak or rump steak, trimmed of fat
2 large eggs
Reduced carb meal
Family friendly
This is my the first part of the reg ex which matches the title and Serving number:
(\D*\n*)(?:\nServes )(\d\n)
I want to stop matching when it hits the categories at the end. I've worked out these may contain 'Reduced' 'Family' 'Quick' etc.
I have then tried to do this as so:
/(\D*\n*)(?:\nServes )(\d\n)((.*\n)*)(?:Carb|Reduced|Quick|Family)/
However, if there are two tags, as in the example, 'Reduced Carb meal' will be included as the Reg Ex continues until the 'Family' line.
Any help would be appreciated as I've been on this for 2 hours!
You simply need to change your non capturing group to a negative lookahead, like this:
/(\D*\n*)(?:\nServes )(\d\n)(.*(?:\n(?!Carb|Reduced|Quick|Family).*)*)/
Edited according to #bobble bubble's suggestion, give him the credit!
Related
I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.
You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.
Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")
This question relates to PCRE regular expressions.
Part of my big dataset are address data like this:
12H MARKET ST. Canada
123 SW 4TH Street USA
ONE HOUSE USA
1234 Quantity Dr USA
123 Quality Court Canada
1234 W HWY 56A USA
12345 BERNARDO CNTR DRIVE Canada
12 VILLAGE PLAZA USA
1234 WEST SAND LAKE RD ?567 USA
1234 TELEGRAM BLVD SUITE D USA
1234-A SOUTHWEST FRWY USA
123 CHURCH STREET USA
123 S WASHINGTON USA
123 NW-SE BLVD USA
# USA
1234 E MAIN STREET USA
I would like to extract the street names including house numbers and additional information from these records. (Of course there are other things in those records and I already know how to extract them).
For the purpose of this question I just manually clipped the interesting part from the data for this example.
The number of words in the address parts is not known before. The only criterion I have found so far is to find the occurrence of country names belonging to some finite set, which of course is bigger than (USA|Canada). For brevity I limit my example just to those two countries.
This regular expression
([a-zA-Z0-9?\-#.]+\s)
already isolates the words making up what I am after, including one space after them. Unfortunately there are cases, where the country after the to-be-extracted street information is only separated by a single space from the country, like e.g. in the first and in the last example.
Since I want to capture the matching parts glued together, I place a + sign behind my regular expression:
([a-zA-Z0-9?\-#.]+\s)+
but then in the two nasty cases with only one separating space before the country, the country is also caught!
Since I know the possible countries from looking at the data, I could try to exclude them by a look ahead-condition like this:
([a-zA-Z0-9?\-#.]+\s)(?!USA|Canada)
which excludes ST. from the match in the first line and STREET from the match in the last line. Of course the single capture groups are not yet glued together by this.
So I would add a plus sign to the group on the left:
([a-zA-Z0-9?\-#.]+\s)+(?!USA|Canada)
But then ST. and STREET and the Country, separated by only a single space, are caught again together with the country, which I want to exclude from my result!
How would you proceed in such a case?
If it would be possible by properly using regular expressions to replace each country name by the same one preceded by an additional space (or even to do this only for cases, where there is only a single space in front of one of the country-names), my problem would be solved. But I want to avoid such a substitution for the whole database in a separate run because a country name might appear in some other column too.
I am quite new to regular expressions and I have no idea how to do two processing steps onto the same input in sequence. - But maybe, someone has a better idea how to cope with this problem.
If I understand correctly, you want all content before the country (excluding spaces before the country). The country will always be present at the end of the line and comes from a list.
So you should be able to set the 'global' and 'multiline' options and then use the following regex:
^(.*?)(?=\s+(USA|Canada)\s*$)
Explanation:
^(.*) match all characters from start of line
(?=\s+(USA|Canada)\s*$) look ahead for one or more spaces, followed by one of the country names, followed by zero or more spaces and end of line.
That should give you a list with all addresses.
Edit:
I have changed the first part to: (.*?), making it non-greedy. That way the match will stop at the last letter before country instead of including some spaces.
The source I'm getting addresses from is inconsistent, it comes in 3 different ways
100 rue des Commandeurs Lévis Québec Canada
100 rue des Commandeurs Lévis Québec G6V7N5 Canada
100 rue des Commandeurs Lévis Québec G6V 7N5 Canada
The First address and street part are always going to be different lengths. "Canada" provides a good anchor for finding the province but the challenge is the postal code is sometimes broken into 2, sometimes combined, and sometimes not there.
I have a solution but I'm looking for a better one. My solution was to extract the first three terms before Canada.
RegExExtract Address (\S+)\h(\S+)\h(\S+)\h+Canada
And analyze each phrase to see if it had a digit.
RegExtract Phrase 1 (\d)
If RegEx Fails, Phrase 1 = Territory
If Success, RegExtract Phrase 2 (\d)
If RegEx Fails, Phrase 2 = Territory
If Success, RegExract Phrase 3 (\d)
If RegEx Fails, Phrase 3 = Territory
If Success, "Something went wrong"
This works fine but I assume there's a better way.
Maybe,
(?i)(\S+)\h*(?:G[A-Z0-9]+\h?[A-Z0-9]+)?\h+Canada
might be somewhat close, yet maybe a better option would be to simply list those States in a capturing or non-capturing group, such as with:
(?i)(Québec|Ontario|British Columbia|Montreal|Victoria|Saskatchewan|Calgary|Newfoundland|Nova Scotia|Alberta)(?:\h+)?(G[A-Z0-9]+)?(?:\h+)?([A-Z0-9]+)?\h+Canada$
RegEx Demo
I am cleaning text in R. My text has the form
but he could not avoid the subject FULLSTOP \n\n\n\n\nsimilar pieces
by the author\n\n\nlife is great 13022015\nreal men don t eath quiche
22042013\nback to the future 01072012\n\n\n\n and as he takes the
stage here wednesday night to rally democrats around hillary clinton
mr FULLSTOP obama will revisit his own promise to guide the nation
into an era of reconciliation and unity harking back to the themes
that propelled his improbable rise but that seem even more out of
reach today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for
a divided nation \n\n\n\n we get frustrated with political gridlock worry
about racial divisions are shocked and saddened by the madness of
orlando or nice mr FULLSTOP
I'm trying to get rid of
\n\n\n\n\nsimilar pieces by the author\n\n\nlife is great 13022015\nreal men don t eath quiche 22042013\nback to the future 01072012\n\n\n\n
so to obtain something like
but he could not avoid the subject FULLSTOP and as he takes the stage
here wednesday night to rally democrats around hillary clinton mr
FULLSTOP obama will revisit his own promise to guide the nation into
an era of reconciliation and unity harking back to the themes that
propelled his improbable rise but that seem even more out of reach
today FULLSTOP \n\n\n\n\nobama at convention to lay out stakes for a
divided nation \n\n\n\n we get frustrated with political gridlock
worry about racial divisions are shocked and saddened by the madness
of orlando or nice mr FULLSTOP
I'm trying with something like
gsub("\\\n{3,}(similar pieces)?.*\\\n{3,}", "", my_string)or gsub("\\\n{3,}(similar pieces)?.*?\\\n{3,}", "", my_string)
But it overtrims or does not work.
Any help (as well as an explanation of what I'm doing wrong and why the alternative works) would be very appreciated.
You need to match everything between the first 5 newline symbols up to the first 4 newline symbols.
I suggest a *\n{5}.*?\n{4} * regex:
* - zero or more literal spaces
\n{5} - 5 newline symbols
.*? - zero or more any characters up to the first....
\n{4} - 4 LF symbols
* - zero or more literal spaces (just to trim the match)
and replace with a space.
Use sub since you only need 1 replacement:
sub(" *\n{5}.*?\n{4} *", " ", s)
See R demo
I've found several questions that touch on this, but none that seem to answer it. I am trying to build a Regex that will allow me to identify Proper Nouns in a group of text.
I am defining a Proper Noun as follows: A word or group of words that begin with a capital letter, are longer than 1 digit (to exclude things like I, A, etc), and are NOT the first word of a new sentence.
So, in the following text
"Susan Dow stayed at the Holiday Inn on Thursday. She met Tom and Shirley Temple at the bar where they ordered Green Eggs and Ham"
I would want the following returned
Holiday Inn
Thursday
Tom
Shirley Temple
Green Eggs
Ham
Right now, [A-Z]{1,1}[a-z]*([\s][A-Z]{1,1}[a-z]*)* is what I have, but it's returning Susan Dow and She in addition to the ones listed above. How can I get my . look-up to work?
You can use:
(?<!^|\. |\. )[A-Z][a-z]+
per this rubular
Update: Integrated the two negative looks using alternation. Also added check for two spaces between sentences. Note that repetition operators cannot be used in negative lookbehinds per notes in http://www.regular-expressions.info/lookaround.html