Splitting a string that is not uniform with commas

Splitting a string that is not uniform with commas - regex

I have a strange problem that I am not sure how to solve.
My goal is to be able to split a string that has certain ingredients that are separated by commas into an array so that each element in the array is an ingredient. However, some of the strings that I will come across that list ingredients have a list within the list that looks like this:
WATER, CORN SYRUP AND 2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT), CITRIC ACID, MALIC ACID, ASCORBIC ACID (VITAMIN C), THIAMIN HYDROCHOLORIDE (VITAMIN B1), NATURAL FLAVORS. MODIFIED CORNSTARCH, CANOLA OIL, CELLULOSE GUM, SUCRALOSE, SODIUM HEXAMETAPHOSPHATE, POTASSIUM SORBATE TO PROTECT FLAVOR, YELLOW #5, YELLOW #6 AND DISODIUM EDTA TO PROTECT COLOR.
As you can see, there is a portion of the string that says "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT)". If the delimiter for the split method is a simple comma, then one of the ingredients will be "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE" which does not look correct. My goal is to get that entire portion of the string into one element e.g. the element should be "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT)".
Thank you for taking the time to look at my question!

Give this a shot:
\,+(?![^\(]*\))
It seems to work in JavaScript against your example.

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.

You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.

Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

Is there an error-proof way in google sheets to extract House numbers from address cell (street + house number) into another cell

In Sheet1!AK2:AK I have addresses in the following formats:
rotenkamper weg, 323, Kirchstieg 2345, Im Schleedörn 20b
I need the street names to export into Sheet2!C3:C, i.e:
rotenkamper weg, Kirchenstieg, Im Schleedörn
The House numbers have to go into Sheet2!D3:D.
I have researched and tried for hours but couldn't find a solution that could fetch the house numbers including the letter i.e. 20b or if the number is a range 24-27.
Also, I have huge trouble to get it to work when the street consist of two or more words.
Does anyone know an elegant solution for this?
Any help would be much appreciated. This will safe me weeks of data entry work.

Try this in Sheet2!C3:
=ARRAYFORMULA(
{
REGEXREPLACE(REGEXREPLACE(Sheet1!AK2:AK, "\s+\S*\d\S*\b", ""), ",+", ","),
IFNA(REGEXEXTRACT(Sheet1!AK2:AK, "\S+$"))
}
)
Explanation:
REGEXREPLACE(Sheet1!AK2:AK, "\s+\S*\d\S*\b", "") this one removes any "word" which has a digit in it. Al of these 323, 2345, 20b will be gone.
REGEXREPLACE(..., ",+", ",") cleans up any multiple consequent commas which may appear after removing in the first step. This will be a value for the first column.
IFNA(REGEXEXTRACT(Sheet1!AK2:AK, "\S+$")) this one just gets whatever is at the end of the address string from the last space to the end. This will be a value for the second column.
{value_for_the_first_column, value_for_the_second_column} placed in the C3 cell will populate C3 with value_for_the_first_column and D3 with value_for_the_first_column.
ARRAYFORMULA will do all of the above for every row.
Regex pattern could be refined if you provide more than one example of the address.

Regexmatch in Google Sheet to identify cells that include any string in another sheet

I have a ColumnA where each cell include multiple values separated by comma, eg:
Elvis Costello, Madonna
Bob, Elvis Presley, Morgan Stanley
Frank, Morgan Stanley, Madonna Ford,
Elvis Costello, Madonna Ford
And I want to identify which rows/cells that includes any of the exact terms in another sheet/column, eg
Elvis Presley
Madonna
And I found this simple solution using Regexmatch (the last solution on that page) Is there a way to REGEXMATCH from a range of cells from A1:A1000 for example?
Say you want to search for a match from a list of cities.
Put your list of cities in one tab.
Make them into lowercase for easier lookup since search terms are all in lowercase. You can do this by adding a new column and using the LOWER function.
Go back to your cell that has the list of search phrases.
In any blank cell out of the way (off to the side on the top row is a good place) put this formula: CITY LIST FORMULA: =TEXTJOIN("|",1,'vlookup city'!B$2:B$477) (if your tab is named 'vlookup city' and your cities are in column B of that tab)
Add a new column next to your search terms, or pick an existing one where you want to put your "match found" info.
In that new column, add this formula (if your data starts in row 4 and you put the City List formula in cell G3:) =REGEXMATCH(A4,G$4)
Fill the formula all the way down your list. You can double-click the little blue square in the bottom right corner of the cell, or grab-and-drag all the way to the bottom of the list.
Ba-ding! It will search for any one of those city names, anywhere in your search phrase.
If the search phrase contains at least one matching term, it will return "True."
You can then add extra features on your formula to make it return something else. For example: =IF(REGEXMATCH(A4,G$4), "match found", "no match found")
This is a super lightweight solution that won't slow your sheet down too much and is easy to use.
https://docs.google.com/spreadsheets/d/1XAIDB98r2CGu7hL3ISirErDPNlgT6lVt-TCG0qI1uTE/edit?usp=sharing
The problem is that the Regexmatch solution identifies "Elvis Costello" and "Madonna Ford" and I only want to identify cells/rows that includes the exact term to match, ie "Elvis Presley" and "Madonna", ie whatever is between the commas has to be an exact match with one of the search terms, not just partially right.
I hope it made sense:)
Thanks all!

I think I might have found the answer, still trying to double check if it's correct.
I added \b before and after. So in the example sheet re-posted in the quoted part of my question i changed the cell:
Cell B3:
=TEXTJOIN("|",1,'vlookup city'!B$2:B$476)
and added another cell like this:
Cell B2:
=concatenate("\b(",$B$3,")\b")
Still checking if all false flags are removed.
Thanks

How do I count emoji and symbols in a cell?

What formula can I use to get a count of emoji and characters in a single cell?
For example, In cells, A1,A2 and A3:
🙌🙌🙌
🤜✋️👈🤜🤜
??👊👊👊
Total Count of characters in each cell(Desired Output):
3
5
5

For the given emojis, This will work well:
=LEN(REGEXREPLACE(A13,".","."))
MID/LEN considers each emoji as 2 separate characters.
REGEX will consider them as one.
But even REGEX will fail with a complex emoji like this:
👨‍👩‍👧‍👦
This contains a literal man emoji👨, a woman emoji👩,a girl emoji👧 and a boy emoji👦-all joined by a ZeroWidthJoiner. You could even swap the boy for a another girl with this formula:
=SUBSTITUTE("‍👨‍👩‍👧‍👦","👦","👧")
It'll become like this:
‍👨‍👩‍👧‍👧

=COUNTA(FILTER(
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#"),
SPLIT(REGEXREPLACE(A1,"(.)","#$1"),"#")<>""
))
Based on the answer by #I'-'I
Some emojis contain from multiple emojis joined by char(8205):
👨‍👩‍👧‍👦‍👦‍👆
The result differs and depends on a browser you use.
I wonder, how do we count them?

Regex for words that don't differ by only one letter

I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.

You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm

Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)

A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";

This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js