how do i get all values inside cdata through regular expression? - regex

[CDATA[The presence of proteins can be detected by the Biuret test.<br/>
(i) Crush the food sample and put some of it into a clean test tube. Add some
sodium hydroxide solution.<br/>
(ii) Cork the test tube and shake it to mix the food with the sodium hydroxide
solution. Then add a little copper sulphate solution. Cork and shake it
again.<br/>If the solution turns blue, there is no protein in the food. But it turns violet, there is protein in the food.]]

\[[^\[\]]*?\]
Try this.See demo.
http://regex101.com/r/sA8iT4/4

Related

Extract the last 1 or 2 alphabetic character(s) from a quantity (mg,g,ml,l,cm,mm,m) when preceded by a digit

I need to be able to populate a cell in a Google Sheets spreadsheet with the measurement units extracted from the end of a string value in another cell. The raw data comes through with every source cell ending with a measurement unit, either preceded with a numeric value or not, as in the example data below...
SAMPLE DATA:
Colgate Plax Spearmint Alcohol Free Mouthwash 500ml
Peckish Tangy BBQ Rice Crackers 100g
Alison's Pantry BBQ Chickpea Snacks kg
Yoghurt Raisins Miscellaneous Confectionery kg
Roasted Unsalted Supreme Mixed Nuts kg
Alison's Pantry Honey & Dijon Snippets kg
Banana Chips kg
Sealord Satay Tuna 95g
Sealord Savoury Onion Tuna 95g
Coca-Cola No Sugar Soft Drink 2.25l
Tongariro Natural Spring 15l
Trident Sweet Chilli Sauce With Ginger 285ml
Pams Lite Whole Egg Mayonnaise 443ml
Value Lite Milk 2l
Morning Harvest Caged Size 7 Eggs 12pk
EXPECTED RESULT:
![New column showing the measurement units][1]
CURRENT METHODOLOGY:
=IF(A1<>"",REGEXEXTRACT(A1,"^.*([a-zA-Z][a-zA-Z])$|^.*([a-zA-Z])$"),"")
CURRENT RESULT:
![Result being split over two columns][2]
While I can combine the two values into a third column using the expression =IF(B1<>"",B1,IF(C1<>"",C1,"")), this becomes messy, convoluted, and adds unnecessary columns. I would prefer to tweak the regular expression to return just a single value, either the one or two character measurement unit. I have no idea how to achieve this, though. Any help would be appreciated.
You could also make the pattern a bit more specific matching either a digit of space, and capture one of the units at the end of the string.
=IF(A1<>"",REGEXEXTRACT(A1, "[\d ]((?:m?l|[mk]?g|pk|[cm]?m))$"),"")
See a regex demo for the capture group values.
Match 1 optional letter, then 1 letter anchored to end:
IF(A1<>"",REGEXEXTRACT(A1, "[a-zA-Z]?[a-zA-Z]$"),"")

My formula only keeps one word and not anything after spaces

I am basically trying to omitted all illegal characters and numbers in a soft drinks column. I currently have something like this:
Soft Drinks
Dr Pepper;1234
Pepsi369
Coca Cola
Red Bull
Mountain Dew;11
Gatorade
Fanta
Crush Soda456
Essentially I want something like this:
Soft Drinks
Dr Pepper
Pepsi
Coca Cola
Red Bull
Mountain Dew
Gatorade
Fanta
Crush Soda
I tried using this formula: =ARRAYFORMULA(REGEXEXTRACT(A1:A9&"", "[a-zA-Z]+")) but instead I am only getting the first word in the list please see below:
Soft
Dr
Pepsi
Coca
Red
Mountain
Gatorade
Fanta
Crush
Not sure where I gone wrong. I even tried fixing the regex like this and it still dont work: =ARRAYFORMULA(REGEXEXTRACT(A1:A9&"", "[a-zA-Z]+[a-zA-Z]+"))
This regex should work: [a-zA-Z\s]+
=ARRAYFORMULA(REGEXEXTRACT(A1:A9&"", "[a-zA-Z\s]+"))
I only added \s so that space between words would also be included in your pattern. You can also use [a-zA-Z ]+, as using \s would match any whitespace character.
Use this working, better for you formula
=ARRAYFORMULA(IFERROR(REGEXREPLACE(B2:B,"[\d]+$|;","")))
See problem with other formula here.

Regex for words that don't differ by only one letter

I want to create series of puzzle games where you change one letter in a word to create a new word, with the aim of reaching a given target word. For example, to change "this" to "that":
this
thin
than
that
What I want to do is create a regex which will scan a list of words and choose all those that do not match the current word by all but one letter. For example, if my starting word is "pale" and my list of words is...
pale
male
sale
tale
pile
pole
pace
page
pane
pave
palm
peal
leap
play
help
pack
... I want all the words from "peal" to "pack" to be selected. This means that I can delete them from my list, leaving only the words that could be the next match. (It's OK for "pale" itself to be unselected.)
I can do this in parts:
^.(?!ale).{3}\n selects words not like "*ale"
^.(?<!p).{3}\n|^.{2}(?!le).{2}\n selects words not like "p*le"
^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n selects words not like "pa*e"
^.{3}(?<!pal).\n selects words not like "pal*".
However, when I put them together...
^.(?!ale).{3}\n|^.(?<!p).{3}\n|^.{2}(?!le).{2}\n|^.{2}(?<!pa).{2}\n|^.{3}(?!e).\n|^.{3}(?<!pal).\n
... everything but "pale" is matched.
I need some way to create an AND relationship between the different regexes, or (more likely) a completely different approach.
You can use the Python regex module that allows fuzzy matching:
>>> import regex
>>> regex.findall(r'(?:pale){s<=1}', "male sale tale pile pole pace page pane pave palm peal leap play help pack")
['male', 'sale', 'tale', 'pile', 'pole', 'pace', 'page', 'pane', 'pave', 'palm']
In this case, you want a substitution of 0 or 1 is a match.
Or consider the TRE library and the command line agrep which supports a similar syntax.
Given:
$ echo $s
male sale tale pile pole pace page pane pave palm peal leap play help pack
You can filter to a list of a single substitution:
$ echo $s | tr ' ' '\n' | agrep '(?:pale){ 1s <2 }'
male
sale
tale
pile
pole
pace
page
pane
pave
palm
Here's a solution that uses cool python tricks and no regex:
def almost_matches(word1, word2):
return sum(map(str.__eq__, word1, word2)) == 3
for word in "male sale tale pile pole pace page pane pave palm peal leap play help pack".split():
print almost_matches("pale", word)
A completely different approach: Levenshtein distance
...the Levenshtein distance between two words is the minimum number of single-character edits (i.e. insertions, deletions or substitutions) required to change one word into the other.
PHP example:
$words = array(
"pale",
"male",
"sale",
"tale",
"pile",
"pole",
"pace",
"page",
"pane",
"pave",
"palm",
"peal",
"leap",
"play",
"help",
"pack"
);
foreach($words AS $word)
if(levenshtein("pale", $word) > 1)
echo $word."\n";
This assumes the word on the first line is the keyword. Just a brute force parallel letter-match and count gets the job done:
awk 'BEGIN{FS=""}
NR==1{n=NF;for(i=1;i<=n;++i)c[i]=$i}
NR>1{j=0;for(i=1;i<=n;++i)j+=c[i]==$i;if(j<n-1)print}'
A regexp general solution would need to be a 2-stepper I think -- generate the regexp in first step (from the keyword), run the regexp against the file in the second step.
By the way, the way to do an "and" of regexp's is to string lookaheads (and the lookaheads don't need to be as complicated as you had above I think):
^(?!.ale)(?!p.le)(?!pa.e)(?!pal.)

Splitting a string that is not uniform with commas

I have a strange problem that I am not sure how to solve.
My goal is to be able to split a string that has certain ingredients that are separated by commas into an array so that each element in the array is an ingredient. However, some of the strings that I will come across that list ingredients have a list within the list that looks like this:
WATER, CORN SYRUP AND 2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT), CITRIC ACID, MALIC ACID, ASCORBIC ACID (VITAMIN C), THIAMIN HYDROCHOLORIDE (VITAMIN B1), NATURAL FLAVORS. MODIFIED CORNSTARCH, CANOLA OIL, CELLULOSE GUM, SUCRALOSE, SODIUM HEXAMETAPHOSPHATE, POTASSIUM SORBATE TO PROTECT FLAVOR, YELLOW #5, YELLOW #6 AND DISODIUM EDTA TO PROTECT COLOR.
As you can see, there is a portion of the string that says "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT)". If the delimiter for the split method is a simple comma, then one of the ingredients will be "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE" which does not look correct. My goal is to get that entire portion of the string into one element e.g. the element should be "2% OR LESS OF EACH OF THE FOLLOWING: CONCENTRATED JUICES (ORANGE, TANGERINE, APPLE, LIME, GRAPEFRUIT)".
Thank you for taking the time to look at my question!
Give this a shot:
\,+(?![^\(]*\))
It seems to work in JavaScript against your example.

Create Fill-In-The-Blanks Text From Text Chunks using Regex and Replace

I trying to create a fill-in-the-blanks worksheet from a chunk of text, and I think regex and a replace function in a text editor will greatly expedite my project.
Example text:
HAMLET O, that this too too solid flesh would melt Thaw and resolve
itself into a dew! Or that the Everlasting had not fix'd His canon
'gainst self-slaughter! O God! God! How weary, stale, flat and
unprofitable, Seem to me all the uses of this world! Fie on't! ah fie!
'tis an unweeded garden, That grows to seed; things rank and gross in
nature Possess it merely. That it should come to this! But two months
dead: nay, not so much, not two: So excellent a king; that was, to
this, Hyperion to a satyr; so loving to my mother That he might not
beteem the winds of heaven Visit her face too roughly. Heaven and
earth! Must I remember? why, she would hang on him, As if increase of
appetite had grown By what it fed on: and yet, within a month-- Let me
not think on't--Frailty, thy name is woman!-- A little month, or ere
those shoes were old With which she follow'd my poor father's body,
Like Niobe, all tears:--why she, even she-- O, God! a beast, that
wants discourse of reason, Would have mourn'd longer--married with my
uncle, My father's brother, but no more like my father Than I to
Hercules: within a month: Ere yet the salt of most unrighteous tears
Had left the flushing in her galled eyes, She married. O, most wicked
speed, to post With such dexterity to incestuous sheets! It is not nor
it cannot come to good: But break, my heart; for I must hold my
tongue.
Replace alternate text sets with a blank "__" a character length equal to that of the length that has been replaced, where a text set is defined as group of words ending with a "!", "," "--", "?" etc.
So the above text from Hamlet becomes like
HAMLET O, ___________________ Or that the
Everlasting had not fix'd His canon 'gainst self-slaughter! __
God! _____, stale, ________ ......
What is the regex that I should use to achieve this end?
Here is an attempt using perl regex:
perl -pe 's/(.*?)([\!\?\,;\.]|--)(.*?)([\!\?\,;\.]|--)/\1\2________________\4/g' file
Output:
HAMLET O,_______! Or that the Everlasting had not fix'd His
canon 'gainst self-slaughter!_______! God!_______,
stale,_______, Seem to me all the uses of this
world!_______! ah fie!_______, That grows to
seed;_______. That it should come to this!_______,
not so much,_______; that was,_______, Hyperion to a
satyr;_______. Heaven and earth!_______?
why,_______, As if increase of appetite had grown By what it
fed on: and yet,_______-- Let me not think
on't--_______, thy name is woman!_______-- A little
month,_______, Like Niobe,_______--why
she,_______-- O,_______! a beast,_______,
Would have mourn'd longer--_______, My father's
brother,_______, She married._______, most wicked
speed,_______! It is not nor it cannot come to good: But
break,_______; for I must hold my tongue.
This solution replaces fix number of '__' and I am yet to figure out how to replace with matching charater length.