Extract two words regardless of order In Google Sheets - regex

I have a Google Sheets table with input in column A, and I'd want to achieve this result using REGEXEXTRACT.
Desired result:
Input
Output
Stock OutNew21554 - Shirt - Red
New | Stock Out
NewStock Out54872 - Shirt - Green
New | Stock Out
This is what I attempted.
01
=ArrayFormula(REGEXEXTRACT(A1:A2, "[(Stock Out)|(New)]+"))
Input
Output
Stock OutNew21554 - Shirt - Red
Stock OutNew
NewStock Out54872 - Shirt - Green
NewStock Out
02
=ArrayFormula(REGEXEXTRACT(A2:A3, "(Stock Out)|(New)+"))
Input
Output
Stock OutNew21554 - Shirt - Red
Stock Out
NewStock Out54872 - Shirt - Green

Use two instances of regexextract() in an { array expression }, wrapped in iferror():
=arrayformula( iferror(
{
regexextract(A2:A3, "New"),
regexextract(A2:A3, "Stock Out")
}
) )

There's no way to this in a single regex without generating all possible permutations or without lookaround support. However, we can call regexextract repeatedly using REDUCE. For eg, to extract, New,Stock and Color,
=BYROW(A2:A3,LAMBDA(row,TEXTJOIN(" | ",1,REDUCE(,{"New","Stock Out","Red|Green"},LAMBDA(a,c,{a;IFNA(REGEXEXTRACT(row,c))})))))
This supports unlimited☨ number of words to extract.
Output
New | Stock Out | Red
New | Stock Out | Green

Related

Regex: Kusto Query to fetch /extract text after a word

Using Kusto Query, is there a way to extract or fetch the text after a word, "Measure".
For example in below string , i would like to fetch 2 values -
cubeCount of Sales
Number of Product Categories
string:
SELECT NON EMPTY
CrossJoin(Hierarchize(AddCalculatedMembers({DrilldownLevel({[Office
View].[Office View].[All]})})), {[Measures].[cubeCount of
Sales],[Measures].[Number of Product Categories]}) DIMENSION
PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON COLUMNS , NON
EMPTY
Hierarchize(AddCalculatedMembers({DrilldownLevel({[Board].[Board].[All]})}))
DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON ROWS
FROM [EZI_NS] WHERE ([Entity].[Entity Schema].&[Total],[Date].[FY
Year].&[FY2021],[Date].[FY Month Short].&[Jan],[Type].[Service
Type].[All],[DateView].[DateView].&[Periodic]) CELL PROPERTIES VALUE,
FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS
Tried using regex, but unable to frame the query in the extract_all function.
print txt = "SELECT NON EMPTY CrossJoin(Hierarchize(AddCalculatedMembers({DrilldownLevel({[Office View].[Office View].[All]})})), {[Measures].[cubeCount of Sales],[Measures].[Number of Product Categories]}) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON COLUMNS , NON EMPTY Hierarchize(AddCalculatedMembers({DrilldownLevel({[Board].[Board].[All]})})) DIMENSION PROPERTIES PARENT_UNIQUE_NAME,HIERARCHY_UNIQUE_NAME ON ROWS FROM [EZI_NS] WHERE ([Entity].[Entity Schema].&[Total],[Date].[FY Year].&[FY2021],[Date].[FY Month Short].&[Jan],[Type].[Service Type].[All],[DateView].[DateView].&[Periodic]) CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR, FORE_COLOR, FONT_FLAGS"
| project Measures = extract_all(#"\[Measures]\.\[(.*?)]", txt)
Measures
["cubeCount of Sales","Number of Product Categories"]
Fiddle

Merging two Pandas Dataframes using Regular Expressions

I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW

Summing up number values extracted from one cell using rexexextract or regexreplace

I have numbers like the sample below stored in one cell:
First:
[9miles 12lbs weight 1g Raw]
Second:
[1miles 3lbs weight 7g Raw]
Third:
[20miles 6lbs weight 3g Raw]
I'd like to extract the numbers, sum them up () and place them in another cell in the same row. So far I can only manage to extract the first instance of regexp using regexextract formula. Is this even possible?
Desired outcome:
[30miles 21lbs weight 11g Raw]
try:
=INDEX(QUERY(IFERROR(REGEXEXTRACT(SPLIT(
FLATTEN(SPLIT(A1, ":")), " "), "\d+")*1, 0),
"select sum(Col1),sum(Col2),sum(Col4)"), 2)

How to get the email address in between 2 different characters in Excel or Google Sheets using formula only?

The cell A2 has the following sample email address:
Jose Rizal <jose#email.com>
I want to get the email address only in cell B2:
=right(A2,len(A2) - search("<",A2,1))
but the result was: jose#email.com> (with the > on the last character).
The table looks like this and the expected result is on B2:
| A | B |
1| complete email address | email address only |
2| Jose Rizal <jose#email.com> | jose#email.com |
What to improve on my formula?
paste in B2:
=REGEXEXTRACT(A2, "<(.*)>")
and arrayformula would be:
=ARRAYFORMULA(IFERROR(REGEXEXTRACT(A2:A, "<(.*)>")))
In Excel, or Google sheets(But player0's REGEXEXTRACT is better to use in Google Sheets):
=MID(REPLACE(A2,FIND(">",A2),LEN(A2),""),FIND("<",A2)+1,LEN(A2))
And drag the formula down.
Add another Left trim in there:
=LEFT(RIGHT(A2,LEN(A2) - SEARCH("<",A2,1)),LEN(RIGHT(A2,LEN(A2) - SEARCH("<",A2,1)))-1)
Another attempt using FILTERXML if you are using one of the following versions of Excel:
=FILTERXML("<b><a>"&SUBSTITUTE(LEFT(A2,LEN(A2)-1),"<","</a><a>")&"</a></b>","//a[2]")
Suppose your data stars from Cell A2, drag the formula down to apply across.
For the logic behind this formula you may give a read to this article: Extract Words with FILTERXML.

How to extract specific information from strings

I have a dataset with the addresses of authors' affiliations. Addresses have differing length. But the information before the first comma is the name of he institution and that after the last comma the country. What I want to do is to extract the country and create a new variable for it.
I tried this code in Stata. It works to extract the name of institutions.
generate splitat = strpos(institutions ,",")
generate str80 univ = substr(institutions, 1, splitat - 1)
I am wondering whether this code also can be applied to extract the country.
I thought it could check from the end instead from the start?
My dataset looks like the following example:
Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan
Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands
There is a specific function in Stata 14+ to look for the last occurrence of a substring (e.g. a specific character) in a string. See help string functions in Stata 14 for documentation of strrpos().
If that is not in your version of Stata, you merely reverse the string, find the substring using the method you already know, and then reverse what you found.
If you are not using the latest version of Stata, it is always a good idea to specify that in questions in any forum that supports Stata questions,
clear
input str244 institutions
"Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan"
"Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands"
end
compress
gen country = substr(institutions, strrpos(institutions, ",") + 1, .)
local rev strreverse(institutions)
gen country2 = strreverse(substr(`rev', 1, strpos(`rev', ",") - 1))
assert country == country2
l country
+--------------+
| country |
|--------------|
1. | Taiwan |
2. | Netherlands |
+--------------+