SubString match in Google Sheets using Regex and Query - regex

This is the data in google sheets
Account Numkber Names
7728550,543216 Govt Req
772855,65432 Vodafone
I am trying to do a lookup of the account numbers with the formula
=QUERY(Sheet1!B$3:C$4,"Select C where B matches '^.*(" & B2 & ").*$' limit 1")
772855 - Govt req
How do I solve this ? There is a large chunk of data so I can't paste the values in different rows.

use:
=ARRAYFORMULA(IFNA(VLOOKUP(B2:B,
SPLIT(FLATTEN(SPLIT(Sheet1!F2:F, ",")&"×"&Sheet1!G2:G), "×"), 2, )))

Related

Summing up number values extracted from one cell using rexexextract or regexreplace

I have numbers like the sample below stored in one cell:
First:
[9miles 12lbs weight 1g Raw]
Second:
[1miles 3lbs weight 7g Raw]
Third:
[20miles 6lbs weight 3g Raw]
I'd like to extract the numbers, sum them up () and place them in another cell in the same row. So far I can only manage to extract the first instance of regexp using regexextract formula. Is this even possible?
Desired outcome:
[30miles 21lbs weight 11g Raw]
try:
=INDEX(QUERY(IFERROR(REGEXEXTRACT(SPLIT(
FLATTEN(SPLIT(A1, ":")), " "), "\d+")*1, 0),
"select sum(Col1),sum(Col2),sum(Col4)"), 2)

Regex in google sheets - Extract parts of URL between slashes

I have difficulties with extracting parts of URL.
What I need is to know what category the product is(jeans, socks, tshirts), and a subcategory what color the product is(blue, black, white)
https://www.examplewebsite.com/shop/jeans/blue/123456
https://www.examplewebsite.com/shop/socks/black/234567
https://www.examplewebsite.com/shop/tshirst/white/4321
What is the best way to extract this in google sheets?
Please try:
=index(split(A1,"/"),,4)
and
=index(split(A1,"/"),,5)
copied down to suit.
={"PRODUCT", "CATEGORY";
INDEX(SPLIT(A2:A, "/"), , 4),
INDEX(SPLIT(A2:A, "/"), , 5)}

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.
I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))
I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

Grouping Similar words/phrases

I have a frequency table of words which looks like below
> head(freqWords)
employees work bose people company
1879 1804 1405 971 959
employee
100
> tail(freqWords)
youll younggood yoyo ytd yuorself zeal
1 1 1 1 1 1
I want to create another frequency table which will combine similar words and add their frequencies
In above example, my new table should contain both employee and employees as one element with a frequency of 1979. For example
> head(newTable)
employee,employees work bose people
1979 1804 1405 971
company
959
I know how to find out similar words (using adist, stringdist) but I am unable to create the frequency table. For instance I can use following to get a list of similar words
words <- names(freqWords)
lapply(words, function(x) words[stringdist(x, words) < 3])
and following to get a list of similar phrases of two words
lapply(words, function(x) words[stringdist2(x, words) < 3])
where stringdist2 is follwoing
stringdist2 <- function(word1, word2){
min(stringdist(word1, word2),
stringdist(word1, gsub(word2,
pattern = "(.*) (.*)",
repl="\\2,\\1")))
}
I do not have any punctuation/special symbols in my words/phrases. (I do not know a lot of R; I created stringdist2 by tweaking an implementation of adist2 I found here but I do not understand everything about how pattern and repl works)
So I need help to create new frequency table.

How to match Amazon / CJ / Linkshare Products

I need to create a data base with Amazon, commission junction & link share API's & data feeds and then match the same products to create comparisons on product information.
My problem is related to the matching process.
I start by matching products via SKU/UPC/ASIN but this not perform well because many of the products doesn't contain this information.
I maked some research and the most popular techniques I found are :
-Measuring cosine similarity via TF-IDF
-Measuring edit distance/ levenshtein / Jaro-Winkler
In this technique i used cosine similarity and Jaro-Winkler
How I do the matching :
Step 1 : Preprocessing
Preprocessing to transform strings into a normal form :
 Lowercase
 Filter stop words (new, by, the …)
 Strip whitespace
 replace all whitespace occurrences with a single space character
Step 2, Indexing :
Index Amazon products in a Solr core [core A] and CJ/Linkshare [core B] in an other core. The goal of indexing is to limit the number of string comparisons (via TF-IDF and Jaro-Winkler)
Step 3, matching :
I start by retrieving a product title from core B, make a solr search in core A with this title and take the top 30 results.
I measure similarity via TF-IDF between the product i want to match (the query) and the 30 results retrieved by solr search. I keep the products with similarity > 80%
sort the tokens from each product alphabetically.I then compare the transformed strings with Jaro Winkler distance and keep the products with similarity > 80% (==> This perform a Jaro Winkler similarity between phrases)
Here, I tokenize both strings (query and product to match) , and perform a comparison between tokens.
But this techniques also don't perform well. Example :
Product 1 : Orange by Hugo Boss, 3 Ounce Eau de toilette Spray
Product 2 : In Motion Orange By Hugo Boss Eau De Toilette Spray 3 Ounces
Product 1 and 2 are similar via this techniques but actually they are different.
How can I improve this algorithm? Is that the right way to match products?
How if i train a classifier with token's weight (using Jaro Winkler) (learning data from matched products via UPC) and use this classifier to match products in a final step?
PS : I have products from different categories (health, beauty, electronics, books, movies...) and data is very unstructured or incomplete.
Any advice will be helpfull
Thanks
Smail