Google Sheets formula to check if string contains ALL unsorted characters in another string (Not App Script)? - regex

I need a Google Sheets native formula (not App Script function) that return true if all characters in string B exist in string A.
For example, if A = ‘CDEFGH’
If B = ‘EDX’ —> Match = False (character X is not in string A)
If B = ‘HCE’ —> Match = True (Characters H, C, and E are all in string A)
Note the following:
The characters in B can be in any order and are not necessarily contiguous - I need to check for the presence of ALL (not any) of B’s characters (in any order) are present in string A.
I know how to do this in App Script but it is too slow for my application as I need to call this function thousands of times. So the solution has to be using Google Sheets native built-in functions excluding App Script.
I would love to do this using regular expression. If so, please show me how to get Google Sheets to read strings A and B in cells and return true/false when the condition above is met.

Try
=arrayformula(sum(--REGEXMATCH(B1,split(REGEXREPLACE($A$1,"(.)","$1~"),"~"))))=len(B1)
explanation
REGEXREPLACE($A$1,"(.)","$1~") will add a tilde after each character
then split the result by ~
compare with REGEXMATCH(B1,split(REGEXREPLACE(A1,"(.)","$1~"),"~"))
then sum (-- will tranform boolean to number) and compare with the length of A1
extension
to avoid any repetition, try
=arrayformula(sum(--REGEXMATCH(B1,unique(transpose(split(REGEXREPLACE($A$1,"(.)","$1~"),"~"))))))=len(join("",unique(transpose(split(REGEXREPLACE(B1,"(.)","$1~"),"~")))))

Related

If cell contains '?' then formula X if not then copy value

At the moment I am busy with a spreadsheet to analyse results per url. The problem is that when I want to make a list of unique urls the urls with a parameter behind it (for example '?fbads') will be seen as unique, instead of that I need these results to be blended together with the main url. See example below:
https://www.holidayguru.nl/deal/accommodatie/luxe-strandvakantie-in-ijmuiden-5e25ba62-e001-4072-8eb5-b6c3b0e7e66f/?fbclid=IwA
&
https://www.holidayguru.nl/deal/accommodatie/luxe-strandvakantie-in-ijmuiden-5e25ba62-e001-4072-8eb5-b6c3b0e7e66f/
Should both be: https://www.holidayguru.nl/deal/accommodatie/luxe-strandvakantie-in-ijmuiden-5e25ba62-e001-4072-8eb5-b6c3b0e7e66f/
I already fixed this with a formula but I need one list with all urls. So I'm look for two options. Or in the
=LEFT(A11,FIND("?",A11)-1)
That I use right now I need to find a way how I can say. If you don't find a '?' than just copy cell A11
Or...
I have to work with an if fuction to say, if A11 contains '?' than execute =left fuction otherwise use A11.
I can't manage to get the formula working. Demo sheet is down below :). Thanks!
Example spreadsheet
Delete everything from Sheet1!A:A (including the header) and place the following in Sheet1!A1:
=ArrayFormula({"UNIQUE URLS"; UNIQUE(FILTER(REGEXEXTRACT(URLs!A2:A,"[^\?]+"),URLs!A2:A<>""))})
This will create the header (which you can change as you like within the formula itself) and a unique list of URLs as determined only by the portion before a question mark (if a question mark exists) or to the end of the original URL.
For your reference, the expression [^\?]+ means "a string of the greatest length that can be extracted without containing a literal question mark."
[ ] = "any of the characters contained herein"
[^ ] = "not any of these characters"
\ = literal marker (i.e., whatever is next will be treated as a literal character)
\? = literal question mark (using the literal marker before the ? is necessary, since alone, the ? has a separate special meaning in REGEX-type expressions)
+ = "one or more of the preceding character or group of characters"

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

How can I normalize / asciify Unicode characters in Google Sheets?

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.
I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.
I tried to write a formula (similar to this one):
REGEXREPLACE("público","(\pL)\pM*","$1")
But Sheets produces the following error:
Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.
I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.
Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?
[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.
But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".
Workaround
Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:
=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))
And then make directory of replacements in range D:E:
D E
1 ú u
2 ē e
3 ... ...
This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.
Or use Java Script
Also found a good solution, which works in google sheets.
This did it for me in Google Sheets, Google Apps Scripts, GAS
function normalizetext(text) {
var weird = 'öüóőúéáàűíÖÜÓŐÚÉÁÀŰÍçÇ!#£$%^&*()_+?/*."';
var normalized = 'ouooueaauiOUOOUEAAUIcC ';
var idoff = -1,new_text = '';
var lentext = text.toString().length -1
for (i = 0; i <= lentext; i++) {
idoff = weird.search(text.charAt(i));
if (idoff == -1) {
new_text = new_text + text.charAt(i);
} else {
new_text = new_text + normalized.charAt(idoff);
}
}
return new_text;
}
This answer doesn't require a Google App Script, and it's still fast, and relatively simple. It builds on Max's answer by providing a full lookup table, and it also allows for case-sensitive transliteration (normally VLOOKUP is NOT case-sensitive).
Here is a link to the Google Spreadsheet if you want to jump right into it. If you want to use your own sheet, you'll need to copy the TRANS_TABLE sheet into your Spreadsheet.
In the code snippet below, the source cell is A2, so you'd place this formula in any column on row 2. Using REGEXREPLACE AND SPLIT, we split apart the string in A2 into an array of characters, then USING ARRAYFORMULA, we do the following to EACH character in the array: First, the character is converted to its 'decimal' CODE equivalent, then matched against a table on the TRANS_TABLE sheet by that number, then using VLOOKUP, a character X number of columns over (the index value provided) on the TRANS_TABLE sheet (in this case, the 3rd column over) is returned. When all characters in the array have been transliterated, we finally JOIN the array of characters back into a single string. I provided examples with named ranges as well.
=iferror(
join(
"",
ARRAYFORMULA(
vlookup(
code(split(REGEXREPLACE($A2,"(.)", "$1;"),";",TRUE)),
TRANS_TABLE!$A$5:$F,3
)
)
)
,)
You'll note on the TRANS_TABLE sheet I made, I created 4 different transliteration columns, which makes it easy to have a column for each of your transliteration needs. To reference the column, just use a different index number in the VLOOKUP. Each column is simply a replacement character column. In some cases, you don't want any conversion made (A -> A or 3 -> 3), so you just copy the same character from the source Glyph column. Where you DO want to convert characters, you type in whatever character you want replaced (ñ -> n etc). If you want a character removed altogether, you leave the cell blank (? -> ''). You can see examples of the transliteration output on the data sheet in which I created 4 different transliteration columns (A-D) referencing each of the Transliteration tables from the TRANS_TABLE sheet for different use case scenarios.
I hope this finally answers your question in a fashion that isn't so "ugly." Cheers.

Spotfire: count the number of a certain character in a string

I am trying to add a new calculated column that counts the number of semi colons in a string and adds one to it. So the column i have contains a bunch of aliases and I need to know how many for each row.
For example,
A; B; C; D
So basically this means there are 4 aliases (3 semi colons + 1)
Need to do this for over 2 million rows. Help please!
Basic idea is to subtract length of your string without ; characters from it's original length:
len([columnName])-len(Substitute([columnName],";",""))+1
Here it is with a regular expression:
Len(RXReplace([Column 1], "(?!;).", "", "gis"))+1
RXReplace takes as arguments:
The string you are wanting to work on (in this case it is on Column 1)
The regular expression you want to use (here it is (?!;). )
What you want to replace matches with (blank in this situation so
that everything that matches the regex is removed)
Finally a parameter saying how you want it to work (we are passing
in gis which means replace all matches not just the first, ignore case, replace newlines)
We wrap this in a Len which gives us the amount of semicolons since that is all that is left and finally we add 1 to it to get the final result.
You can read more about the regular expression here: https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx but in a nutshell it says match everything that isn't a semi colon.
You can read more about RXReplace and Len here: https://docs.tibco.com/pub/spotfire/6.0.0-november-2013/userguide-webhelp/ncfe/ncfe_text_functions.htm

subset doesn't recognize regex

I just cannot get this working. i want to subset all the rows containing "mail". I use this:
Email <- subset(Total_Content, source == ".*mail.*")
I have rows like this ones:
"snt152.mail.live.com",
"mailing.serviciosmovistar.com",
"blu179.mail.live.com"
But when using: "View(Email)"
I just get a data.frame empty (just see the columns). I don't need to "scape" any metacharacter, because i need the "." to mean "anycharacter" and the "*" (0 or more times), right? Thanks.
Well, no, it doesn't - it's not meant to. You're not passing it a regular expression to be evaluated against each row, you're just passing it a character string; it doesn't know that . and * are regex characters because it's not performing a regex search. It's returning all rows where source is the literal string .mail. - which in this case is 0 rows.
What you probably want to be doing (I'm assuming this is a data.frame, here) is:
Email <- Total_Content[grepl(x = Total_Content$source, pattern = ".*mail.*"),]
grepl produces a set of boolean values of whether each entry in Total_Content$source matched the pattern. Total_Content[boolean_vector,] limits to those rows of Total_Content where the equivalent boolean is TRUE.
Why not use subset with a logical regex funtion?
Email <- subset(Total_Content, grepl(".*mail.*", source) )
The subset function does create a local environment for the evaluation of expressions that are used in either the 'subset' (row targets) or the 'select' (column targets) arguments.