How can I normalize / asciify Unicode characters in Google Sheets? - regex

I'm trying to write a formula for Google Sheets which will convert Unicode characters with diacritics to their plain ASCII equivalents.
I see that Google uses RE2 in its "REGEXREPLACE" function. And I see that RE2 offers Unicode character classes.
I tried to write a formula (similar to this one):
REGEXREPLACE("público","(\pL)\pM*","$1")
But Sheets produces the following error:
Function REGEXREPLACE parameter 2 value "\pL" is not a valid regular expression.
I suppose I could write a formula consisting of a long set of nested SUBSTITUTE functions (Like this one), but that seems pretty awful.
Can any offer a suggestion for a better way to normalize Unicode letters with diacritical/accent marks in a Google Sheets formula?

[[:^alpha:]] (negated ASCII character class) works fine for REGEXEXTRACT formula.
But =REGEXREPLACE("público","([[:alpha:]])[[:^alpha:]]","$1") gives "pblic" as a result. So, I guess, formula doesn't know what exact ASCII character must replace "ú".
Workaround
Let's take the word públicē; we need to replace two symbols in it. Put this word in cell A1, and this formula in cell B1:
=JOIN("",ArrayFormula(IFERROR(VLOOKUP(SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"),D:E,2,0),SPLIT(REGEXREPLACE(A1,"(.)","$1-"),"-"))))
And then make directory of replacements in range D:E:
D E
1 ú u
2 ē e
3 ... ...
This formula is still ugly, but more useful because you can control your directory by adding more characters to the table.
Or use Java Script
Also found a good solution, which works in google sheets.

This did it for me in Google Sheets, Google Apps Scripts, GAS
function normalizetext(text) {
var weird = 'öüóőúéáàűíÖÜÓŐÚÉÁÀŰÍçÇ!#£$%^&*()_+?/*."';
var normalized = 'ouooueaauiOUOOUEAAUIcC ';
var idoff = -1,new_text = '';
var lentext = text.toString().length -1
for (i = 0; i <= lentext; i++) {
idoff = weird.search(text.charAt(i));
if (idoff == -1) {
new_text = new_text + text.charAt(i);
} else {
new_text = new_text + normalized.charAt(idoff);
}
}
return new_text;
}

This answer doesn't require a Google App Script, and it's still fast, and relatively simple. It builds on Max's answer by providing a full lookup table, and it also allows for case-sensitive transliteration (normally VLOOKUP is NOT case-sensitive).
Here is a link to the Google Spreadsheet if you want to jump right into it. If you want to use your own sheet, you'll need to copy the TRANS_TABLE sheet into your Spreadsheet.
In the code snippet below, the source cell is A2, so you'd place this formula in any column on row 2. Using REGEXREPLACE AND SPLIT, we split apart the string in A2 into an array of characters, then USING ARRAYFORMULA, we do the following to EACH character in the array: First, the character is converted to its 'decimal' CODE equivalent, then matched against a table on the TRANS_TABLE sheet by that number, then using VLOOKUP, a character X number of columns over (the index value provided) on the TRANS_TABLE sheet (in this case, the 3rd column over) is returned. When all characters in the array have been transliterated, we finally JOIN the array of characters back into a single string. I provided examples with named ranges as well.
=iferror(
join(
"",
ARRAYFORMULA(
vlookup(
code(split(REGEXREPLACE($A2,"(.)", "$1;"),";",TRUE)),
TRANS_TABLE!$A$5:$F,3
)
)
)
,)
You'll note on the TRANS_TABLE sheet I made, I created 4 different transliteration columns, which makes it easy to have a column for each of your transliteration needs. To reference the column, just use a different index number in the VLOOKUP. Each column is simply a replacement character column. In some cases, you don't want any conversion made (A -> A or 3 -> 3), so you just copy the same character from the source Glyph column. Where you DO want to convert characters, you type in whatever character you want replaced (ñ -> n etc). If you want a character removed altogether, you leave the cell blank (? -> ''). You can see examples of the transliteration output on the data sheet in which I created 4 different transliteration columns (A-D) referencing each of the Transliteration tables from the TRANS_TABLE sheet for different use case scenarios.
I hope this finally answers your question in a fashion that isn't so "ugly." Cheers.

Related

Google Sheets formula to check if string contains ALL unsorted characters in another string (Not App Script)?

I need a Google Sheets native formula (not App Script function) that return true if all characters in string B exist in string A.
For example, if A = ‘CDEFGH’
If B = ‘EDX’ —> Match = False (character X is not in string A)
If B = ‘HCE’ —> Match = True (Characters H, C, and E are all in string A)
Note the following:
The characters in B can be in any order and are not necessarily contiguous - I need to check for the presence of ALL (not any) of B’s characters (in any order) are present in string A.
I know how to do this in App Script but it is too slow for my application as I need to call this function thousands of times. So the solution has to be using Google Sheets native built-in functions excluding App Script.
I would love to do this using regular expression. If so, please show me how to get Google Sheets to read strings A and B in cells and return true/false when the condition above is met.
Try
=arrayformula(sum(--REGEXMATCH(B1,split(REGEXREPLACE($A$1,"(.)","$1~"),"~"))))=len(B1)
explanation
REGEXREPLACE($A$1,"(.)","$1~") will add a tilde after each character
then split the result by ~
compare with REGEXMATCH(B1,split(REGEXREPLACE(A1,"(.)","$1~"),"~"))
then sum (-- will tranform boolean to number) and compare with the length of A1
extension
to avoid any repetition, try
=arrayformula(sum(--REGEXMATCH(B1,unique(transpose(split(REGEXREPLACE($A$1,"(.)","$1~"),"~"))))))=len(join("",unique(transpose(split(REGEXREPLACE(B1,"(.)","$1~"),"~")))))

Extract a list of unique text characters/ emojis from a cell

I have a text in cell (A1) like this:
✌😋👅👅☝️😉🍌🍪💧💧
I want to extract the unique emojis from this cell into separate cells:
✌😋👅☝️😉🍌🍪💧
Is this possible?
You want to put each character of ✌😋👅👅☝️😉🍌🍪💧💧 to each cell by splitting using the built-in function of Google Spreadsheet.
Sample formula:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
✌😋👅👅☝️😉🍌🍪💧💧 is put in a cell "A1".
Using REGEXREPLACE, # is put to between each character like ✌#😋#👅#👅#☝#️#😉#🍌#🍪#💧#💧#.
Using SPLIT, the value is splitted with #.
Result:
Note:
In your question, the value of ️ which cannot be displayed is included. It's \ufe0f. So "G1" can be seen like no value. But the value is existing. So please be careful this. If you want to remove the value, you can use ✌😋👅👅☝😉🍌🍪💧💧.
References:
REGEXREPLACE
SPLIT
Added:
From marikamitsos's comment, I could notice that my understanding was not correct. So the final result is as follows. This is from marikamitsos.
=TRANSPOSE(UNIQUE(TRANSPOSE(SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#"))))
or try:
=TRANSPOSE(UNIQUE(TRANSPOSE(REGEXEXTRACT(A1, REPT("(.)", LEN(A1))))))
Formula
Appears, one of the best formula solutions would be:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
You may also add some additional checks like skin tones & intermediate chars:
=TRANSPOSE(SPLIT(REGEXREPLACE(A2,"(.[🏻🏼🏽🏾🏿"&CHAR(8205)&CHAR(65039)&"]*)","#$1"),"#"))
It will help to join some emojis as a single emoji.
Script
More precise way is to use the script:
https://github.com/orling/grapheme-splitter/blob/master/index.js
↑
Add the code to Script editor
Add code for sample usage:
function splitEmojis(string) {
var splitter = new GraphemeSplitter();
// split the string to an array of grapheme clusters (one string each)
var graphemes = splitter.splitGraphemes(string);
return graphemes;
}
Tests
Not 100% precise
1
Please note: some emojis are not correctly shown in sheets
🏴󠁧󠁢󠁷󠁬󠁳󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴
↑ emojis:
flag: England
flag: Scotland
flag: Wales
black flag
are the same for Google Sheets.
2
Vlookup function in #GoogleSheets and in #Excel thinks chars
#️⃣ and
*️⃣
are the same!

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

String excerpts

I would like to copy a certain string (out of a longer range of strings in one cell) and show it in a different cell with Google Sheets. This is what is in the initial cell A1:A :
"String 1","String 2","String 3"
In B1:B I'd like ONLY String 3, so without the "" and the other strings.
Is this possible with spreadsheets?
Or is there any other way of doing so?
Update
So the task is to get word inside double quotes. And the mathcing string is placed in the end of text.
You may use regular expressions to deal with that, the basic formula is:
=REGEXEXTRACT(A1,"([^""]+)""$")
This will give a word inside "" from text in cell A1 at the end of text.
For example:
some text...,"Thisthat","https://www.url.com/de/Thisthat"
gives https://www.url.com/de/Thisthat
You may also use arrayformula:
=ArrayFormula(REGEXEXTRACT(A1:A3,"([^""]+)""$"))
Please, read more about this functions here and here.
Old answer
if you want strings to be on their rows, use this formula in B1:
=ArrayFormula(if(A1:A = "String 3";A1:A;""))
If you have cells in A1:A, which contain 'string 3', and you want to match them too, use this:
=ArrayFormula(if(REGEXMATCH(A1:A , "String 3"),"String 3",""))

Regular Expression in ms excel

How can I use regular expression in excel ?
In above image I have column A and B. I have some values in column A. Here I need to move data after = in column B. For e.g. here in 1st row I have SELECT=Hello World. Here I want to remove = sign and move Hello world in column B. How can I do such thing?
Stackoverflow has many posts about adding regular expressions to Excel using VBA. For your particular example, you would need VBA to actually move a substring from one cell to another.
If you simply want to copy the substring, you can do so easily using the MID function:
=IFERROR(MID(A1,FIND("=",A1)+1,999),A1)
I used 999 to ensure that enough characters were grabbed.
IFERROR returns the cell as-is if an equals sign is not found.
To return the portion of string before the equals sign, do this:
=LEFT(A1,FIND("=",A1&"=")-1)
In this case, I appended the equals sign to A1, so FIND won't return an error if not found.
You can use the Text to Column functionality of MS-Excel giving '=' as delimiter.
Refer to this link:
Chop text in column to 60 charactersblocks
You can simply use Text to Column feature of excel for this:
Follow the below steps :
1) Select Column A.
2) Goto Data Tab in Menu Bar.
3) Click Text to Column icon.
4) Choose Delimited option and do Next and then check the Other options in delimiter and enter '=' in the entry box.
5) Just click finish.
Here are URL for Text to Column : http://www.excel-easy.com/examples/text-to-columns.html