Removing the last specific character from the results of my formula - regex

I'm using some VLOOKUPs to pull in text from another tab on my spreadsheet using the below formula
={"Product Category Test";ARRAYFORMULA(IF(ISBLANK(A2:A),"",
VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Product Category",'Import
Template'!A1:DB1,0),false)&"|"&IF(VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Automatic
Categories",'Import Template'!A1:DB1,0),false)<>"",VLOOKUP(A2:A,'Import
Template'!A:DB,MATCH("Automatic Categories",'Import Template'!A1:DB1,0),false),"")))}
Example of results: Books|Coming Soon Images|
All of my results will be delimited by a "|" which will also be the final character. I need to remove the final "|" from the results ideally without using a helper column, is there a way to wrap another function around my formula to achieve this? I've played around with RIGHT and LEN but can't figure it out.
Thanks,

use regex:
=ARRAYFORMULA({"Product Category Test"; REGEXREPLACE(""&IF(ISBLANK(A2:A),,
VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Product Category",'Import
Template'!A1:DB1,0),)&"|"&IF(VLOOKUP(A2:A,'Import Template'!A:DB,MATCH("Automatic
Categories",'Import Template'!A1:DB1,0), )<>"",VLOOKUP(A2:A,'Import
Template'!A:DB,MATCH("Automatic Categories",'Import Template'!A1:DB1,0),),)), "\|$", )})
if this won't work make sure there are no empty spaces after last |

Related

Extract a list of unique text characters/ emojis from a cell

I have a text in cell (A1) like this:
✌😋👅👅☝️😉🍌🍪💧💧
I want to extract the unique emojis from this cell into separate cells:
✌😋👅☝️😉🍌🍪💧
Is this possible?
You want to put each character of ✌😋👅👅☝️😉🍌🍪💧💧 to each cell by splitting using the built-in function of Google Spreadsheet.
Sample formula:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
✌😋👅👅☝️😉🍌🍪💧💧 is put in a cell "A1".
Using REGEXREPLACE, # is put to between each character like ✌#😋#👅#👅#☝#️#😉#🍌#🍪#💧#💧#.
Using SPLIT, the value is splitted with #.
Result:
Note:
In your question, the value of ️ which cannot be displayed is included. It's \ufe0f. So "G1" can be seen like no value. But the value is existing. So please be careful this. If you want to remove the value, you can use ✌😋👅👅☝😉🍌🍪💧💧.
References:
REGEXREPLACE
SPLIT
Added:
From marikamitsos's comment, I could notice that my understanding was not correct. So the final result is as follows. This is from marikamitsos.
=TRANSPOSE(UNIQUE(TRANSPOSE(SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#"))))
or try:
=TRANSPOSE(UNIQUE(TRANSPOSE(REGEXEXTRACT(A1, REPT("(.)", LEN(A1))))))
Formula
Appears, one of the best formula solutions would be:
=SPLIT(REGEXREPLACE(A1,"(.)","$1#"),"#")
You may also add some additional checks like skin tones & intermediate chars:
=TRANSPOSE(SPLIT(REGEXREPLACE(A2,"(.[🏻🏼🏽🏾🏿"&CHAR(8205)&CHAR(65039)&"]*)","#$1"),"#"))
It will help to join some emojis as a single emoji.
Script
More precise way is to use the script:
https://github.com/orling/grapheme-splitter/blob/master/index.js
↑
Add the code to Script editor
Add code for sample usage:
function splitEmojis(string) {
var splitter = new GraphemeSplitter();
// split the string to an array of grapheme clusters (one string each)
var graphemes = splitter.splitGraphemes(string);
return graphemes;
}
Tests
Not 100% precise
1
Please note: some emojis are not correctly shown in sheets
🏴󠁧󠁢󠁷󠁬󠁳󠁿🏴󠁧󠁢󠁳󠁣󠁴󠁿🏴󠁧󠁢󠁥󠁮󠁧󠁿🏴
↑ emojis:
flag: England
flag: Scotland
flag: Wales
black flag
are the same for Google Sheets.
2
Vlookup function in #GoogleSheets and in #Excel thinks chars
#️⃣ and
*️⃣
are the same!

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

perform substring extraction on data frame column

I have a dataframe with 1 column called 'full_url'. Each element of the column is just a url. How to I write a function to remove the 'http://' from all of the elements at once? I need to use some kind of regex because some don't have it at all, some have https, etc. The closest I've gotten is gsub(".*//","",unlist(full_url))
but that also returns 'full_url1' 'full_url2' 'full_url3' ... as the row names for some reason
Without a reproducible example I'm not sure, but would something like this work?
apply(df$full_url, 1, function(x) ifelse(substr(x,1,7) == "http://", substr(x,8,length(x)),x)
So using apply to go by row and substr to find if the first 7 characters are "http://". If they are replace without the http and if they're not then replace with just x.

find a pattern in string and remove that pattern of the string from excel cells without touching the pattern in the middle of the string

I have a column which has "--" pattern in the beginning, middle and end of the string. For example:
-- myString
my -- String
myString --
I want to find these two types of cells
-- myString
myString --
and remove the "--" pattern, so it will look fine! I am an amateur user of excel but can use functions if you suggest me. It should be possible with find and use the results of the Find in Replace functions, but I do not know how to pass the results to Replace.
Please note: The answer should take care all the cells in the column, which are hundreds. One solution for changing all, not one solution for one cell.
EDIT: Just reread the request, per instruction from Gary'sStudent. This will remove all instances of "--", not only those at the beginning/end.
If the data is in A1, use the following formula:
=SUBSTITUTE(A1,"--","")
With data in A1 in B1 enter:
=IF(LEFT(A1,2)="--",MID(A1,3,9999),IF(RIGHT(A1,2)="--",MID(A1,1,LEN(A1)-2),A1))
OK, I found the answer. The answer from #Dubison helped me to find the right answer.
If the left two characters in this cell is "--" and the last two characters are "--" the substitute the "--" with "", else to nothing.
=IF(LEFT(A1,2)="--",SUBSTITUTE(A1,"--",""),IF(RIGHT(A1,2)="--",SUBSTITUTE(A1,"--",""), A1))
This will be pretty much the same with previous answers, only using simpler logic. If your strings first or last character = "-" do nothing, else replace "--" with "".
=IF(LEFT(A1,1)="-",A1,IF(RIGHT(A1,1)="-",A1, SUBSTITUTE(A1,"--","")))
UPDATE:
I noticed that I have misread the question. Above code will remove the "--" only if it is in the middle. However original question was to remove "--" only if it is at the beginning or at the end. So formula should be:
=IF(OR(LEFT(A1,2)="--",RIGHT(A1,2)="--"),SUBSTITUTE(A1,"--",""),A1)

How to split CSV line according to specific pattern

In a .csv file I have lines like the following :
10,"nikhil,khandare","sachin","rahul",viru
I want to split line using comma (,). However I don't want to split words between double quotes (" "). If I split using comma I will get array with the following items:
10
nikhil
khandare
sachin
rahul
viru
But I don't want the items between double-quotes to be split by comma. My desired result is:
10
nikhil,khandare
sachin
rahul
viru
Please help me to sort this out.
The character used for separating fields should not be present in the fields themselves. If possible, replace , with ; for separating fields in the csv file, it'll make your life easier. But if you're stuck with using , as separator, you can split each line using this regular expression:
/((?:[^,"]|"[^"]*")+)/
For example, in Python:
import re
s = '10,"nikhil,khandare","sachin","rahul",viru'
re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]
=> ['10', '"nikhil,khandare"', '"sachin"', '"rahul"', 'viru']
Now to get the exact result shown in the question, we only need to remove those extra " characters:
[e.strip('" ') for e in re.split(r'((?:[^,"]|"[^"]*")+)', s)[1::2]]
=> ['10', 'nikhil,khandare', 'sachin', 'rahul', 'viru']
If you really have such a simple structure always, you can use splitting with "," (yes, with quotes) after discarding first number and comma
If no, you can use a very simple form of state machine parsing your input from left to right. You will have two states: insides quotes and outside. Regular expressions is a also a good (and simpler) way if you already know them (as they are basically an equivalent of state machine, just in another form)