REGEXP_EXTRACT () every word except ‘,’ in a field - regex

I’d like to select country except ‘,’ from a data field which looks like this
Japan,Singapore,Italy,France
and my Code looks like this REGEXP_EXTRACT(country,'([^,]*)'), unfortunately, it works but only the country at the first was selected. How can I code it to select it all?

I slightly changed the RegEx to ([^,]+) to make the country name at least one digit. Using * creates empty matches so that every other match contains the country name. (Example)
Take a look at the fixed example here.
Important is the /g tag in the end to make the RegEx match globally.

If you are looking to extract all the characters except , then it could be achieved using either of the the REGEXP_REPLACE Calculated Fields below:
1) Replace , with (space)
REGEXP_REPLACE(country, ",", " ")
2) Remove ,
REGEXP_REPLACE(country, ",", "")
Google Data Studio Report and a GIF to elaborate:

Related

How to split a string in db2?

I've some URL's in my cas_fnd_dwd_det table,
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf
www.casiac.net/fnds/casi/as.pdf
www.casiac.net/fnds/casi/vindq.pdf
www.casiac.net/fnds/CASI/mnip.pdf
how do i copy the letters between last '/' and '.pdf' to another column
expected outcome
casi_imp_urls cas_code
----------------------------------- -----------
www.casiac.net/fnds/CASI/qnxp.pdf qnxp
www.casiac.net/fnds/casi/as.pdf as
www.casiac.net/fnds/casi/vindq.pdf vindq
www.casiac.net/fnds/CASI/mnip.pdf mnip
the below URL's are static
www.casiac.net/fnds/CASI/
www.casiac.net/fnds/casi/
Advise, how do i select the codes between last '/' and '.pdf' ?
I would recommend to take a look at REGEXP_SUBSTR. It allows to apply a regular expression. Db2 has string processing functions, but the regex function may be the easiest solution. See SO question on regex and URI parts for different ways of writing the expression. The following would return the last slash, filename and the extension:
SELECT REGEXP_SUBSTR('http://fobar.com/one/two/abc.pdf','\/(\w)*.pdf' ,1,1)
FROM sysibm.sysdummy1
/abc.pdf
The following uses REPLACE and the pattern is from this SO question with the pdf file extension added. It splits the string in three groups: everything up to the last slash, then the file name, then the ".pdf". The '$1' returns the group 1 (groups start with 0). Group 2 would be the ".pdf".
SELECT REGEXP_REPLACE('http://fobar.com/one/two/abc.pdf','(?:.+\/)(.+)(.pdf)','$1' ,1,1)
FROM sysibm.sysdummy1
abc
You could apply LENGTH and SUBSTR to extract the relevant part or try to build that into the regex.
For older Db2 versions than 11.1. Not sure if it works for 9.5, but definitely should work since 9.7.
Try this as is.
with cas_fnd_dwd_det (casi_imp_urls) as (values
'www.casiac.net/fnds/CASI/qnxp.pdf'
, 'www.casiac.net/fnds/casi/as.pdf'
, 'www.casiac.net/fnds/casi/vindq.pdf'
, 'www.casiac.net/fnds/CASI/mnip.PDF'
)
select
casi_imp_urls
, xmlcast(xmlquery('fn:replace($s, ".*/(.*)\.pdf", "$1", "i")' passing casi_imp_urls as "s") as varchar(50)) cas_code
from cas_fnd_dwd_det

string replace method to be replaced by regular expression

I am using string replace method to clean-up column names.
df.columns=df.columns.str.replace("#$%./- ","").str.replace(' ', '_').str.replace('.', '_').str.replace('(','').str.replace(')','').str.replace('.','').str.lower()
Though it works, certainly does not look pythonic. Any suggestion?
I need only A-Za-z and underscore _ if required as column names.
Update:
I tried using Regular expression in the first replace method, but I still need to chain the string like this...
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '')
Update showing test data:
Original string (Tab separated):
[Sr.No. Course Terms Besic of Education Degree Course Course Approving Authority (i.e Medical Council, etc.) Full form of Course 1 year Duration 2nd year 3rd year Duration 4 th year Duration]
Change column names:
terms.columns=terms.columns.str.replace(r"^[^a-zA-Z1-9]*", '').str.replace(' ', '_').str.replace('(','').str.replace(')','').str.replace('.', '').str.replace(',', '').str.lower()
Output:
['srno', 'course', 'terms', 'besic_of_education', 'degree_course',
'course_approving_authority_ie_medical_council_etc',
'full_form_of_course', '1_year_duration', '2nd_year_',
'3rd_year_duration', '4_th_year_duration']
Above output is correct. The question: Is there any way to achive the same other than the way I have used?
You can use a smaller number of .replace operations by replacing non-word strings with an empty string and subsequently removing the whitespace characters with an underscore.
df.columns.str.replace("[^\w\s]+","").str.replace("\s+","_")‌​.str.lower()
I hope this helps.

How can I separate a string by underscore (_) in google spreadsheets using regex?

I need to create some columns from a cell that contains text separated by "_".
The input would be:
campaign1_attribute1_whatever_yes_123421
And the output has to be in different columns (one per field), with no "_" and excluding the final number, as it follows:
campaign1 attribute1 whatever yes
It must be done using a regex formula!
help!
Thanks in advance (and sorry for my english)
=REGEXEXTRACT("campaign1_attribute1_whatever_yes_123421","(("&REGEXREPLACE("campaign1_attribute1_whatever_yes_123421","((_)|(\d+$))",")$1(")&"))")
What this does is replace all the _ with parenthesis to create capture groups, while also excluding the digit string at the end, then surround the whole string with parenthesis.
We then use regex extract to actuall pull the pieces out, the groups automatically push them to their own cells/columns
To solve this you can use the SPLIT and REGEXREPLACE functions
Solution:
Text - A1 = "campaign1_attribute1_whatever_yes_123421"
Formula - A3 = =SPLIT(REGEXREPLACE(A1,"_+\d*$",""), "_", TRUE)
Explanation:
In cell A3 We use SPLIT(text, delimiter, [split_by_each]), the text in this case is formatted with regex =REGEXREPLACE(A1,"_+\d$","")* to remove 123421, witch will give you a column for each word delimited by ""
A1 = "campaign1_attribute1_whatever_yes_123421"
A2 = "=REGEXREPLACE(A1,"_+\d*$","")" //This gives you : *campaign1_attribute1_whatever_yes*
A3 = SPLIT(A2, "_", TRUE) //This gives you: campaign1 attribute1 whatever yes, each in a separate column.
I finally figured it out yesterday in stackoverflow (spanish): https://es.stackoverflow.com/questions/55362/c%C3%B3mo-separo-texto-por-guiones-bajos-de-una-celda-en...
It was simple enough after all...
The reason I asked to be only in regex and for google sheets was because I need to use it in Google data studio (same regex functions than spreadsheets)
To get each column just use this regex extract function:
1st column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){0}([^_]*)_')
2nd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){1}([^_]*)_')
3rd column: REGEXP_EXTRACT(Campaña, '^(?:[^_]*_){2}([^_]*)_')
etc...
The only thing that has to be changed in the formula to switch columns is the numer inside {}, (column number - 1).
If you do not have the final number, just don't put the last "_".
Lastly, remember to do all the calculated fields again, because (for example) it gets an error with CPC, CTR and other Adwords metrics that are calculated automatically.
Hope it helps!

variable number of capturing groups

I have a xpath expression which I want to use to extract City and date from a td which contains a string of this kind:
City(may contain spaces and may be missing, but the following space is always present) on 2013/07/20
So far, I got to the following solution for extracting the date, which works partially:
//path/to/my/td/text()/replace(.,'(.*) on (.*)','$3')
This works when City is present, but when City is missing I get "on 2013/07/20" as a result.
I think this is because the first capturing group fails and so the number of groups is different.
How can I get this expression to work?
I did not fully check your regex, but it looks fine at first sight. Anyway, you can also go an easier way if you only want to get the date by extracting the text after "on ":
//path/to/my/td/text()/substring-after(.,'on ')
edit: or you may go the substring-way and select the last 10 characters of the content:
//path/to/my/td/text()/substring(., string-length(.) - 9)

Use REGEXP_SUBSTR like a Split function

I need to extract a text value from data in a VARCHAR2 column. Sample:
EDKES^Visit: ^PRIMARY INSURANCE COMMENTS: ^SECONDARY INSURANCE COMMENTS: ^TERTIARY INSURANCE COMMENTS: ^NO PRIMARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO SECONDARY INSURANCE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NONE^NO TERTIARY INS*
I need to get the text that proceeds the 6th occurrence of the '^' (excluding the '^'). In this example, the text would be NO PRIMARY INSURANCE.
([\w\s\:\*]+(\^?)) mostly works, but doesn't exclude the '^'.
When I try to use this expression REGEXP_SUBSTR(VARCHAR_COL, '([\w\s\:\*]+(\^?))', 1, 6), I get a single character ('s'), rather than the expected match NO PRIMARY INSURANCE^.
What am I missing?
This should work pretty well:
REPLACE(REGEXP_SUBSTR(VARCHAR_COL, '[^^]+\^?', 1, 6), '^', '')
You might be able to account for blank columns as well. And if the engine only returns
the capture groups, it will trim the delimiter.
([^^]*).?
This of course means that the last column found is always invalid.