How to get text from URLs using regexp_extract in data studio - regex

Example URLs:
/en/current-season/abc-note-book/2018-abc-note-book-arun-1
/en/current-season/xyz-note-book/2018-xyz-note-book-kumar-2
/en/current-season/pqr-note-book/2018-pqr-note-book-rahul-3
I want to extract 'abc-note-book' section as column 1 from all the URLs
Expected Result:
abc note book
xyz note book
pqr note book
And also need to extract 'arun-1' section as column 2 from all the URLs
Expected Result
arun-1
kumar-2
rahul-3
Please suggest how to extract using regexp_extract in data studio? Or is there any other formula to extract it.
Thanks.

Created a Google Data Studio Report (Google Sheets Embedded) to demonstrate. The required text can be extracted using the REGEXP_EXTRACT function, and in the case of Column 1, REGEXP_REPLACE can be used to replace the - with a space:
Column 1 (e.g. abc note book)
REGEXP_REPLACE(REGEXP_EXTRACT(URL, "/\\d+-(\\w+-\\w+-\\w+)"), "-", " ")
Column 2 (e.g. arun-1)
REGEXP_EXTRACT(URL, "(\\w+-\\d+)$")

Related

extract data immediately after numeric value in a cell in google sheets

I have cells containing data in google spreadsheet as quantity of some entity and I wish to extract only that string which is after the quantity value (number).
Example, If my data is :
learn 10 functions
Watch 3 YT tutorial videos
complete 10 charts
I want the result as :
functions
YT tutorial videos
charts
try:
=INDEX(IFNA(REGEXEXTRACT(A1:A, "\d+ (.+)")))
Assuming that you don't have multiple numbers occurrences in the text, you can use this regex.
(?<=\d\s).+$
This regex will match all characters after a number followed by a single space to the end of the line.
Regex Demo

Regex for values that are in between spaces

I am new to regex and having difficulty obtaining values that are caught in between spaces.
I am trying to get the values "field 1" "abc/def try" from the sameple data below just using regex
Currently im using (^.{18}\s+) to skip the first 18 characters, but am at at loss of how to do grab values with spaces between.
A1234567890 field 1 abc/def try
02021051812 12 test test 12 pass
3333G132021 no test test cancel
any help/pointers will be appreciated.
If this text has fixed-width columns, you can match and trim the column values knowing the amount of chars between start of string and the column text.
For example, this regex will work for the text you posted:
^(.*?)\s*(?<=.{19})(.*?)\s*(?<=^.{34})(.*?)\s*(?<=^.{46})
See the regex demo.
So, Column 2 starts at Position 19, Column 3 starts at Position 34 and Column 4 (end of string here) is at Position 46.
However, this regex is not that efficient, and it would be really great if the data format is fixed on the provider's side.
Given the not knowing if the data is always the same length I created the following, which will provide you with a group per column you might want to use:
^((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)(\s{2,})((\s{0,1}\S{1,})*)
Regex demo

REGEXP_EXTRACT with String Value in Bigquery

I want to extract words in a column, the column value looks like this:'p-fr-youtube-car'. And they should all be extracted to their own column.
INPUT:
p-fr-youtube-car
DESIRED OUTPUT:
Country = fr
Channel = youtube
Item = car
I've tried below to extract the first word, but can't figure out the rest.What RegEx will achieve my desired output from this input? And how can I make it not case sensative fr and FR will be the same.
REGEXP_EXTRACT_ALL(CampaignName, r"^p-([a-z]*)") AS Country
You can use [^-]+ to match parts between hyphens and only capture what you need to fetch.
To get strings like youtube, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-([^-]+)')
To get strings like car, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-[^-]+-([^-]+)')
So, [^-]+ matches one or more chars other than - and ([^-]+) is the same pattern wrapped with a capturing group whose contents REGEXP_EXTRACT actually returns as a result.
You can use named groups.
Example Regex:
p-(?P<Country>[a-z]*)\-(?P<Channel>[a-z]*)\-(?P<Item>[a-z]*)$
https://regex101.com/r/fKoBIn/3
Below is for BigQuery Standard SQL
I would recommend use of SPLIT in cases like yours
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(CampaignName, '-') AS parts)])
if to apply to sample data from your question - the output is
Row CampaignName Country Channel Item
1 p-fr-youtube-car fr youtube car
Meantime, if for some reason you are required to use Regexp - you can use below
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(REGEXP_EXTRACT_ALL(CampaignName, r'(?:^|-)([^-]*)') AS parts)])

How can I achieve this price REGEX with REGEXMATCH in Google Spreadsheet?

Here is the deal,
I want to allow user to enter this kind of entries in my price column:
1 or 1234 or 1234,1 or 1234,1234 ...
So I've used this regex which works fine with REGEX101's website
^\d+(,\d+)?$
https://regex101.com/r/D5dAXx/1
only problem is that it doesn't work well with Google spreadsheet's function REGEXMATCH
=REGEXMATCH(TO_TEXT(C2), "^\d+(,\d+)?$")
for example this entries do not match
1
12
1,123
when this entries matches correctly
1,1
1,12
Why is that and what could be the correct REGEX?
My problem was a bad format on the column.
When I entered:
12,1234
the format turned it into
12.1234
which was not matching my REGEXMATCH.
This means data validation criterion comes after the formatting in google's spreadsheets

how to do a fast regex search on a hdf5 database

I have an HDF5 database with 100 million+ rows of text each storing a simple three column set of values:
ID WORD HEADWORD
1 the the
2 cats cat
3 sat sit
4 on on
5 the the
6 mats mat
...
I want to do a search on the "WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
In some other database (e.g. PostgresQL) I might do this with a simple regex search '?at?'. If I could search the HDF5 index using regex, that would be fine. But, I don't think this is possible. Any suggestions for how to do this kind of 'wildcard' (regex) search quickly?
Try following regex
[^\s]+[\s]+([a-zA-Z]*at[a-zA-Z]*)[\s]+[^\s]+
Group 1 in above regex will give you desired result.
"WORD" column to find all hits for at (i.e., 'cats', 'sat', 'mats').
Debuggex Demo
Regex Demo