REGEXP_EXTRACT with String Value in Bigquery

REGEXP_EXTRACT with String Value in Bigquery - regex

I want to extract words in a column, the column value looks like this:'p-fr-youtube-car'. And they should all be extracted to their own column.
INPUT:
p-fr-youtube-car
DESIRED OUTPUT:
Country = fr
Channel = youtube
Item = car
I've tried below to extract the first word, but can't figure out the rest.What RegEx will achieve my desired output from this input? And how can I make it not case sensative fr and FR will be the same.
REGEXP_EXTRACT_ALL(CampaignName, r"^p-([a-z]*)") AS Country

You can use [^-]+ to match parts between hyphens and only capture what you need to fetch.
To get strings like youtube, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-([^-]+)')
To get strings like car, you can use
REGEXP_EXTRACT_ALL(CampaignName, r'^p-[^-]+-[^-]+-([^-]+)')
So, [^-]+ matches one or more chars other than - and ([^-]+) is the same pattern wrapped with a capturing group whose contents REGEXP_EXTRACT actually returns as a result.

You can use named groups.
Example Regex:
p-(?P<Country>[a-z]*)\-(?P<Channel>[a-z]*)\-(?P<Item>[a-z]*)$
https://regex101.com/r/fKoBIn/3

Below is for BigQuery Standard SQL
I would recommend use of SPLIT in cases like yours
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(SPLIT(CampaignName, '-') AS parts)])
if to apply to sample data from your question - the output is
Row CampaignName Country Channel Item
1 p-fr-youtube-car fr youtube car
Meantime, if for some reason you are required to use Regexp - you can use below
#standardSQL
SELECT CampaignName,
parts[SAFE_OFFSET(1)] AS Country,
parts[SAFE_OFFSET(2)] AS Channel,
parts[SAFE_OFFSET(3)] AS Item
FROM `project.dataset.table`,
UNNEST([STRUCT(REGEXP_EXTRACT_ALL(CampaignName, r'(?:^|-)([^-]*)') AS parts)])

Related

Azure data factory - mapping data flows regex implementation to format a number

I am creating a mapping data flow where I have a phone number column which can contain values like
(555) 555-1234 or
(555)555-1234 or
555555-1234
I want to extract numbers from this value. How can that be done. I have tried the below function with different variations but nothing is working.
regexExtract("(555) 555-1234",'\d+)')
regexExtract("(555) 555-1234",'(\d\d\d\d\d\d\d\d\d\d)')

Because you have multiple phone formats, you need to remove parentheses and spaces and dashes so you need multiple statements of regexExtract which will make your solution complicated.
instead, i suggest that you use regexReplace, mainly keeping only digits.
i tried it in ADF and it worked, for the sake of the demo, i added a derived column phoneNumber with a value: (555) 555-1234
in the derived column activity i added a new column 'validPhoneNumber' with a regexReplace value like so:
regexReplace(phoneNumber,'[^0-9]', '')
Output:
You can read about it here: https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expressions-usage#regexReplace

Extract data from dataset

I need to extract title from name but cannot understand how it is working . I have provided the code below :
combine = [traindata , testdata]
for dataset in combine:
dataset["title"] = dataset["Name"].str.extract(' ([A-Za-z]+)\.' , expand = False )
There is no error but i need to understand the working of above code
Name
Braund, Mr. Owen Harris
Cumings, Mrs. John Bradley (Florence Briggs Thayer)
Heikkinen, Miss. Laina
Futrelle, Mrs. Jacques Heath (Lily May Peel)
Allen, Mr. William Henry
Moran, Mr. James
above is the name feature from csv file and in dataset["title"] it stores the title of each name that is mr , miss , master , etc

Your code extracts the title from name using pandas.Series.str.extract function which uses regex
pandas.series.str.extract - Extract capture groups in the regex pat as columns in a DataFrame.
' ([A-Za-z]+)\.' this is a regex pattern in your code which finds the part of string that is here Name wherever a . is present.
[A-Za-z] - this part of pattern looks for charaters between alphabetic range of a-z and A-Z
+ it states that there can be more than one character
\. looks for following . after a part of string
An example is provided on the link above where it extracts a part from
string and puts the parts in seprate columns

I found this specific response with the link very helpful on how to use the 'str's extract method and put the strings in columns and series with changing the expand's value from True to False.

PostgreSQL - finding string using regular expression

What I am looking to do is to, within Postgres, search a column for a string (an account number). I have a log table, which has a parameters column that takes in parameters from the application. It is a paragraph of text and one of the parameters stored in the column is the account number.
The position of the account number is not consistent in the text and some rows in this table have nothing in the column (since no parameters are passed on certain screens). The account number has the following format: L1234567899. So for the account number, the first character is a letter and then it is followed by ten digits.
I am looking for a way to extract the account number alone from this column so I can use it in a view for a report.
So far what I have tried is getting it into an array, but since the position changes, I cannot count on it being in the same place.
select foo from regexp_split_to_array(
(select param from log_table where id = 9088), E'\\s+') as foo

You can use regexp_match() to achieve that result.
(regexp_match(foo,'[A-Z][0-9]{10}'))[1]
DBFiddle

Use substring to pull out the match group.
select substring ('column text' from '[A-Z]\d{10}')
Reference: PostgreSQL regular expression capture group in select

Regular expression for words not starting or ending with vowels?

I used this regex: ^[aeiou](\w|\s)*[aeiou]$
for words starting and ending with vowels and it works fine.
But when I use this regex:
^[^aeiou](\w|\s)*[^aeiou]$ for words not starting and ending with vowels, it doesn't work. Can you tell me what is wrong in my 2nd regex?
Words are like:
South Britain
Rives Junction
Larkspur
Southport
Compton
Linden
Sedgwick
Humeston
Siler
Panther Burn

select distinct(city) from station
WHERE regexp_like(city, '^[^aeiou](\w|\s)*$', 'i')
OR regexp_like(city, '^(\w|\s)*[^aeiou]$', 'i');
The question is asking "ends with OR starts with" you can do it in the regex, but I did it as an OR in the WHERE clause

If I understand the desired logic and if you happen to want a single regular expression, I would use something like !/(^[aeiou](\w|\s)*)|((\w|\s)*[aeiou]$)/i. This, of course, is not the most readable format, but should grab only those words that start AND end with non-vowels.

Late to the game but if it’s MySQL this is what works for me:
SELECT DISTINCT CITY FROM STATION
WHERE NOT REGEXP_LIKE(CITY,'[aeiou]$','i')
OR NOT REGEXP_LIKE(CITY, '^[aeiou]','i');

Simple and readable solution
SELECT DISTINCT city
FROM station
WHERE city NOT REGEXP '^[aeiou]' AND city NOT REGEXP '[aeiou]$'

Below is the simplest MS SQL Server command that you can come up with -
SELECT DISTINCT CITY
FROM STATION
WHERE CITY LIKE '[^aeiouAEIOU]%' OR CITY LIKE '%[^aeiouAEIOU]';
it can not get simpler than this.

In MYSQL I used:
Select distinct City from STATION where City RLike "^[^aeiou].*|[^aeiou]$"

Regex parse with alteryx

One of the columns has the data as below and I only need the suburb name, not the state or postcode.
I'm using Alteryx and tried regex (\<\w+\>)\s\<\w+\> but only get a few records to the new column.
Input:
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta NSW 2150
Claymore 2559
CASULA
Output
CABRAMATTA
CANLEY HEIGHTS
ST JOHNS PARK
Parramatta
Claymore
CASULA

This regex matches all letter-words up to but not including an Australian state abbreviation (since the addresses are clearly Australian):
( ?(?!(VIC|NSW|QLD|TAS|SA|WA|ACT|NT)\b)\b[a-zA-Z]+)+
See demo
The negative look ahead includes a word boundary to allow suburbs that start with a state abbreviation (see demo).

Expanding on Bohemian's answer, you can use groupings to do a REGEXP REPLACE in alteryx. So:
REGEX_Replace([Field1], "(.*)(\VIC|NSW|QLD|TAS|SA|WA|ACT|NT)+(\s*\d+)" , "\1")
This will grab anything that matches in the first group (so just the suburb). The second and third groups match the state and the zip. Not a perfect regex, but should get you most of the way there.

I think this workflow will help you :

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEXP_EXTRACT with String Value in Bigquery - regex

You can use named groups. Example Regex: p-(?P<Country>[a-z])\-(?P<Channel>[a-z])\-(?P<Item>[a-z]*)$ https://regex101.com/r/fKoBIn/3

Related

Azure data factory - mapping data flows regex implementation to format a number

Extract data from dataset

PostgreSQL - finding string using regular expression

Regular expression for words not starting or ending with vowels?

Regex parse with alteryx

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

REGEXP_EXTRACT with String Value in Bigquery - regex

You can use named groups. Example Regex: p-(?P<Country>[a-z]*)\-(?P<Channel>[a-z]*)\-(?P<Item>[a-z]*)$ https://regex101.com/r/fKoBIn/3

Related

Azure data factory - mapping data flows regex implementation to format a number

Extract data from dataset

PostgreSQL - finding string using regular expression

Regular expression for words not starting or ending with vowels?

Regex parse with alteryx

Categories

Resources

You can use named groups. Example Regex: p-(?P<Country>[a-z])\-(?P<Channel>[a-z])\-(?P<Item>[a-z]*)$ https://regex101.com/r/fKoBIn/3