Split string, extract and add to another column regex BIGQUERY - regex

I have a table with Equipment column containing strings. I want to split string, take a part of it and add this part to a new column (SerialNumber_Asset). Part of the string i want to extract always has the same pattern: A + 7 digits. Example:
Equipment SerialNumber_Asset
1 AXION 920 - A2302888 - BG-ADM-82 -NK A2302888
2 Case IH Puma T4B 220 - BG-AEH-87 - NK null
3 ARION 650 - A7702047 - BG-ADZ-74 - MU A7702047
4 ARION 650 - A7702039 - BG-ADZ-72 - NK A7702039
My code:
select x, y, z,
regexp_extract(Equipment, r'([\A][\d]{7})') as SerialNumber_Asset
FROM `aa.bb.cc`
The message i got:
Cannot parse regular expression: invalid escape sequence: \A
Any suggestions what could be wrong? Thanks

Just use A instead of [\A], check example below:
select regexp_extract('AXION 920 - A2302888 - BG-ADM-82 -NK', r'(A[\d]{7})') as SerialNumber_Asset

Related

Merging two Pandas Dataframes using Regular Expressions

I'm new to Python and Pandas but I try to use Pandas Dataframes to merge two dataframes based on regular expression.
I have one dataframe with some 2 million rows. This table contains data about cars but the model name is often specified in - lets say - a creative way, e.g. 'Audi A100', 'Audi 100', 'Audit 100 Quadro', or just 'A 100'. And the same for other brands. This is stored in a column called "Model". In a second model I have the manufacturer.
Index
Model
Manufacturer
0
A 100
Audi
1
A100 Quadro
Audi
2
Audi A 100
Audi
...
...
...
To clean up the data I created about 1000 regular expressions to search for some key words and stored it in a dataframe called 'regex'. In a second column of this table I save the manufacture. This value is used in a second step to validate the result.
Index
RegEx
Manufacturer
0
.* A100 .*
Audi
1
.* A 100 .*
Audi
2
.* C240 .*
Mercedes
3
.* ID3 .*
Volkswagen
I hope you get the idea.
As far as I understood, the Pandas function "merge()" does not work with regular expressions. Therefore I use a loop to process the list of regular expressions, then use the "match" function to locate matching rows in the car DataFrame and assign the successfully used RegEx and the suggested manufacturer.
I added two additional columns to the cars table 'RegEx' and 'Manufacturer'.
for index, row in regex.iterrows():
cars.loc[cars['Model'].str.match(row['RegEx']),'RegEx'] = row['RegEx']
cars.loc[cars['Model'].str.match(row['RegEx']),'Manufacturer'] = row['Manfacturer']
I learnd 'iterrows' should not be used for performance reasons. It takes 8 minutes to finish the loop, what isn't too bad. However, is there a better way to get it done?
Kind regards
Jiriki
I have no idea if it would be faster (I'll be glad, if you would test it), but it doesn't use iterrows():
regex.groupby(["RegEx", "Manufacturer"])["RegEx"]\
.apply(lambda x: cars.loc[cars['Model'].str.match(x.iloc[0])])
EDIT: Code for reproduction:
cars = pd.DataFrame({"Model": ["A 100", "A100 Quatro", "Audi A 100", "Passat V", "Passat Gruz"],
"Manufacturer": ["Audi", "Audi", "Audi", "VW", "VW"]})
regex = pd.DataFrame({"RegEx": [".*A100.*", ".*A 100.*", ".*Passat.*"],
"Manufacturer": ["Audi", "Audi", "VW"]})
#Output:
# Model Manufacturer
#RegEx Manufacturer
#.*A 100.* Audi 0 A 100 Audi
# 2 Audi A 100 Audi
#.*A100.* Audi 1 A100 Quatro Audi
#.*Passat.* VW 3 Passat V VW
# 4 Passat Gruz VW

Regex to extract shoe size from string column

I have a database with string column product_name which has data like:
Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)
Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)
I am trying to extract the US size...
SELECT REGEXP_SUBSTR("product_name", ...) AS "size"
...with desired output like this.
size
6
9.5
I have tried this, but to no avail
SELECT REGEXP_SUBSTR("product_name", '(US)(\d+)') AS "size"
I need to agree with B001, this might not be the best way of saving your information. However, if you are sure your strings are going to have this format, you could use this regex
\(US\) ?: ?(\d+\.?\d*) \(EUR: ?(\d+\.?\d*)\)
This will match the US shoe size first and then the EUR one.
Here is a visual explaination of the regex
Please note that this regex will match BOTH sizes, I'm not sure which one you prefer
You can test more cases in this regex101
When working in the web UI I had to double slash my slashes. Thus the following worked as you want.
select REGEXP_SUBSTR(str, '\\(US\\)\\s\\:\\s(\\d+\\.?\\d*)',1,1,'i',1)
from values ('Vans Classic Slip-On Black & White Checkerboard/ White - veľkosť (US) : 6 (EUR: 38)'),
('Vans Old Skool - čierna - veľkosť (US) : 9.5 (EUR: 42.5)') v(str);
gives:
REGEXP_SUBSTR(STR, '\\(US\\)\\S\\:\\S(\\D+\\.?\\D*)',1,1,'I',1)
6
9.5

SQLite extract string from text in column

I have a Spatialite Database and I've imported OSM Data into this database.
With the following query I get all motorways:
SELECT * FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
I use GLOB '*A [0-9]*' here, because in Germany every Autobahn begins with A, followed by a number (like A 73).
There is a column called other_tags with information about the motorway part:
"bdouble"=>"yes","hazmat"=>"designated","lanes"=>"2","maxspeed"=>"none","oneway"=>"yes","ref"=>"A 73","width"=>"7"
If you look closer there is the part "ref"=>"A 73".
I want to extract the A 73 as the name for the motorway.
How can I do this in sqlite?
If the format doesn't change, that means that you can expect that the other_tags field is something like %"ref"=>"A 73","width"=>"7"%, then you can use instr and substr (note that 8 is the length of "ref"=>"):
SELECT substr(other_tags,
instr(other_tags, '"ref"=>"') + 8,
instr(other_tags, '","width"') - 8 - instr(other_tags, '"ref"=>"')) name
FROM lines
WHERE other_tags GLOB '*A [0-9]*'
AND highway='motorway'
The result will be
name
A 73
Check with following condition..
other_tags like A% -- Begin With 'A'.
abs(substr(other_tags, 3,2)) <> 0.0 -- Substring from 3rd character, two character is number.
length(other_tags) = 4 -- length of other_tags is 4
So here is how your query should be:
SELECT *
FROM lines
WHERE other_tags LIKE 'A%'
AND abs(substr(other_tags, 3,2)) <> 0.0
AND length(other_tags) = 4
AND highway = 'motorway'

Extracting ZIP code from the address line

I have a data frame which has address as one of the column, the address can sometimes contain ZIP/PIN code in it and sometimes not.
Data Frame:
BANK ADDRESS
ABU DHABI COMMERCIAL BANK REHMAT MANZIL, V. N. ROAD,CURCHGATE, MUMBAI - 400020
VIJAYA BANK BOKARO CITY JHARKHAND,15/D1 HOTEL BLUE-,DIAMOND COMPLEX,BOKARO CITY,JHARKHAND,JHARKHAND
ALLAHABAD BANK DANKIN GANJ DIST. MIRZAPUR - 231 001 UTTAR PRADESH
How can i extract only ZIP/PIN code with the following information:
1. ZIP/PIN code are 6 digits (INDIAN ZIP/PIN CODE)
2. ZIP are sometimes split by 3 digits, 560 015
3. ZIP are sometimes separated by -, eg: 560-015
Below is my present code:
df$zip <- stri_extract_all_regex(df$ADDRESS, "(?<!\\d)\\d{6}(?!\\d)")
But the above code does not account point 2 and 3 of my logic, that is handle the ZIP split by "" or "-"
But the above code does not account point 2 and 3 of my logic, that is
handle the ZIP split by "" or "-"
m = regexpr("\\<\\d{3}[- ]?\\d{3}\\>", df$ADDRESS)
df$zip = substr(df$ADDRESS, m, m + attr(m, "match.length") - 1)

In R, use regular expression to match multiple patterns and add new column to list

I've found numerous examples of how to match and update an entire list with one pattern and one replacement, but what I am looking for now is a way to do this for multiple patterns and multiple replacements in a single statement or loop.
Example:
> print(recs)
phonenumber amount
1 5345091 200
2 5386052 200
3 5413949 600
4 7420155 700
5 7992284 600
I would like to insert a new column called 'service_provider' with /^5/ as Company1 and /^7/ as Company2.
I can do this with the following two lines of R:
recs$service_provider[grepl("^5", recs$phonenumber)]<-"Company1"
recs$service_provider[grepl("^7", recs$phonenumber)]<-"Company2"
Then I get:
phonenumber amount service_provider
1 5345091 200 Company1
2 5386052 200 Company1
3 5413949 600 Company1
4 7420155 700 Company2
5 7992284 600 Company2
I'd like to provide a list, rather than discrete set of grepl's so it is easier to keep country specific information in one place, and all the programming logic in another.
thisPhoneCompanies<-list(c('^5','Company1'),c('^7','Company2'))
In other languages I would use a for loop on on the Phone Company list
For every row in thisPhoneCompanies
Add service provider to matched entries in recs (such as the grepl statement)
end loop
But I understand that isn't the way to do it in R.
Using stringi :
library(stringi)
recs$service_provider <- stri_replace_all_regex(str = recs$phonenumber,
pattern = c('^5.*','^7.*'),
replacement = c('Company1', 'Company2'),
vectorize_all = FALSE)
recs
# phonenumber amount service_provider
# 1 5345091 200 Company1
# 2 5386052 200 Company1
# 3 5413949 600 Company1
# 4 7420155 700 Company2
# 5 7992284 600 Company2
Thanks to #thelatemail
Looks like if I use a dataframe instead of a list for the phone companies:
phcomp <- data.frame(ph=c(5,7),comp=c("Company1","Company2"))
I can match and add a new column to my list of phone numbers in a single command (using the match function).
recs$service_provider <- phcomp$comp[match(substr(recs$phonenumber,1,1), phcomp$ph)]
Looks like I lose the ability to use regular expressions, but the matching here is very simple, just the first digit of the phone number.