bigquery standard sql = extracting data from strings

bigquery standard sql = extracting data from strings - regex

I am looking to extract parts of a string that follow specified letters, for example from the below string:
wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4
I am looking to extract 275 (for wc), 2 (for c1.3) and 25 (for c12.1).
I have tried the following but the fields anew, ridanxietycnt and wordcount just show "NULL".
SELECT substr(CAST((DATE) AS STRING),0,8) as daydate,
count(1) as count,
avg(CAST(REGEXP_REPLACE(V2Tone, r',.*', "")AS FLOAT64)) tone,
avg(CAST(REGEXP_EXTRACT(GCAM, r'c1.3:([-d.]+)')AS FLOAT64)) anew,
sum(CAST(REGEXP_EXTRACT(GCAM, r'c12.1:([-d.]+)')AS FLOAT64)) ridanxietycnt,
sum(CAST(REGEXP_EXTRACT(GCAM, r'wc:(d+)')AS FLOAT64)) wordcount
FROM `gdelt-bq.gdeltv2.gkg_partitioned` where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
group by daydate
I would expect to see the aggregated number for each column.
I wonder if the issue is with the regex expression?

You can treat them as simple key/value pairs and then do the aggregation on top of it. Something like:
select
substr(CAST((DATE) AS STRING),0,8) as daydate,
split(x,':')[safe_offset(0)] as key,
cast(split(x,':')[safe_offset(1)] as float64) as value
from `gdelt-bq.gdeltv2.gkg_partitioned`,
unnest(split(GCAM, ',')) as x
where _PARTITIONTIME BETWEEN TIMESTAMP('2019-02-02') AND TIMESTAMP('2019-02-02')
Hope this helps.

This should get you 2 (anew), 25 (ridanxietycnt), 275 (wordcount), in this order
SELECT
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c1.3:(\d+)') as FLOAT64) anew,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'c12.1:(\d+)') as FLOAT64) ridanxietycnt,
SAFE_CAST(REGEXP_EXTRACT('wc:275,nwc:267,c1.3:2,c12.1:25,c12.10:39,c12.12:21,c12.13:4', r'wc:(\d+)') as FLOAT64) wordcount

Related

How to merger these two records ino one row removing Null value in Informatica using transformation. Please see the snapshot for scenario

enter image description here
Input-
Code value Min Max
A abc 10 null
A abc Null 20
Output-
Code value Min Max
A abc 10 20

You can use an aggregator transformation to remove nulls and get single row. I am providing solution based on your data only.
use an aggregator with below ports -
inout_Code (group by)
inout_value (group by)
in_Min
in_Max
out_Min= MAX(in_Min)
out_Max = MAX(in_Max)
And then attach out_Min, out_Max, code and value to target.
You will get 1 record for a combination of code and value and null values will be gone.
Now, if you have more than 4/5/6/more etc. code,value combinations and some of min, max columns are null and you want multiple records, you need more complex mapping logic. Let me know if this helps. :)

Importhtml Query Extract Between String

I'm trying to find a formula that fits two tables
=QUERY(IMPORTHTML(A1,"table", 16), "Select Col4")
output is
Page 1/10Page 2/10Page 3/10Page 4/10Page 5/10Page 6/10Page 7/10Page 8/10Page
9/10Page 10/10
Another:
=QUERY(IMPORTHTML(A2,"table", 16), "Select Col4")
output is
Page 1/3Page 2/3Page 3/3
I want to extract the digits between "space" and "/" Is there a way to do this in this formula itself?
I then tried this
=transpose(SPLIT(REGEXREPLACE(A2,"Page|/10","~"),"~",0,1))
This also doesn't work since I have to manually change /10 to /3 in the second formula
Is there any way to achieve this for both data?
The sheet is here

try:
=ARRAYFORMULA(IF(ROW(A1:B)<=(1*{
REGEXEXTRACT(IMPORTXML(A1, "//option[#value='21']"), "\d+$"),
REGEXEXTRACT(IMPORTXML(B1, "//option[#value='21']"), "\d+$")}),
ROW(A1:B), ))

Is there any way that I can do format matching within a column in powerBI? ( something similar Fuzzy)

I have a column look like as below.
DK060
DK705
DK715
dk681
dk724
Dk716
Dk 685 (there is a space after Dk).
This is obviously due to human error. Is there any way that I can ensure the format is correct based on the specified format which is two uppercase DK followed by three digits?
Or Am I being too ambitious!!??

Go to the power query editor. Select advance editor and paste this 2 steps
#"Uppercase" = Table.TransformColumns(#"Source",{{"Column", Text.Upper, type text}}),
#"Replace Value" = Table.ReplaceValue(#"Uppercase"," ","",Replacer.ReplaceText,{"Column"})
Note: be sure to replace the "Source" statement into the Uppercase sentence for your previuos step name if needed.
So you will have something like this:
This is the expected result:

How to build a regex in Hive to get string until Nth occurrence of a delimiter

I have some sample data in Hive as
select "abc:def:ghi:jkl" as data
union all
select "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
I want to extract the strings until 3rd colon so that I get the output as
abc:def:ghi
jkl:mno:23ar
I want to keep N as variable so that I can shrink the output text as needed. How do I do this in Hive?

SELECT regexp_replace(`data`, '^([^:]+:[^:]+:[^:]+).*$', "$1")
FROM
( SELECT "abc:def:ghi:jkl" AS `data`
UNION ALL SELECT "jkl:mno:23ar:stu:abc:def:ghi:7345" AS `data`) AS tmp

With using split and posexplode functions, you can combine again with filtering position
select t.dataId, concat_ws(":", collect_list(t.cell)) as firstN from (
SELECT x.dataId, pos as pos, cell
FROM (
select 1 as dataId, "jkl:mno:23ar:stu:abc:def:ghi:7345" as data
union all
select 2 as dataId, "abc:def:ghi:7345" as data
) x
LATERAL VIEW posexplode(split(x.data,':')) dataTable AS pos, cell
) t
where t.pos<3
group by t.dataId

With variable:
set hivevar:n=3; --variable, you can pass it to the script
with your_table as(
select stack(2,"abc:def:ghi:jkl", "jkl:mno:23ar:stu:abc:def:ghi:7345")as data
)
select regexp_replace(regexp_extract(data,'([^:]*:){1,${hivevar:n}}',0),':$','') from your_table;
Result:
OK
abc:def:ghi
jkl:mno:23ar
Time taken: 0.105 seconds, Fetched: 2 row(s)
Quantifier {1,${hivevar:n}} after variable substitution will become {1,3} which means 1 to 3 times, this allows to extract values shorter than 3. If you need not to extract shorter values, use {${hivevar:n}} quantifier. If there are < than N elements, it will extract empty string in this case.

Select substring_index('abc:def:ghi:jkl',':',3) as data
Union all
Select substring_index('jkl:mno:23ar:stu:abc:def:ghi:7345',':',3) as data;

How to find all the source lines containing desired table names from user_source by using 'regexp'

For example we have a large database contains lots of oracle packages, and now we want to see where a specific table resists in the source code. The source code is stored in user_source table and our desired table is called 'company'.
Normally, I would like to use:
select * from user_source
where upper(text) like '%COMPANY%'
This will return all words containing 'company', like
121 company cmy
14 company_id, idx_name %% end of coding
453 ;companyname
1253 from db.company.company_id where
989 using company, idx, db_name,
So how to make this result more intelligent using regular expression to parse all the source lines matching a meaningful table name (means a table to the compiler)?
So normally we allow the matched word contains chars like . ; , '' "" but not _
Can anyone make this work?

To find company as a "whole word" with a regular expression:
SELECT * FROM user_source
WHERE REGEXP_LIKE(text, '(^|\s)company(\s|$)', 'i');
The third argument of i makes the REGEXP_LIKE search case-insensitive.
As far as ignoring the characters . ; , '' "", you can use REGEXP_REPLACE to suck them out of the string before doing the comparison:
SELECT * FROM user_source
WHERE REGEXP_LIKE(REGEXP_REPLACE(text, '[.;,''"]'), '(^|\s)company(\s|$)', 'i');
Addendum: The following query will also help locate table references. It won't give the source line, but it's a start:
SELECT *
FROM user_dependencies
WHERE referenced_name = 'COMPANY'
AND referenced_type = 'TABLE';

If you want to identify the objects that refer to your table, you can get that information from the data dictionary:
select *
from all_dependencies
where referenced_owner = 'DB'
and referenced_name = 'COMPANY'
and referenced_type = 'TABLE';
You can't get the individual line numbers from that, but you can then either look at user_source or use a regexp on the specific source code, which woudl at least reduce false positives.

SELECT * FROM user_source
WHERE REGEXP_LIKE(text,'([^_a-z0-9])company([^_a-z0-9])','i')
Thanks #Ed Gibbs, with a little trick this modified answer could be more intelligent.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

bigquery standard sql = extracting data from strings - regex

Related

How to merger these two records ino one row removing Null value in Informatica using transformation. Please see the snapshot for scenario

Importhtml Query Extract Between String

Is there any way that I can do format matching within a column in powerBI? ( something similar Fuzzy)

How to build a regex in Hive to get string until Nth occurrence of a delimiter

How to find all the source lines containing desired table names from user_source by using 'regexp'

Categories

Resources