What is the equivalent of the SQL function REGEXP_EXTRACT in Azure Synapse? - regex

I want to convert my code that I was running in Netezza (SQL) to Azure Synapse (T-SQL). I was using the built-in Netezza SQL function REGEXP_EXTRACT but this function is not built-in Azure Synapse.
Here's the code I'm trying to convert
-- Assume that "column_v1" has datatype Character Varying(3) and can take value between 0 to 999 or NULL
SELECT
column_v1
, REGEXP_EXTRACT(column_v1, '[0-9]+') as column_v2
FROM INPUT_TABLE
;
Thanks,
John

regexExtract() function is supported in Synapse.
In order to implement it, you need to use couple of things, here is a demo that i built, here im using the SalesLT.Customer data that is supported as a sample data in microsoft:
In Synapse -> Integrate tab:
Create new pipeline
Add dataflow activity to your pipline
In dataflow activity: under settings tab -> create new data flow
double click on the dataflow (it should open it) Add source (it can be blob storage / files on prem etc.)
add a derived column transformation
in derived column add new column (or override an existing column) in Expression: add this command regexExtract(Phone,'(\\d{3})') it will select the 3 first digits, since my data has dashes in it, its makes more sense to replace all characters that are not digits using regexReplace method: regexReplace(Phone,'[^0-9]', '')
add sink
DataFlow activities:
derived column transformation:
Output:
please check MS docs:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions

Regex_extract is not available in T-SQL. Thus, we try to do similar functionalities using Substring/left/right functions along with Patindex function
SELECT input='789A',
extract= SUBSTRING('789A', PATINDEX('[0-9][0-9][0-9]', '789A'),4);
Result
Refer Microsoft documents patindex (T-SQL), substring (T-SQL) for additional information.

Related

In BigQuery, why can I still query the geo_census_blockgroups table but cannot find it in the public data?

In BigQuery, I can run this query fine:
a.geo_id,
a.total_pop,
a.white_pop,
a.black_pop,
a.hispanic_pop,
a.asian_pop,
a.amerindian_pop,
a.other_race_pop,
a.two_or_more_races_pop,
b.blockgroup_geom
FROM `bigquery-public-data.census_bureau_acs.blockgroup_2010_5yr` a
join `bigquery-public-data.geo_census_blockgroups.us_blockgroups_national` b
using(geo_id)
limit 100
But when I search for the table geo_census_blockgroups, I can't find it.
If I search boundaries, I cannot see block groups or census tracts.
Is geo_census_blockgroups being phased out or why doesn't it appear more easily in public data searches?
The current Bigquery interface which you are using is in Preview
You can use the workaround of using the Legacy BQ interface to search for geo_census_blockgroups dataset in bigquery-public-data by following the below steps.
Go to BigQuery UI.
Select Disable Editor tabs
Opt out Bugs and BigQuery UI will be switched to the Legacy BigQuery Interface.
Search for the dataset in the search bar using keywords and the searched dataset will appear.
Also for reference you can check the public issue tracker, any update related to this will be provided there.

Google Cloud Data Fusion - Dynamic arguments based on functions

Good morning all,
I'm looking in Google Data Fusion for a way to make dynamic the name of a source file stored on GCS. The files to be processed are named according to their value date, example: 2020-12-10_data.csv
My need would be to set the filename dynamically so that the pipeline uses the correct file every day (something like this: ${ new Date(). Getfullyear()... }_data.csv
I managed to use the arguments in runtime by specifying the date as a string (2020-12-10) but not with a function.
And more generally is there any documentation on how to enter dynamic parameters with ready-made or custom "functions" (I couldn't find it)
Thanks in advance for your help.
There is a readymade workaround, you can give a try "BigQuery Execute" plugin.
Steps:
Put below query in SQL
select cast(current_date as string) ||'_data.csv' as filename
--for output '2020-12-15_data.csv'
Row As Arguments to 'true'
Now use the above arguments via ${filename} wherever you want to.

AWS Quicksight : How to use parameter value inside SQL (data set) to render dynamic data on dashboard?

There is a provision to pass value for quicksight parameters via URL. But how can I use the value of the parameter inside SQL (data set) to get dynamic data on dashboard?
For example:
QUERY as of now:
select * from CITYLIST;
Dashboard:
CITYLIST
city_name | cost_of_living
AAAAAAAAA | 20000
BBBBBBBBB | 25000
CCCCCCCCC | 30000
Parameter Created : cityName
URL Triggered : https://aws-------------------/dashboard/abcd123456xyz#p.cityName=AAAAAAAAA
Somehow I need to use the value passed in URL inside SQL so that I can write a dynamic query as below:
select * from CITYLIST where city_name = SomeHowNeedAccessOfParameterValue;
QuickSight doesn't provide a way to access parameters via SQL directly.
Instead you should create a filter from the parameter to accomplish your use-case.
This is effectively QuickSight's way of creating the WHERE clause for you.
This design decision makes sense to me. Though it takes filtering out of your control in the SQL, it makes your data sets more reusable (what would happen in the SQL if the parameter weren't provided?)
Create a parameter, then control and then filter ("Custom filter" -> "Use parameters").
If you select Direct query option and Custom SQL query for data set, then SQL query will be executed per each visual change/update.
The final query on DB side will look like [custom SQL query] + WHERE clause.
For example:
Visual side:
For control Control_1 selected values "A", "B", "C";
DB side:
[Custom SQL from data set] + 'WHERE column in ("A", "B", "C")'
Quicksight builds a query for you and runs it on DB side.
This allows to reduce data sent over network.
Yes now it provides you to sql editor amd you can use it for the same purpose
l
For full details please find the below reference
https://docs.aws.amazon.com/quicksight/latest/user/adding-a-SQL-query.html

Glue custom classifiers for CSV with non standard delimiter

I am trying to use AWS Glue to crawl a data set and make it available to query in Athena. My data set is a delimited text file using ^ to separate columns. Glue is not able to infer the schema for this data as the CSV classifier only recognises comma (,), pipe (|), tab (\t), semicolon (;), and Ctrl-A (\u0001). Is there a way of updating this classifer to include non standard delimeters? The option to build custom classifiers only seems to support Grok, JSON or XML which are not applicable in this case.
You will need to create a custom classifier using the custom Grok pattern and use that in the crawler. Suppose your data is like below with four fields:
qwe^123^22.3^2019-09-02
To process the above data, your custom pattern will look like below:
%{NOTSPACE:name}^%{INT:class_num}^%{BASE10NUM:balance}^%{CUSTOMDATE:balance_date}
Please let me know if that worked for you.

Does Spark-SQL supports Hive Select All Query with Except Columns using regex specification

I am trying to achieve this functionality using SPARK-SQL using a pyspark wrapper.I have ran into this error
pyspark.sql.utils.AnalysisException: u"cannot resolve '```(qtr)?+.+```'
given input columns:
This is my query..Basically trying to exclude the column 'qtr'.
select `(qtr)?+.+` from project.table;
Works perfectly fine in hive/beeline using below property
set hive.support.quoted.identifiers=none;
Any help is appreciated ?
Spark allows the RegEx as a column name in SELECT expression. By default this behavior is disabled. To enable it we need to set the below property to true before running the query with RegEx columns.
spark.sql("SET spark.sql.parser.quotedRegexColumnNames=true").show(false)
After setting this property we can the select expression with regex as below.
spark.sql("SELECT `(.*time.*)+.+` FROM test.orders limit 2""").show(false)
Note: Here it allows any java valid RegEx. I have tested this solution in Spark 2.3