Use function from future presto version - amazon-athena

binomial_cdf(numberOfTrials, successProbability, value) → double is available in the current version of Presto, but not in version 0.217
Is there any way I can copy the implementation of this function into my sql code?
Background:
current https://prestodb.io/docs/current/functions/math.html
0.217 https://prestodb.io/docs/0.217/functions/math.html
Athena engine version 2 is based on Presto 0.217.
see aws-docs

Athena flavour of Presto does not list CREATE FUNCTION as supported DDL operation. Though you can create and use a user defined function (UDF) referencing some lambda.

Related

What is the equivalent of the SQL function REGEXP_EXTRACT in Azure Synapse?

I want to convert my code that I was running in Netezza (SQL) to Azure Synapse (T-SQL). I was using the built-in Netezza SQL function REGEXP_EXTRACT but this function is not built-in Azure Synapse.
Here's the code I'm trying to convert
-- Assume that "column_v1" has datatype Character Varying(3) and can take value between 0 to 999 or NULL
SELECT
column_v1
, REGEXP_EXTRACT(column_v1, '[0-9]+') as column_v2
FROM INPUT_TABLE
;
Thanks,
John
regexExtract() function is supported in Synapse.
In order to implement it, you need to use couple of things, here is a demo that i built, here im using the SalesLT.Customer data that is supported as a sample data in microsoft:
In Synapse -> Integrate tab:
Create new pipeline
Add dataflow activity to your pipline
In dataflow activity: under settings tab -> create new data flow
double click on the dataflow (it should open it) Add source (it can be blob storage / files on prem etc.)
add a derived column transformation
in derived column add new column (or override an existing column) in Expression: add this command regexExtract(Phone,'(\\d{3})') it will select the 3 first digits, since my data has dashes in it, its makes more sense to replace all characters that are not digits using regexReplace method: regexReplace(Phone,'[^0-9]', '')
add sink
DataFlow activities:
derived column transformation:
Output:
please check MS docs:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-derived-column
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-expression-functions
Regex_extract is not available in T-SQL. Thus, we try to do similar functionalities using Substring/left/right functions along with Patindex function
SELECT input='789A',
extract= SUBSTRING('789A', PATINDEX('[0-9][0-9][0-9]', '789A'),4);
Result
Refer Microsoft documents patindex (T-SQL), substring (T-SQL) for additional information.

What is LATEST_ON syntax in QuestDB?

I'm using QuestDB and SQL for the first time, and I stumbled upon the LATEST_ON syntax used in QuestDB. Can someone explain it's usage and where to use it?
Quoted from the docs:
For scenarios where multiple time series are stored in the same table, it is relatively difficult to identify the latest items of these time series with standard SQL syntax. QuestDB introduces LATEST ON clause for a SELECT statement to remove boilerplate clutter and splice the table with relative ease.
For more information visit the official documentation
LATEST ON is to find the latest record for each unique time series in a table. See this page for some examples: https://questdb.io/docs/reference/sql/latest-on/
It gives you the latest available record for each combination of the PARTITION BY values, according to the ON timestamp
Maybe easier to understand with an example. If you go to https://demo.questdb.io you can execute this query
select * from trades latest on timestamp
partition by symbol, side
It will then show you the latest existing row for each combination of Symbol and Side. If you wanted to do this using standard SQL, you would probably have to use a window function, something like this
select * from
(select *
,ROW_NUMBER() over (partition by Symbol, Side
order by timestamp DESC) AS RowNumber
from trades where timestamp > '2022-10-01') t
where t.RowNumber = 0
Latest on retrieves the latest entry by timestamp for a given key or combination of keys, for scenarios where multiple time series are stored in the same table.
Check this link for some examples: https://questdb.io/docs/reference/sql/latest-on/

Why there are so many duplicate SQL functions in Redshift?

In Amazon Redshift, I found that there are many SQL functions with different names but does exactly the same thing. The Redshift document even mentions those as synonym functions.
For eg: STRPOS Function has two other synonym function - CHARINDEX Function and POSITION Function. All three do the exact same thing - Return the position of a substring within a specified string.
What is the reason for having three functions for exact same task? Is there any performance difference among these?
Possibly to make it more compatible with other forms of SQL.
For example, CHARINDEX is a command from Microsoft SQL Server, whereas POSITION is a command from PostgreSQL (upon which Amazon Redshift is based).
Given that Redshift is from Amazon, I would guess that the answer is "because customers asked for it!"

In Redshift, how do you combine CTAS with the "if not exists" clause?

I'm having some trouble getting this table creation query to work, and I'm wondering if I'm running in to a limitation in redshift.
Here's what I want to do:
I have data that I need to move between schema, and I need to create the destination tables for the data on the fly, but only if they don't already exist.
Here are queries that I know work:
create table if not exists temp_table (id bigint);
This creates a table if it doesn't already exist, and it works just fine.
create table temp_2 as select * from temp_table where 1=2;
So that creates an empty table with the same structure as the previous one. That also works fine.
However, when I do this query:
create table if not exists temp_2 as select * from temp_table where 1=2;
Redshift chokes and says there is an error near as (for the record, I did try removing "as" and then it says there is an error near select)
I couldn't find anything in the redshift docs, and at this point I'm just guessing as to how to fix this. Is this something I just can't do in redshift?
I should mention that I absolutely can separate out the queries that selectively create the table and populate it with data, and I probably will end up doing that. I was mostly just curious if anyone could tell me what's wrong with that query.
EDIT:
I do not believe this is a duplicate. The post linked to offers a number of solutions that rely on user defined functions...redshift doesn't support UDF's. They did recently implement a python based UDF system, but my understanding is that its in beta, and we don't know how to implement it anyway.
Thanks for looking, though.
I couldn't find anything in the redshift docs, and at this point I'm
just guessing as to how to fix this. Is this something I just can't do
in redshift?
Indeed this combination of CREATE TABLE ... AS SELECT AND IF NOT EXISTS is not possible in Redshift (per documentation). Concerning PostgreSQL, it's possible since version 9.5.
On SO, this is discussed here: PostgreSQL: Create table if not exists AS . The accepted answer provides options that don't require any UDF or procedural code, so they're likely to work with Redshift too.

Does Alia support insert-batch

Cassaforte has an insert-batch function for inserting multiple rows into a cassandra CQL
table in one go.
I've recently switched to Alia and I'm wondering if it offers the same? I can't see anything immediately in the documentation, and (hayt/values ..) seems to only support a single row insertion at one time.
Alia supports CQL batch inserts through the Hayt DSL.
(alia/execute
session
(hayt/batch
(hayt/queries
(hayt/insert ...)
(hayt/insert ...)
(hayt/insert ...))
As per the CQL spec, only DML statements are supported:
http://www.datastax.com/documentation/cql/3.0/cql/cql_reference/batch_r.html