Hash Array Column AWS Athena - amazon-athena

I am working with data without any primary keys. I am trying to hash the unique columns in order to create a surrogate key, however I'm running into an issue as the data contains arrays. So I do want to keep the data in arrays, because if I change it to just a blob of text I lose unnesting. Ultimately, I need to move the rows into columns and in order to do that, I need the unique key to join back to.
I have tried
SELECT md5(to_utf8(array_column)) from my_table;
I have also tried to cast the column as a varchar:
SELECT CAST(array_column as VARCHAR) from my_table
I keep getting results that complain about the type:
Unexpected parameters (array(row(**remaining data definitions))

You can create a unique identifier using the uuid function. For example:
select uuid(), ... from mytable

Related

Big Query - Convert a int column into float

I would like to convert a column called lauder from int to float in Big Query. My table is called historical. I have been able to use this SQL query
SELECT *, CAST(lauder as float64) as temp
FROM sandbox.dailydev.historical
The query works but the changes are not saved into the table. What should I do?
If you use SELECT * you will scan the whole table and thus will be the cost. If table is small this shouldn't be a problem, but if it is big enough to be concern about cost - below is another approach:
apply ALTER TABLE ADD COLUMN to add new column of needed data type
apply UPDATE for new column
UPDATE table
SET new_column = CAST(old_column as float64)
WHERE true
Do you want to save them in a temporary table to use it later?
You can save it to a temporary table like below and then refer to "temp"
with temp as
( SELECT *, CAST(lauder as float64)
FROM sandbox.dailydev.historical)
You can not change a columns data type in a table
https://cloud.google.com/bigquery/docs/manually-changing-schemas#changing_a_columns_data_type
What you can do is either:
Create a view to sit on top and handle the data type conversion
Create a new column and set the data type to float64 and insert values into it
Overwrite the table
Options 2 and 3 are outlined well including pros and cons in the link I shared above.
Your statement is correct. But tables columns in Big Query are immutable. You need to run your query and save results to a new table with the modified column.
Click "More" > "Query settings", and in "Destination" select "Set a destination table for query results" and fill the table name. You can even select if you want to overwrite the existing table with generated one.
After these settings are set, just "Run" your query as usual.
You can use CREATE or REPLACE TABLE to write the structural changes along with data into the same table:
CREATE OR REPLACE TABLE sandbox.dailydev.historical
AS SELECT *, CAST(lauder as float64) as temp FROM sandbox.dailydev.historical;
In this example, historical table will be restructured with an additional column temp.
In some cases you can change column types:
CREATE TABLE mydataset.mytable(c1 INT64);
ALTER TABLE mydataset.mytable ALTER COLUMN c1 SET DATA TYPE NUMERIC;
Check conversion rules.
And google docs.

How to create table based on minimum date from other table in DAX?

I want to create a second table from the first table using filters with dates and other variables as follows. How can I create this?
Following is the expected table and original table,
Go to Edit Queries. Lets say our base table is named RawData. Add a blank query and use this expression to copy your RawData table:
=RawData
The new table will be RawDataGrouped. Now select the new table and go to Home > Group By and use the following settings:
The result will be the following table. Note that I didnt use the exactly values you used to keep this sample at a miminum effort:
You also can now create a relationship between this two tables (by the Index column) to use cross filtering between them.
You could show the grouped data and use the relationship to display the RawDate in a subreport (or custom tooltip) for example.
I assume you are looking for a calculated table. Below is the workaround for the same,
In Query Editor you can create a duplicate table of the existing (Original) table and select the Date Filter -> Is Earliest option by clicking right corner of the Date column in new duplicate table. Now your table should contain only the rows which are having minimum date for the column.
Note: This table is dynamic and will give subsequent results based on data changes in the original table, but you to have refresh both the table.
Original Table:
Desired Table:
When I have added new column into it, post to refreshing dataset I have got below result (This implies, it is doing recalculation based on each data change in the original source)
New data entry:
Output:

Performance of views on redshift

I have a cluster that I use, among other things, for reporting via PowerBi. For this I created views to show only the required fields so the queries run faster. If the source table is sorted by date and the view is 'select fields from table;' will it use the date if I query the view using WHERE on that field?
Any recommendations? For better performance!
Thank you!
For better performance in Redshift, It is absolutely important to set SortKey, DistributionKey and Encoding properly. I guess you want to generate date wise report. In that case, the "date" column should be the distribution key. Do not encode the "date" column which means keep the value ENCODING as RAW / NONE.
Then, you can use the "date" column as a COMPOUND sort key. If you have any other column you want to filter with then use that column as the first key and the "date" column as the second key in the SORT key order. Otherwise, you can define the SORT key only using the "date" column.

Prevent duplicates on insert with billions of rows in SQL Data Warehouse?

I'm trying to determine if there's a practical way to prevent duplicate rows from being inserted into a table using Azure SQL DW when the table already holds billions of rows (say 20 billion).
The root cause of needing this is that the source of the data is a third party that sends over supposedly unique data, but sometimes sends duplicates which have no identifying key. I unfortunately have no idea if we've already received the data they're sending.
What I've tried is to create a table that contains a row hash column (pre-calculated from several other columns) and distribute the data based on that row hash. For example:
CREATE TABLE [SomeFact]
(
Row_key BIGINT NOT NULL IDENTITY,
EventDate DATETIME NOT NULL,
EmailAddress NVARCHAR(200) NOT NULL,
-- other rows
RowHash BINARY(16) NOT NULL
)
WITH
(
DISTRIBUTION = HASH(RowHash)
)
The insert SQL is approximately:
INSERT INTO [SomeFact]
(
EmailAddress,
EventDate,
-- Other rows
RowHash
)
SELECT
temp.EmailAddress,
temp.EventDate,
-- Other rows
temp.RowHash
FROM #StagingTable temp
WHERE NOT EXISTS (SELECT 1 FROM [SomeFact] f WHERE f.RowHash = temp.RowHash);
Unfortunately, this is just too slow. I added some statistics and even created a secondary index on RowHash and inserts of any real size (10 million rows, for example) won't run successfully without erroring due to transaction sizes. I've also tried batches of 50,000 and those too are simply too slow.
Two things I can think of that wouldn't have the singleton records you have in your query would be to
Outer join your staging table with the fact table and filter on some NULL values. Assuming You're using Clustered Column Store in your fact table this should be a lot more inexpensive than the above.
Do a CTAS with a Select Distinct from the existing fact table, and a Select Distinct from the staging table joined together with a UNION.
My gut says the first option will be faster, but you'll probably want to look at the query plan and test both approaches.
Can you partition the 'main' table by EventDate and, assuming new data has a recent EventDate, CTAS out only the partitions that include the EventDate's of the new data, then 'Merge' the data with CTAS / UNION of the 'old' and 'new' data into a table with the same partition schema (UNION will remove the duplicates) or use the INSERT method you developed against the smaller table, then swap the partition(s) back into the 'main' table.
Note - There is a new option on the partition swap command that allows you to directly 'swap in' a partition in one step: "WITH (TRUNCATE_TARGET = ON)".

Redshift sequential scan on columns defined as sort keys

Referring to the attached image of the query plan of the query I'm executing and created_time is Interleaved sort key which is used as a range filter on the data.
Though it \looks like seq scan is happening on the table data, the rows scanned column seems empty in the image, does that mean there is no scan happening, and the sort keys work?
Even though created_time is your sort key, in this query it is not recognized since you are converting it to date. Therefore, it is scanning entire table.
You need to leave it unmodified for it to know that it is the sort key.