How to convert a VARCHAR column to a Timestamp in a table - amazon-athena

I am trying to convert a VARCHAR column from a table into a TIMESTAMP.
Currently my column looks like this:
2020-05-31T05:00:21Z
My goal is to transform this column in to a timestamp so I can use it for a visualization (Big Number with Trendline)
When I leave the column as it is, I get this error:
awsathena error: INVALID_CAST_ARGUMENT: Value cannot be cast to timestamp: 2020-07-07T18:56:56Z
When I change the type of the column to TIMESTAMP, I get this other error:
awsathena error: SYNTAX_ERROR: line 4:19: '>=' cannot be applied to varchar, timestamp with time zone
This is the query from the visualization:
SELECT date_trunc('day', CAST("eventtime" AS TIMESTAMP)) AS "__timestamp",
COUNT(*) AS "count"
FROM "cloudtrail_logs_cloud_trail_elk"
WHERE "eventtime" >= '2020-07-01 00:00:00.000000'
AND "eventtime" < '2020-07-08 00:00:00.000000'
GROUP BY date_trunc('day', CAST("eventtime" AS TIMESTAMP))
ORDER BY "count" DESC
LIMIT 50000;
Is there any way of changing the type of this column so I can use it for my visualization?

In case someone runs in to the same Problem as i did, you can use this function:
from_iso8601_timestamp(eventtime)
in Superset here: Superser -> Edit Table -> Edit Column -> Expression
Superser-> Edit Table-> Expression

A possibility is to do the casting on the datetimes values:
SELECT date_trunc('day', CAST("eventtime" AS TIMESTAMP)) AS "__timestamp",
COUNT(*) AS "count"
FROM "cloudtrail_logs_cloud_trail_elk"
WHERE "eventtime" >= TIMESTAMP '2020-07-01 00:00:00'
AND "eventtime" < TIMESTAMP '2020-07-08 00:00:00'
GROUP BY date_trunc('day', CAST("eventtime" AS TIMESTAMP))
ORDER BY "count" DESC
LIMIT 50000;
or using between instead of >= and <
SELECT date_trunc('day', CAST("eventtime" AS TIMESTAMP)) AS "__timestamp",
COUNT(*) AS "count"
FROM "cloudtrail_logs_cloud_trail_elk"
WHERE "eventtime" between TIMESTAMP '2020-07-01 00:00:00'
AND TIMESTAMP '2020-07-08 00:00:00'
GROUP BY date_trunc('day', CAST("eventtime" AS TIMESTAMP))
ORDER BY "count" DESC
LIMIT 50000;

Related

Cannot find running sum of a blended chart in data studio

I'm trying to create a chart for following big query using data studio. Instead, auto generating the chart from the GCP. I'm trying to create chart using tools in the data studio.
SELECT t.timestamp, sum(t.introduced_violation)
OVER(
PARTITION BY t.introduced_user_id
ORDER BY t.timestamp desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
as cumulative_introduced_violation,
sum(t.fixed_violation)
OVER(
PARTITION BY t.introduced_user_id
ORDER BY t.timestamp desc
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
as cumulative_fixed_violation,
FROM (SELECT SUM(CASE
WHEN is_fixed = 1 THEN 1
ELSE 0
END) AS fixed_violation,
SUM(1) AS introduced_violation,
timestamp, introduced_user_id
FROM `project_id.violation.table_name`
where introduced_user_id = 'username#company.com'
and timestamp >=1622556834000
and timestamp <=1631231999999
group by timestamp, introduced_user_id
order by timestamp desc) as t;
Expected output from the query:
At first, I try to create chart for the inner query (below). I succeed this step by creating 2 charts and blending them together.
SELECT SUM(CASE
WHEN is_fixed = 1 THEN 1
ELSE 0
END) AS fixed_violation,
SUM(1) AS introduced_violation,
timestamp, introduced_user_id
FROM `project_id.violation.table_name`
where introduced_user_id = 'username#company.com'
and timestamp >=1622556834000
and timestamp <=1631231999999
group by timestamp, introduced_user_id
order by timestamp desc;
Expected output from inner query:
As in the query output for introduced_violation and fixed_violation values are RunningSUM values.
Is there is a way to find the RunningSUM of introduced_violation and fixed_violation columns in the blended charts or some other way to do the whole scenario?

Getting table names and row counts for all tables in an athena database

I have an AWS database with multiple tables that I am trying to get the row counts for in a single query.
The ideal query output would be:
table_name row_count
table2_name row_count
etc...
So far I've been able to either get all the table names from the database or all the rowcounts of the tables (in random order), but not both in the same query.
This query returns a column of all the table names that exist in the database:
SELECT table_name FROM information_schema.tables WHERE table_schema = '<database_name>';
This query returns all the row counts for the tables:
SELECT COUNT(*) FROM table_name
UNION ALL
SELECT COUNT(*) FROM table2_name
UNION ALL
etc..for the rest of the tables
The issue with this query is that is displays the row counts in a random order that doesn't correspond with the order of the tables in the query, and so I don't know which row count goes with which table - hence why I need both the table names and row counts.
Simply add the names of the tables as literals in your queries:
SELECT 'table_name' AS table_name, COUNT(*) AS row_count FROM table_name
UNION ALL
SELECT 'table_name2' AS table_name, COUNT(*) AS row_count FROM table_name2
UNION ALL
…
The following query generates the UNION query to produce counts of all records.
The problem to solve is that (as of December 2022) INFORMATION_SCHEMA.TABLES incorrectly defines every table and view as a BASE TABLE so you will need some logic to eliminate the views.
In Data Warehousing it is common practise to record snapshots of the record counts of landing tables at frequent intervals. Any unexpected deviations from expected counts can be used for reporting/alerting
WITH Table_List AS (
SELECT table_schema,table_name, CONCAT('SELECT CURRENT_DATE AS run_date, ''',table_name, ''' AS table_name, COUNT(*) AS Records FROM "',table_schema,'"."', table_name, '"') AS BaseSQL
FROM INFORMATION_SCHEMA.TABLES
WHERE
table_schema = 'YOUR_DB_NAME' -- Change this
AND table_name LIKE 'YOUR TABLE PATTERN%' -- Change or remove this line
)
, Total_Records AS (
SELECT COUNT(*) AS Table_Count
FROM Table_List
)
SELECT
CASE WHEN ROW_NUMBER() OVER (ORDER BY table_name) = Table_Count
THEN BaseSQL
ELSE CONCAT(BaseSql, ' UNION ALL') END AS All_Table_Record_count_SQL
FROM Table_List CROSS JOIN Total_Records
ORDER BY table_name;

How to find missing dates in BigQuery table using sql

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.
Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:
An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-
I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1

Pivot with dynamic DATE columns

I have a query that I created from a table.
example:
select
pkey,
trunc (createdformat) business_date,
regexp_substr (statistics, 'business_ \ w *') business_statistics
from business_data
where statistics like '% business_%'
group by regexp_substr(statistics, 'business_\w*'), trunc(createdformat)
This works great thanks to your help.
Now I want to show that in a crosstab / pivot.
That means in the first column are the "business_statistics", the column headings are the "dynamic days from business_date".
I've tried the following, but it doesn't quite work yet
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN ('17.06.2020','18.06.2020')
)
ORDER BY business_statistics
If I specify the date, like here 17.06.2020 and 18.06.2020 it works. 3 columns (Business_Statistic, 17.06.2020, 18.06.2020). But from column 2 it should be dynamic. That means he should show me the days (date) that are also included in the query / table. So that is the result of X columns (Business_Statistics, Date1, Date2, Date3, Date4, ....). Dynamic based on the table data.
For example, this does not work:
...
IN (SELECT DISTINCT trunc(createdformat) FROM BUSINESS_DATA WHERE statistics like '%business_%' order by trunc(createdformat))
...
The pivot clause doesn't work with dynamic values.
But there are some workarounds discuss here: How to Convert Rows to Columns and Back Again with SQL (Aka PIVOT and UNPIVOT)
You may find one workaround that suits your requirements.
Unfortunately, I am not very familiar with PL / SQL. But could I still process the start date and the end date of the user for the query?
For example, the user enters the APEX environment as StartDate: June 17, 2020 and as EndDate: June 20, 2020.
Then the daily difference is calculated in the PL / SQL query, then a variable is filled with the value of the entered period using Loop.
Example: (Just an idea, I'm not that fit in PL / SQL yet)
DECLARE
startdate := :P9999_StartDate 'Example 17.06.2020
enddate := P9999_EndDate 'Example 20.06.2020
BEGIN
LOOP 'From the startdate to the enddate day
businessdate := businessdate .... 'Example: 17.06.2020,18.06.2020,19.06.2020, ...
END LOOP
SELECT *
FROM (
select
pkey,
trunc(createdformat) business_date,
regexp_substr(statistics, 'business_\w*') business_statistics
from business_data
where statistics like '%business_%'
)
PIVOT(
count(pkey)
FOR business_date
IN (businessdate)
)
ORDER BY business_statistics
END;
That would be my idea, but I fail to implement it. Is that possible? I hope you understand what I mean

Bigquery Standard Dialect REGEXP_REPLACE input type

I am exploring the power of Google Biguery with the GDELT database using this tutorial however the sql dialect is in 'legacy' and I would like to use the standard dialect.
In legacy dialect:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';'), r',.*',"") theme
from [gdelt-bq:gdeltv2.gkg]
where DATE>20150302000000 and DATE < 20150304000000 and V2Persons like '%Netanyahu%'
)
group by theme
ORDER BY 2 DESC
LIMIT 300
and when I try to translate into standard dialect:
SELECT
theme,
COUNT(*) AS count
FROM (
SELECT
REGEXP_REPLACE(SPLIT(V2Themes,';') , r',.*', " ") AS theme
FROM
`gdelt-bq.gdeltv2.gkg`
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%' )
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
it throws the following error:
No matching signature for function REGEXP_REPLACE for argument types: ARRAY<STRING>, STRING, STRING. Supported signatures: REGEXP_REPLACE(STRING, STRING, STRING); REGEXP_REPLACE(BYTES, BYTES, BYTES) at [6:5]
it seems like I have to cast the result of the SPLIT() operation as a string. How do I do this?
UPDATE: I found a talk explaining the unnest operation:
SELECT
COUNT(*),
REGEXP_REPLACE(themes,",.*","") AS theme
FROM
`gdelt-bq.gdeltv2.gkg_partitioned`,
UNNEST( SPLIT(V2Themes,";") ) AS themes
WHERE
_PARTITIONTIME >= "2018-08-09 00:00:00"
AND _PARTITIONTIME < "2018-08-10 00:00:00"
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
100
Flatten the array first:
SELECT
REGEXP_REPLACE(theme , r',.*', " ") AS theme,
COUNT(*) AS count
FROM
`gdelt-bq.gdeltv2.gkg`,
UNNEST(SPLIT(V2Themes,';')) AS theme
WHERE
DATE>20150302000000
AND DATE < 20150304000000
AND V2Persons LIKE '%Netanyahu%'
GROUP BY
theme
ORDER BY
2 DESC
LIMIT
300
The legacy SQL equivalent in your question actually has the effect of flattening the array as well, although it's implicit in the GROUP BY on the theme.