subtract 1 month interval from date? - postgresql-11

SELECT CONCAT(EXTRACT(MONTH
FROM '{{ date.start }}'::timestamp ),'/',
EXTRACT(YEAR
FROM '{{ date.start }}'::timestamp )) - interval '1' MONTH
I get an error when I run this query
Error running query: operator does not exist: text - interval LINE 25: ... FROM '2019-01-01'::timestamp )) - interval... ^ HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
How to solve it?

You need to remove the concat() as it turns the timstamp into a varchar.
If you want to get the start of the month of the "timestamp" value, there are easier way to do that:
date_trunc('month', '{{ date.start }}'::timestamp)
The result of that is a timestamp from which you can subtract the interval:
date_trunc('month', '{{ date.start }}'::timestamp) - interval '1 month'
The following sample query:
with sample_data (input_date) as (
values
(timestamp '2019-01-01 17:18:19'),
(timestamp '2019-02-07 16:30:40'),
(timestamp '2019-03-02 23:30:42')
)
select input_date,
(date_trunc('month', input_date) - interval '1 month')::date as previous_month_start
from sample_data;
returns the following result:
input_date | previous_month_start
--------------------+---------------------
2019-01-01 17:18:19 | 2018-12-01
2019-02-07 16:30:40 | 2019-01-01
2019-03-02 23:30:42 | 2019-02-01
If you want to display the result of that in a different format, apply to_char() on the result:
to_char(date_trunc('month', input_date) - interval '1 month', 'mm/yyyy')
Online example

From the comments and the query in OP I think you are trying to convert a timestamp to customized format of MM/YYYY minus the 1 month interval.
Following is just one of few approaches to achieve that: Using concatenation operator ||
SELECT (extract(month FROM (input_date - interval '1 month')))::text
|| '/'
|| (extract(year FROM (input_date - interval '1 month')))::text
AS formatted_string;

Related

How to query the time in unix epoch timestamp in aws athena

I have a simple table contains the node, message, starttime, endtime details where starttime and endtime are in unix timestamp. The query I am running is:
select node, message, (select from_unixtime(starttime)), (select from_unixtime(endtime)) from table1 WHERE try(select from_unixtime(starttime)) > to_iso8601(current_timestamp - interval '24' hour) limit 100
The query is not working and throwing the syntax error.
I am trying to fetch the following information from the table:
query the table using start time and end time for past 'n' hours or 'n' days and get the output of starttime and endtime in human readable format
query the table using a specific date and time in human readable format
You don't need "extra" selects and you don't need to_iso8601 in the where clasue:
WITH dataset AS (
SELECT * FROM (VALUES
(1627409073, 1627409074),
(1627225824, 1627225826)
) AS t (starttime, endtime))
SELECT from_unixtime(starttime), from_unixtime(endtime)
FROM
dataset
WHERE from_unixtime(starttime) > (current_timestamp - interval '24' hour) limit 100
Output:
_col0
_col1
2021-07-27 18:04:33.000
2021-07-27 18:04:34.000
to search last week you can use
WHERE your_date >= to_unixtime(CAST(now() - interval '7' day AS timestamp))

How to find missing dates in BigQuery table using sql

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.
Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:
An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-
I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1

Is there a way to evaluate for first day of the month and last day of the month?

I wanted to know if there is a way to evaluate dates in a manner that evaluate the 1st of the month and the last of a month
I have two columns
Start Date End Date Result
1/1/2020 1/31/2020 Standard
2/5/2020 2/15/2020 Irregular
Goal intended is to use conditional formatting to highlight dates that do not start on the 1st of a month or end on the last of a month
Tried this solution that was suggested out:
IsStandard = [End Date] = Date.EndOfMonth(Date.FromText([End Date])) and [Start Date] = Date.AddDays(Date.AddMonths(Date.EndOfMonth(Date.FromText([Start Date])),-1),1)
Expression evaluates but returns all results as "False" (NOTE: the dates are as text values so I had to add the additional Date.FromText() )
Is there a way to get the expression to evaluate for the date range from 1st month to last month as true and anything else False??
PowerQuery version: You could create a custom column as follows:
IsStandard = [End Date] = Date.EndOfMonth([End Date]) and [Start Date] = Date.AddDays(Date.AddMonths(Date.EndOfMonth([Start Date]),-1),1)
This version returns TRUE if the start date is the start of a month and the end date is the end of a month. To suit your use-case, you could invert the logic or break it up into a column to flag is/is not start of month and another to flag is/is not end of month.
The reason for the weird logic around start of month is PowerQuery provides the Date.EndOfMonth function but not a Date.StartOfMonth, so you have to take the end of a month, subtract a month, then add a day to get to the start of month.
Hopefully this helps :) .
Edit RE: always false...
This is another quirk of the language and not intuitive, but the key is that the type conversion via Date.FromText() returns a nullable date which is different from a strict Date datatype (this can be confirmed by adding a column based on the formula Date.FromText([End Date]) and examining the data type implicitly assigned to the new column, which is not Date.
Probably the simplest option would be to Change Type on the [Start Date] and [End Date] columns to Date prior to evaluating for 'standard' vs 'irregular' and remove the Date.FromText calls from the new column. This might make the fields more consistent anyway but you may run into an issue if any of the raw values are actually null or empty.
If any values are null or empty, you do a check for null to short-circuit the date conversion & check for start/end of month. If the null check fails, you would have the opportunity to then flag the record as something different from 'Standard' and 'Irregular', if that is important for your use-case.

Calculate date and weekending date on Presto

Given day, month and year as integer columns in the table, calculate the date and weekending date from these values.
I tried following
select date_parse(cast (2020 as varchar)||cast (03 as varchar)||cast (02 as varchar),'%Y%m%d')
returns an error saying "INVALID_FUNCTION_ARGUMENT: Invalid format: "202032" is too short"
The simplest way is to use format() + cast to date:
presto> SELECT CAST(format('%d-%d-%d', 2020, 3, 31) AS date);
_col0
------------
2020-03-31
Since Athena is still based on Presto .172, it doesn't have this function yet, so you can do the same without format:
presto> SELECT CAST(CAST(2020 AS varchar) || '-' || CAST(3 AS varchar) || '-' || CAST(31 AS varchar) AS date);
_col0
------------
2020-03-31

Hive: Usage of Concat in Column Name

I am trying to get data from a table that has column name as: year_2016, year_2017, year_2018 etc.
I am not sure how to get the data from this table.
The data looks like:
| count_of_accidents | year_2016 | year_2017 |year_2018 |
|--------------------|-----------|-----------|----------|
| 15 | 12 | 5 | 1 |
| 5 | 10 | 6 | 18 |
I have tried 'concat' function but this doesn't really work.
I have tried with this:
select SUM( count_of_accidents * concat('year_',year(regexp_replace('2018_1_1','_','-'))))
from table_name;
The column name (year_2017 or year_2018 etc) will be passed as a parameter. So, I am not really able to hardcode the column name like this-
select SUM( count_of_accidents * year_2018) from table_name;
Is there any way I can do this?
You can do it using regular expressions. Like this:
--create test table
create table test_col(year_2018 string, year_2019 string);
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--test select using hard-coded pattern
select year_2018, `(year_)2019` from test_col;
OK
year_2018 year_2019
Time taken: 0.862 seconds
--test pattern parameter
set hivevar:year_param=2019;
select year_2018, `(year_)${year_param}` from test_col;
OK
year_2018 year_2019
Time taken: 0.945 seconds
--two parameters
set hivevar:year_param1=2018;
set hivevar:year_param2=2019;
select `(year_)${year_param1}`, `(year_)${year_param2}` from test_col t;
OK
year_2018 year_2019
Time taken: 0.159 seconds
--parameter contains full column_name and using more strict regexp pattern
set hivevar:year_param2=year_2019;
select `^${year_param2}$` from test_col t;
OK
year_2019
Time taken: 0.053 seconds
--select all columns using single pattern year_ and four digits
select `^year_[0-9]{4}$` from test_col t;
OK
year_2018 year_2019
Parameter should be calculated and passed to the hive script, no functions like concat(), regexp_replace are supported in the column names.
Also column aliasing does not work for columns extracted using regular expressions:
select t.number_of_incidents, `^${year_param}$` as year1 from test_t t;
throws exception:
FAILED: SemanticException [Error 10004]: Line 1:30 Invalid table alias
or column reference '^year_2018$': (possible column names are:
number_of_incidents, year_2016, year_2017, year_2018)
I found a workaround to alias a column using union all with empty dataset, see this test:
create table test_t(number_of_incidents int, year_2016 int, year_2017 int, year_2018 int);
insert into table test_t values(15, 12, 5, 1); --insert test data
insert into table test_t values(5,10,6,18);
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
--enable regex columns and print column names
set hive.support.quoted.identifiers=none;
set hive.cli.print.header=true;
--Alias column using UNION ALL with empty dataset
select sum(number_of_incidents*year1) incidents_year1
from
(--UNION ALL with empty dataset to alias columns extracted
select 0 number_of_incidents, 0 year1 where false --returns no rows because of false condition
union all
select t.number_of_incidents, `^${year_param}$` from test_t t
)s;
Result:
OK
incidents_year1
105
Time taken: 38.003 seconds, Fetched: 1 row(s)
First query in the UNION ALL does not affect data because it returns no rows. But it's column names become the names of the whole UNION ALL dataset and can be used in the upper query. This trick works. If you will find a better workaround to alias columns extracted using regexp, please add your solution as well.
Update:
No need in regular expressions if you can pass full column_name as a parameter. Hive substitutes variables as is (does not calculate them) before query execution. Use regexp only if you can not pass full column name for some reason and like in the original query some pattern concatenation is needed. See this test:
--parameter, can be passed from outside the script from command line
set hivevar:year_param=year_2018;
select sum(number_of_incidents*${year_param}) incidents_year1 from test_t t;
Result:
OK
incidents_year1
105
Time taken: 63.339 seconds, Fetched: 1 row(s)