Retrieving the row with the greatest timestamp in questDB - questdb

I'm currently running QuestDB 6.1.2 on linux. How do I get the row with maximum value from a table? I have tried the following on a test table with around 5 million rows:
select * from table where cast(timestamp as symbol) in (select cast(max(timestamp) as symbol) from table );
select * from table inner join (select max(timestamp) mm from table ) on timestamp >= mm
select * from table where timestamp = max(timestamp)
select * from table where timestamp = (select max(timestamp) from table )
where 1 is correct but runs in ~5s, 2 is correct and runs in ~500ms but looks unnecessarily verbose for a query, 3 compiles but returns an empty table, and 4 is incorrect syntax although that's how sql usually does it

select * from table limit -1 works. QuestDB returns rows sorted by timestamp as default, and limit -1 takes the last row, which happens to be the row with the greatest timestamp. To be explicit about ordering by timestamp, select * from table order by timestamp limit -1 could be used instead. This query runs in around 300-400ms on the same table.
As a side note, the third query using timestamp=max(timestamp) doesn't work yet since QuestDB does not support subqueries in where yet (questDB 6.1.2).

Related

How to find missing dates in BigQuery table using sql

How to get a list of missing dates from a BigQuery table. For e.g. a table(test_table) is populated everyday by some job but on few days the jobs fails and data isn't written into the table.
Use Case:
We have a table(test_table) which is populated everyday by some job( a scheduled query or cloud function).Sometimes those job fail and data isn't available for those particular dates in my table.
How to find those dates rather than scrolling through thousands of rows.
The below query will return me a list of dates and ad_ids where data wasn't uploaded (null).
note: I have used MAX(Date) as I knew dates was missing in between my boundary dates. For safe side you can also specify the starting_date and ending_date incase data hasn't been populated in the last few days at all.
WITH Date_Range AS
-- anchor for date range
(
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM `project_name.dataset_name.test_table`
),
day_series AS
-- anchor to get all the dates within the range
(
SELECT *
FROM Date_Range
,UNNEST(GENERATE_TIMESTAMP_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) AS days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
original_table.ad_id
FROM day_series
-- do a left join on the source table
LEFT JOIN `project_name.dataset_name.test_table` AS original_table ON (original_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE original_table.ad_id IS NULL
GROUP BY 1,2
ORDER BY 1
Final output will look like below:
An Alternate solution you can try following query to get desired output:-
with t as (select 1 as id, cast ('2020-12-25' as timestamp) Days union all
select 1 as id, cast ('2020-12-26' as timestamp) Days union all
select 1 as id, cast ('2020-12-27' as timestamp) Days union all
select 1 as id, cast ('2020-12-31' as timestamp) Days union all
select 1 as id, cast ('2021-01-01' as timestamp) Days union all
select 1 as id, cast ('2021-01-04' as timestamp) Days)
SELECT *
FROM (
select TIMESTAMP_ADD(Days, INTERVAL 1 DAY) AS Days, TIMESTAMP_SUB(next_days, INTERVAL 1 DAY) AS next_days from (
select t.Days,
(case when lag(Days) over (partition by id order by Days) = Days
then NULL
when lag(Days) over (partition by id order by Days) is null
then Null
else Lead(Days) over (partition by id order by Days)
end) as next_days
from t) where next_days is not null
and Days <> TIMESTAMP_SUB(next_days, INTERVAL 1 DAY)),
UNNEST(GENERATE_TIMESTAMP_ARRAY(Days, next_days, INTERVAL 1 DAY)) AS days
Output will be as :-
I used the code above but had to restructure it for BigQuery:
-- anchor for date range - this will select dates from the source table (i.e. the table your query runs off of)
WITH day_series AS(
SELECT *
FROM (
SELECT MIN(DATE) as starting_date,
MAX(DATE) AS ending_date
FROM --enter source table here--
---OPTIONAL: filter for a specific date range
WHERE DATE BETWEEN 'YYYY-MM-DD' AND YYYY-MM-DD'
),UNNEST(GENERATE_DATE_ARRAY(starting_date, ending_date, INTERVAL 1 DAY)) as days
-- other options depending on your date type ( mine was timestamp)
-- GENERATE_DATETIME_ARRAY or GENERATE_DATE_ARRAY
)
SELECT
day_series.days,
output_table.date
FROM day_series
-- do a left join on the output table (i.e. the table you are searching the missing dates for)
LEFT JOIN `project_name.dataset_name.test_table` AS output_table
ON (output_table.date)= day_series.days
-- I only want the records where data is not available or in other words empty/missing
WHERE output_table.date IS NULL
GROUP BY 1,2
ORDER BY 1

How to select data from aws athena table which is partitioned like 'year=yyyy/month=MM/date=dd/' for a given date range?

Athena Tables are partitioned like and same as s3 folder path
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9
parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16
parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14
PARTITIONED BY (
`parent` string,
`year` int,
`month` tinyint,
`date` tinyint)
Now how should I form the where condition for a select query to get data for parent = "9ab4fcca-65d8-11ea-bc55-0242ac130003" from 2019-06-01 to 2020-04-31 ?
SELECT *
FROM table
WHERE parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003' AND year >= 2019 AND year <= 2020 AND month >= 04 AND month <= 06 AND date >= 01 AND date <= 31 ;
But this isn't correct. Please help
Partitioning on year, month, and day separately makes it unnecessarily difficult to query tables. If you're starting out I really suggest to avoid this kind of partitioning scheme. If you can't avoid it you can still make things easier by creating the table partitions differently.
Most guides will tell you to create directory structures like year=2020/month=4/date=1/file1, create a table with three corresponding partition columns, and then run MSCK REPAIR TABLE to load partitions. This works, but it's far from the best way to use Athena. MSCK REPAIR TABLE has atrocious performance, and partitioning like that is far from ideal.
I suggest creating directory structures that are just 2020-03-01/file1, but if you can't, you can actually have any structure you want, 2020/03/01/file1, year=2020/month=4/date=1/file1, or any other structure where there is one distinct prefix per date will work more or less equally well.
I also suggest you create tables with only one partition column: date (or dt or day if you want avoid quoting), typed as DATE, not string.
What you do then, instead of running MSCK REPAIR TABLE is that you use ALTER TABLE … ADD PARTITION or the Glue APIs directly, to add partitions. This command lets you specify the location separately from the partition column value:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/2020-04-01/'
The important thing here is that the partition column value doesn't have to have any relationship at all with the location, this would work equally well:
ALTER TABLE my_table ADD
PARTITION (day = '2020-04-01') LOCATION 's3://some-bucket/path/to/data-for-first-of-april/'
For your specific case you could have:
PARTITIONED BY (`parent` string, `day` date)
and then do:
ALTER TABLE your_table ADD
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-17') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=17'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-09') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=9'
PARTITION (parent = '0fc966a0-bba7-4c0b-a648-cff7f0332059', day = '2020-04-16') LOCATION 's3://your-bucket/parent=0fc966a0-bba7-4c0b-a648-cff7f0332059/year=2020/month=4/date=16'
PARTITION (parent = '9ab4fcca-65d8-11ea-bc55-0242ac130003', day = '2020-04-14') LOCATION 's3://your-bucket/parent=9ab4fcca-65d8-11ea-bc55-0242ac130003/year=2020/month=4/date=14'
Here is how you can use year, month and day values the come from partitions in order to select date range
SELECT col1, col2
FROM my_table
WHERE CAST(date_parse(concat(CAST(year AS VARCHAR(4)),'-',
CAST(month AS VARCHAR(2)),'-',
CAST(day AS VARCHAR(2))
), '%Y-%m-%d') as DATE)
BETWEEN DATE '2019-06-01' AND DATE '2020-04-31'
You can add additional filter statements as needed)

I can't sort a table column by a calculated column

Trying to sort a column in my custom date table (a csv file) via a calculated column in the same table but am seeing an error. The calculated column does not reference the column I wish to sort by. Here's the DAX for the calculated column:
PeriodOffset =
Dates[Period] + Dates[FiscalYear] * 13
- CALCULATE ( VALUES ( Dates[Period] ), Dates[Date] = TODAY () )
- CALCULATE ( VALUES ( Dates[FiscalYear] ), Dates[Date] = TODAY () ) * 13
My date table has every date from 2003/4 to 2034/35, along with custom period numbers, calendar and fiscal years etc. The column I am trying to sort is called PeriodFiscalYear. Each value in that column has only one entry in the PeriodOffset column so it's not that.
The weird thing is, I have had this working in a previous report. In this instance, I was simply trying to recreate the functionality but it won't do it. Even stranger, if I create the PeriodFiscalYear column as a calculated column (currently it's hard-coded in the csv file), it works! So I have a sort-of workaround, I would just like to understand what is going on.
Thanks
I believe this has to do with the fact that data column are sorted when data are ingested into PBI. Calculated columns are calculated only at a later time.
Therefore:
you can sort data column only with other data columns (because calculated columns have not been calculated yet)
you can sort calculated column with both data column and calculated column
Solution:
A) PeriodFiscalYear becomes a calculated column
B) PeriodOffset becomes a data column (either in your CSV or Power Query)
I actually figured this out. The problem was with my data model - I had a circular relationship in there as I was deriving the Period column in one table using my calendar table then linking them back in the relationship!
I created a linking table with the keys in both to make the relationship, then hid it.
Thanks

How do I select rows from an Sqlite table exluding ones from a previous query?

I have an Sqlite table having > 25 million rows. I'd selected 1 million rows randomly from this table using the following code:
# using sqlite3 code
c = cursor.execute("SELECT *
FROM reviews_table WHERE ROWID IN (SELECT ROWID FROM reviews_table ORDER BY RANDOM() LIMIT 1000000) ")
Now, I wish to select another 1 million rows from the table, excluding those rows in the previous query. How would I go about doing this?

Cassandra only returning a subset of my rows when using varchar

I have encountered a problem when using Apache Cassandra, in that I have 500k rows of entries in a 4 column table. 3 of the columns make up the compound key and the last one is a help column for indexing so that I can search in between the other ones using greater than or lesser than operators. The 3 components of the compund key are integers, and the help column is a varchar filled with help for all 500k entries. Now, when i use:
select count(*) from table where help='help' limit 1kk allow filtering;
I should have gotten as result 500k, but I get 36738.
Any ideas as to why this is happening?
If the table has for columns: id, column1, column2, help; my query needs to be something similar to:
select * from table where column1 > 15 and column1 < 1000 and column2 > 200 and column2 < 10000 and help='help' limit 1kk allow filtering;
Also when I created the table, I used PRIMARY KEY(id, column1, column2)