Optimal timestamp-based query in Django - django

What is the optimal query to obtain all the records for one specific day?
In my Weather model, 'timestamp' is a standard DateTimeField.
I'm currently using
start = datetime.datetime(2009, 1, 31)
end = start + datetime.timedelta(hours=23, minutes=59, seconds=59)
Weather.objects.filter(timestamp__range=(start, end))
but wonder if there is a more efficient method.

The way it's done in django.views.generic.date_based is:
{'date_field__range': (datetime.datetime.combine(date, datetime.time.min),
datetime.datetime.combine(date, datetime.time.max))}
There should soon be a patch merged into Django that will provide a __date lookup for exactly this type of query (http://code.djangoproject.com/ticket/9596).

Do not prematurely optimize
Index columns that your queries are based on frequently
Optimize expensive columns, like add auto-updated year, month, and day values (maybe just as a string) if and only if tests show it provides a significant speedup and only after using what already works NOW and determining it isn't viable.

Related

Django and PostgreSQL - how to store time range?

I would like to store a time range (without dates) like "HH:MM-HH:MM".
Can you give me a hint, how can I implement it most easily? Maybe there is a way to use DateRangeField for this aim or something else.
Thanks for your spend time!
Time without date doesn't make much since if you ever need a range that span mid-night (days) You could always convert to text using to_char(<yourtimestamp>,'hh24.mi:ss') or extract the individual parts. Unfortunately Postgres does not provide an extract(time from <yourtimestamp>) function. The following function provides essentially that.
create or replace
function extract_tod(date_time_in timestamp without time zone)
returns time
language sql
as $$
select to_char(date_time_in, 'hh24:mi:ss')::time;
$$;
See here for timestatamp ranges and here for their associated functions. As for how to store then just store with the date
as a standard TIMESTAMP (date + time).

Checking if context is total

What might be an alternative way, possibly more effective, for checking if context is total. I use this measure as benchmark:
IsTotal1 = CALCULATE(COUNT(Tab[Store]), ALLSELECTED(Tab)) = COUNT(Tab[Store])
The idea is that it calculates COUNT on a table with filters removed (left side, so we get counts for all dimensions in context) and checks it against the COUNT with current context. If both are the same, we have total.
I know that using the function HASONEVALUE might be tempting:
IsTotal2 = NOT(HASONEVALUE(Tab[Store]))
However, using this approach has a serious drawback. If we make a table displaying sales by store and by product then the first measure will work and the second will fail. Moreover, if we display sales by product only the first measure will still work, and the second should be retyped to HASONEVALUE(Tab[Product]).
So I want the measure to be resistant to any change of granularity due to adding new dimension to table visual.
Based on the information you provided in the comments, it sounds like you have a page- or report level filter. In that case, you can't rely on functions such as ISFILTERED(...) or ISCROSSFILTERED(...), as these external filters or slicers could impact the result returned from these two functions.
So you have to either stick with your approach (perhaps changing COUNT(...) to COUNTROWS(Tab) could improve the performance slightly), or write something like
ISINSCOPE('Tab'[Store]) || ISINSCOPE('Tab'[Product]) || etc...
where you repeat ISINSCOPE for every column that could potentially be used to slice the data, as ISINSCOPE is the only function that distinguishes using a column on a filter/slicer vs. using it as a row/column grouping on a table/matrix visual.

AWS IoT Analytics Delta Window

I am having real problems getting the AWS IoT Analytics Delta Window (docs) to work.
I am trying to set it up so that every day a query is run to get the last 1 hour of data only. According to the docs the schedule feature can be used to run the query using a cron expression (in my case every hour) and the delta window should restrict my query to only include records that are in the specified time window (in my case the last hour).
The SQL query I am running is simply SELECT * FROM dev_iot_analytics_datastore and if I don't include any delta window I get the records as expected. Unfortunately when I include a delta expression I get nothing (ever). I left the data accumulating for about 10 days now so there are a couple of million records in the database. Given that I was unsure what the optimal format would be I have included the following temporal fields in the entries:
datetime : 2019-05-15T01:29:26.509
(A string formatted using ISO Local Date Time)
timestamp_sec : 1557883766
(A unix epoch expressed in seconds)
timestamp_milli : 1557883766509
(A unix epoch expressed in milliseconds)
There is also a value automatically added by AWS called __dt which is a uses the same format as my datetime except it seems to be accurate to within 1 day. i.e. All values entered within a given day have the same value (e.g. 2019-05-15 00:00:00.00)
I have tried a range of expressions (including the suggested AWS expression) from both standard SQL and Presto as I'm not sure which one is being used for this query. I know they use a subset of Presto for the analytics so it makes sense that they would use it for the delta but the docs simply say '... any valid SQL expression'.
Expressions I have tried so far with no luck:
from_unixtime(timestamp_sec)
from_unixtime(timestamp_milli)
cast(from_unixtime(unixtime_sec) as date)
cast(from_unixtime(unixtime_milli) as date)
date_format(from_unixtime(timestamp_sec), '%Y-%m-%dT%h:%i:%s')
date_format(from_unixtime(timestamp_milli), '%Y-%m-%dT%h:%i:%s')
from_iso8601_timestamp(datetime)
What are the offset and time expression parameters that you are using?
Since delta windows are effectively filters inserted into your SQL, you can troubleshoot them by manually inserting the filter expression into your data set's query.
Namely, applying a delta window filter with -3 minute (negative) offset and 'from_unixtime(my_timestamp)' time expression to a 'SELECT my_field FROM my_datastore' query translates to an equivalent query:
SELECT my_field FROM
(SELECT * FROM "my_datastore" WHERE
(__dt between date_trunc('day', iota_latest_succeeded_schedule_time() - interval '1' day)
and date_trunc('day', iota_current_schedule_time() + interval '1' day)) AND
iota_latest_succeeded_schedule_time() - interval '3' minute < from_unixtime(my_timestamp) AND
from_unixtime(my_timestamp) <= iota_current_schedule_time() - interval '3' minute)
Try using a similar query (with no delta time filter) with correct values for offset and time expression and see what you get, The (_dt between ...) is just an optimization for limiting the scanned partitions. You can remove it for the purposes of troubleshooting.
Please try the following:
Set query to SELECT * FROM dev_iot_analytics_datastore
Data selection filter:
Data selection window: Delta time
Offset: -1 Hours
Timestamp expression: from_unixtime(timestamp_sec)
Wait for dataset content to run for a bit, say 15 minutes or more.
Check contents
After several weeks of testing and trying all the suggestions in this post along with many more it appears that the extremely technical answer was to 'switch off and back on'. I deleted the whole analytics stack and rebuild everything with different names and it now seems to now be working!
Its important that even though I have flagged this as the correct answer due to the actual resolution. Both the answers provided by #Populus and #Roger are correct had my deployment being functioning as expected.
I found by chance that changing SELECT * FROM datastore to SELECT id1, id2, ... FROM datastore solved the problem.

Reducing query time in table with unsorted timeranges

I had a question regarding this matter some days ago, but I'm still wondering about how to tune my performance on this query.
I have a table looking like this (SQLite)
CREATE TABLE ZONEDATA (
TIME INTEGER NOT NULL,
CITY INTEGER NOT NULL,
ZONE INTEGER NOT NULL,
TEMPERATURE DOUBLE,
SERIAL INTEGER ,
FOREIGN KEY (SERIAL) REFERENCES ZONES,
PRIMARY KEY ( TIME, CITY, ZONE));
I'm running a query like this:
SELECT temperature, time, city, zone from zonedata
WHERE (city = 1) and (zone = 1) and (time BETWEEN x AND y);
x and y are variables which may have several hundred thousands variables between them.
temperature ranges from -10.0 to 10.0, city and zone from 0-20 (in this case it is 1 and 2, but can be something else). Records are logged continuously with intervals on about 5-6 seconds from different zones and cities. This creates a lot of data, and does not necessarily mean that every record is logged in correct order of time.
The question is how I can optimize retrieval of records in a big time range (where records are not sorted 100% correctly by time). This can take a lot of time, especially when I'm retrieving from several cities and zones. That means running the mentioned query with different parameters several times. What I'm looking for is specific changes to the query, table structure (preferably not) or other changeable settings.
My application using this is btw implemented in c++.
Your data already is sorted by Time.
By having a Primary Key on (Time, City, Zone) all the records with that same Time value will be next to each other. (Unless you have specified a CLUSTER INDEX elsewhere, though I'm not familiar enough with SQLite to know if that's possible.)
In your particular case, however, that means the records that you want are not next to each other. Instead they're in bunches. Each bunch of records will have (city=1, zone=1) and have the same Time value. One bunch for Time1, another bunch for Time2, etc, etc.
It's like putting it all in Excel and ordering by Time, then by City, then by Zone.
To bunch ALL the records you want (for the same City and Zone) change that to (City, Zone, Time).
Note, however, that if you also have a query for all cities and zones but a time = ??? the key I suggested won't be perfect for that, your original key would be better.
For that reason you may wish/need to add different indexes in different orders, for different queries.
This means that to give you a specific recommended solution we need to know the specific query you will be running. My suggested key/index order may be ideal for your simplified example, but the real-life scenario may be different enough to warrant a different index altogether.
You can index those columns, it will sort it internally for faster query but you will not see it.
For a database between is hard to optimize. One way out of this is adding extra fields so you can replace between with an =. For example, if you add a day field, you could query for:
where city = 1 and zone = 1 and day = '2012-06-22' and
time between '2012-06-22 08:00' and '2012-06-22 12:00'
This query is relatively fast with an index on city, zone, day.
This requires thought to pick the proper extra fields. It requires additional code to maintain the field. If this query is in an important performance path of your application, it might be worth it.

Simple query working for years, then suddenly very slow

I've had a query that has been running fine for about 2 years. The database table has about 50 million rows, and is growing slowly. This last week one of my queries went from returning almost instantly to taking hours to run.
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).latest('id')
I have narrowed the slow query down to the Rank model. It seems to have something to do with using the latest() method. If I just ask for a queryset, it returns an empty queryset right away.
#count returns 0 and is fast
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)).count() == 0
Rank.objects.filter(site=Site.objects.get(profile__client=client, profile__is_active=False)) == [] #also very fast
Here are the results of running EXPLAIN. http://explain.depesz.com/s/wPh
And EXPLAIN ANALYZE: http://explain.depesz.com/s/ggi
I tried vacuuming the table, no change. There is already an index on the "site" field (ForeignKey).
Strangely, if I run this same query for another client that already has Rank objects associated with her account, then the query returns very quickly once again. So it seems that this is only a problem when their are no Rank objects for that client.
Any ideas?
Versions:
Postgres 9.1,
Django 1.4 svn trunk rev 17047
Well, you've not shown the actual SQL, so that makes it difficult to be sure. But, the explain output suggests it thinks the quickest way to find a match is by scanning an index on "id" backwards until it finds the client in question.
Since you said it has been fast until recently, this is probably not a silly choice. However, there is always the chance that a particular client's record will be right at the far end of this search.
So - try two things first:
Run an analyze on the table in question, see if that gives the planner enough info.
If not, increase the stats (ALTER TABLE ... SET STATISTICS) on the columns in question and re-analyze. See if that does it.
http://www.postgresql.org/docs/9.1/static/planner-stats.html
If that's still not helping, then consider an index on (client,id), and drop the index on id (if not needed elsewhere). That should give you lightning fast answers.
latests is normally used for date comparison, maybe you should try to order by id desc and then limit to one.