Postgres db index not being used on Heroku - django

I'm trying to debug a slow query for a model that looks like:
class Employee(TimeStampMixin):
title = models.TextField(blank=True,db_index=True)
seniority = models.CharField(blank=True,max_length=128,db_index=True)
The query is: Employee.objects.exclude(seniority='').filter(title__icontains=title).order_by('seniority').values_list('seniority')
When I run it locally it takes ~0.3 seconds (same database size). An explain locally shows:
Limit (cost=1000.58..196218.23 rows=7 width=1) (actual time=299.016..300.366 rows=1 loops=1)
Output: seniority
Buffers: shared hit=2447163 read=23669
-> Gather Merge (cost=1000.58..196218.23 rows=7 width=1) (actual time=299.015..300.364 rows=1 loops=1)
Output: seniority
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2447163 read=23669
-> Parallel Index Only Scan using companies_e_seniori_12ac68_idx on public.companies_employee (cost=0.56..195217.40 rows=3 width=1) (actual time=293.195..293.200 rows=0 loops=3)
Output: seniority
Filter: (((companies_employee.seniority)::text <> ''::text) AND (upper(companies_employee.title) ~~ '%INFORMATION SPECIALIST%'::text))
Rows Removed by Filter: 2697599
Heap Fetches: 2819
Buffers: shared hit=2447163 read=23669
Worker 0: actual time=291.087..291.088 rows=0 loops=1
Buffers: shared hit=820222 read=7926
Worker 1: actual time=291.056..291.056 rows=0 loops=1
Buffers: shared hit=812538 read=7888
Planning Time: 0.209 ms
Execution Time: 300.400 ms
however when I run the same code on Heroku I get execution times of 3s+, possibly because the former is using an index while the second is not:
Limit (cost=216982.74..216983.39 rows=6 width=1) (actual time=988.738..1018.964 rows=1 loops=1)
Output: seniority
Buffers: shared hit=199527 dirtied=5
-> Gather Merge (cost=216982.74..216983.39 rows=6 width=1) (actual time=980.932..1011.157 rows=1 loops=1)
Output: seniority
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=199527 dirtied=5
-> Sort (cost=215982.74..215982.74 rows=3 width=1) (actual time=959.233..959.234 rows=0 loops=3)
Output: seniority
Sort Key: companies_employee.seniority
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=199527 dirtied=5
Worker 0: actual time=957.414..957.414 rows=0 loops=1
Sort Method: quicksort Memory: 25kB
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 1.179 ms, Inlining 0.000 ms, Optimization 0.879 ms, Emission 9.714 ms, Total 11.771 ms
Buffers: shared hit=54855 dirtied=2
Worker 1: actual time=939.591..939.592 rows=0 loops=1
Sort Method: quicksort Memory: 25kB
JIT:
Functions: 4
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 0.741 ms, Inlining 0.000 ms, Optimization 0.654 ms, Emission 6.531 ms, Total 7.926 ms
Buffers: shared hit=87867 dirtied=1
-> Parallel Seq Scan on public.companies_employee (cost=0.00..215982.73 rows=3 width=1) (actual time=705.244..959.146 rows=0 loops=3)
Output: seniority
Filter: (((companies_employee.seniority)::text <> ''::text) AND (upper(companies_employee.title) ~~ '%INFORMATION SPECIALIST%'::text))
Rows Removed by Filter: 2939330
Buffers: shared hit=199449 dirtied=5
Worker 0: actual time=957.262..957.262 rows=0 loops=1
Buffers: shared hit=54816 dirtied=2
Worker 1: actual time=939.491..939.491 rows=0 loops=1
Buffers: shared hit=87828 dirtied=1
Query Identifier: 2827140323627869732
Planning:
Buffers: shared hit=293 read=1 dirtied=1
I/O Timings: read=0.021
Planning Time: 1.078 ms
JIT:
Functions: 13
Options: Inlining false, Optimization false, Expressions true, Deforming true
Timing: Generation 2.746 ms, Inlining 0.000 ms, Optimization 2.224 ms, Emission 23.189 ms, Total 28.160 ms
Execution Time: 1050.493 ms
I confirmed the model indexes are identical for my local database and on Heroku, this is what they are:
indexname | indexdef
----------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------
companies_employee_pkey | CREATE UNIQUE INDEX companies_employee_pkey ON public.companies_employee USING btree (id)
companies_employee_company_id_c24081a8 | CREATE INDEX companies_employee_company_id_c24081a8 ON public.companies_employee USING btree (company_id)
companies_employee_person_id_936e5c6a | CREATE INDEX companies_employee_person_id_936e5c6a ON public.companies_employee USING btree (person_id)
companies_employee_role_8772f722 | CREATE INDEX companies_employee_role_8772f722 ON public.companies_employee USING btree (role)
companies_employee_role_8772f722_like | CREATE INDEX companies_employee_role_8772f722_like ON public.companies_employee USING btree (role text_pattern_ops)
companies_employee_seniority_b10393ff | CREATE INDEX companies_employee_seniority_b10393ff ON public.companies_employee USING btree (seniority)
companies_employee_seniority_b10393ff_like | CREATE INDEX companies_employee_seniority_b10393ff_like ON public.companies_employee USING btree (seniority varchar_pattern_ops)
companies_employee_title_78009330 | CREATE INDEX companies_employee_title_78009330 ON public.companies_employee USING btree (title)
companies_employee_title_78009330_like | CREATE INDEX companies_employee_title_78009330_like ON public.companies_employee USING btree (title text_pattern_ops)
companies_employee_institution_75d6c7e9 | CREATE INDEX companies_employee_institution_75d6c7e9 ON public.companies_employee USING btree (institution)
companies_employee_institution_75d6c7e9_like | CREATE INDEX companies_employee_institution_75d6c7e9_like ON public.companies_employee USING btree (institution text_pattern_ops)
companies_e_seniori_12ac68_idx | CREATE INDEX companies_e_seniori_12ac68_idx ON public.companies_employee USING btree (seniority, title)
title_seniority | CREATE INDEX title_seniority ON public.companies_employee USING btree (upper(title), seniority)

Related

show_chunks() does not correspond to what DELETE complains about

This is really puzzling. I need to delete a date from a hypertable from timescaleDB 1.7:
DELETE FROM raw WHERE tm::date = '2020-11-06' -- the local date style is YYYY-MM-DD
Before doing that, I check what chunks I need to decompress, giving it one day margin, and receive two chunks:
SELECT show_chunks('raw', newer_than => '2020-11-05 00:00'::timestamp)
---
Result:
"_timescaledb_internal._hyper_1_19_chunk"
"_timescaledb_internal._hyper_1_21_chunk"
So I decompress these two. However when I run the DELETE command above, I still get an error about totally different chunk:
ERROR: cannot update/delete rows from chunk "_hyper_1_1_chunk" as it
is compressed SQL state: XX000
BTW this chunk is empty as far as I can see by looking at it in the pgAdmin. Any idea what's going on? Looks like a bug to me, but maybe I'm doing something wrong?
Thanks!
Edit:
Below is an excerpt from the result of EXPLAIN DELETE, as requested by #k_rus:
EXPLAIN DELETE FROM raw WHERE tm::date = '2020-11-06'
Result:
"Delete on raw (cost=0.00..719.63 rows=147 width=6)"
" Delete on raw"
" Delete on _hyper_1_1_chunk"
" Delete on _hyper_1_2_chunk"
...
" Delete on _hyper_1_22_chunk"
" -> Seq Scan on raw (cost=0.00..0.00 rows=1 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
...
" -> Custom Scan (CompressChunkDml) on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
Thank you for providing the explain. The explain shows that the DELETE statement is planned to touch all chunks of the hypertable and only in runtime the execution of the DELETE statement will realise that it is nothing to delete in many chunks:
EXPLAIN DELETE FROM raw WHERE tm::date = '2020-11-06'
Result:
"Delete on raw (cost=0.00..719.63 rows=147 width=6)"
" Delete on raw"
" Delete on _hyper_1_1_chunk"
" Delete on _hyper_1_2_chunk"
...
" Delete on _hyper_1_22_chunk"
" -> Seq Scan on raw (cost=0.00..0.00 rows=1 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_1_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
" -> Custom Scan (CompressChunkDml) on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_2_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
...
" -> Custom Scan (CompressChunkDml) on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" -> Seq Scan on _hyper_1_22_chunk (cost=0.00..27.40 rows=6 width=6)"
" Filter: ((tm)::date = '2020-11-06'::date)"
Since some chunks are compressed, TimescaleDB returns error on the deletes planned for compressed chunks.
The only way to not get the error is to have the selection condition to trigger chunk exclusion at the planning time. In the question the selection condition is tm::date = '2020-11-06', which first extracts date from the column tm and then compares with the constant. Thus the planner cannot decide if the chunk is filtered or not and instead push the filter down for runtime execution on every chunk.
To resolve this it is good to have a selection condition, which compares the time dimension column with constant or value, which can be calculated at planning time. Assuming tm is the time dimension column in the hypertable raw, I suggest to convert the constant date into timestamp, e.g., '2020-11-06'::timestamp and keep the column. You will need to specify the range of timestamps to cover all rows belonging to the targeted date.
For example, the DELETE statement can be:
DELETE FROM raw WHERE tm BETWEEN '2020-11-06 00:00' AND '2020-11-06 23:59'
Answers to questions:
show_chunks() does not correspond to what DELETE complains about
show_chunk statement and DELETE statement have different conditions and thus cannot be compared directly. show_chunk only shows chunks, which cover the time newer than the given constant. While DELETE is planned to check every chunk, thus it can complain on any chunk of the hypertable.
BTW this chunk is empty as far as I can see by looking at it in the pgAdmin. Any idea what's going on? Looks like a bug to me, but maybe I'm doing something wrong?
The compressed chunk stores data in a different internal chunk, thus no data can be seen in _hyper_1_1_chunk. TimescaleDB assumes that data are read through the hypertable, not directly from the chunks. Hypertable is an abstraction, which hides implementation details of TimescaleDB.
What specific version of 1.7? There was a bug on this, but should be fixed from 1.7.3 forward. https://github.com/timescale/timescaledb/pull/2092
If you’re on 1.7.3 or later and still seeing this, it’d be best to open an issue on the timescaledb GitHub repo.
You can check your version by connecting with psql and running \dx

Simple Neptune Gremlin query to perform date comparison degrades due to large join

We have a graph that contains both customer and product verticies. For a given product, we want to find out how many customers who signed up before DATE have purchased this product. My query looks something like
g.V('PRODUCT_GUID') // get product vertex
.out('product-customer') // get all customers who ever bought this product
.has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))) // see if the customer was created after a given date
.count() // count the results
This query is incredibly slow, so I looked at the neptune profiler and saw something odd. Below is the full profiler output. Ignore the elapsed time in the profiler. This was after many attempts at the same query, so the cache is warm. in the wild, it can take 45 seconds or more.
*******************************************************
Neptune Gremlin Profile
*******************************************************
Query String
==================
g.V('PRODUCT_GUID').out('product-customer').has('created_on', gte(datetime('2020-11-28T00:33:44.536Z'))).count()
Original Traversal
==================
[GraphStep(vertex,[PRODUCT_GUID]), VertexStep(OUT,[product-customer],vertex), HasStep([created_on.gte(Sat Nov 28 00:33:44 UTC 2020)]), CountGlobalStep]
Optimized Traversal
===================
Neptune steps:
[
NeptuneCountGlobalStep {
JoinGroupNode {
PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586, indexTime=0, joinTime=14, numSearches=1, actualTotalOutput=13424}
PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574, indexTime=10, joinTime=140, numSearches=13424}
}, annotations={path=[Vertex(?1):GraphStep, Vertex(?3):VertexStep], joinStats=true, optimizationTime=0, maxVarId=8, executionTime=165}
}
]
Physical Pipeline
=================
NeptuneCountGlobalStep
|-- StartOp
|-- JoinGroupOp
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6) . project ?1,?3 . IsEdgeIdFilter(?6) .], {estimatedCardinality=30586, expectedTotalOutput=30586})
|-- SpoolerOp(1000)
|-- DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^<DATETIME>) .], {estimatedCardinality=1285574})
Runtime (ms)
============
Query Execution: 164.996
Traversal Metrics
=================
Step Count Traversers Time (ms) % Dur
-------------------------------------------------------------------------------------------------------------
NeptuneCountGlobalStep 1 1 164.919 100.00
>TOTAL - - 164.919 -
Predicates
==========
# of predicates: 131
Results
=======
Count: 1
Output: [22]
Index Operations
================
Query execution:
# of statement index ops: 13425
# of unique statement index ops: 13425
Duplication ratio: 1.0
# of terms materialized: 0
In particular
DynamicJoinOp(PatternNode[(?3, <created_on>, ?7, ?) . project ask . CompareFilter(?7 >= Sat Nov 28 00:33:44 UTC 2020^^) .], {estimatedCardinality=1285574})
This line surprises me. The way I'm reading this is that Neptune is ignoring the verticies coming from ".out('product-customer')" to satisfy the ".has('created_on'...)" requirement, and is instead joining on every single customer vertex that has the created_on attribute.
I would have expected that the cardinality is only the number of customers with an edge from the product, not every single customer.
I'm wondering if there's a way to only run this comparison on the customers coming from the "out('product-customer')" step.
Neptune actually must solve the first pattern,
(?1=<PRODUCT_GUID>, ?5=<product-customer>, ?3, ?6)
before it can solve the second,
(?3, <created_on>, ?7, ?)
Each quad pattern is an indexed lookup bound by at least two fields. So the first lookup uses the SPOG index in Neptune bound by the Subject (the ID) and the Predicate (the edge label). This will return a set of Objects (the vertex IDs for the vertices at the other end of the product-customer edges) and references them via the ?3 variable for the next pattern.
In the next pattern those vertex IDs (?3) are bound with the Predicate (property key of created-on) to evaluate the condition of the date range. Because this is a conditional evaluation, each vertex in the set of ?3 has to be evaluated (each 'created-on' property on each of those vertices has to be read).

Windows Defender increases file write times

On a Windows 10 machine, I seem to be running into substantially increased write times on our cache files.
Below I have included timing operations for our writes with/without Defender's intervention. For this test, we are writing 32KB blocks to a 1GB, pre-allocated cache file, 36000 times.
Here are file write times with Windows Defender enabled (default behavior on machines):
### Manager: [CacheFile] --> 123.524 secs
### Count: 36784. Time: 123524(ms). Average: 3(ms). Max Finished: 218(ms).
### Unfinished: 0. Max Unfinished: 0(ms). Min Unfinished: 0(ms).
### Max Finished Item: [Name: [DirectFileWrite:4294967293]. Pid: 0x00000000000002E8. Tid: 0x00000000000010A0. Data: 0x0000000000000000.].
### Max Unfinished Item: [].
### Min Unfinished Item: [].
### Reporting Time: 0(ms).
And here's the same operations performed when our cache file is added to Windows Defender's exclusion list:
### Manager: [CacheFile] --> 9.194 secs
### Count: 36784. Time: 9194(ms). Average: 0(ms). Max Finished: 126(ms).
### Unfinished: 0. Max Unfinished: 0(ms). Min Unfinished: 0(ms).
### Max Finished Item: [Name: [DirectFileWrite:4294967293]. Pid: 0x00000000000006F4. Tid: 0x000000000000130C. Data: 0x0000000000000000.].
### Max Unfinished Item: [].
### Min Unfinished Item: [].
### Reporting Time: 0(ms).
#########
I'm thinking that Windows Defender is running some sort of check on the entire file (opening and checking the entire data of the 1GB file), every time we write to the file.
Adding the cache files to the exclusion list would be the last option, so I'm wondering if anyone run in to any issues of similar nature?
I'm using Windows C++ API for all I/O operations.

Unable to find custom Hive InputFormat when using `where 1=1`

I'm using Hive and I'm encountering an exception when I'm performing a query with a custom InputFormat.
When I use the query select * from micmiu_blog; Hive works without problems, but if I use select * from micmiu_blog where 1=1; it seems that the framework cannot find my custom InputFormat class.
I have put the JAR file into "hive/lib","hadoop/lib" and I have also put "hadoop/lib" into the CLASSPATH. This is the log:
hive> select * from micmiu_blog where 1=1;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1415530028127_0004, Tracking URL = http:/ /hadoop01-master:8088/proxy/application_1415530028127_0004/
Kill Command = /home/hduser/hadoop-2.2.0/bin/hadoop job -kill job_1415530028127_0004
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2014-11-09 19:53:32,996 Stage-1 map = 0%, reduce = 0%
2014-11-09 19:53:52,010 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1415530028127_0004 with errors
Error during job, obtaining debugging information...
Examining task ID: task_1415530028127_0004_m_000000 (and more) from job job_1415530028127_0004
Task with the most failures(4):
-----
Task ID:
task_1415530028127_0004_m_000000
URL:
http://hadoop01-master:8088/taskdetails.jsp?jobid=job_1415530028127_0004&tipid=task_1415530028127_0004_m_000000
-----
Diagnostic Messages for this Task:
Error: java.io.IOException: cannot find class hiveinput.MyDemoInputFormat
at org.apache.hadoop.hive.ql.io.CombineHiveInputFormat.getRecordReader(CombineHiveInputFormat.java:564)
at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.<init>(MapTask.java:167)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:408)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
MapReduce Jobs Launched:
Job 0: Map: 1 HDFS Read: 0 HDFS Write: 0 FAIL
Total MapReduce CPU Time Spent: 0 msec
I met the problem just now. You should add your JAR file to the CLASSPATH with hive cli.
You can do it:
hive> add jar /usr/lib/xxx.jar;

Cached data structure design

I've got a C++ program that needs to access this wind data, refreshed every 6 hours. As clients of the server need the data, the server queries the database and provides the data to the client. The client will use lat, lon, and mb as keys to find the the 5 values.
+------------+-------+-----+-----+----------+----------+-------+------+------+
| id | lat | lon | mb | wind_dir | wind_spd | uv | vv | ts |
+------------+-------+-----+-----+----------+----------+-------+------+------+
| 1769584117 | -90.0 | 0.0 | 100 | 125 | 9 | -3.74 | 2.62 | 2112 |
| 1769584118 | -90.0 | 0.5 | 100 | 125 | 9 | -3.76 | 2.59 | 2112 |
| 1769584119 | -90.0 | 1.0 | 100 | 124 | 9 | -3.78 | 2.56 | 2112 |
Because the data changes so infrequently, I'd like the data to be cached by the server so if a client needs data previously queried, a second SQL query is not necessary.
I'm trying to determine the most efficient in-memory data structure, in terms of storage/speed, but more importantly, ease of access.
My initial thought was a map keyed by lat, containing a map keyed by lon, containing a map keyed by mb for which the value is a map containing the wind_dir, wind_speed, uv, vv and ts fields.
However, that gets complicated fast. Another thought of course is a 3-dimensional array (lat, lon, mb indices) containing a struct of the last 5 fields.
As I'm sitting here, I came up with the thought of combining lat, lon and mb into a string, which could be used as an index into a map, given that I'm 99% sure the combination of lat, lon and mb would always be unique.
What other ideas make sense?
Edit: More detail from comment below
In terms of data, there are 3,119,040 rows in the data set. That will be fairly constant, though it may slowly grow over the years as new reporting stations are added. There are generally between 700 and 1500 clients requesting the data. The clients are flight simulators. They'll be requesting the data every 5 minutes by default, though the maximum possible frequency would be every 30 seconds. There is not additional information - what you see above is the data desired to return.
One final note I forgot to mention: I'm quite rusty in my C++ and especially STL stuff, so the simpler, the better.
You can use std::map with a three part key and a suitable less than operator (this is what Crazy Eddie proposed, extended with some lines of code)
struct key
{
double mLat;
double mLon;
double mMb;
key(double lat, double lon, double mb) :
mLat(lat), mLon(lon), mMb(mb) {}
};
bool operator<(const key& a, const key& b)
{
return (a.lat < b.lat ||
a.lat == b.lat && a.lon < b.lon ||
a.lat == b.lat && a.lon == b.lon && a.mb < b.mb);
}
Defining and inserting into the map would look like:
std::map<key, your_wind_struct> values;
values[key(-90.0, 0.0, 100)] = your_wind_struct(1769584117, 125, ...);
A sorted vector also makes sense. You can feed it a less predicate that compares your three part key. You could do the same with a map or set. A hash... Depends on a lot of factors which container you chose.
Another option is the c++11 unordered_set, which uses a hash table instead of red black tree as the internal data structure, and gives (I believe) an amortized lookup time of O(1) vs O(logn) for red-black. Which data structure you use depends on the characteristics of the data in question - how many pieces of data, how often will a particular record likely be accessed, etc. I'm in agreement with several commentors, that using a structure as a key is the cleanest way to go. It also allows you to more simply alter what the unique key is, should that change in the future; you would just need to add a member to your key structure, and not create a whole new level of maps.