Postgresql ArrayField vs ForeignKey? which one is performant? - django

I want to design a phone book application that each contact can have multiple numbers. There is two db designs:
use contact foreign key in each number.
store numbers in an ArrayField inside each contact.
which solution is more performant in production and why?
Thanks in advance.

If you build a GIN index on the array column and Django will write queries in such a way that they can use that index, then the performance for reading will be quite similar between the two.
It is very unlikely that the performance difference should be the driving factor behind this choice. For example, do you need more info behind the phone number than just the number, such as when it was added, when it was last used, whether it is a mobile phone or something else, etc.
The array column should be faster because it only has to consult one index and table, rather than two of each. Also, it will be more compact, and so more cacheable.
On the other hand, the statistical estimates for your array column will have a problem when estimating rare values, which you are likely to have here, as no phone number is likely to be shared by a large number of people. This misestimate could have devastating results on your query performance. For example in a little test, overestimating the number of rows by many thousand fold caused it to launch parallel worker for a single-row query, leading it to be about 20 fold slower than when parallelization is turned off, and 10 times slower than using the foreign-key representation which doesn't suffer from the estimation problem.
For example:
create table contact as select md5(floor(random()*50000000)::text) as name, array_agg(floor(random()*100000000)::int) phones from generate_series(1,100000000) f(x) group by name;
vacuum analyze contact;
create index on contact using gin (phones );
explain analyze select * from contact where phones #> ARRAY[123456];
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
Gather (cost=3023.30..605045.19 rows=216167 width=63) (actual time=0.668..8.071 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on contact (cost=2023.30..582428.49 rows=90070 width=63) (actual time=0.106..0.110 rows=1 loops=3)
Recheck Cond: (phones #> '{123456}'::integer[])
Heap Blocks: exact=2
-> Bitmap Index Scan on contact_phones_idx (cost=0.00..1969.25 rows=216167 width=0) (actual time=0.252..0.252 rows=2 loops=1)
Index Cond: (phones #> '{123456}'::integer[])
Planning Time: 0.820 ms
Execution Time: 8.137 ms
You can see that it estimates where will be 216167 rows, but in fact there are only 2. (For convenience, I used ints, rather than the text field you would probably use for phone numbers, but this doesn't change anything fundamental).
If this is really vital to you, then you should do the test and see, using your own data and your own architecture. It will depend on what does and does not fit in memory, what kinds of queries you are doing (do you ever look up numbers in bulk? Join them to other tables besides the immediately discussed foreign key?), and maybe how your driver/library handle columns/parameters with array types.

Related

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

Best way to store mappings in a database

Suppose I have an employees table(with around a million employees) and a tasks table(with a few hundred tasks).
Now, I have a mechanism to predict how probable(percentage) an employee is to complete the task -- let's say I have four such mechanisms, and each of the mechanism outputs it's own probability.
Putting it all together, I now have n1(employees) times n2(tasks) times n3(mechanisms) results to store.
I was wondering what would be the best way to store these results.
I have a few options and thoughts:
Maintain a column(JSONField) in either of employees or tasks tables -- Concern: Have to update the whole column data if one of the values changes
Maintaining a third table predictions with foreign keys to employee and task with a column to store the predicted_probability -- Concern: Will have to store n1 * n2 * n3 records, I'm worried about scalability and performance
Thanks for any help.
PS: I'm using Django with postgres
The predictions table is the correct way to go. Depending on how you access the data, the size of the table won't matter. e.g. I would expect that reading the prediction for a single employee has a pretty constant performance. Large tables tend to be a problem only when you need to process all (or a large fraction) of the rows. If you hit a performance problem once you test this, you could e.g. partition that table by task or by task and mechanism (depending on how your queries are structured)
-Credits to #a_horse_with_no_name

Reducing query time in table with unsorted timeranges

I had a question regarding this matter some days ago, but I'm still wondering about how to tune my performance on this query.
I have a table looking like this (SQLite)
CREATE TABLE ZONEDATA (
TIME INTEGER NOT NULL,
CITY INTEGER NOT NULL,
ZONE INTEGER NOT NULL,
TEMPERATURE DOUBLE,
SERIAL INTEGER ,
FOREIGN KEY (SERIAL) REFERENCES ZONES,
PRIMARY KEY ( TIME, CITY, ZONE));
I'm running a query like this:
SELECT temperature, time, city, zone from zonedata
WHERE (city = 1) and (zone = 1) and (time BETWEEN x AND y);
x and y are variables which may have several hundred thousands variables between them.
temperature ranges from -10.0 to 10.0, city and zone from 0-20 (in this case it is 1 and 2, but can be something else). Records are logged continuously with intervals on about 5-6 seconds from different zones and cities. This creates a lot of data, and does not necessarily mean that every record is logged in correct order of time.
The question is how I can optimize retrieval of records in a big time range (where records are not sorted 100% correctly by time). This can take a lot of time, especially when I'm retrieving from several cities and zones. That means running the mentioned query with different parameters several times. What I'm looking for is specific changes to the query, table structure (preferably not) or other changeable settings.
My application using this is btw implemented in c++.
Your data already is sorted by Time.
By having a Primary Key on (Time, City, Zone) all the records with that same Time value will be next to each other. (Unless you have specified a CLUSTER INDEX elsewhere, though I'm not familiar enough with SQLite to know if that's possible.)
In your particular case, however, that means the records that you want are not next to each other. Instead they're in bunches. Each bunch of records will have (city=1, zone=1) and have the same Time value. One bunch for Time1, another bunch for Time2, etc, etc.
It's like putting it all in Excel and ordering by Time, then by City, then by Zone.
To bunch ALL the records you want (for the same City and Zone) change that to (City, Zone, Time).
Note, however, that if you also have a query for all cities and zones but a time = ??? the key I suggested won't be perfect for that, your original key would be better.
For that reason you may wish/need to add different indexes in different orders, for different queries.
This means that to give you a specific recommended solution we need to know the specific query you will be running. My suggested key/index order may be ideal for your simplified example, but the real-life scenario may be different enough to warrant a different index altogether.
You can index those columns, it will sort it internally for faster query but you will not see it.
For a database between is hard to optimize. One way out of this is adding extra fields so you can replace between with an =. For example, if you add a day field, you could query for:
where city = 1 and zone = 1 and day = '2012-06-22' and
time between '2012-06-22 08:00' and '2012-06-22 12:00'
This query is relatively fast with an index on city, zone, day.
This requires thought to pick the proper extra fields. It requires additional code to maintain the field. If this query is in an important performance path of your application, it might be worth it.

JPA 2.0: Batching queries with IN clause

I am looking for a strategy to batch all my queries (with IN clause) to overcome the restrictions by databases on IN clause (See here).
I usually get list of size 100000 to 305000. So, this has become very important to tackle.
I have tried two strategies so far.
Strategy 1:
Create an entity and hence a table with one column to hold such values (can we create temp tables on the fly with JPA 2.0 vendor-independent?) and use the data from the temp table as a subquery to the original query before eventually cleaning up the temp table.
Advantage: Very performant queries. Really quick, I must admit for the numbers I have mentioned, it was mostly under a minute.
Possible drawback: Use of temp table which is actually a permanent one in my case thus far.
Strategy 2:
Calculate the batch size for the given input list and for each batch execute the query and accumulate the result.
Advantage: No temp tables. Easy for any threads within the same transaction.
Disadvantage: A big disadvantage is amount of time it takes to execute all the batches. For the mentioned numbers, this is at an unacceptable level at the moment. Takes anything between 5 to 15 mins!
I would appreciate any feedback, suggestions or improvements from all you JPA gurus.
Thanks.
I only tested up to 50,000 integers but I have some pretty good performance data around splitting large lists using various methods, with CLR and a numbers table leading the pack at the higher end:
Splitting a list of integers : another roundup
Not sure if you are using integers or strings but the results should be roughly equivalent.
As an aside, I'll confess I have no idea what JPA 2.0 is, but I assume you can control the format of the lists that it sends to SQL Server.