Reducing query time in table with unsorted timeranges - c++

I had a question regarding this matter some days ago, but I'm still wondering about how to tune my performance on this query.
I have a table looking like this (SQLite)
CREATE TABLE ZONEDATA (
TIME INTEGER NOT NULL,
CITY INTEGER NOT NULL,
ZONE INTEGER NOT NULL,
TEMPERATURE DOUBLE,
SERIAL INTEGER ,
FOREIGN KEY (SERIAL) REFERENCES ZONES,
PRIMARY KEY ( TIME, CITY, ZONE));
I'm running a query like this:
SELECT temperature, time, city, zone from zonedata
WHERE (city = 1) and (zone = 1) and (time BETWEEN x AND y);
x and y are variables which may have several hundred thousands variables between them.
temperature ranges from -10.0 to 10.0, city and zone from 0-20 (in this case it is 1 and 2, but can be something else). Records are logged continuously with intervals on about 5-6 seconds from different zones and cities. This creates a lot of data, and does not necessarily mean that every record is logged in correct order of time.
The question is how I can optimize retrieval of records in a big time range (where records are not sorted 100% correctly by time). This can take a lot of time, especially when I'm retrieving from several cities and zones. That means running the mentioned query with different parameters several times. What I'm looking for is specific changes to the query, table structure (preferably not) or other changeable settings.
My application using this is btw implemented in c++.

Your data already is sorted by Time.
By having a Primary Key on (Time, City, Zone) all the records with that same Time value will be next to each other. (Unless you have specified a CLUSTER INDEX elsewhere, though I'm not familiar enough with SQLite to know if that's possible.)
In your particular case, however, that means the records that you want are not next to each other. Instead they're in bunches. Each bunch of records will have (city=1, zone=1) and have the same Time value. One bunch for Time1, another bunch for Time2, etc, etc.
It's like putting it all in Excel and ordering by Time, then by City, then by Zone.
To bunch ALL the records you want (for the same City and Zone) change that to (City, Zone, Time).
Note, however, that if you also have a query for all cities and zones but a time = ??? the key I suggested won't be perfect for that, your original key would be better.
For that reason you may wish/need to add different indexes in different orders, for different queries.
This means that to give you a specific recommended solution we need to know the specific query you will be running. My suggested key/index order may be ideal for your simplified example, but the real-life scenario may be different enough to warrant a different index altogether.

You can index those columns, it will sort it internally for faster query but you will not see it.

For a database between is hard to optimize. One way out of this is adding extra fields so you can replace between with an =. For example, if you add a day field, you could query for:
where city = 1 and zone = 1 and day = '2012-06-22' and
time between '2012-06-22 08:00' and '2012-06-22 12:00'
This query is relatively fast with an index on city, zone, day.
This requires thought to pick the proper extra fields. It requires additional code to maintain the field. If this query is in an important performance path of your application, it might be worth it.

Related

Application for filtering database for the short period of time

I need to create an application that would allow me to get phone numbers of users with specific conditions as fast as possible. For example we've got 4 columns in sql table(region, income, age [and 4th with the phone number itself]). I want to get phone numbers from the table with specific region and income. Just make a sql query won't help because it takes significant amount of time. Database updates 1 time per day and I have some time to prepare data as I wish.
The question is: How would you make the process of getting phone numbers with specific conditions as fast as possible. O(1) in the best scenario. Consider storing values from sql table in RAM for the fastest access.
I came up with the following idea:
For each phone number create smth like a bitset. 0 if the particular condition is false and 1 if the condition is true. But I'm not sure I can implement it for columns with not boolean values.
Create a vector with phone numbers.
Create a vector with phone numbers' bitsets.
To get phone numbers - iterate for the 2nd vector and compare bitsets with required one.
It's not O(1) at all. And I still don't know what to do about not boolean columns. I thought maybe it's possible to do something good with std::unordered_map (all phone numbers are unique) or improve my idea with vector and masks.
P.s. SQL table consumes 4GB of memory and I can store up to 8GB in RAM. The're 500 columns.
I want to get phone numbers from the table with specific region and income.
You would create indexes in the database on (region, income). Let the database do the work.
If you really want it to be fast I think you should consider ElasticSearch. Think of every phone in the DB as a doc with properties (your columns).
You will need to reindex the table once a day (or in realtime) but when it's time to search you just use the filter of ElasticSearch to find the results.
Another option is to have an index for every column. In this case the engine will do an Index Merge to increase performance. I would also consider using MEMORY Tables. In case you write to this table - consider having a read replica just for reads.
To optimize your table - save your queries somewhere and add index(for multiple columns) just for the top X popular searches depends on your memory limitations.
You can use use NVME as your DB disk (if you can't load it to memory)

Update in data warehouse fact table

Reading upon many Kimball design tips regarding fact tables (transaction, accumulating, periodic) etc. I'm still vague what should I do with my case of updating a fact table which I believe is not that uncommon. To the case.
We're processing complaints from clients, and we want to be able to reflect current status of complaint in the Data Warehouse. Our complaints have a workflow of statuses they go through, different assignees that deal with them on time, but for our analysis this is irrelevant as of now. We would like to review what the current situation on complaint is.
To my understanding the grain of the fact table would be single complaint, with columns (irrelevant for this question whether it should be junk dimension, degenerate etc) such as:
Complaint Number
Current Status
Current Status Date
Current Assignee
Type of complaint
As far as I understand, since we don't want to view the process history, but instead see what the current status of the process is, storing multiple rows for each complaint representing it's state is an overkill, so instead we store only one row per complaint and update it.
Now, is my reasoning correct to do that? In above case, complaint number and type of complaint store values that don't change, while "Current" columns do and we need to update the row, so we could implement Change Data Capture mechanism (just like we do for dimensions right now) to compare incoming rows from source system for this fact with currently stored fact rows to improve time cost of such operation.
It honestly looks like a Dimension table with mixed SCD Type 0&1 for me, but it stores facts of receiving complaints.
SO Post for reference: Fact table with information that is regularly updatable in source system
Edit
I'm aware that I could use accumulating fact table with time stamps which is somewhat SCD Type 2 alike but the end user doesn't really care about the history of the process. There are more facts involved in the analysis later on, so separating this need from data warehouse doesn't really work in this case.
I’ve encountered similar use cases in the past, where an accumulating snapshot would be the default solution.
However, the accumulating snapshot doesn’t allow processes with varying length. I’ve designed a different pattern, when 2 rows are added for each event: if an object goes from state A to state B you first insert a row with state A and quantity -1, then a new one with state B and quantity +1.
The end result allows:
- no updates necessary, only inserts;
- map-reduce friendly;
- arbitrary length processes;
- counting how many of each in each state at any point in time (with the help of a periodic snapshot for performance reasons);
- how many entered or left any state at any point in time.;
- calculate time in each state and age overall.
Details in 5 blog posts here (with implementation in Pentaho Data Integration):
http://ubiquis.co.uk/dwh/status-change-fact-table-part-1-the-problem/

Is there any real sense in uniform distributed partition keys for small applications using DynamoDB?

Amazon DynamoDB doc is focused on partition key uniform distribution is the most important point in creating correct db architecture.
From the other hand, when things come to real numbers, you can find that your app will never go out of one partition. That is, according to doc:
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
partition calculation formula is
( readCapacityUnits / 3,000 ) + ( writeCapacityUnits / 1,000 ) = initialPartitions (rounded up)
So you need more than 1000 writes per second demand (for 1 kb data) to go out from one partition. But according to my calculation for the most of small application you don't even need default 5 writes per second - 1 is enough. (To be precise you can go out of one partition if your data excesses 10Gb but it's also a big number).
The question becomes more important when you realize that creating of any additional indexes requires additional writes per second allocation.
Just imagine, I have some data related to particular user, for example, "posts".
I create "posts" data table and then according to Amazon guidelines I choose the next key format:
partition: id, // post id like uuid
sort: // don't need it
Since there is no any two posts having the same id we don't need sort key here. But then you realize that the most common operation you have is requesting a list of posts for a particular user. So you need to create secondary index like:
partition: userId,
sort: id // post id
But every secondary index requires additional read/write units so the cost of such decision is doubled!
From the other hand, keeping in mind that you have only one partition, you could already have such primary key:
partition: userId
sort: id // post id
That works fine for your purposes and doesn't double your cost.
So the question is: have I missed something? May be partition key is much more effective than sort one even inside one partition?
Addition: you may say "ok, now having userId as partition key for posts is ok but when you have 100000 users in your app you'll run into troubles with scaling". But in reality the trouble can be only for some "transition" case - when you have only a few partitions with a group of active users posts all in one partition and inactive ones in the other one. If you have thousands of users it's natural that you have a lot of users with active posts, the impact of one user is negligible and statistically their posts are evenly distributed between a lot of partitions due to big numbers.
I think its absolutely fine if you make sure you wont exceed partition limits by increasing RCU/WCU or by growth of your data. Moreover, best practices says
If the table will fit entirely into a single partition (taking into consideration growth of your data over time), and if your application's read and write throughput requirements do not exceed the read and write capabilities of a single partition, then your application should not encounter any unexpected throttling as a result of partitioning.

Modelling EVERY day in Django

I have a booking system for something where the price can change based on the day. The admins for the site can make these changes. If a booking crosses the boundary of a daily rate, they pay pro-rata for the rates they used.
I'm losing confidence in how this is implemented. There are at least two ways:
Having Rates that specify their validity (start, end fields) and then working out which of those apply. But which overlapping ones take priority? Etc. Nasty. This is what we're trying to do and cannot currently answer sufficiently well.
The same except that there is some form of unique quality to date so that no two rates can overlap. The problem here is we'd need to split existing Rates on insert and rejoin two on delete/edit, etc if they had the same value. We'd need to make sure there were no gaps. It requires some heavy ORM overriding.
Keeping a DayRate table with every day defined. This means keeping a load of extra data around but most bookings are for tens of days, not thousands so I'm not worried about the database bandwidth requirements here. Date would be primary-unique and I'd just do a range filter for grabbing which ones I need to factor in.
The problem is generating these dates ahead of time. I know that as soon as I implement this, somebody will make a booking for 2032. Is there a good way around this or should we limit them?
None of these answers seems great and I have to imagine that I'm not the first guy with a booking system. Is there a better way of keeping track of a rate over a contiguous (possibly infinite) amount of time?

Storing Time Series in AWS DynamoDb

I would like to store 1M+ different time series in Amazon's DynamoDb database. Each time series will have about 50K data points. A data point is comprised of a timestamp and a value.
The application will add new data points to time series frequently (all the time) and will retrieve (usually the whole time series) time series from time to time, for analytics.
How should I structure the database? Should I create a separate table for each timeseries? Or should I put all data points in one table?
Assuming your data is immutable and given the size, you may want to consider Amazon Redshift; it's written for petabyte-sized reporting solutions.
In Dynamo, I can think of a few viable designs. In the first, you could use one table, with a compound hash/range key (both strings). The hash key would be the time series name, the range key would be the timestamp as an ISO8601 string (which has the pleasant property that alphabetical ordering is also chronological ordering), and there would be an extra attribute on each item; a 'value'. This gives you the abilty to select everything from a time series (Query on hashKey equality) and a subset of a time series (Query on hashKey equality and rangeKey BETWEEN clause). However, your main problem is the "hotspot" problem: internally, Dynamo will partition your data by hashKey, and will disperse your ProvisionedReadCapacity over all your partitions. So you may have 1000 KB of reads a second, but if you have 100 partitions, then you have only 10 KB a second for each partition, and reading all data from a single time series (single hashKey) will only hit one partition. So you may think your 1000 KB of reads gives you 1 MB a second, but if you have 10 MB stored it might take you much longer to read it, as your single partition will throttle you much more heavily.
On the upside, DynamoDB has an extremely high but costly upper-bound on scaling; if you wanted you could pay for 100,000 Read Capacity units, and have sub-second response times on all of that data.
Another theoretical design would be to store every time series in a separate table, but I don't think DynamoDB is meant to scale to millions of tables, so this is probably a no-go.
You could try and spread out your time series across 10 tables where "highly read" data goes in table 1, "almost never read data" in table 10, and all other data somewhere in between. This would let you "game" the provisioned throughput / partition throttling rules, but at a high degree of complexity in your design. Overall, it's probably not worth it; where do you new time series? How do you remember where they all are? How do you move a time series?
I think DynamoDB supports some internal "bursting" on these kinds of reads from my own experience, and it's possible my numbers are off, and you will get adequete performance. However my verdict is to look into Redshift.
How about dripping each time series into JSON or similar and store in S3. At most you'd need a lookup from somewhere like Dynamo.
You still may need redshift to process your inputs.