I am about to integrate TimescaleDB in my django project, but what’s unclear to me is how timescale groups different timestamps together to form a time-series.
Imagine I have multiple Drinks (coca-cola, lime-juice etc, fanta, water…). I can have a million drinks in my database. Each drink can have multiple time-series related to it. A time-series of consumption data over the years, a time-series of customer data over the years etc… Then imagine I also have food data with the same assumptions. I want to save all those data in a hypertable.
I can start by creating a hypertable and inject all consumption data for Coca-Cola in the year 2018. Now i also want to store customer data for Coca-Cola too, but the time-stamps will collide. I will have 2022-12-03 multiple times. Thus timescale must have a best-practice of how to group timestamps together to from a time-series. Otherwise I could only have a single time-series in a hypertable.
I see two solutions:
I can save a foreign key or an object-id with the timestamp. E.g. 2022-01-01 and 5, whereas 5 is the id of coca cola object.
Or I can create a meta-table which stores the meta-information about the timestamp. The meta object holds the information to which object the timestamp belongs to. E.g. in the meta model I can save the id for the coca-cola object and also other meta information (e.g. the amount of sugar during the whole year in the drink). I see the advantage of storing more meta information as important, so I’d prefer this approach.
My question though is: Is that how timescale is supposed to be used? Or is the idea to have one hypertable per drink in this case? That would surprise me because then I’d have millions of hypertables. Or will I lose performance if I design my setup as explained above? Or to put it simple: What is the best practice of grouping timestamps together to a time-series in timescale.
Thanks a bunch for the feedback
Related
I'm building a Django web application, part of it involves an online ordering system for food. I want to make a "receipt" object to save transactions.
My concern, however, is this - let's say I have an object Receipt that relates to Orders which relate to Items, if the items get edited or change over time, it will make the receipts look different down the line. Is there a way to save these at the moment of a transaction?
I am implementing a "soft deletion" to my models to avoid deletion issues however I don't think this would protect against edits.
The only way I can think of to deal with is to 'materialize' the Receipt. In other words when a receipt is generated use the Order and Items information current at the time and then write the actual values, not the Order/Items id to a receipt table. So for a Items item write out the attributes(description, price, qty.etc) you are interested in recording to the table, instead of just an Items.id that points to a possibly changed value in future.
I'm trying to learn DynamoDB just for didactic purposes, for that reason I propose myself to create a small project to sell vehicles (cars, bikes, quad bikes, etc) in order to learn and get some experience with NoSQL databases. I read a lot of documentation about creating the right models but I still cannot figure out the best way to store my data.
I want to get all the vehicles by filters like:
get all the cars not older than 3 months.
get all the cars not older than 3 months by brand, year and model.
And so on the same previous queries for bikes, quad bikes, etc.
After reading the official documentation and other pages with examples (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html#bp-general-nosql-design-approach , https://medium.com/swlh/data-modeling-in-aws-dynamodb-dcec6798e955 , Separate tables vs map lists - DynamoDB), they said that the best designs use only one table for storing everything, so I end up with a model like the next below:
-------------------------------------------------------------------------------------
Partition key | Sort key | Specific attributes for each type of vehicle
-------------------------------------------------------------------------------------
cars | date#brand#year#model | {main attributes for the car}
bikes | date#brand#year#model | {main attributes for the bike}
-------------------------------------------------------------------------------------
I've used a composite sort key because they specify that is a good practice for searching data (https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-sort-keys.html).
But after defining my model I end up that the previous model will have a problem called "Hotspotting" or "Hoy key". (https://medium.com/expedia-group-tech/dynamodb-data-modeling-c4b02729ac08, https://dzone.com/articles/partitioning-behavior-of-dynamodb) because in the official documentation they recommend having partitions keys with high cardinality to avoid the problem.
So at this point, I'm a little stuck about how to define a good and scalable model. Could you provide me some help or examples about how to achieve a model to get the queries above mentioned?
Note: I also considered creating a specific table for each vehicle but that would create more problems because to find the information I would need to perform a full table scan.
A few things...
hot partitions, only come into play if you have multiple partitions...
Just because you've got multiple partition (hash) keys, doesn't automatically mean DDB will need multiple partitions. You'll also need more than 10GB of data and/or more than 3000 RCU or 1000 WCU being used.
Next, DDB now supports "Adaptive Capacity", so hot partitions aren't as big a deal as they used to be. why what you know about DynamoDB might be outdated
In connection with the even newer "Instantaneous Adaptive Capacity", you've got DDB on demand.
One final note, you may be under the impression that a given partition (hash) key can only have a maximum of 10GB of data under it. This is true if your table utilizes Local Secondary Indexes (LSI) but is not true otherwise. Thus, consider using global secondary indexes (GSI). There's extra cost associated with GSIs, so it's a trade off to consider.
I'm in the process of evaluating some different data stores for a project and I have a strange but inflexible requirement to check the existence of a 1500 keys per query... Basically the only query I'll be running is of the form:
SELECT user_id, name, gender
WHERE user_id in (user1, user2, ..., user1500)
I will have around 3.5 billion rows in the table. One data store that has caught my eye is Spanner. I was wondering if querying the data in this way would be feasible or if I would run into performance issues due to the large number of items in my WHERE clause. I have only been able to test these queries on a small amount of data so far so I'm leaning more on what the theoretical performance hit might look like instead having the luxury to just "try and found out".
Also, are there other data stores that might work better for this read pattern? I expected to run no more than 80 queries per second. Also, the data will be bulk loaded on a weekly basis. The data is structured by nature but we don't use it in a relational way (i.e. no joins).
Anyways, sorry if this question is vague in any way. I'm happy to provide more detail if needed.
1500 keys should not be a problem if you use a bound array parameter to specify the keys:
SELECT user_id, name, gender
FROM table
WHERE user_id in UNNEST(#users)
https://cloud.google.com/spanner/docs/sql-best-practices#write_efficient_queries_for_range_key_lookup
I'm trying to build a web application where users can upload a file (specifically the MDF file format) and view the data in forms of various charts. The files can contain any number of time based signals (various numeric data types) and users may name the signals wildly.
My thought on saving the data involves 2 steps:
Maintain a master table as an index, to save such meta information as file names, who uploaded it, when, etc. Records (rows) are added each time a new file is uploaded.
Create a new table (I'll refer to this as data tables) for each file uploaded, within the table each column will be one signal (first column being timestamps).
This brings the problem that I can't pre-define the Model for the data tables because the number, name, and datatype of the fields will differ among virtually all uploaded files.
I'm aware of some libs that help to build runtime dynamic models but they're all dated and questions about them on SO basically get zero answers. So despite the effort to make it work, I'm not even sure my approach is the optimal way to do what I want to do.
I also came across this Postgres specifc model field which can take nested arrays (which I believe fits the 2-D time based signals lists). In theory I could parse the raw uploaded file and construct such an array and basically save all the data in one field. Not knowing the limit of size of data, this could also be a nightmare for the queries later on, since to create the charts it usually takes only a few columns of signals at a time, compared to a total of up to hundreds of signals.
So my question is:
Is there a better way to organize the storage of data? And how?
Any insight is greatly appreciated!
If the name, number and datatypes of the fields will differ for each user, then you do not need an ORM. What you need is a query builder or SQL string composition like Psycopg. You will be programatically creating a table for each combination of user and uploaded file (if they are different) and programtically inserting the records.
Using postgresql might be a good choice, you might also create a GIN index on the arrays to speed up queries.
However, if you are primarily working with time-series data, then using a time-series database like InfluxDB, Prometheus makes more sense.
I'm creating a web based point of sale (think cash register) solution with Django as the backend. I've always taken the 'classic' approach of modeling invoices and their line items.
InvoiceTable
id
date
customer
salesperson
discount
shipping
subtotal
tax
grand_total
[...]
InvoiceLineItems
invoice_id // foreign key
product_id
unit_price
qty
item_discount
extended_price
[...]
After attempting to research best practices, I've found that there aren't many - at least no definitive source that's widely used.
The Kimball Group suggests: "Rather than holding onto the operational notion of a transaction header “object,” we recommend that you bring all the dimensionality of the header down to the line items."
See http://www.kimballgroup.com/2007/10/02/design-tip-95-patterns-to-avoid-when-modeling-headerline-item-transactions/ and http://www.kimballgroup.com/2001/07/01/design-tip-25-designing-dimensional-models-for-parent-child-applications/.
I'm new to development (only having used desktop database software before) - but from my understanding this makes sense as we can drill down the data any way we want for reporting purposes (though I imagine we could do the same with the first method by joining the tables).
My Questions
The invoice ID will need to be repeated for each row (so we can generate data like totals for the invoice). Is this an intentional feature of this way of modeling the data?
We often have invoice level data like notes, discounts, shipping charges, etc. - How do we represent these using this method? Some discounts are product specific - so they belong on the line item anyway, others are invoice wide (think of a deal where you buy two separate products and receive a discount on the two) - we could we somehow allocate it across the line items? Same with shipping charges, allocate it by dividing it among the line items?
What do we do with invoice 'notes' - we have printed and/or internal notes, would we put the data in the line items and just repeat it for each line item? That seems to go against data normalization. Put it in a related table?
Any open source projects that use this method that I could take a look at? Not sure how to search for them.
It sounds like you're confusing relational design and dimensional design.
A relational design is for facilitating transaction processing, and minimizing data anomalies and duplication. It's your operational database. A dimensional design is for facilitating analysis.
A relational design will have an invoices table and a line_items table and a dimensional design will have a company_invoices_customer fact table with a grain of invoice line item.
Since this is for POS, I assume you want a relational design first.
As for your questions:
First there are tons of good data modelling patterns for this scenario. See https://dba.stackexchange.com/questions/12991/ready-to-use-database-models-example/23831#23831
The invoice ID will need to be repeated for each row (so we can
generate data like totals for the invoice). Is this an intentional
feature of this way of modeling the data?
Yes
We often have invoice level data like notes, discounts, shipping
charges, etc. - How do we represent these using this method?
Probably easiest/simplest to have a "notes" field on the invoice table.
For charges and discounts you should use abstraction (see Table Inheritance), and add them as Order Adjustments. See the book by Silverston in the link above.
Some discounts are product specific - so they belong on the line item
anyway, others are invoice wide (think of a deal where you buy two
separate products and receive a discount on the two) - we could we
somehow allocate it across the line items?
The price of the item should be calculated at runtime based on it's default price, and any discounts or charges that apply in the current "scenario", example discount for government, nearby, on sale day. You could have hierarchical line items that reference each other, to keep things in order. Again, see Silverston book.
What do we do with invoice 'notes' - we have printed and/or internal
notes, would we put the data in the line items and just repeat it for
each line item?
If you want line item notes, add a notes column on the line items table.
That seems to go against data normalization. Put it in a related
table?
If notes are nullable, and you want to be strict about normalization, then yes, add a invoice_notes table.