How do I manage a rapidly growing postgres table in Django?

How do I manage a rapidly growing postgres table in Django? - django

I have a website that is used to show live data from different machines in a crushing facility. Right now the "Sensor Data" obtained from all the machines in one facility is stored in a "sensordata" table and this data is used by the user to make reports for a time period(upto last three months). The site has been running for 3 months now and the sensor data table is already at 113 million rows. My company is planning to add even more facilities( this will multiply the number of machines). What's a good solution to store this large an amount of data from on which analysis is to be performed in the future(even create reports and such)?(That's also future safe for say 100's of facilities)

Related

Partitioned tables in BigQuery

I was wondering what the usage of using a partitioned table in BigQuery is. It seems most of the queries seem to take about the same time to finish regardless of size (ignoring extremes, I'm generalizing), is this mainly a matter of using it to reduce costs on the bytes processed, or what is the main use case of partitioning tables in BQ?
https://cloud.google.com/bigquery/docs/creating-column-partitions

There are multiple benefits, mainly costs.
by writing a query to read only eg: 7 days of partitions instead of 7 years you have lower costs
partitions you don't touch for older than 90 days are at lower costs
you can clearly reload a day's data much more easier than having to work around
you are still recommended to use YEARly tables eg mytable_2018, but you are no longer required to have daily tables eg: mytable_20180101, this further leads to have simpler queries, also no longer a problem to read more than 1000 tables (which is a hard limit).
when you modify schema, you need to modify a few tables, you no longer need to script alters on thousands of table
this also means it's lover bytes processed and in the cloud platform can be better optimized and needs fewer resources
by reorganizing data into partitioned tables the query times will benefit in the future. As customers will move data, the cloud engineering team will optimize the service for better usage.
you see clear cost wise benefits if your existing data is at least a couple of terabytes.

VoltDB is exhausting the RAM while loading the data

I am trying to load the database tables into VoltDB database using csvloader utility of VoltDB. When I am trying to load one table of size 5GB, Voltdb eats the RAM so fast that free RAM become 200 MB from 55 GB, then the VoltDB process gets killed by the system.
What can be the reason for this and what are the recommended setting for VoltDB to avoid this?

Is the table you are loading partitioned? That's the first thing to check, because if you have the default sitesperhost=8 on a single server, and the table is not partitioned, there will be a complete copy of the table in each of the 8 partitions. If the table is partitioned, the data is distributed among the partitions based on the hashing assignment of the values of the partitioning key column.
If it's partitioned and you still can't load all of the data, the next thing to look at would be the schema. There are formulas in the Planning Guide that describe the memory usage for given datatypes and for indexes. The VMC interface also has a sizing worksheet that gives you the mins and maxes based on the schema. You could also post the definition of the table you are trying to load, along with any indexes you have defined on it, and we can explain more about the bytes it would use per row.

How to use Amazon MWS to indicate two different shipping times on items?

I have a bit of a unique problem here. I currently have two warehouses that I ship items out of for selling on Amazon, my primary warehouse and my secondary warehouse. Shipping out of the secondary warehouse takes significantly longer than shipping from the main warehouse, hence why it is referred to as the "secondary" warehouse.
Some of our inventory is split between the two warehouses. Usually this is not an issue, but we keep having a particular issue. Allow me to explain:
Let's say that I have 10 red cups in the main warehouse, and an additional 300 in the secondary warehouse. Let's also say it's Christmas time, so I have all 310 listed. However, from what I've seen, Amazon only allows one shipping time to be listed for the inventory, so the entire 310 get listed as under the primary warehouse's shipping time (2 days) and doesn't account for the secondary warehouse's ship time, rather than split the way that they should be, 10 at 2 days and 300 at 15 days.
The problem comes in when someone orders an amount that would have to be split across the two warehouses, such as if someone were to order 12 of said red cups. The first 10 would come out of the primary warehouse, and the remaining two would come out of the secondary warehouse. Due to the secondary warehouse's shipping time, the remaining two cups would have to be shipped out at a significantly different date, but Amazon marks the entire order as needing to be shipped within those two days.
For a variety of reasons, it is not practical to keep all of one product in one warehouse, nor is it practical to increase the secondary warehouse's shipping time. Changing the overall shipping date for the product to the longest ship time causes us to lose the buy box for the listing, which really defeats the purpose of us trying to sell it.
So my question is this: is there some way in MWS to indicate that the inventory is split up in terms of shipping times? If so, how?
Any assistance in this matter would be appreciated.

Short answer: No.
There is no way to specify two values for FulfillmentLatency, in the same way as there is no way to specify two values for Quantity in stock. You can only ever have one inventory with them (plus FBA stock)
Longer answer: You could.
Sign up twice with Amazon:
"MySellerName" has an inventory of 10 and a fulfillment latency of 2 days
"MySellerName Overseas Warehouse" has an inventory of 300 and a fulfillment latency of 30 days
I haven't tried by I believe Amazon will then automatically direct the customer to the best seller for them, which should be "MySellerName" for small orders and "MySellerName Overseas Warehouse" for larger quantities.

Amount of Test Data needed for load testing of a web service

I am currently working on a project that requires load testing of web services.
One of the services is being called 60,000 times in the production during Busy-Day/Busy-HR.
{PerfTest Env=PROD}
Input Account Number
Output AccountDetails
Do I really need 60,000 unique account numbers(TEST DATA) for this loadrunner script to simulate the production scenario?
If unique data is required, for endurance test I will have to prepare lot of test data for each web service.
If I don't get that much test data, what is the chance of Load Test being affected due to Application Server Cache mechanism??
Can somebody help me?
Thanks
Ram

Are you simulating a day or the highest volume hour in the last year? This can help you to shape the amount of data that you need. Rarely would you start with a 24 hour test. Instead you would be looking at your high water test of an hour with a ramp up and ramp down, so you would need approximately 1.333* your high water hour's worth of data.
So this can drop your 60K to (potentially) 20K(?) I am making an assumption that your worst hour over the last year is somewhere around 1/3 of your traditional day. I have observed this pattern over and over again in different environments over the past two decades. You will want to objectively verify this with log data or query data to support the number in your environment.
Next up, how many of these inquiries are actually unique? You are really going to need a log of the queries across a day (or your high water hour) to determine this. Log processing tools such as Microsoft Logparser or Splunk/Splunk Storm can help you to pull the observed distribution of unique account references within your data, including counts of those which are multiple. Once you know this you can simply use a data file with a fixed block size for each user for unique data and once the data is exhausted the user exits.

Incremental update of millions of records, indexed vs. join

I'm currently developing a strategy for an incremental update of our user data. We assume 100_000_000 records in our database of which approximately 1_000_000 records are updated per workflow.
The idea is to update records in a MapReduce job. Is it useful to use an indexed storage (eg. Cassandra) to be able to access current records randomly? Or is it preferable to retrieve data from HDFS and join new information to existing records.
The record size is O(200 Bytes). The user data has a fixed length but should be extendable. The log events have a similar but not equal structure. The number of user records is likely to grow. Near real-time updates are desirable, ie. a 3 hour time gap is not acceptable, few minutes is OK.
Have you made any experiences with either of these strategies and data of this size?
Is the pig JOIN fast enough? Is it a bottleneck always to read all records? Is Cassandra able to hold this amount of data efficiently? Which solution is scalable? What about the complexity of the system?

You need to define your requirements first. Your record volumes are not a problem, but you don't give a record length. Are they fixed length, fixed field number, likely to change format over time? Are we talking 100 byte records or 100,000 byte records? You need an index on a field/column if you wish to query by that field/column, unless you do all your work using map/reduce. Will the number of user records stay at 100mill (1 server will probably suffice) or will it grow 100% per year ( probably multiple servers adding new ones over time).
How you access records for updating depends on whether you need to update them in real-time or whether you can run a batch job. Will updates be every minute, or hour, or month?
I would strongly suggest you do some experimenting. Have you done any testing already? This will give you a context for your questions and this will lead to more objective questions and answers. It is unlikely that you can 'whiteboard' a solution based on your question.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js