I'm trying out QuestDB using the binaries, running them in an Ubuntu container under Proxmox. The docs for the binaries don't say what resources you need, so I guesstimated. Looking at the performance metrics for the container when running some of the CRUD examples with 10,000,000 rows, I still managed to over-provision — by a lot.
Provisioned the container with 4 CPU cores, 4GB RAM & swap, and 8GB SSD. It would probably be fine with a fraction of that: CPU usage during queries is <1%, RAM usage <1.25GB, and storage is <25%.
There is some good info in the capacity planning section of the QuestDB docs (e.g. 8 GB RAM for light workloads), but my question is really about the low end of the scale — what’s the least you can get away with and still be performant when getting started with the examples from the docs?
(I don't mind creating a pull request with this and some other docs additions. Most likely, 2 cores, 2 GB of RAM and 4 GB of storage would be plenty and still give you a nice 'wow, this is quick' factor, with the proviso that this is for evaluation purposes only.)
In QuestDB ingestion and querying are separated by design, meaning if you are planning to ingest medium/high throughput data while running queries, you want to have a dedicated core for ingestion and then another for the shared pool.
The shared pool is used for queries, but also for internal tasks QuestDB needs to run. If you are just running a demo, you probably can do well with just one core for the shared pool, but for production scenarios it is likely you would want to increase that depending on your access patterns.
Regarding disk capacity and memory, it all depends on the size of the data set. QuestDB queries will be faster if the working dataset fits in memory. 2GB of RAM and 4GB of disk storage as you suggested should be more than enough for the examples, but for most production scenarios you would probably want to increase both.
Related
Our application's userbase has reached 2M users and we are planning to scale up the application using the AWS.
The main problem we are facing is the handling of shared data which includes cache, uploads, models, sessions, etc.
An option is AWS EFS but it will kill the performance of the application as the files will be really small ranging from few Bytes to few MBs and are being updated very frequently.
We can use Memcache/Redis for sessions and S3 for uploads but still need to manage cache, models, and some other shared files.
Is there any alternative to EFS or any way to make EFS work for this scenario where small files are updated frequently?
Small files and frequent updates should not be a problem for EFS.
The problem some users encountered in the original release was that it had two dimensions tightly coupled together -- the amount of throughput available was a function of how much you were paying, and how much you were paying was a function of the total size of the filesystem (all files combined, regardless of individual file sizes)... so the larger, the faster.
But they have, since then, introduced "provisioned throughput," allowing you to decouple these two dimensions.
This default Amazon EFS throughput bursting mode offers a simple experience that is suitable for a majority of applications. Now with Provisioned Throughput, applications with throughput requirements greater than those allowed by Amazon EFS’s default throughput bursting mode can achieve the throughput levels required immediately and consistently independent of the amount of data.
https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-efs-now-supports-provisioned-throughput/
If you use this feature, you pay for the difference between the throughput you provision, and the throughput that would have been included, anyway, based on the size of the data.
See also Amazon EFS Performance in the Amazon Elastic File System User Guide.
Provisioned throughput can be activated and deactivated, so don't confuse this with the fact that there are also two performance modes, called General Purpose and Max I/O, one of which must be selected when creating the filesystem, and this selection can't be changed later. These are related to an optional tradeoff in the underlying infrastructure and the recommended practice is to select General Purpose unless you have a reason not to, based on observed metrics. The Max I/O mode does not have the same metadata consistency model as general purpose.
I observe some behavior on the EC2 instances that I believe it is due to the disk cache. Bascially:
I have a calculation task that needs to access large chuck of data sequentially (~60 1GB files). I have included the files to my amazon image. I also use mpi to start ~30 processes to access different files simultaneously. BTW, the program is computation bound but the disk IO takes a decent chunk of run time. I noticed that when I start the instance and perform the calculation on the first try, it is extemely slow. The top command will show the processes were hanging from time to time and cpu usage is around 60%. However, once that run finishes, if I start another run, it is much faster and the cpu is around 99%. Is that because my data was still on a network drive (EBS) and it was loaded to local instance disk cache (SSD drive?) automatically? I ran it on C5n.18xlarge but it is listed as EBS only.
Has anyone has similar experiences? Or alternative explanations?
It was almost certainly disk cache, but in RAM, not a local SSD.
The c5.18xl instance type has 192 GB of RAM. So, depending on what else you're doing with that RAM, it's entirely possible that your 60 GB of data files were read into the cache and never left.
For more information: https://www.tldp.org/LDP/sag/html/buffer-cache.html
I'm trying to develop a data exploration environment for heavy processing of "Small Data" (10 - 30 GB). Reliability and stability are not concerns for these lightweight environments (that basically just contain Jupyter, Julia, Python, and R, plus some packages). Instead, I'd like to maximize performance, and the data sets I'm working with are small enough to fit into memory. Is there a way that I can boot a Linux image directly into RAM on Google Compute Engine, bypassing the SSD altogether?
Google provides instructions on how to create a RAM disk for storing data (https://cloud.google.com/compute/docs/disks/mount-ram-disks), but I would like all of the OS files to be on the RAM disk as well.
I am looking at porting an application to the cloud, more speficially I am looking at Amazon EC2 or Google GCE.
My app heavily uses Linux's mmap to memory map large read-only files and I I would like to understand how mmap would actually work when a file is on the ESB volume.
I would specifically like to know what happens when I call mmap as EBS appears to be a black-box. Also, are the benefits negated?
I can speak for GCE Persistent Disks. It behaves pretty much in the same way a physical disk would. At a high level, pages are faulted in from disk as mapped memory is accessed. Depending on your access pattern these pages might be loaded one by one, or in a larger quantity when readahead kicks in. As the file system cache fills up, old pages are discarded to give space to new pages, writing out dirty pages if needed.
One thing to keep in mind with Persistent Disk is that performance is proportional to disk size. So you'd need to estimate your throughput and IOPS requirements to ensure you get a disk with enough performance for your application. You can find more details here: Persistent disk performance.
Is there any aspect of mmap that you're worried about? I would recommend to write a small app that simulates your workload and test it before deciding to migrate your application.
~ Fabricio.
I'm using cloud VPS instances to host very small private game servers. On Amazon EC2, I get good performance on their micro instance (1 vCPU [single hyperthread on a 2.5GHz Intel Xeon], 1GB memory).
I want to use Google Compute Engine though, because I'm more comfortable with their UX and billing. I'm testing out their small instance (1 vCPU [single hyperthread on a 2.6GHz Intel Xeon], 1.7GB memory).
The issue is that even when I configure near-identical instances with the same game using the same settings, the AWS EC2 instances perform much better than the GCE ones. To give you an idea, while the game isn't Minecraft I'll use that as an example. On the AWS EC2 instances, succeeding world chunks would load perfectly fine as players approach the edge of a chunk. On the GCE instances, even on more powerful machine types, chunks fail to load after players travel a certain distance; and they must disconnect from and re-login to the server to continue playing.
I can provide more information if necessary, but I'm not sure what is relevant. Any advice would be appreciated.
Diagnostic protocols to evaluate this scenario may be more complex than you want to deal with. My first thought is that this shared core machine type might have some limitations in consistency. Here are a couple of strategies:
1) Try backing into the smaller instance. Since you only pay for 10 minutes, you could see if the performance is better on higher level machines. If you have consistent performance problems no matter what the size of the box, then I'm guessing it's something to do with the nature of your application and the nature of their virtualization technology.
2) Try measuring the consistency of the performance. I get that it is unacceptable, but is it unacceptable based on how long it's been running? The nature of the workload? Time of day? If the performance is sometimes good, but sometimes bad, then it's probably once again related to the type of your work load and their virtualization strategy.
Something Amazon is famous for is consistency. They work very had to manage the consistency of the performance. it shouldn't spike up or down.
My best guess here without all the details is you are using a very small disk. GCE throttles disk performance based on the size. You have two options ... attach a larger disk or use PD-SSD.
See here for details on GCE Disk Performance - https://cloud.google.com/compute/docs/disks
Please post back if this helps.
Anthony F. Voellm (aka Tony the #p3rfguy)
Google Cloud Performance Team