Running a Google Compute Engine VM instance entirely on a RAM disk - google-cloud-platform

I'm trying to develop a data exploration environment for heavy processing of "Small Data" (10 - 30 GB). Reliability and stability are not concerns for these lightweight environments (that basically just contain Jupyter, Julia, Python, and R, plus some packages). Instead, I'd like to maximize performance, and the data sets I'm working with are small enough to fit into memory. Is there a way that I can boot a Linux image directly into RAM on Google Compute Engine, bypassing the SSD altogether?
Google provides instructions on how to create a RAM disk for storing data (https://cloud.google.com/compute/docs/disks/mount-ram-disks), but I would like all of the OS files to be on the RAM disk as well.

Related

Recommended minimum system requirements to try out the QuestDB binaries?

I'm trying out QuestDB using the binaries, running them in an Ubuntu container under Proxmox. The docs for the binaries don't say what resources you need, so I guesstimated. Looking at the performance metrics for the container when running some of the CRUD examples with 10,000,000 rows, I still managed to over-provision — by a lot.
Provisioned the container with 4 CPU cores, 4GB RAM & swap, and 8GB SSD. It would probably be fine with a fraction of that: CPU usage during queries is <1%, RAM usage <1.25GB, and storage is <25%.
There is some good info in the capacity planning section of the QuestDB docs (e.g. 8 GB RAM for light workloads), but my question is really about the low end of the scale — what’s the least you can get away with and still be performant when getting started with the examples from the docs?
(I don't mind creating a pull request with this and some other docs additions. Most likely, 2 cores, 2 GB of RAM and 4 GB of storage would be plenty and still give you a nice 'wow, this is quick' factor, with the proviso that this is for evaluation purposes only.)
In QuestDB ingestion and querying are separated by design, meaning if you are planning to ingest medium/high throughput data while running queries, you want to have a dedicated core for ingestion and then another for the shared pool.
The shared pool is used for queries, but also for internal tasks QuestDB needs to run. If you are just running a demo, you probably can do well with just one core for the shared pool, but for production scenarios it is likely you would want to increase that depending on your access patterns.
Regarding disk capacity and memory, it all depends on the size of the data set. QuestDB queries will be faster if the working dataset fits in memory. 2GB of RAM and 4GB of disk storage as you suggested should be more than enough for the examples, but for most production scenarios you would probably want to increase both.

VMWare: About snapshots: do they usually occupy how much% of the disk space source VM? And are they used to downgrade software?

I would like to update the samba on a 3TB NAS. My boss suggested making a clone, however, there is no storage that will fit him whole. If a snapshot of the VM costs a smaller size, and serves to, in case of failure, restore the samba as it was, making it a better idea.
There's no real guide on how much space snapshots occupy. That will greatly depend on the activity on the VM where the snapshot has been taken. If it's an active VM (database or something of the like), there could be a considerable amount of data written. If it's not a very used VM, there could be limited to no data written to the backend datastore.

mmap performance of Amazon ESB

I am looking at porting an application to the cloud, more speficially I am looking at Amazon EC2 or Google GCE.
My app heavily uses Linux's mmap to memory map large read-only files and I I would like to understand how mmap would actually work when a file is on the ESB volume.
I would specifically like to know what happens when I call mmap as EBS appears to be a black-box. Also, are the benefits negated?
I can speak for GCE Persistent Disks. It behaves pretty much in the same way a physical disk would. At a high level, pages are faulted in from disk as mapped memory is accessed. Depending on your access pattern these pages might be loaded one by one, or in a larger quantity when readahead kicks in. As the file system cache fills up, old pages are discarded to give space to new pages, writing out dirty pages if needed.
One thing to keep in mind with Persistent Disk is that performance is proportional to disk size. So you'd need to estimate your throughput and IOPS requirements to ensure you get a disk with enough performance for your application. You can find more details here: Persistent disk performance.
Is there any aspect of mmap that you're worried about? I would recommend to write a small app that simulates your workload and test it before deciding to migrate your application.
~ Fabricio.

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.

How to measure the power consumption in a virtual machine on a cloud environment?

I'm running a few Blockchain related containers in a cloud environment (Google Compute Engine). I want to measure the power/energy consumption of the containers or the instance that I'm running.
There are tools like powerapi which is possible to do this in real infrastructure where it has access to real CPU counters. This should be possible by doing an estimation based on the CPU, Memory, and Network usage. There's one such model proposed in the literature.
Is it theoretically possible to do this? If so is there already existing tools for this task.
A generalized answer is "No, it is impossible". A program running in a virtual machine deals with virtual hardware. The goal of virtualization is to abstract running programs from physical hardware, while access to physical equipment is required to measure energy consumption. For instance, without processor affinity enabled, it's unlikely that PowerAPI will be able to collect useful statistics due to virtual CPU migrations. Time-slicing that allows to run multiple vCPUs on one physical CPU is another factor to keep in mind. Needless to say about energy consumed by RAM, I/O controllers, storage devices, etc.
A substantive answer is "No, it is impossible". Authors of the manuscript use PowerAPI libraries to collect CPU statistics, and a monitoring agent to count bytes transmitted through network. HWPC-sensor the PowerAPI relies on has distinct requirement:
"monitored node(s) run a Linux distribution that is not hosted in a
virtual environment"
Also, authors emphasize that they couldn't measure absolute values of power consumption but rather used percentage to compare certain Java classes and methods in order to suppose the origin of energy leaks in CPU-intensive workloads.
As for the network I/O, the number of bytes used to estimate power consumption in their model differs significantly between the network interface exposed into the virtual guest and the host hardware port on the network with the SDN stack in between.
Measuring in cloud instances is more complicated yes. If you do have access to the underlying hardware, then it would be possible. There is a pretty recent project that allows to get power consumption metrics for a qemu/kvm hypervisor and share the metrics releveant to each virtual machine through another agent in the VM: github.com/hubblo-org/scaphandre/
They are also working (very early) on a "degraded mode" to estimate* the power consumption of the instance/vm if such communication between the hypervisor and the vm is not possible, like when working on a public cloud instance.
*: Based on the resources (cpu/ram/gpu...) consumption in the VM and characteristics of the instance and the underlying hardware power consumption "profile".
Can we conclude that , it is impossible to measure power/energy consumption of process running on virtual machines?