Over Provisioned VMWare 6.0 VM Storage, VM Won't Start? - vmware

I recently updated a free licensed VMWare ESXi host to 6.0 (I do not have access to vcenter). The host has 6 datastores available, the first two of which reside on SSD's and are fairly small (I typically use those for my VM OS, and any VM's that need more storage can use one of the mechanical datastores). The upgrade went fine and all my machines started.
I decided to shut down one of the machines and expand it's OS storage. My datastore1 has a but more than 70GB free, so I extended the VM's guest disk size from 160GB to 229GB figuring I'd still have some wiggle room there. I guess that was my first mistake. I was unaware that apparently you can easily increase a virtual disk size, but decreasing it is not possible. Now my VM won't start!
Failed to start the virtual machine.
Failed to power on VM.
Could not power on virtual machine: msg.vmk.status.VMK_NO_SPACE.
Current swap file size is 0 KB.
Failed to extend swap file from 0 KB to 16777216 KB.
Now I've tried multiple things, starting from removing snapshots etc. to try to free up some space, to migrating the virtual disk to another datastore and then using vcenter converter to move it back but to a smaller disk (that failed horribly, took several hours and when all was said and done, the VM could only PXE boot, said no operating system found).
I still have a few copies of the virtual disk but they're all 230GB virtual disks. If I change the VM settings to run the virtual disk off of one of the larger mechanical datastores, it does still work fine (OS boots etc.) but I really want to get this thing back down to 160GB and moved back to my SSD datastores.
Now, I have NOT utilized the extra space provisioned to this VM. fdisk still shows 160GB drive / partitions, so I have not even touched the extra provisioned space yet. I am not trying to reduce the partition, I want to reduce the space provisioned to this VM and ultimately the VMDK file so I can move it back to my SSD datastore and fire it up again.
I have searched all over but I feel I may be using the wrong terminology as many of my results seem to end in "it's not possible without data loss" but I feel since I haven't used the extra provisioned space it simply has to be possible. Maybe I'm wrong. Can anyone help point me in the right direction?

I don't know of a documented way to shrink a disk without VMware Converter, but VMware Converter should work. Have you verified you gave all the correct arguments (most notably the new size)? You can try mounting the resulting VMDK on a different VM (as a data disk) to see if there's anything wrong with it.
Have you considered making the disk thin-provisioned? See this VMware KB for how to achieve this without vCenter (you'll need to ssh into the ESXi). Since the last 69GB of the disk are zeros, it can help you reclaim that space.
If all else fails and you're feeling adventerous, you might be able to manually edit the VMDK file and prune the last part of it.

Related

VMWare: About snapshots: do they usually occupy how much% of the disk space source VM? And are they used to downgrade software?

I would like to update the samba on a 3TB NAS. My boss suggested making a clone, however, there is no storage that will fit him whole. If a snapshot of the VM costs a smaller size, and serves to, in case of failure, restore the samba as it was, making it a better idea.
There's no real guide on how much space snapshots occupy. That will greatly depend on the activity on the VM where the snapshot has been taken. If it's an active VM (database or something of the like), there could be a considerable amount of data written. If it's not a very used VM, there could be limited to no data written to the backend datastore.

Neo4j performance discrepancies local vs cloud

I am encountering drastic performance differences between a local Neo4j instance running on a VirtualBox-hosted VM and a basically identical Neo4j instance hosted in Google Cloud (GCP). The task involves performing a simple load from a Postgres instance also located in GCP. The entire load takes 1-2 minutes on the VirtualBox-hosted VM instance and 1-2 hours on the GCP VM instance. The local hardware setup is a 10-year-old 8 core, 16GB desktop running VirtualBox 6.1.
With both VirtualBox and GCP I perform these similar tasks:
provision a 4 core, 8GB Ubuntu 18 LTS instance
install Neo4j Community Edition 4.0.2
use wget to download the latest apoc and postgres jdbc jars into the plugins dir
(only in GCP is the neo4j.conf file changed from defaults. I uncomment the "dbms.default_listen_address=0.0.0.0" line to permit non-localhost connections. Corresponding GCP firewall rule also created)
restart neo4j service
install and start htop and iotop for hardware monitoring
login to empty neo4j instance via broswer console
load jdbc driver and run load statement
The load statement uses apoc.periodic.iterate to call apoc.load.jdbc. I've varied the "batchSize" parameter in both environments from 100-10000 but only saw marginal changes in either system. The "parallel" parameter is set to false because true causes lock errors.
Watching network I/O, both take the first ~15-25 seconds to pull the ~700k rows (8 columns) from the database table. Watching CPU, both keep one core maxed at 100% while another core varies from 0-100%. Watching memory, neither takes more than 4GB and swap stays at 0. Initially, I did use the config recommendations from "neo4j-admin memrec" but those didn't seem to significantly change anything either in mem usage or overall execution time.
Watching disk, that is where there are differences. But I think these are symptoms and not the root cause: the local VM consistently writes 1-2 MB/s throughout the entire execution time (1-2 minutes). The GCP VM burst writes 300-400 KB/s for 1 second every 20-30 seconds. But I don't think the GCP disks are slow or the problem (I've tried with both GCP's standard disk and their SSD disk). If the GCP disks were slow, I would expect to see sustained write activity and a huge write-to-disk queue. It seems whenever something should be written to disk, it gets done quickly in GCP. It seems the bottleneck is before the disk writes.
All I can think of are that my 10-year-old cores are way faster than a current GCP vCPU, or that there is some memory heap thing going on. I don't know much about java except heaps are important and can be finicky.
Do you have the exact same :schema on both systems? If you're missing a critical index used in your LOAD query that could easily explain the differences you're seeing.
For example, if you're using a MATCH or a MERGE on a node by a certain property, it's the difference between doing a quick lookup of the node via the index, or performing a label scan of all nodes of that label checking every single one to see if the node exists or if it's the right node. Understand also that this process repeats for every single row, so in the worst case it's not a single label scan, it's n times that.

Replace HDD with SSD on google cloud compute engine

I am running GETH node on google cloud compute engine instance and started with HDD. It grows 1.5TB now. But it is damn slow. I want to move from HDD to SSD now.
How I can do that?
I got some solution like :
- make a snapshot from the existing disk(HDD)
- Edit the instance and attach new SSD with the snapshot made.
- I can disconnected old disk afterwards.
One problem here I saw is : Example - If my HDD is 500GB, it is not allowing SSD of size less than 500GB. My data is in TBs now. It will cost like anything.
But, I want to understand if it actually works? Because this is a node I want to use for production. I already waiting too long and cannot afford to wait more.
One problem here I saw is : If my HDD is 500GB, it is not allowing SSD of size less than 500GB. My data is in TBs now. It will cost like anything.
You should try to use Zonal SSD persistent disks.
As standing in documentation
Each persistent disk can be up to 64 TB in size, so there is no need to manage arrays of disks to create large logical volumes.
The description of the issue is confusing so I will try to help from my current understanding of the problem. First, you can use a booting disk snapshot to create a new booting disk accomplishing your requirements, see here. The size limit for persistent disk is of 2 TB so I don’t understand your comment about the 500 GB minimum size. If your disk have 1.5 TB then will meet the restriction.
Anyway, I don’t recommend having such a big disk as a booting disk. A better approach could be to use a smaller boot disk and expand the total capacity by attaching additional disks as needed, see this link.

google cloud hard disk deleted. all data lost

My google cloud VM hard disk got full. So I tried to increase its size. I have done this before. This time things went differently. I increased the size. But the VM was not picking up the new size. So I stopped VM. Next thing I know, my VM got deleted and recreated, my hard disk returned to previous size with all data lost. It had my database with over 2 months of changes.
I admit I was careless not to backup. But currently my concern is, is there a way to retrieve the data. On Google Cloud, it shows $400 for Gold Plan which includes Tech Support. If I can be certain that they will be able to recover the data, I will am willing to pay. Does anyone know if I pay $400, the google support team will be able to recover the data?
If there are other ways to recover data, kindly let me know.
UPDATE:
Few people have shown interest in investigating this.
This most likely happened because by default "Auto-delete boot disk" option is selected which I was not aware of. But even then, I would expect auto-delete to happen when I delete the VM, not when I simply stopped it.
I am attaching screenshot of all activities that happened after I resized the boot partition.
As you can see, I resized the disk at 2:00AM.
After receiving resize successful message, I stopped the VM.
Suddenly at 2:01, VM got deleted.
At this point I had not checked notifications, I simply thought, it stopped. Then I started VM hoping to see new resized disk.
Instead of starting my VM, new VM was created with new disk and all previous data was lost.
I tried stopping and starting VM again. But the result was still the same.
UPDATE:
Adding activities before the incident.
It is not possible to recover deleted PDs.
You have no snapshots either?
The disk may have been marked for auto-delete.
However, this disk shouldn't have been deleted when the instance was stopped even if it was marked for auto-delete.
You can also only recover a persistent disk from a snapshot.
In a managed instance group, when you stop an instance, health check fails and the MIG deletes and recreates an instance if autoscaler is on. The process is discussed here. I hope that sheds some light if that is your use case.

Amazon EC2 and EBS using Windows AMIs

I put our application on EC2 (Windows 2003 x64 server) and attached up to 7 EBS volumes. The app is very I/O intensive to storage -- typically we use DAS with NTFS mount points (usually around 32 mount points, each to 1TB drives) so i tried to replicate that using EBS but the I/O rates are bad as in 22MB/s tops. We suspect the NIC card to the EBS (which are dymanic SANs if i read correctly) is limiting the pipeline. Our app uses mostly streaming for disk access (not random) so for us it works better when very little gets in the way of our talking to the disk controllers and handling IO directly.
Also when I create a volume and attach it, I see it appear in the instance (fine) and then i make it into a dymamic disk pointing to my mount point, then quick format it -- when I do this does all the data on the volume get wiped? Because it certainly seems so when i attach it to another AMI. I must be missing something.
I'm curious if anyone has any experience putting IO intensive apps up on the EC2 cloud and if so what's the best way to setup the volumes?
Thanks!
I've had limited experience, but I have noticed one small thing:
The initial write is generally slower than subsequent writes.
So if you're streaming a lot of data to disk, like writing logs, this will likely bite you. But if you make a big file fill it with data, and do a lot of random access I/O to it, it gets better on the second time writing to any specific location.