Neo4j performance discrepancies local vs cloud - google-cloud-platform

I am encountering drastic performance differences between a local Neo4j instance running on a VirtualBox-hosted VM and a basically identical Neo4j instance hosted in Google Cloud (GCP). The task involves performing a simple load from a Postgres instance also located in GCP. The entire load takes 1-2 minutes on the VirtualBox-hosted VM instance and 1-2 hours on the GCP VM instance. The local hardware setup is a 10-year-old 8 core, 16GB desktop running VirtualBox 6.1.
With both VirtualBox and GCP I perform these similar tasks:
provision a 4 core, 8GB Ubuntu 18 LTS instance
install Neo4j Community Edition 4.0.2
use wget to download the latest apoc and postgres jdbc jars into the plugins dir
(only in GCP is the neo4j.conf file changed from defaults. I uncomment the "dbms.default_listen_address=0.0.0.0" line to permit non-localhost connections. Corresponding GCP firewall rule also created)
restart neo4j service
install and start htop and iotop for hardware monitoring
login to empty neo4j instance via broswer console
load jdbc driver and run load statement
The load statement uses apoc.periodic.iterate to call apoc.load.jdbc. I've varied the "batchSize" parameter in both environments from 100-10000 but only saw marginal changes in either system. The "parallel" parameter is set to false because true causes lock errors.
Watching network I/O, both take the first ~15-25 seconds to pull the ~700k rows (8 columns) from the database table. Watching CPU, both keep one core maxed at 100% while another core varies from 0-100%. Watching memory, neither takes more than 4GB and swap stays at 0. Initially, I did use the config recommendations from "neo4j-admin memrec" but those didn't seem to significantly change anything either in mem usage or overall execution time.
Watching disk, that is where there are differences. But I think these are symptoms and not the root cause: the local VM consistently writes 1-2 MB/s throughout the entire execution time (1-2 minutes). The GCP VM burst writes 300-400 KB/s for 1 second every 20-30 seconds. But I don't think the GCP disks are slow or the problem (I've tried with both GCP's standard disk and their SSD disk). If the GCP disks were slow, I would expect to see sustained write activity and a huge write-to-disk queue. It seems whenever something should be written to disk, it gets done quickly in GCP. It seems the bottleneck is before the disk writes.
All I can think of are that my 10-year-old cores are way faster than a current GCP vCPU, or that there is some memory heap thing going on. I don't know much about java except heaps are important and can be finicky.

Do you have the exact same :schema on both systems? If you're missing a critical index used in your LOAD query that could easily explain the differences you're seeing.
For example, if you're using a MATCH or a MERGE on a node by a certain property, it's the difference between doing a quick lookup of the node via the index, or performing a label scan of all nodes of that label checking every single one to see if the node exists or if it's the right node. Understand also that this process repeats for every single row, so in the worst case it's not a single label scan, it's n times that.

Related

Redis elasticache in aws - how to get persistence and keep good latency

I am currently using redis cluster with 2 node groups and a replica per node.
I chose to use redis because of the high performance. I have a new requirement to have persistent storage of the data in redis. I want to keep the good latency redis gives me and still build some procedure to save the data in the background. Backup built in snapshots is not good enough anymore since there is a maximum of 20 backups per 24 hours. I need data to be synced aprox. every minute
The data needs to be stored in a way that restart of the system will not make the data to be lost and that it can be restored back at all times.
So if I summarize the requirements:
Keep working with redis elasticache
Keep highest performance and latency
Be able to have the data persistent (including when the system is down or restarted)
The data sync to happen in intervals of a minute.
Be able to restore data back to redis when it lost the data.
I was looking when googling at manually running BGSAVE from a side docker in EC2 or to have a slave running in another EC2 machine. And then a lambda may take the rdb dile/data and save in s3.
Will this fit my needs?
What do the experts suggest? What are your ideas?
You can get close to your requirements by enabling AOF persistence.
This is done in the cluster's parameter group:
appendonly yes
appendfsync always|everysec
You will have to restart as well.
As you can see, redis only has two options for file system sync-for every value and every second.
Every value will be quite slow, so go with everysec if you want to keep good performance.

EFS taking longer time to respond requests

Description:
I have a project running on an AWS server which is taking too long to complete.
The project consists of dynamic form fields which are saved with Solr search Indexer. My simple Solr Queries are taking 2 minutes of time to successfully execute the query. I have kept Solr in a EFS mapped folder on AWS instance.
I don't want to move Solr on EBS.
Solutions I have tried:
EFS Performance Mode : Both General purpose and Max I/O still taking around 2 Minutes.
EFS Throughput Mode : Tried Burst mode taking roughly around same time.

Adding GPU to an existing VM instance on Google Compute Engine Slows Performance

I created a VM instance with Google Compute Engine with a single initial GPU and would like to add a second GPU of the same type. The VM instance is using Windows Server 2016, has 8 CPUS and 52GB of memory. I have followed the steps to add a GPU at the following location:
https://cloud.google.com/compute/docs/gpus/add-gpus
When using the updated VM instance, the performance is very slow (Click on a window/button and 10 seconds later it opens). The CPU does not appear to be heavily utilized. If I remove the GPU so that only a single GPU is used, the performance goes back to normal.
Am I missing a step (maybe updating something in windows)?

CouchDB load spike (even with low traffic)?

We're been running CouchDB v1.5.0 on AWS and its been working fine. Recently AWS came out with new prices for their new m3 instances so we switched our CouchDB instance to use an m3.large. We have a relatively small database with < 10GB of data in it.
Our steady state metrics for it are system loads of 0.2 and memory usages of 5% or so. However, we noticed that every few hours (3-4 times per day) we get a huge spike that floors our load to 1.5 or so and memory usage to close to 100%.
We don't run any cronjobs that involve the database and our traffic flow about the same over the day. We do run a continuous replication from one database on the west coast to another on the east coast.
This has been stumping me for a bit - any ideas?
Just wanted to follow up on this question in case it helps anyone.
While I didn't figure out the direct answer to my load spike question, I did discover another bug from inspecting the logs that I was able to solve.
In my case, running "sudo service couchdb stop" was not actually stopping couchdb. On top of that, every couple of seconds a new process of couch would try and spawn only to be blocked by the existing couchdb process.
Ultimately, removing the respawn flag /etc/init.d/couchdb fixed this error.

Data Intensive process in EC2 - any tips?

We are trying to run an ETL process in an High I/O Instance on Amazon EC2. The same process locally on a very well equipped laptop (with a SSD) take about 1/6th the time. This process is basically transforming data (30 million rows or so) from flat tables to a 3rd normal form schema in the same Oracle instance.
Any ideas on what might be slowing us down?
Or another option is to simply move off of AWS and rent beefy boxes (raw hardware) with SSDs in something like Rackspace.
We have moved most of our ETL processes off of AWS/EMR. We host most of it on Rackspace and getting a lot more CPU/Storage/Performance for the money. Don't get me wrong AWS is awesome but there comes a point where it's not cost effective. On top of that you never know how they are really managing/virtualizing the hardware that applies to your specific application.
My two cents.