Cloud Redis latency causes (vs. local redis on macbook pro)

Cloud Redis latency causes (vs. local redis on macbook pro) - amazon-web-services

Redis can give sub millisecond response times. That's a great promise. I'm testing heroku redis and I get 1ms up to about 8ms, for a zincrby. I'm using microtime() in php to wrap the call. This heroku redis (I'm using the free plan) is a shared instance and there is resource contention so I expect response times for identical queries to vary, and they certainly do.
I'm curious as to the cause of the difference in performance vs. redis installed on my macbook pro via homebrew. There's obviously no network latency there. What I'm curious about is does this mean that any cloud redis (i.e. connecting over the network, say within aws), is always going to be quite a bit slower than if I were to have one cloud server and run a redis inside the same physical machine, thus eliminating network latency?
There is also resource contention in these cloud offerings, unless a private server is chosen which costs a lot more.
Some numbers: my local macbook pro consistently gives 0.2ms for the identical zincrby that takes between 1ms & 8ms on the heroku redis.
Is network latency the cause of this?

No, probably not.
The typical latency of a 1 Gbit/s network is about 200us. That's 0.2ms.
What's more, in aws you're probably on 10gbps at least.
As this page in the redis manual explains, the main cause of the latency variation between these two environments will almost certainly be a result of the higher intrinsic latency (there's a redis command to test this on any particular system: redis-cli --intrinsic-latency 100, see the manual page above) arising from being run in a linux container.
i.e., network latency is not the dominant cause of the variation seen here.
Here is a checklist (from redis manual page linked above).
If you can afford it, prefer a physical machine over a VM to host the server.
Do not systematically connect/disconnect to the server (especially true for web based applications). Keep your connections as long lived
as possible.
If your client is on the same host than the server, use Unix domain sockets.
Prefer to use aggregated commands (MSET/MGET), or commands with variadic parameters (if possible) over pipelining.
Prefer to use pipelining (if possible) over sequence of roundtrips.
Redis supports Lua server-side scripting to cover cases that are not suitable for raw pipelining (for instance when the result of a command
is an input for the following commands).

Related

what will happen if my virtual machine too slow

i have a newbie question in here, but i'm new to clouds and linux, i'm using google cloud now and wondering when choosing a machine config
what if my machine is too slow? will it make the app crash? or just slow it down
how fast should my vm be? in the image bellow
last 6 hours of a python scripts i'm running and it's cpu usage, it's obviously running for less than %2 of the cpu for most of it's time, but there's a small spike, should i care about the spike? and also, how much should my cpu usage be max before i upgrade? if a script i'm running is using 50-60% of the cpu most of the i assume i'm safe, or what's the max before you upgrade?

what if my machine is too slow? will it make the app crash? or just
slow it down
It depends.
Some applications will just respond slower. Some will fail if they have timeout restrictions. Some applications will begin to thrash which means that all of a sudden the app becomes very very slow.
A general rule, which varies among architects, is to never consume more than 80% of any resource. I use the rule 50% so that my service can handle burst traffic or denial of service attempts.
Based on your graph, your service is fine. The spike is probably normal system processing. If the spike went to 100%, I would be concerned.
Once your service consumes more than 50% of a resource (CPU, memory, disk I/O, etc) then it is time to upgrade that resource.
Also, consider that there are other services that you might want to add. Examples are load balancers, Cloud Storage, CDNs, firewalls such as Cloud Armor, etc. Those types of services tend to offload requirements from your service and make your service more resilient, available and performant. The biggest plus is your service is usually faster for the end user. Some of those services are so cheap, that I almost always deploy them.

You should choose machine family based on your needs. Check the link below for details and recommendations.
https://cloud.google.com/compute/docs/machine-types
If CPU is your concern you should create a managed instance group that automatically scales based on CPU usage. Usually 80-85% is a good value for a max CPU value. Check the link below for details.
https://cloud.google.com/compute/docs/autoscaler/scaling-cpu
You should also consider the availability needed for your workload to keep costs efficient. See below link for other useful info.
https://cloud.google.com/compute/docs/choose-compute-deployment-option

GCP Compute Engine limits download to 50 K/s?

From some reason download traffic from virtual machine on GCP (Google Cloud Platform) with Debian 9 is limited to 50K/s? Upload seems to be fine, inline with my local upload link.
It is the same with scp or https download. Any suggestions what might be wrong, where to search?
Machine type
n1-standard-1 (1 vCPU, 3.75 GB memory)
CPU platform
Intel Skylake
Zone
europe-west4-a
Network interfaces
Premium tier
Thanks,
Mihaelus
Simple test:
wget https://hrcki.primasystems.si/Nova/assets/download.test.html
Output:
--2018-10-18 15:21:00-- https://hrcki.primasystems.si/Nova/assets/download.test.html Resolving
hrcki.primasystems.si (hrcki.primasystems.si)... 35.204.252.248
Connecting to hrcki.primasystems.si
(hrcki.primasystems.si)|35.204.252.248|:443... connected. HTTP request
sent, awaiting response... 200 OK Length: 541422592 (516M) [text/html]
Saving to: `download.test.html.1' 0% [] 1,073,152 48.7K/s eta
2h 59m

Always good to minimize variables when trying to diagnose. So while it is unlikely the use of HTTP is why things are that very slow, you might consider using netperf or iperf3 to measure TCP bulk transfer performance between your VM in GCP and your local system. You can do that either "by hand" or via PerfKit Benchmarker https://cloud.google.com/blog/products/networking/perfkit-benchmarker-for-evaluating-cloud-network-performance
It can be helpful to have packet traces - from both ends when possible - to look at. You want the packet traces to be started before the test - it is important to see the packets used to establish the TCP connection(s). They do not need to be "full packet" traces, and often you don't want them to be. Capturing just the first 96 bytes of each packet would be sufficient for this sort of investigating.
You might also consider taking snapshots of the network statistics offered by the OSes running in your GCP VM and local system. For example, if running *nix taking a snapshot of "netstat -s" before and after the test. And perhaps a traceroute from each end towards the other.
Network statistics and packet traces, along with as many details about the two endpoints as possible are among the sorts of things support organizations are likely to request when looking to help resolve an issue of this sort.

Running RabbitMQ+Celery in the same server as production environment

I'm running a Django app in an EC2 instance, which uses RabbitMQ + Celery for task queuing. Are there any drawbacks to running my RabbitMQ node from the same EC2 instance as my production app?

The answer to this questions really depends on the context of your application.
When you're faced with scenarios you should always consider a few things.
Seperation of concerns
Here, we want to make sure that if one of the systems are not responsible for the running of other systems. This includes things like
If the ec2 instance running all the stuff goes down, will the remaining tasks in queue continue running
if my RAM is full, will all systems remain functioning
Can I scale just one segment of my app without having to redesign infrastructure.
By having rabbit and django (with some kind of service, wsgi, gunicorn, waitress etc) all on one box, you loose a lot of resource contingency.
Although RAM and CPU may be abundant, there is a limit to IO, disk writes, network writes etc. This means that if for some reason you have a heavy write function, all other systems may suffer as a result. If you have a heavy write to RAM funciton, the same applies.
So really the downfalls from keeping things in one system that I can see from your question and my own experience are as follows.
Multiple points of failure. If your one instance of rabbit fails, your queues and tasks stop working.
If your app starts generating big traffic, other systems start to contend for recourses.
If any component goes down, that could mean other downtime of other services.
System downtime means complete downtime of all components.
Lots of headaches when your application demands more resources with minimal downtime.
Lots of web traffic will slow down task running
Lots of task running will slow down web requests
Lots of IO will slow down all the things
The rule of thumb that I usually follow is keep single points of failures far from each other - that way you only need to manage those components. A good use case for this would be to use an EC2 instance for your app, another for your workers and another for your rabbit. That way you can apply smaller/bigger instances for just those components if you need to. You can even create AMIs and create autoscaling groups - if it is your use case.
Here are some articles for reference
Seperation of concern
Modern design architectures
Single points of failure

TLDR; If you can run on one EC2 you should but make it easy to scale today.
Both Joshnidhin and Giannis covered the RAM, IO and CPU aspects.
I have run production apps in single instances with containerization and slept with peace of mind that if tomorrow suddenly lots of people want what I have built, I can scale pretty quickly by deploying those containers on different instances instead of one single instance.
Docker allows you to put a limit on CPU consumption and memory usage for each container hence you can also be sure that they will not step into each other.

If we take EC2 instance out of this question it becomes:
Are there any drawbacks in running RabbitMQ Node on the same server as my productions app?
I would say it depends on various things like, kind of workloads and its composition, complexity of the workload, do you expect growth in usage etc.
If your workload is well behaved and the server is big enough for both (app + task q) then why not as there will be only one server to manage. Make sure to protect these 2 process from each other by limiting their system resource usage.
If your traffic is not well behaved then you might want more the one server. In this case having dedicated servers is better (separation of concerns) as you will have to manage more than one server.
Now back to EC2, all the above still apply. EC2 makes horizontal scaling of applications easier so if you have them on separate instance then you can scale them individually and cost effectively. If not when you scale there will be wastage of resources.

How to fix a ColdFusion app that runs much slower with sandbox security on?

Recently we upgraded to ColdFusion 11 Enterprise and noticed that the full-fledged sandbox security tends to have a way bigger overhead than the Standard edition (CF10).
What can one do to make an existing CF app perform well with sandbox security?

Here are my findings so far:
install VisualVM by adding -Dcom.sun.management.jmxremote.port=8701 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false to CF Admin's JVM Arguments. Learn how to use it and pay special attention to the CPU Snapshot & Hotspot tab. http://boncode.blogspot.ca/2010/04/cf-java-using-free-visualvm-tool-to.html. FYI CF Server Monitor in the Enterprise edition is utterly useless because its memory/performance profiling overhead is way too big to be viable for a live production server, and it doesn't perform well under load to give you any useful data of what could be going wrong.
Disable IPv6, and add [serverip] [serverip] to the OS's hostfile to speed up the default DNS reverse proxy lookup on creating new physical DB connection by Security Manager. See: On Linux, Java issues reverse DNS lookups when a socket is opened. Why, and how can I stop it? (FYI, Windows is affected to)
remove as much <cfmodule> and <cfinclude> as possible as they will end up with many java.io.File.canRead() and java.io.File.exists() which will stress the disk IO under load. Even SSD suffers under load. I have tried Trusted Cache and it does not help. Instead, try using cached CFC's in application scope and make sure the code are thread safe and local-var'ed.
eliminate the use of <cfinterface>, inheritance with extends, and getMetaData() as much as possible as they will eventually calls java.io.File.lastModified() which will stress the disk IO under load. Bug?
eliminate the use of access="package" as it will end up with many java.security.AccessController.checkPermission calls.
less objects per request the better, as each object instantiation has a higher cost with the extra java.security.AccessController.checkPermission call.

Is there a way for the cache to stay up without timeout after crash in AppFabric Cache?

First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?

I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js