Test connection latency between Amazon RDS instance and external server - amazon-web-services

I need to test connection between a server located in my own datacenter and an Amazon RDS instance. I've tried with
time telnet <dns-of-my.instance> 3306
but it tracks the time since i've issued the command, until i've ended it, which is not relevant.
Are there any ways of measuring this?

My answer does not assume that ICMP ping is allowed, it uses TCP based measures. But you will have to ensure there are security group rules to allow access from the shell running the tests to the RDS instance
First, ensure some useful packages are installed
apt-get install netcat-openbsd traceroute
Check that basic connectivity works to the database port. This example is for Oracle, ensure you use the endpoint and port from the console
nc -vz dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com 1521
Then see what the latency is. The number you want is the final one (step 12)
sudo tcptraceroute dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com 1521
traceroute to dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com (10.32.21.12), 30 hops max, 60 byte packets
1 pc-0-3.ioppublishing.com (172.16.0.3) 0.691 ms 3.341 ms 3.400 ms
2 10.100.101.1 (10.100.101.1) 0.839 ms 0.828 ms 0.811 ms
3 xe-10-2-0-12265.lon-001-score-1-re1.interoute.net (194.150.1.229) 10.591 ms 10.608 ms 10.592 ms
4 ae0-0.lon-001-score-2-re0.claranet.net (84.233.200.190) 10.575 ms 10.668 ms 10.668 ms
5 ae2-0.lon-004-score-1-re0.claranet.net (84.233.200.186) 12.708 ms 12.734 ms 12.717 ms
6 169.254.254.6 (169.254.254.6) 12.673 ms * *
7 169.254.254.1 (169.254.254.1) 10.623 ms 10.642 ms 10.823 ms
8 * * *
9 * * *
10 * * *
11 * * *
12 * 10.32.21.12 (10.32.21.12) <syn,ack> 20.662 ms 21.305 ms
A better measure of "latency" might be "the time a typical transaction takes with no or little data to transfer". To do this, write a script element that does this in a loop, maybe 1000 times and then time it with a high precision timer. But the exact details of this vary according to your needs

Time the query. RDS must be hosting a SQL database server, so issue a trivial SQL query to it and time the execution.
For example, if your RDS instance is PostgreSQL, connect using psql and enable \timing.
psql -h myhost -U myuser
postgres=> \timing
Timing is on.
postgres=> SELECT 1;
?column?
----------
1
(1 row)
Time: 14.168 ms
The latency is 14.168 ms in this example. Consult the manual for timing your specific SQL server implementation.

Use ping. You will need to enable ping on your EC2 instance per this answer.
Ping will provide a time for each ping in milliseconds:
ping 34.217.36.7
PING 34.217.36.7 (34.217.36.7): 56 data bytes
64 bytes from 34.217.36.7: icmp_seq=0 ttl=227 time=68.873 ms
64 bytes from 34.217.36.7: icmp_seq=1 ttl=227 time=68.842 ms
64 bytes from 34.217.36.7: icmp_seq=2 ttl=227 time=68.959 ms
64 bytes from 34.217.36.7: icmp_seq=3 ttl=227 time=69.053 ms

Related

Load Testing SQL Alchemy: "TimeoutError: QueuePool limit of size 3 overflow 0 reached, connection timed out, timeout 30"

I have a SQL-Alchemy based web-application that is running in AWS.
The webapp has several c3.2xlarge EC2 instances (8 CPUs each) behind an ELB which take web requests and then query/write to the shared database.
The Database I'm using is and RDS instance of type: db.m4.4xlarge.
It is running MariaDB 10.0.17
My SQL Alchemy settings are as follows:
SQLALCHEMY_POOL_SIZE = 3
SQLALCHEMY_MAX_OVERFLOW = 0
Under heavy load, my application starts throwing the following errors:
TimeoutError: QueuePool limit of size 3 overflow 0 reached, connection timed out, timeout 30
When I increase the SQLALCHEMY_POOL_SIZE from 3 to 20, the error goes away for the same load-test. Here are my questions:
How many total simultaneous connections can my DB handle?
Is it fair to assume that Number of Number of EC2 instances * Number of Cores Per instance * SQLALCHEMY_POOL_SIZE can go up to but cannot exceed the answer to Question #1?
Do I need to know any other constraints regarding DB connection pool
sizes for a distributed web-app like mine?
MySQL can handle virtually any number of "simultaneous" connections. But if more than a few dozen are actively running queries, there may be trouble.
Without knowing what your queries are doing, one cannot say whether 3 is a limit or 300.
I recommend you turn on the slowlog to gather information on which queries are the hogs. A well-tuned web app can easily survive 99% of the time on 3 connections.
The other 1% -- well, there can be spikes. Because of this, 3 is unreasonably low.

aerospike cluster crashed after index creation

We have a cluster at AWS of 4 machines t2micro (1cpu 1gb ram 15gb ssd) and we were testing aerospike.
We used the aws marketplace AMI to install aerospike v3 community edition, and configured only the aerospike.conf file to have a namespace on the disk.
We had one namespace with two sets, totaling 18M documents, 2gb ram occupied and aprox 40gb of disk space occupied.
After the creation of an index in a 12M records set the system crashed.
Some info:
aql on the instance:
[ec2-user#ip-172-XX-XX-XXX ~]$ aql
2015-09-16 18:44:37 WARN AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to seed cluster*
Tail of the log: (it keeps adding only lines repeated)
Sep 16 2015 19:08:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:08:46 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:06 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
asmonitor:
$ asmonitor -h 54.XX.XXX.XX
request to 54.XX.XXX.XX : 3000 returned error
skipping 54.XX.XXX.XX:3000
***failed to connect to any hosts
asadm:
$ asadm -h 54.XXX.XXX.XX -p 3000
Aerospike Interactive Shell, version 0.0.10-6-gdd6fb61
Found 1 nodes
Offline: 54.207.67.238:3000
We tried restarting the instances, one of them is back but working as a standalone node, the rest are in the described state.
The instances are working, but the aerospike service is not.
There is a guide dedicated to using Aerospike on Amazon EC2 and you probably want to follow it closely to get started.
When you see a AEROSPIKE_ERR_CLIENT "Failed to seed cluster" it means that your client cannot connect to any seed node in the cluster. A seed node is the first node the client connects to, from which it learns about the cluster partition table and the other nodes. You are using aql with the default host (127.0.0.1) and port (3000) values. Try with -h and -p, or use --help for information on the flags.
There are many details you're not including, such as are these nodes all in the same Availability Zone of the same EC2 region? Did you configure your /etc/aerospike.conf with mesh configuration (that's the mode needed in Amazone EC2). Simply, can your nodes see each other? You're using what looks like public IP, but your nodes need to see each other through their local IP addresses. They have no idea what their public IP is, unless you configured it. At the same time the clients may be connecting from other AZs, so you will need to set up the access_address correctly. See this discussion forum post on the topic: https://discuss.aerospike.com/t/problems-configuring-clustering-on-aws-ec2-with-3-db-instances/1676

Is there an NTP server I should be using when using Amazon's EC2 service to combat clock drift?

I’m using AWS and am on an EC2 server …
[dalvarado#mymachine ~]$ uname -a
Linux mydomain.org 3.14.33-26.47.amzn1.x86_64 #1 SMP Wed Feb 11 22:39:25 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
My clock is off by a minute ro so despite the fact that I already have NTPD installed and running
[dalvarado#mymachine ~]$ sudo service ntpd status
ntpd (pid 22963) is running...
It would appear ntp packets are blocked or there is some other problem because I get this error …
[dalvarado#mymachine ~]$ sudo ntpdate pool.ntp.org
2 Apr 16:43:50 ntpdate[23748]: no server suitable for synchronization found
Does anyone know with AWS if there’s another server I should be contacting for NTP info or if there are other additional configurations I need?
Thanks, - Dave
Edit: Including the output from the comment ...
[dalvarado#mymachine ~]$ sudo ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
173.44.32.10 .INIT. 16 u - 1024 0 0.000 0.000 0.000
deekayen.net .INIT. 16 u - 1024 0 0.000 0.000 0.000
dhcp-147-115-21 .INIT. 16 u - 1024 0 0.000 0.000 0.000
time-b.timefreq .INIT. 16 u - 1024 0 0.000 0.000 0.000
Second edit:
Below are the contents of the /etc/ntp.conf file
# For more information about this file, see the man pages
# ntp.conf(5), ntp_acc(5), ntp_auth(5), ntp_clock(5), ntp_misc(5), ntp_mon(5).
driftfile /var/lib/ntp/drift
# Permit time synchronization with our time source, but do not
# permit the source to query or modify the service on this system.
restrict default nomodify notrap nopeer noquery
# Permit all access over the loopback interface. This could
# be tightened as well, but to do so would effect some of
# the administrative functions.
restrict 127.0.0.1
restrict ::1
# Hosts on local network are less restricted.
#restrict 192.168.1.0 mask 255.255.255.0 nomodify notrap
# Use public servers from the pool.ntp.org project.
# Please consider joining the pool (http://www.pool.ntp.org/join.html).
server 0.amazon.pool.ntp.org iburst
server 1.amazon.pool.ntp.org iburst
server 2.amazon.pool.ntp.org iburst
server 3.amazon.pool.ntp.org iburst
#broadcast 192.168.1.255 autokey # broadcast server
#broadcastclient # broadcast client
#broadcast 224.0.1.1 autokey # multicast server
#multicastclient 224.0.1.1 # multicast client
#manycastserver 239.255.254.254 # manycast server
#manycastclient 239.255.254.254 autokey # manycast client
# Enable public key cryptography.
#crypto
includefile /etc/ntp/crypto/pw
# Key file containing the keys and key identifiers used when operating
# with symmetric key cryptography.
keys /etc/ntp/keys
# Specify the key identifiers which are trusted.
#trustedkey 4 8 42
# Specify the key identifier to use with the ntpdc utility.
#requestkey 8
# Specify the key identifier to use with the ntpq utility.
#controlkey 8
# Enable writing of statistics records.
#statistics clockstats cryptostats loopstats peerstats
# Enable additional logging.
logconfig =clockall =peerall =sysall =syncall
# Listen only on the primary network interface.
interface listen eth0
interface ignore ipv6
# Disable the monitoring facility to prevent amplification attacks using ntpdc
# monlist command when default restrict does not include the noquery flag. See
# CVE-2013-5211 for more details.
# Note: Monitoring will not be disabled with the limited restriction flag.
disable monitor
and below is the output from "ntpq -p"
sudo ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
173.44.32.10 .INIT. 16 u - 1024 0 0.000 0.000 0.000
deekayen.net .INIT. 16 u - 1024 0 0.000 0.000 0.000
dhcp-147-115-21 .INIT. 16 u - 1024 0 0.000 0.000 0.000
time-b.timefreq .INIT. 16 u - 1024 0 0.000 0.000 0.000
(2018) Amazon now recommend "just" using their 169.254.169.123 NTP server because
Your instance does not require access to the internet, and you do not have to configure your security group rules or your network ACL rules to allow access.
(It looks like the link-local "Amazon Time Sync Service" was introduced in late 2017)
Note: The 169.254.169.123 server does "leap smearing" and SHOULD NOT be mixed with other (non-Amazon) NTP servers from out on the internet that aren't doing the smearing exactly the same way. Amazon also recommend using chrony instead of ntpd unless you are stuck in a legacy situation where chrony is unavailable as compared to ntpd, chrony is faster at achieving synchronization, more accurate and more robust.
Yes, you should be using at least 3 and ideally 5 or more servers which are a low stratum and a close (round trip time) to your instance.
Amazon provide some documents which detail how to configure ntp. It should be noted that you don't need to use the pool servers listed - they are a front for the public ntp pool which Amazon load balance to; you can pick any servers you like, just remember to update your security/ACL settings for any new addresses.
The output you provided
[dalvarado#mymachine ~]$ sudo ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
173.44.32.10 .INIT. 16 u - 1024 0 0.000 0.000 0.000
deekayen.net .INIT. 16 u - 1024 0 0.000 0.000 0.000
dhcp-147-115-21 .INIT. 16 u - 1024 0 0.000 0.000 0.000
time-b.timefreq .INIT. 16 u - 1024 0 0.000 0.000 0.000
Shows that the servers you have configured are not reachable.
Refid=.INIT. means you have not yet initialised comms to the referenced server. You poll them every 1024 sec but they all have reach=0 thus you can't reach them and are not receiving the time from any server. That's why your clock is still wrong.
It maybe you have your firewall/network security setup too harsh and you are blocking access to those hosts, or more likely the port.
Do some network level diag as it would appear that's where your problem lies - also please include your ntp.conf and the output from ntpq -pcrv if you need further help.
Once you fix the reachability issue, check the numbers in ntpq -p are showing valid data and you should find your problem sorted and clock gets kept in check as expected.
Just a warning to folks about using the AWS time service at 169.254.169.123; This server is not a true ntp server as it doest not correctly handle leap seconds. Instead the AWS server does 'leap smearing'.
This may or may not be suitable for your setup, and you should never mix normal NTP and leap smeared NTP servers together in the same config, or the same timing domain. You should pick one standard and stick to it to avoid any problems.
Amazon documents NTP here. They include NTP configuration with their Amazon linux distributions. An Amazon instance that I have currently running lists these servers in /etc/ntp.conf, which is also what their documentation recommends:
server 0.amazon.pool.ntp.org iburst
server 1.amazon.pool.ntp.org iburst
server 2.amazon.pool.ntp.org iburst
server 3.amazon.pool.ntp.org iburst

AWS VPN issue routing to 2nd ip block

I've just setup a VPN link between our local network and an Amazon VPC.
Our local network has two ip blocks of interest:
192.168.0.0/16 - block local-A
10.1.1.0/24 - block local-B
The AWS VPC has a ip block of:
172.31.0.0/16 - block AWS-A
I have setup the VPN connection with static routes to the local-A and local-B ip blocks.
I can connect from:
local-A to AWS-A and
AWS-A to local-A.
I can't connect from:
AWS-A to local-B (e.g. 173.31.17.135 to 10.1.1.251)
From the 173.31.17.135 server, 10.1.1.251 seems to be resolving to an Amazon server, rather than a server on our local network. This is despite setting up local-B (10.1.1.0/24) as a route on the VPN. See tracert output...
tracert 10.1.1.251
Tracing route to ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 33 ms 35 ms 36 ms 169.254.247.22
4 34 ms 35 ms 33 ms ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
How can I change the config so that the 10.1.1.251 address resolves to the server on my local network when accessed from within AWS?
Disclosure - I've asked this same question at serverfault too.
Edit...tracert to the other subnet
tracert -d 192.168.0.250
Tracing route to 192.168.0.250 over a maximum of 30 hops
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 36 ms 35 ms * 169.254.247.22
4 35 ms 36 ms 34 ms 192.168.0.250
Edit #2...tracetcp to the bad and good destinations:
>tracetcp 10.1.1.251 -n
Tracing route to 10.1.1.251 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 31 ms 31 ms 16 ms 169.254.247.22
4 * * * Request timed out.
5 ^C
>tracetcp 192.168.0.250 -n
Tracing route to 192.168.0.250 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 47 ms 46 ms 32 ms 169.254.247.22
4 Destination Reached in 31 ms. Connection established to 192.168.0.250
Trace Complete.
Hop #3 is the VPN end point on the local side. We are getting that far and then something is failing to route past that. So yes, the problem is on our side, i.e. AWS setup is fine. I'll update when we get to the bottom of it.
Edit #3
Fixed! The problem was the Barracuda Firewall that we run internally. We had created incoming and outgoing holes in the firewall AND there were no logs showing blocked requested to the destination so we had ruled the firewall out as the cause early on. In desperation we quickly switched the firewall off and the connection went through ok. Since then we have updated the firmware on the firewall and it has fixed the issue and the firewall is now allowing through the traffic.
tracert and tracetcp were helpful diagnostics but the firewall never showed up on the list of hops (even now it's fixed) so it was hard to tell the firewall was blocking it. I would have expected all traffic to be passing through the firewall and that the firewall would be in the hop list. Thanks for the help.
You appear to be blurring the distinction between routing and name resolution. These two things are completely unrelated.
Looking at the round-trip-times, it certainly doesn't look like you're hitting an AWS server. I would speculate that you're routing fine, but amazon's internal DNS service is showing you the wrong name.
Have you actually tried connecting to your 10.1.1.251 machine? Intuition tells me it's fine, you're just seeing a hostname you don't expect.
The fix, of course, is to configure your EC2 VPC instances to use your corporate DNS servers instead of Amazon's. Or you can just use tracert -d so that you don't see this. :)
Okay, update. Your references to resolution and the latency involved made me think otherwise. I'm still inclined to think you actually have connectivity at some level, though because the 36 milliseconds seems inexplicably high. What's the round-trip time to the other subnet?
Can you pull the Ethernet cable from 10.1.1.251 and see if it stops responding?
If you trace from a 10.1.1.x machine on your LAN toward the VPC or try to connect to a VPC machine from 10.1.1.x... what happens there?

Benchmarking EC2

I am running some quick tests to try to estimate hw costs for a launch and in the future.
Specs
Ubuntu Natty 11.04 64-bit
Nginx 0.8.54
m1.large
I feel like I must be doing something wrong here. What I am trying to do estimate how many
simultaneous I can support before having to add an extra machine. I am using django app servers but for right now I am just testing nginx server the static index.html page
Results:
$ ab -n 10000 http://ec2-107-20-9-180.compute-1.amazonaws.com/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking ec2-107-20-9-180.compute-1.amazonaws.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: nginx/0.8.54
Server Hostname: ec2-107-20-9-180.compute-1.amazonaws.com
Server Port: 80
Document Path: /
Document Length: 151 bytes
Concurrency Level: 1
Time taken for tests: 217.748 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 3620000 bytes
HTML transferred: 1510000 bytes
Requests per second: 45.92 [#/sec] (mean)
Time per request: 21.775 [ms] (mean)
Time per request: 21.775 [ms] (mean, across all concurrent requests)
Transfer rate: 16.24 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 9 11 10.3 10 971
Processing: 10 11 9.7 11 918
Waiting: 10 11 9.7 11 918
Total: 19 22 14.2 21 982
Percentage of the requests served within a certain time (ms)
50% 21
66% 21
75% 22
80% 22
90% 22
95% 23
98% 25
99% 35
100% 982 (longest request)
So before I even add a django backend, the basic nginx setup can only supper 45 req/second?
This is horrible for an m1.large ... no?
What am I doing wrong?
You've only set the concurrency level to 1. I would recommend upping the concurrency (-c flag for Apache Bench) if you want more realistic results such as
ab -c 10 -n 1000 http://ec2-107-20-9-180.compute-1.amazonaws.com/.
What Mark said about concurrency. Plus I'd shell out a few bucks for a professional load testing service like loadstorm.com and hit the thing really hard that way. Ramp up load until it breaks. Creating simulated traffic that is at all realistic (which is important to estimating server capacity) is not trivial, and these services help by loading resources and following links and such. You won't get very realistic numbers just loading one static page. Get something like the real app running, and hit it with a whole lot of virtual browsers. You can't count on finding the limits of a well configured server with just one machine generating traffic.