AWS VPN issue routing to 2nd ip block - amazon-web-services

I've just setup a VPN link between our local network and an Amazon VPC.
Our local network has two ip blocks of interest:
192.168.0.0/16 - block local-A
10.1.1.0/24 - block local-B
The AWS VPC has a ip block of:
172.31.0.0/16 - block AWS-A
I have setup the VPN connection with static routes to the local-A and local-B ip blocks.
I can connect from:
local-A to AWS-A and
AWS-A to local-A.
I can't connect from:
AWS-A to local-B (e.g. 173.31.17.135 to 10.1.1.251)
From the 173.31.17.135 server, 10.1.1.251 seems to be resolving to an Amazon server, rather than a server on our local network. This is despite setting up local-B (10.1.1.0/24) as a route on the VPN. See tracert output...
tracert 10.1.1.251
Tracing route to ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
over a maximum of 30 hops:
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 33 ms 35 ms 36 ms 169.254.247.22
4 34 ms 35 ms 33 ms ip-10-1-1-251.ap-southeast-2.compute.internal [10.1.1.251]
How can I change the config so that the 10.1.1.251 address resolves to the server on my local network when accessed from within AWS?
Disclosure - I've asked this same question at serverfault too.
Edit...tracert to the other subnet
tracert -d 192.168.0.250
Tracing route to 192.168.0.250 over a maximum of 30 hops
1 <1 ms <1 ms <1 ms 169.254.247.13
2 <1 ms <1 ms <1 ms 169.254.247.21
3 36 ms 35 ms * 169.254.247.22
4 35 ms 36 ms 34 ms 192.168.0.250
Edit #2...tracetcp to the bad and good destinations:
>tracetcp 10.1.1.251 -n
Tracing route to 10.1.1.251 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 31 ms 31 ms 16 ms 169.254.247.22
4 * * * Request timed out.
5 ^C
>tracetcp 192.168.0.250 -n
Tracing route to 192.168.0.250 on port 80
Over a maximum of 30 hops.
1 0 ms 0 ms 0 ms 169.254.247.13
2 0 ms 0 ms 0 ms 169.254.247.21
3 47 ms 46 ms 32 ms 169.254.247.22
4 Destination Reached in 31 ms. Connection established to 192.168.0.250
Trace Complete.
Hop #3 is the VPN end point on the local side. We are getting that far and then something is failing to route past that. So yes, the problem is on our side, i.e. AWS setup is fine. I'll update when we get to the bottom of it.
Edit #3
Fixed! The problem was the Barracuda Firewall that we run internally. We had created incoming and outgoing holes in the firewall AND there were no logs showing blocked requested to the destination so we had ruled the firewall out as the cause early on. In desperation we quickly switched the firewall off and the connection went through ok. Since then we have updated the firmware on the firewall and it has fixed the issue and the firewall is now allowing through the traffic.
tracert and tracetcp were helpful diagnostics but the firewall never showed up on the list of hops (even now it's fixed) so it was hard to tell the firewall was blocking it. I would have expected all traffic to be passing through the firewall and that the firewall would be in the hop list. Thanks for the help.

You appear to be blurring the distinction between routing and name resolution. These two things are completely unrelated.
Looking at the round-trip-times, it certainly doesn't look like you're hitting an AWS server. I would speculate that you're routing fine, but amazon's internal DNS service is showing you the wrong name.
Have you actually tried connecting to your 10.1.1.251 machine? Intuition tells me it's fine, you're just seeing a hostname you don't expect.
The fix, of course, is to configure your EC2 VPC instances to use your corporate DNS servers instead of Amazon's. Or you can just use tracert -d so that you don't see this. :)
Okay, update. Your references to resolution and the latency involved made me think otherwise. I'm still inclined to think you actually have connectivity at some level, though because the 36 milliseconds seems inexplicably high. What's the round-trip time to the other subnet?
Can you pull the Ethernet cable from 10.1.1.251 and see if it stops responding?
If you trace from a 10.1.1.x machine on your LAN toward the VPC or try to connect to a VPC machine from 10.1.1.x... what happens there?

Related

ELB Intermittently Return 504 GATEWAY_TIMEOUT

I've seen this asked here, here, and here - but without any good answers and was hoping to maybe get some closure on the issue.
I have an ELB connected to 6 instances all running Tomcat7. Up until Friday there were seemingly no issues at all. However, starting about five days ago we started getting around two 504 GATEWAY_TIMEOUT from the ELB per day. That's typically 2/2000 ~ .1%. I turned on logging and see
2018-06-27T12:56:08.110331Z momt-default-elb-prod 10.196.162.218:60132 - -1 -1 -1 504 0 140 0 "POST https://prod-elb.us-east-1.backend.net:443/mobile/user/v1.0/ HTTP/1.1" "BackendClass" ECDHE-RSA-AES128-GCM-SHA256 TLSv1.2
But my Tomcat7 logs don't have any 504s present at all, implying that the ELB is rejecting these requests without even communicating with the Tomcat.
I've seen people mention setting the Tomcats timeout to be greater than the ELB's timeout - but if that were what were happening (i.e. Tomcat times out and then ELB shuts down), then shouldn't I see a 504 in the Tomcat logs?
Similarly, nothing has changed in the code in a few months. So, this all just started seemingly out of nowhere, and is too uncommon to be a bigger issue. I checked to see if there were some pattern in the timeouts (i.e. tomcat restarting or same instance etc.) but couldn't find anything.
I know other people have run into this issue, but any and all help would be greatly appreciated.

Test connection latency between Amazon RDS instance and external server

I need to test connection between a server located in my own datacenter and an Amazon RDS instance. I've tried with
time telnet <dns-of-my.instance> 3306
but it tracks the time since i've issued the command, until i've ended it, which is not relevant.
Are there any ways of measuring this?
My answer does not assume that ICMP ping is allowed, it uses TCP based measures. But you will have to ensure there are security group rules to allow access from the shell running the tests to the RDS instance
First, ensure some useful packages are installed
apt-get install netcat-openbsd traceroute
Check that basic connectivity works to the database port. This example is for Oracle, ensure you use the endpoint and port from the console
nc -vz dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com 1521
Then see what the latency is. The number you want is the final one (step 12)
sudo tcptraceroute dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com 1521
traceroute to dev-fulfil.cvxzodonju67.eu-west-1.rds.amazonaws.com (10.32.21.12), 30 hops max, 60 byte packets
1 pc-0-3.ioppublishing.com (172.16.0.3) 0.691 ms 3.341 ms 3.400 ms
2 10.100.101.1 (10.100.101.1) 0.839 ms 0.828 ms 0.811 ms
3 xe-10-2-0-12265.lon-001-score-1-re1.interoute.net (194.150.1.229) 10.591 ms 10.608 ms 10.592 ms
4 ae0-0.lon-001-score-2-re0.claranet.net (84.233.200.190) 10.575 ms 10.668 ms 10.668 ms
5 ae2-0.lon-004-score-1-re0.claranet.net (84.233.200.186) 12.708 ms 12.734 ms 12.717 ms
6 169.254.254.6 (169.254.254.6) 12.673 ms * *
7 169.254.254.1 (169.254.254.1) 10.623 ms 10.642 ms 10.823 ms
8 * * *
9 * * *
10 * * *
11 * * *
12 * 10.32.21.12 (10.32.21.12) <syn,ack> 20.662 ms 21.305 ms
A better measure of "latency" might be "the time a typical transaction takes with no or little data to transfer". To do this, write a script element that does this in a loop, maybe 1000 times and then time it with a high precision timer. But the exact details of this vary according to your needs
Time the query. RDS must be hosting a SQL database server, so issue a trivial SQL query to it and time the execution.
For example, if your RDS instance is PostgreSQL, connect using psql and enable \timing.
psql -h myhost -U myuser
postgres=> \timing
Timing is on.
postgres=> SELECT 1;
?column?
----------
1
(1 row)
Time: 14.168 ms
The latency is 14.168 ms in this example. Consult the manual for timing your specific SQL server implementation.
Use ping. You will need to enable ping on your EC2 instance per this answer.
Ping will provide a time for each ping in milliseconds:
ping 34.217.36.7
PING 34.217.36.7 (34.217.36.7): 56 data bytes
64 bytes from 34.217.36.7: icmp_seq=0 ttl=227 time=68.873 ms
64 bytes from 34.217.36.7: icmp_seq=1 ttl=227 time=68.842 ms
64 bytes from 34.217.36.7: icmp_seq=2 ttl=227 time=68.959 ms
64 bytes from 34.217.36.7: icmp_seq=3 ttl=227 time=69.053 ms

aerospike cluster crashed after index creation

We have a cluster at AWS of 4 machines t2micro (1cpu 1gb ram 15gb ssd) and we were testing aerospike.
We used the aws marketplace AMI to install aerospike v3 community edition, and configured only the aerospike.conf file to have a namespace on the disk.
We had one namespace with two sets, totaling 18M documents, 2gb ram occupied and aprox 40gb of disk space occupied.
After the creation of an index in a 12M records set the system crashed.
Some info:
aql on the instance:
[ec2-user#ip-172-XX-XX-XXX ~]$ aql
2015-09-16 18:44:37 WARN AEROSPIKE_ERR_CLIENT Socket write error: 111
Error -1: Failed to seed cluster*
Tail of the log: (it keeps adding only lines repeated)
Sep 16 2015 19:08:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:08:46 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:06 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
Sep 16 2015 19:09:26 GMT: INFO (drv_ssd): (drv_ssd.c::2406) device /opt/aerospike/data/bar.dat: used 6980578688, contig-free 5382M (5382 wblocks), swb-free 0, n-w 0, w-q 0 w-tot 23 (0.0/s), defrag-q 0 defrag-tot 128 (0.0/s)
asmonitor:
$ asmonitor -h 54.XX.XXX.XX
request to 54.XX.XXX.XX : 3000 returned error
skipping 54.XX.XXX.XX:3000
***failed to connect to any hosts
asadm:
$ asadm -h 54.XXX.XXX.XX -p 3000
Aerospike Interactive Shell, version 0.0.10-6-gdd6fb61
Found 1 nodes
Offline: 54.207.67.238:3000
We tried restarting the instances, one of them is back but working as a standalone node, the rest are in the described state.
The instances are working, but the aerospike service is not.
There is a guide dedicated to using Aerospike on Amazon EC2 and you probably want to follow it closely to get started.
When you see a AEROSPIKE_ERR_CLIENT "Failed to seed cluster" it means that your client cannot connect to any seed node in the cluster. A seed node is the first node the client connects to, from which it learns about the cluster partition table and the other nodes. You are using aql with the default host (127.0.0.1) and port (3000) values. Try with -h and -p, or use --help for information on the flags.
There are many details you're not including, such as are these nodes all in the same Availability Zone of the same EC2 region? Did you configure your /etc/aerospike.conf with mesh configuration (that's the mode needed in Amazone EC2). Simply, can your nodes see each other? You're using what looks like public IP, but your nodes need to see each other through their local IP addresses. They have no idea what their public IP is, unless you configured it. At the same time the clients may be connecting from other AZs, so you will need to set up the access_address correctly. See this discussion forum post on the topic: https://discuss.aerospike.com/t/problems-configuring-clustering-on-aws-ec2-with-3-db-instances/1676

How do I ping the original webpage, bypassing CDN?

I live in Warsaw, Poland. I am pinging any US webpage (like www.nba.com):
$ ping www.nba.com
PING a1570.gd.akamai.net (213.155.152.161) 56(84) bytes of data.
64 bytes from 213-155-152-161.customer.teliacarrier.com (213.155.152.161): icmp_req=1 ttl=58 time=6.90 ms
64 bytes from 213-155-152-161.customer.teliacarrier.com (213.155.152.161): icmp_req=2 ttl=58 time=5.68 ms
Time that I receive is around 7-10 ms, while distance from Poland to US and back (packages go forth and back) is around 16000 km (16*10^6 m). c=3*10^8 m/s. Distance/c = 0,05 s = 50ms.
So I suppose, that some webpages are cached on some other server e.g. in Western Europe (5ms, means less than 750km from my place). How than I can ping the original, US webpage then?
Or did I miss something?
EDIT1: Ok, I missed, I am pinging actually a1570.gd.akamai.net in London, but distance is still too far (>750km). Is it a ping time counter error?
You are not pinging www.nba.com but one of CDN servers they are using, namely:
a1570.gd.akamai.net (213.155.152.161)
This Akamai server is located in London. Hence your ping is so fast, proving CDN actually works.

Benchmarking EC2

I am running some quick tests to try to estimate hw costs for a launch and in the future.
Specs
Ubuntu Natty 11.04 64-bit
Nginx 0.8.54
m1.large
I feel like I must be doing something wrong here. What I am trying to do estimate how many
simultaneous I can support before having to add an extra machine. I am using django app servers but for right now I am just testing nginx server the static index.html page
Results:
$ ab -n 10000 http://ec2-107-20-9-180.compute-1.amazonaws.com/
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/
Benchmarking ec2-107-20-9-180.compute-1.amazonaws.com (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Completed 10000 requests
Finished 10000 requests
Server Software: nginx/0.8.54
Server Hostname: ec2-107-20-9-180.compute-1.amazonaws.com
Server Port: 80
Document Path: /
Document Length: 151 bytes
Concurrency Level: 1
Time taken for tests: 217.748 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Total transferred: 3620000 bytes
HTML transferred: 1510000 bytes
Requests per second: 45.92 [#/sec] (mean)
Time per request: 21.775 [ms] (mean)
Time per request: 21.775 [ms] (mean, across all concurrent requests)
Transfer rate: 16.24 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 9 11 10.3 10 971
Processing: 10 11 9.7 11 918
Waiting: 10 11 9.7 11 918
Total: 19 22 14.2 21 982
Percentage of the requests served within a certain time (ms)
50% 21
66% 21
75% 22
80% 22
90% 22
95% 23
98% 25
99% 35
100% 982 (longest request)
So before I even add a django backend, the basic nginx setup can only supper 45 req/second?
This is horrible for an m1.large ... no?
What am I doing wrong?
You've only set the concurrency level to 1. I would recommend upping the concurrency (-c flag for Apache Bench) if you want more realistic results such as
ab -c 10 -n 1000 http://ec2-107-20-9-180.compute-1.amazonaws.com/.
What Mark said about concurrency. Plus I'd shell out a few bucks for a professional load testing service like loadstorm.com and hit the thing really hard that way. Ramp up load until it breaks. Creating simulated traffic that is at all realistic (which is important to estimating server capacity) is not trivial, and these services help by loading resources and following links and such. You won't get very realistic numbers just loading one static page. Get something like the real app running, and hit it with a whole lot of virtual browsers. You can't count on finding the limits of a well configured server with just one machine generating traffic.