Slow response from getaddrinfo - c++

I'm using getaddrinfo to do DNS queries from C++ on Windows. I used to use the Windows API DnsQuery and that worked fine, but when adding IPv6 support to my software I switched to getaddrinfo. Since then, I've seen the following:
My problem is that some times getaddrinfo take very long time to complete. The typical response from getaddrinfo takes just a few milliseconds, but roughly 1 time out of 10000, it takes longer time, in some cases around 15 seconds but there's been several cases when it takes several minutes.
I've run Wireshark on the server and analyzed my applications debug logs and see the following:
I call the function getaddrinfo.
15 seconds later, my machine queries the DNS server.
Some milliseconds later, I get the response from the DNS server.
The weird thing here is that the actual DNS query only takes a tenth of a second, but the time getaddrinfo actually executes is much longer.
The problem has been reported by many users, so it's not something specific to my machine.
So what does getaddrinfo do more than contact the DNS server?
Edit:
The problem has occurred with several addresses. If I try to reproduce the problem using these addresses, the problem does not occur.
I have done something stupid. Upon every DNS query, the etc/services is parsed. However, that doesn't explain a delay on several minutes. (thanks D.Shawley)
Edit 2
One type of DNS queries made by my software is anti-spam DNSBL queries. The log from one user showed me that the lookup for ip.address1.example.com always seemed to take exactly 2039 seconds, while the lookup for another.ip.address.example.com always took exactly 1324 seconds. The day after that, the lookups for those addresses were just fine. At first I thought that the DNS BL authors had put some kind of timeout on their side. But if this was the core problem, getaddrinfo should have timed out earlier?

Windows has a local daemon that does DNS caching. Your call to getaddrinfo() is getting routed to that daemon, which presumably is checking its cache before submitting the query to your DNS server.
See Windows Knowledge Base article 318803 for more information on disabling the cache.
[Edited]
It sounds to me as though your Windows Server 2003 instance is not configured correctly for IPv6. Once the IPv6 lookups timeout, it will fall back to IPv4. Knowledge Base articles that might help include:
Windows Server 2003 Deployment Guide >>> Configuring DNS for IPv6/IPv4 Coexistence
TechNet Library >>> Internet Protocol Version 6
TechNet Library >>> Using Windows Tools to Obtain IPv6 Configuration Information
Unfortunately, I don't have access to any Windows Servers, so I can't test/replicate this myself.

Related

How much overhead would a DNS call add to the response time of my API?

I am working on a cross-platform application that runs on iOS, Android and the web. Currently there is an API server which interacts with all the clients. Each request to the API is made through the ip (eg. http://1.1.1.1/customers). This disallows me to move the backend quickly whenever I want to another cloud VPS as I need to update iOS and Android versions of the app with a painful migration process.
I though the solution would be introducing a subdomain. (eg http://api.example.com/customers). How much would an additional DNS call would affect the response times?
The thing to remember about DNS queries is that, as long as you have configured your DNS sensibly, clients will only ever make a single call the first time communication is needed.
A dns query will typically involve three queries, one to the root server, one to the .com (etc) server, and a final one to the example.com domain. Each of these will take milliseconds and will be performed once, probably every hour or so whenever the TTL expires.
The TL;DR is basically that it is a no brainer, you get far far more advantages from using a domain name than you will ever get from an IP Address. The time is minimal, the packet size is tiny.

Suddenly scheduled tasks are not running in coldfusion 8

I am using Coldfusion MX8 server and one of the scheduled task was running from 2 years but now suddenly from 01/12/2014 scheduled tasks are not running. When i browsed the file in browser then the file is running successfully without error.
I am not sure is there any updatation or license expiration problem. I am aware that mid of this year Adobe closed the support for coldfusion 8.
The first most common problem of this problem is external to the server. When you say you browsed to the file and it worked in a browser, it is very important to know if that test was performed on the server desktop. Knowing that you can browse to the file from your desktop or laptop is of small value.
The most common source of issues like this is a change in the DNS or network stack that is interfereing with resolution. For example, if the internal DNS serving your DMZ suddenly starts serving the "external" address - suddenly your server can't browse to your domain. Or if the IP served by the server for the domain in question goes from being 127.0.0.1 to some other IP that the server can't acces correctly due to reverse proxy or LB or some other rule. Finally, sometimes the Apache or IIS is altered so that an IP that previously was serviced (127.0.0.1 being the most common example) now does not respond.
If it is something intrinsic to the scheduler service then Frank's advice is pretty good - especially look for "proxy schduler" entries in the log - they can give you good clues. I would also log results of a scheduled task to a file. Then check the file. If it exists then your scheduled tasks ARE running - they are just not succeeding. Good luck!
I've seen the cf scheduling service crash in CF8. The rest of CF is unaffected.
Have you tried restarting the server?
Here are your concerns:
Your File (works since you tested it manually).
Your Scheduled Task (failed).
Your Coldfusion Application (Service) (any changes here)?
Your Server (what about here).
To test your problem create a duplicate task and schedule it. Leave the other one in place (maybe set your new one to run earlier). Use the same file too. See if it completes.
If it doesn't then you have a larger problem. Since the Coldfusion Server sits atop of the JVM there could be something happening there. Things just don't stop working unless something got corrupted or you got compromised. If you hardened your server by rearranging/renaming the file structure to make it more secure...It would break your task.
So going back: if your test schedule works then determine what is different between the two. Note you have logging capabilities. Logging abilities for CF8
If you are not directly incharge of maintaining this server, then I would recommend asking around and see if there was recent maintenance, if so, what was done to the server?

Strange apache lag in requests

I have an Apache2 and Django (mod_wsgi) setup that provides a RESTful API. I have a set of automated tests for this, that executes ~1000 API requests (pure http GET/POST/PUT/DELETE) in sequential order.
The problem is, for every 80 requests or so, I get a strange lag/timeout for exactly 5s or 10s. See timestamp examples here:
Request 1: 2013-08-30T03:49:20.915
Response 1: 2013-08-30T03:49:30.940
Request 2: 2013-08-30T03:50:32.559
Response 2: 2013-08-30T03:50:37.597
I can't figure out why this happens. I have an apache config with KeepAlive Off (recommended setup setting for Django) but otherwise standard install for Ubuntu 12.04 LTS.
I'm running the tests from the same server where the webserver is, I first thought this was some kind of DNS cache thing, but I've added the hostname I'm requesting to /etc/hosts but the problem persists.
The system is idle and have lots of cpu and mem when this lag/timeouts happens.
The lag is not specific to a certain request (URL), it seems kinda random.
Considering that it's always exactly to the millisecond 5s or 10s, it feels like this is some specific setting somewhere causing this.
In case it provides some insight, watch my talk from PyCon US.
http://lanyrd.com/2013/pycon/scdyzk/
The talk deals with things like process churn and startup costs. One thing you shouldn't do is set maximum requests if you don't really need it.
Also consider trying New Relic to help diagnose where the issue is. That will save a lot of guessing if it is a web application of backend service infrastructure issue.
As far as seeing how such monitoring can help, watch another one of my PyCon talks.
http://lanyrd.com/2012/pycon/spcdg/
This was a DNS issue, adding the domainname I used locally to /etc/hosts actually solved the problem. I just hadn't reboot the server for the changes to take effect, thought restarting networking would take care of that, but apparently not.

Winsock IOCP Server Stress Test Issue

I have a winsock IOCP server written in c++ using TCP IP connections. I have tested this server locally, using the loopback address with a client simulator. I have been able to get upwards of 60,000 clients no sweat. The issue I am having, is when I run the server at my house and the client simulator at a friends house. Everything works fine up until we hit around 3700 connections, after that every call to connect() fails from the client side with a return of 10060 (this is the winsock timed out error). Last night this number was 3700, but it has been around 300 before, and we also saw it near 1000. But whatever the number is, every time we try to simulate it, it will fail right around that number (within 10 or so).
Both computers are using Windows 7 Ultimate. We have also both modified the TCPIP registry setting MaxTcpConnections to around 16 million. We also changed the MaxUserPort setting from its 5000 default to 65k. No useful information is showing up in the event viewer. We also both watched our resource monitor, and we havent even gotten to 1% network utilization, the CPU is also close to 0% usage as well.
We just got off the phone with our ISP, and they are saying that they are not limiting us in any way but the guy was kinda unsure and ended up hanging up on us anyway after a 30 minute hold time...
We are trying everything to figure this issue out, but cannot come up with the solution. I would be very greatful if someone out there could give us a hand with this issue.
P.S. Both computers are on Verizon FIOS with the same verizon router. Another thing to note, the server is using WSAAccept and NOT AcceptEx. The client simulator is attempting to connect over many seconds though, so I am pretty sure the connects are not getting backlogged. We have tried to change the speed at which the client simulator connects, and no matter what speed it is set to it fails right around the same number each time.
UPDATE
We simulated 2 separate clients (on 2 separate machines) on network A. The server was running on network B. Each client was only able to connect half (about 1600) connections to the server. We were initially using a port below 1,000, this has been changed to above 50,000. The router log on both machines showed nothing. We are both using the Actiontec MI424WR verizon FIOS router. This leads me to believe the problem is not with the client code. The server throws no errors and has no unexpected behavior. Could this be an ISP/Router issue?
UPDATE
The solution has been found. The verizon router we were using (MI424WR revision C) is unable to handle any more than 3700 connections, we tested this with a separate set of networks. Thanks for the help guys!
Thanks
- Rick
I would have guessed that this was a MaxUserPort issue, but you say you've changed that. Did you reboot after changing it?
Run the test on the exact same computers on your local network (this will take the computers out of the equation).
The issue could be one of your routers not being up to the job?

WMI - Cannot connect to certain computers via IP address

I have a really odd issue with WMI that I'm running into on a few machines on our network.
I have a piece of software (.NET/C#) written that scans an IP range on a local network, and then uses WMI to query certain data about the machines (computer names, .NET framework versions, among other things). One issue I've run into recently is that a small subset of these machines will not respond to WMI connections made via their IP address- they simply throw an "RPC Server is Unavailable" exception as if WMI isn't running to begin with.
This occurs both with the C# application and with a vbscript application that attempts a simple query to return the computer's name:
if wscript.arguments.count >= 1 then
host = wscript.arguments(0)
end if
if host = "" or isnull(host) then host = "."
connectionStr = "winmgmts:{impersonationLevel=impersonate}!\\" & host & "\root\cimv2"
wscript.echo connectionStr
set objWMIService = GetObject(connectionStr)
set objCompName = objWMIService.ExecQuery("Select * from Win32_ComputerSystem")
for each x in objCompName
wscript.echo x.Name
next
This returns the following as results:
C:\>nslookup BROKENCOMPUTER
Address: 192.168.1.123
C:\>cscript testwmi.vbs 192.168.1.123
winmgmts:{impersonationLevel=impersonate}!\\192.168.1.123\root\cimv2
C:\testwmi.vbs(9, 1) Microsoft VBScript runtime error: The remote server machine does not exist or is unavailable: 'GetObject'
C:\>cscript testwmi.vbs BROKENCOMPUTER
winmgmts:{impersonationLevel=impersonate}!\\BROKENCOMPUTER\root\cimv2
BROKENCOMPUTER
I can still open a WMI connection if I refer to the computer by its host/computer name. I can also connect to other servers running on the machine via IP address (such as HTTP or RDP)- a request tp http://192.168.1.123 returns successfully.
To make things even weirder, the behavior isn't even consistent. Sometimes the connection to the IP will work correctly, and it happens in batches. To test this, I set up a script that repeatedly spammed a WMI request every 5 seconds to the computer in question and recorded the result (and trends of results). What I found was that all requests would fail or succeed for about a certain number of requests (180- a 15 minute interval) or a multiple of it. Example:
- Start script
- 35 successful requests in a row
- 180 failed requests in a row
- 180 successful requests
- 360 failed requests
- 180 successful requests
- 180 failed requests
- 900 successful requests
- etc etc
I then ran this script on two machines at the same time. What I found was the behavior between the two was similar (had several-minute-long-intervals of being able to connect and not being able to connect) but did not sync up between the two; there were periods where both could connect, periods where only one (or the other) could connect, and periods where neither could connect.
I know this is an incredibly weird and specific problem, and I don't really expect anyone to be able to insta-solve it, but I was wondering if anyone had any hints or direction? I've spoken to the network guys here and they're just as puzzled over the issue as I am.
I can add some perspective on this, in addition to the fine answer from MisterZimbu. Assuming Microsoft doesn't remove my comments on this article, see http://msdn.microsoft.com/en-us/library/windows/desktop/aa393720%28v=vs.85%29.aspx. Basically, Microsoft seems to be doing a reverse DNS lookup when IP addresses are passed into WMI. If your DNS isn't squeaky clean, you will get "unpredictable results", which is to say that you will be connecting to machines you didn't expect to connect to.
Adding the period to the IP address forces the reverse (or forward) lookup to fail, and then by some miracle, they actually use the IP address, and not the (potentially incorrect) hostname returned from DNS. It appears that appending a period to the IP address can be used in many contexts (UNC's, browser, etc.), but there are caveats and other failures you might encounter. Note that if you look at your DNS cache (ipconfig /displaydns) you will see the failed lookups when the period is appended, so it doesn't stop the OS from doing the lookup - it just ensures that the stale DNS entries won't be used.
Oddly enough, adding a "." to the end of the IP address when making the query corrects the issue. I assume this forces it to go through DNS resolution or something like that.
So connecting via
winmgmts:{impersonationLevel=impersonate}!\\192.168.1.123.\root\cimv2
seems to work correctly 100% of the time.
Still would be great to know what actually is the underlying cause of the issue though.