We started seeing this issue about 2 hours ago. It's happening very randomly. For example, if you copy and paste this URL for the image into browser, the chance of it not showing up for me is about 20%.
https://d1jbmqjs327xbn.cloudfront.net/_pa/spaces-developer.pxand/assets/images/apps/pos/pos-login-bg.jpg
Even after the browser able to load the image, it may not load if you do a hard refresh. Then it will eventually doing hard refresh 2-3 more times.
This seems like a networking issue on AWS side.
Another thing I saw is that 3 of my domains randomly became unreachable for a few minutes (I tested with ping) and then they eventually become reachable again without making any change on my side.
Is anyone experiencing the same issue today (Sep 20, 2017)? This is causing an issue for 10+ sites/apps I manage and I'm not quite sure how to solve this issue. Amazon is also not getting back to me on this.
Related
I have two windows computers, both using the Microsoft Edge browser. When I'm typing on the Overleaf website, the connection gets lost every 2-5 minutes. What's worse, some unsynced sentences are gone when the connection resumes. I'm not sure whether this is a network problem since all other websites look good, including Twitter and Gmail. I wondered if this is about the framework of what Overleaf cloud service used. Could anyone give some tips about the issue here?
I've been running a few ML training sessions on a GCE VM (with Colab). At the start they are saving me a good deal of time/computing resources, but, like everything Google so far, ultimately the run time disconnects and I cannot reconnect to my VM despite it still being there. a) how do we reconnect to a runtime if the VM exists, we have been disconnected, and it says it cannot reconnect to runtime?
b) how to avoid disconnecting/this issue at all? I am using Colab Pro+ and paying for VM's. And always they cut out at some point and it's just another week of time gone out the window. I must be doing something wrong as there's no way we pay for just losing all of our progress/time all the time and have to restart in hope it doesn't collapse again (it's been about 2 weeks of lost time and I'm just wondering why it GCE VM's can't just run a job for 4 days without collapsing at some point). What am I doing wrong? I just want to pay for an external resource that runs the jobs I pay for, and no connect/disconnect/lose everything issue every few days. I don't understand why Google does this.
I have been testing out a ubuntu instance on GCS for the last couple weeks and a possible home for one of our web servers. Last week suddenly everything stopped working. I was not able to SSH to shell, and I couldn't even visit the site anymore through my browser. I logged into the dashboard and nothing seemed wrong. I had several other colleges try to go to the site and it loaded without any issues. I could not find any settings in the dashboard that would suggest some kind of block like this, so i assumed I must have triggered some kind of anti spam system. I decided to give a few days before trying to mess with it any further. after 6 days of not messing with it at all I still can not visit the site, or login via SSH.
Then to verify they are blocking my IP address and that it wasn't just something wrong with my machine. I switched my IP and then everything started behaving as expected once again. I can get to the site in my browser and can once again SSH into the VM. After switching back to my previous static IP everything went back to not letting me view the webpage, or ssh into the server.
My problem is that this isn't a permanent solution for me. I have many servers that only allow login from my previous IP address so I'd rather fix the issue with this VM rather then change all those system to allow from a new IP address. Any help on finding the solution would be greatly appreciated.
Please let me know if I can provide any additional info to help find the problem.
followup info:
The way our network is set up the IP we get from DHCP is the real world IP our device is seen with (I think we own a block or something)
this is the first time i've done anything with a GCS VM
Edit: added additional information
OK, so before I start, full disclosure: I'm pretty new to Nagios (only been using it for 3 weeks), so forgive me for lack of brevity in this explanation.
In my environment which I inherited, I have two redundant Nagios instances running (primary and secondary). On the primary, I added an active check for seeing if Apache is running on a select group of remote hosts (modifying commands.cfg and services.cfg). Unfortunately, it didn't go well so I had to revert the changes back to the previous configuration.
Where my issue comes in is this: after reverting the changes (deleted the added lines, started Nagios back up), the primary instance of Nagios' web UI is showing that a particular service is going critical intermittently with a change in duration, e.g., when the service is showing as OK, it'll be 4 hours but when it's critical, it'll show as 10 days (see here for an example host; the screenshots were taken less than a minute apart). This is only happening when I'm refreshing any of the Current Status pages or going to an individual host to view monitored services and refreshing there as well. Also, to note, this is a passive check for the service with checking freshness enabled.
I've already did a manual check from the primary Nagios server via the CLI and the status comes back as OK every time. I figured that there was a stale state somewhere in retention.dat, status.dat, objects.cache, or objects.precache, but even after stopping Nagios, removing said files, and starting it back up, and restarting NSCA, the same behavior persists. The secondary Nagios server isn't showing this behavior and is showing the correct statuses for all hosts and services and no modifications were made to it either.
Any help would be greatly appreciated and in advance, thanks! I've already posted up on the Nagios Support forums, but to no avail.
EDIT: Never mind. Turns out there were two instances of Nagios running, hence the intermittent nature. Killed off both and started Nagios again and it stabilized.
At 9am this morning I found that my server was unresponsive I was unable to connection into via either ssh or http/s.
When I finally logged in, everything was fine other then a high cpu (averaging 70-80%). Via the console I see that the disk IO was high and also there was a peak in the amount of API requests per second.
Can anyone point me in the right direction so I can find out what happened ?
Thanks,
What you experienced was most likely the result of the networking issue described here. Also, in general, this forum is a good source to stay updated.
Beyond this, if the time window does not line up with your observations (there's no time zone specified in the question), one would need logs (i.e. Apache's, syslog, etc) in order to determine a possible cause, albeit it can turn into an exercise in speculation after the fact.