rsyslogd using 100% CPU Utilization on all RHEL EC2 Instances

rsyslogd using 100% CPU Utilization on all RHEL EC2 Instances - amazon-web-services

Since past two days, rsyslogd is using 100% CPU Utilization on all RHEL EC2 Instances in my environment. I Stopped and started rsyslog service but still issue persists.
This is first time we are facing this kind of behaviour in multiple servers.
There is sufficient disk space/memory exists in all servers.
I checked kernel logs (/var/log/kernel) and server messages( /var/log/messages),but not find any useful info.
Following is the OS and Kernal versions of all servers.
Operating System: Red Hat Enterprise Linux Server 7.2 (Maipo)
Kernel: Linux 3.10.0-1160.42.2.el7.x86_64
Can someone please suggest on this.

Issue is due to cert expiry for rsyslog. Observed a lot of connection retry errors in system messages for all clients. Post renewing certs in rsyslog, cpu went down.

Related

About the problem of using GCP to establish an L2TP tunnel: cannot succeed, 619 or 800 error

Current date 2023-1-15 test,
The script used is: https://github.com/hwdsl2/setup-ipsec-vpn
Test system Debian GNU/Linux 11 (bullseye)
GCP virtual computer hardware configuration: E2-small
Ports are all open!
HTTP/HTTPS/IP Forwarding: Checked
I'm not sure what is causing the problem,619 or 800 error.
The free L2TP account I found online can be used normally, which proves that there is no problem with my computer settings.
In the author's script, it is mentioned that the windows7 system needs to update the registry and restart. I did the same, but I still can't connect.

AWS Linux 2 systemd vulnerability

A recent security scan of our systems has shown that our AWS Linux 2 instances are vulnerable to an issue with systemd
https://nvd.nist.gov/vuln/detail/CVE-2021-33910
I've had a look through the AWS Linux 2 security bulletin board (https://alas.aws.amazon.com/alas2.html) and at present they don't seem to have anything for this problem.
I tried to manually update systemd but we're already running the latest version.
Has anyone come across this, or found a resolution? At the moment all I can do is sit and wait for AWS to get around to patching the vulnerability.

As this vulnerability has been modified (14/07) since its last analysis, patching might take some time.!
In the mean time if possible you could use Red Hat Enterprise Linux as those are fixed and some older versions are unaffected like Red Hat Enterprise Linux 7.

Websocket performance on AWS EC2

I have issues with websocket performance on AWS EC2.
I use websockets to listen to a server with incoming network rate 100-300 Kb/sec. Just listening, not sending. On EC2, every 10-20 minutes, I get disconnected (code 1006 - abnormal connection loss - no reason given). I have tested with t2.micro (which I believe should be more than enough for such a small task) and t2.large. I use US East, which should be close to the source.
This is to be compared with only one disconnection every few hours when I run the same app on my personal computer, in a different country. I have used two different libraries (Python aiohttp and websockets) to confirm that I have the same issues.
This points to an issue with network quality on EC2. However I'm not sure if this websockets task is demanding, so this is surprising.
Did anyone experience this before? What other diagnostics can I do to better understand the root cause?

Can't rerun meteor leaderboard on AWS EC2 micro T1 instance after failing keepalive

I'm unable to run a Meteor leaderboard demo after a failed keepalive error on an AWS EC2 micro.T1 instance. If I start from a freshly booted Amazon Machine Instance (AMI) I'm able to run the leaderboard demo at localhost:3000 from Firefox when I'm connected with a VNC client (TightNVC Viewer). It runs very, very slowly, but it runs.
If I fail to interact with it soon enough however I get these messages
I2051-00:03:03.173(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
From that point forward everything on that instance runs at a glacial pace. Switching back to the Firefox window takes 3 minutes. when I try to connect to //localhost:3000 Firefox I usually get a message about a script no longer running and eventually the terminal window adds this to what I wrote above:
I2051-00:06:02.443(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Meteor server restarted
I2051-00:08:17.227(0)?Failed to receive keepalive! Exiting.
=> Exited with code:1
=> Your application is crashing. Waiting for file change.
Can anyone translate for me what is happening?
I'm wondering whether the t1.micro instance I'm running is just too under-powered or because it's not shutting down meteor properly thereby leaving an instance of MongoDB running and trying to launch another.
I'm using Amazon Machine Image ubuntu-precise-12.04-amd64-server-20130411.1 (ami-70f96e40) which says this about it's configuration:
Size: t1.micro
ECUs: up to 2
vCPUs: 1
Memory (GiB): 0.613
Instance Storage (GiB): EBS only
EBS-Optimized Available: -
Netw. Performance: -Very Low
Micro instances
Micro instances are a low-cost instance option, providing a small amount of CPU resources. They are suited for lower throughput applications, and websites that require additional compute cycles periodically, but are not appropriate for applications that require sustained CPU performance. Popular uses for micro instances include low traffic websites or blogs, small administrative applications, bastion hosts, and free trials to explore EC2 functionality.
If my guess is right, can anyone suggest an AMI suitable for Meteor development?
Thanks

check this answer
Try to remove meteor remove autopublish

How are you running the app on ec2? I have been able to run apps on a micro instance so I don't see why this should be an issue.
If you are running it by using 'meteor' as you would locally that's probably the issue. You get way better performance when running it as a node app, this typically isn't an issue when developing locally but may be too much for a ec2 micro.
What you want to do is 'meteor bundle example.tgz', upload that to the server and run it as a node app.
Here is a guide that I remember using a while ago to get it done on ec2:
http://julien-c.fr/2012/10/meteor-amazon-ec2/
You shouldn't need to use VNC either, you can access it from your own computer in a browser using the public address your instance gets assigned.
If you get a node fibers error message which is pretty common then cd into bundle/program/server do 'npm uninstall fibers' and then 'npm install fibers'

Is there a way for the cache to stay up without timeout after crash in AppFabric Cache?

First my setup that is used for testing purpose:
3 Virtual Machines running with the following configuration:
MS Windows 2008 Server Standard Edition
Latest version of AppFabric Cache
Each one has a local network share where the config file is stored (I have added all the machines in each config)
The cache is distributed but not high availibility (we don't have Enterprise version of Windows)
Each host is configured as lead, so according to the documentation at least one host should be allowed to crash.
Each machine has the website I testing installed, and local cache configured
One linux machine that is used as a proxy (varnish is used) to distribute the traffic for testing purpose.
That's the setup and now on to the problem. The scenario I am testing is simulating one of the servers crashing and then bring it back in the cluster. I have problem both with the server crashing and bringing it back up. Steps I am using to test it:
Direct the traffic with Varnish on the linux machine to one server only.
Log in to make sure there is something in the cache.
Unplug the network cable for one of the other servers (simulates that server crashing)
Now I get a cache timeout and I get a service error. I want the application to still be up on the servers that didn't crash, and it take some time for the cache to come back up on the remaining servers. Is that how it should be? Plugging the network cable back in and starting the host cause a similar problem.
So my question is if I have missed something? What I would like to see happen is that if one server crashes the cache should still remaing upp since a majority of the leads are still up, and starting the crashed server again should bring it back gracefully into the cluster without any causing any problems on the other hosts. But that might no be how it works?

I ran through a similar test scenario a few months ago where I had a test client generating load on a 3 lead-server cluster with a variety of Puts, Gets, and Removes. I rebooted one of the servers multiple times while the load test was running and the cache stayed online. If I remember correctly, there were a limited number errors as that server rebooted, but overall the cache appeared to remain healthy.
I'm not sure why you're not seeing similar results, but I would try removing the Varnish proxy from your test and see if that helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js