intermittent issues with ClamAV clamd INSTREAM on socket

intermittent issues with ClamAV clamd INSTREAM on socket - clam

I've got an AWS Lambda function running NodeJS code to stream files from S3 to ClamAV running on an EC2 instance.
Generally (about 75% of the time) the system works, but often (especially when multiple files are being scanned from different Lambda containers) clamd threads gets stuck on INSTREAM.
Once a thread has been in INSTREAM for 25-30 seconds it does not seem to be able to recover. When it has been QUEUEDSINCE 350 seconds it is killed off. I can't figure out how either of these numbers relate to any value in my config.
I'm struggling to find any sign of an error in the logs - the number of INSTREAM requests matches the number of complete scans:
$ sudo grep -c "got command INSTREAM" /var/log/clamav/clamav.log
129
$ sudo grep -c "Chunks complete" /var/log/clamav/clamav.log
129
$ sudo grep -c "Scanthread: connection shut down" /var/log/clamav/clamav.log
129
...okay, now that I look a little more deeply into the logs it just takes a lot longer for some to be scanned. When I do a batch of 16 files, with Lambda concurrency restricted to 7 the first 7 files are scanned within a few seconds. The next file begins scanning soon after, gets to "Chunks complete" within a second, but takes 23 seconds before "Scanthread: connection shutdown". From here on it just gets worse - 1:24, 1:45... and then the 3rd batch of 7 files take over 3 minutes to scan.
If I give the system a few minutes to settle down, all the threads to die off, the same files that took over 3 minutes now take about 5-7 seconds.
If I run the same test on a faster machine the performance improves, but the issue is still there:
When threads get stuck at INSTREAM I can see that the files are still there:
$ ls -al /tmp
drwx------ 2 clamav clamav 4096 Aug 29 16:52 clamav-493bdf893ce4d8d7763c00fee22d9d69.tmp
-rwx------ 1 clamav clamav 25683921 Aug 29 16:52 clamav-5cdefd83d5531a03c7cf22fda37d133f.tmp
https://github.com/yongtang/clamav.js/issues/6
https://github.com/yongtang/clamav.js/issues/7
https://bugzilla.clamav.net/show_bug.cgi?id=12181

Related

AWS EC2 terminal session terminated with "Plugin with name Standard_Stream not found"

I was streaming Kafka on AWS EC2 CentOS 7. My Session Manager Idle Timeout is set to 60min. And yet, after running for much less than that, the terminal got frozen, saying My session has been terminated. Of course, the Kafka streaming for disrupted as well.
When I tried to restart a new session with a new terminal, I got this error popup
Your session has been terminated for the following reasons: Plugin with name Standard_Stream not found. Step name: Standard_Stream
and I am still unable to restart a terminal.
What does this error mean and how to resolve it? Thanks.

So far you need to access the EC2 using SSH with key-pem to debug
(ask your admin)
Running tail -f got issue
tail: inotify resources exhausted
tail: inotify cannot be used, reverting to polling
Restart ssm-agent service also got issue No space left on device
but it's not about disk space
[root#env-test ec2-user]# systemctl restart amazon-ssm-agent.service
Error: No space left on device
[root#env-test ec2-user]# df -h |grep dev
devtmpfs 32G 0 32G 0% /dev
tmpfs 32G 0 32G 0% /dev/shm
/dev/nvme0n1p1 100G 82G 18G 83% /
So the error itself means that system is getting low on inotify
watches, that enable programs to monitor file/dirs changes. To see
the currently set limit (including output on my machine)
$ cat /proc/sys/fs/inotify/max_user_watches
8192
Check which processes using inotify to improve your apps or increase max_user_watches
for foo in /proc/*/fd/*; do readlink -f $foo; done | grep inotify | sort | uniq -c | sort -nr
5 /proc/1/fd/anon_inode:inotify
2 /proc/7126/fd/anon_inode:inotify
2 /proc/5130/fd/anon_inode:inotify
1 /proc/4497/fd/anon_inode:inotify
1 /proc/4437/fd/anon_inode:inotify
1 /proc/4151/fd/anon_inode:inotify
1 /proc/4147/fd/anon_inode:inotify
1 /proc/4028/fd/anon_inode:inotify
1 /proc/3913/fd/anon_inode:inotify
1 /proc/3841/fd/anon_inode:inotify
1 /proc/31146/fd/anon_inode:inotify
1 /proc/2829/fd/anon_inode:inotify
1 /proc/21259/fd/anon_inode:inotify
1 /proc/1934/fd/anon_inode:notify
Notice that the above inotify list include PID of ssm-agent
processes, it explains why we got issue with SSM when
max_user_watches reached limit
ps -ef | grep ssm-ag
root 3841 1 0 00:02 ? 00:00:05 /usr/bin/amazon-ssm-agent
root 4497 3841 0 00:02 ? 00:00:33 /usr/bin/ssm-agent-worker
Final Solution: Permanent solution (preserved across restarts)
echo "fs.inotify.max_user_watches=1048576" >> /etc/sysctl.conf sysctl -p
Verify:
$ aws ssm start-session --target i-123abc456efd789xx --region ap-northeast-2
Starting session with SessionId: userdev-03ccb1a04a6345bf5
sh-4.2$
This issue comes from EC2 instance not about SSM agent Go to link to
undestanding SSM agent.
optional link

In my case, extend the disk space works!
(syslog full of my case)

In my case too extending the disk space worked as my /var/logs was huge.

GCP VM time sync issue after resuming from suspension (in both linux and windows)

GCP VM doesn't update the system datetime after resuming it from suspension.
It keeps the system date/time same as what it was while suspending. Due to this, my scripts to fetch gcloud resources is failing as with auth token expiry error.
As per the Google Documentation https://cloud.google.com/compute/docs/instances/managing-instances#linux_1,
NTP is already configured but for my VMs I get the "command not found" error for ntpq -p.
$ sudo timedatectl status
Local time: Wed 2020-08-05 15:31:34 EDT
Universal time: Wed 2020-08-05 19:31:34 UTC
RTC time: Wed 2020-08-05 19:31:34
Time zone: America/New_York (EDT, -0400)
System clock synchronized: yes
NTP service: inactive
RTC in local TZ: no
gcloud auth activate-service-account in my script is failing with below error
(gcloud.compute.instances.describe) There was a problem refreshing your current auth tokens: invalid_grant: Invalid JWT: Token must be a short-lived token (60 minutes) and in a reasonable timeframe. Check your iat and exp values in the JWT claim.
OS - Windows/Linux

After resuming, the hardware clock of the VM instance is set correctly as it gets time from the hypervisor. You can check it with sudo hwclock.
The problem is with the time service of the operating system.
For Windows, it could take few minutes to sync system time with the time source. If you can't wait for the timesync cycle to complete, you can logon to Windows and force time synchronization manually:
net stop W32Time
net start W32Time
w32tm /resync /force
In Linux, NTP cannot handle a time offset of more that 1000 seconds (see http://doc.ntp.org/4.1.0/ntpd.htm. Therefore you have to force time synchronization manually. There are various ways to do that (some of them are deprecated, but still may work):
netdate timeserver1
ntpdate -u -s timeserver1
hwclock --hctosys
service ntp restart
systemctl restart ntp.service

If you run into this issue while using Google Cloud Platform, they replace netd and systemd-timesyncd with chronyd
I had to use systemctl start chrony to get my time in working order. Tried hwclock --hctosys, but it was ignoring time zones and thus setting the wrong time.
This happened because I was suspending every minute by accident. A permanent fix would be to modify the systemd definition and ask it keep retrying to start it.
The reason it stopped was this Can't synchronise: no selectable sources

Redis keys deletion (aws elastic cache)

I have used the below script to delete the Keys from my redis node ( using AWS elastic cache service ) , bytes used for cache metrics dropped from 100 GB to 80 GB which is fine since we deleted around 1,60,000 keys. In a few minutes, bytes used for cache increased rapidly and hit the maximum (106GB). Is this because of the delete operation ?
any fault with the script?
In addition to above after reaching the 106 GB, in few minutes drastically it reduced again to 80 GB and stabilized
count=0
while read -r delkeys
do
((count=count+1))
echo "KEYNAME:$delkeys"
redis-cli -h $REDIS_HOST -p $REDIS_PORT DEL "$delkeys"
if [[ $count == 1000 ]]
then
sleep 5
count=0
fi
done < filename
Engine Version : 2.8.21
Engine : Redis
In addition to the above part, on previous day I have taken values for all the 1,60,000 keys by using LRANGE "$dumpkeys" 0 -1 in the above script but havent faced any performance issue like cpu or high RAM utilisation

Hyperledger Fabric: Peer nodes fail to restart with byfn script when machine is shut down while network is running

I have a hyperledger fabric network running on a single AWS instance using the default byfn script.
ERROR: Orderer, cli, CA docker containers show "Up" status. Peers show "Exited" status.
Error occurs when:
Byfn network is running, machine is rebooted (not in my control but because of some external reason).
Network is left running overnight without shutting the machine. Shows same status next morning.
Error shown:
docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
b0523a7b1730 hyperledger/fabric-tools:latest "/bin/bash" 23 seconds ago Up 21 seconds cli
bfab227eb4df hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 23 seconds ago peer1.org1.example.com
6fd7e818fab3 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 19 seconds ago peer1.org2.example.com
1287b6d93a23 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 22 seconds ago peer0.org2.example.com
2684fc905258 hyperledger/fabric-orderer:latest "orderer" 28 seconds ago Up 26 seconds 0.0.0.0:7050->7050/tcp orderer.example.com
93d33b51d352 hyperledger/fabric-peer:latest "peer node start" 28 seconds ago Exited (2) 25 seconds ago peer0.org1.example.com
Attaching docker log: https://hastebin.com/ahuyihubup.cs
Only the peers fail to start up.
Steps I have tried to solve the issue:
docker start $(docker ps -aq) or manually, starting individual peers.
byfn down, generate and then up again. Shows the same result as above.
Rolled back to previous versions of fabric binaries. Same result on 1.1, 1.2 and 1.4. In older binaries, error is not repeated if network is left running overnight but repeats when machine is restarted.
Used older docker images such as 1.1 and 1.2.
Tried starting up only one peer, orderer and cli.
Changed network name and domain name.
Uninstalled docker, docker-compose and reinstalled.
Changed port numbers of all nodes.
Tried restarting without mounting any volumes.
The only thing that works is reformatting the AWS instance and reinstalling everything from scratch. Also, I am NOT using AWS blockchain template.
Any help would be appreciated. I have been stuck on this issue for a month now.

Error resolved by adding following lines to peer-base.yaml:
GODEBUG=netdns=go
dns_search: .
Thanks to #gari-singh for the answer:
https://stackoverflow.com/a/49649678/5248781

Cntlmd not starting under systemd on Centos 7.1

Had a weird error trying to start cntlmd on Centos 7.1.
systemctl start cntlmd` results in the following in the logs (and yes, becomming is exactly how it's spelt in the logs :)):
systemd: Started SYSV: Cntlm is meant to be given your proxy address and becoming
Weird thing is:
that it did run initially after installation.
The exact same config works perfectly on another machine (provisioned with Chef so 100% same config).
If I run it in the foreground it works but through systemd, not.
To "fix" it, I had to manually remove and reinstall, whereupon it worked again.
Anybody seen this error (Google reveals nothing) and know what's going on?

I realised that the /var/run/cntlm directory seemed to be "removed" after every boot. Turns out the /var/run/cntlm directory is never created by systemd-tmpfiles on boot (thanks to this SO answer), which then resulted in:
Feb 29 06:13:04 node01 cntlm: Using following NTLM hashes: NTLMv2(1) NT(0) LM(0)
Feb 29 06:13:04 node01 cntlm[10540]: Daemon ready
Feb 29 06:13:04 node01 cntlm[10540]: Changing uid:gid to 996:995 - Success
Feb 29 06:13:04 node01 cntlm[10540]: Error creating a new PID file
because cntlm couldn't write it's pid file because /var/run/cntlm didn't exist.
So to get systemd-tmpfiles to create the /var/run/cntlm directory on boot you need to add the following file in /usr/lib/tmpfiles.d/cntlm.conf:
d /run/cntlm 700 cntlm cntlm
Reboot and Bob's your uncle.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js