Process replayd killed by jetsam reason highwater - replaykit

Recently I add a broadcast upload extension to my host app to implement system wide screen cast.I found the broadcast upload extension sometimes stopped for unknown reason.If I debug the broadcast upload extension process in Xcode, it stopped without stopping at a breakpoint(If the extension is killed for 50M bytes memory limit, it will stopped at a breakpoint, and Xcode will point out that it's killed for 50M bytes memory limit).For more imformation, I read the console log line by line.Finally, I found a significant line:
osanalyticshelper Process replayd [26715] killed by jetsam reason highwater
It looks like the ReplayKit serving process 'replayd' is killed by jetsam, and the reason is 'highwater'.So I searched the internet for more imformation.And I found a post:
https://www.jianshu.com/p/30f24bb91222
After reading that,I checked the JetsamEvent report in device, and found that when the 'replayd' process was killed it occupied 100M bytes memory.Is there a 100M bytes memory limit for 'replayd' process?How can I avoid it to occupy more than 100M bytes memory?
Further more, I found that this problem offen occured if the previous extension process is stopped via RPBroadcastSampleHandler's finishBroadcastWithError method.If I stop the extension via control center button, this rarely occured.
As comparison, when the 'Wemeet' app stop it's broadcast upload extension, it raraly cause this problem.I compared the console log when 'Wemeet' stop it's broadcast extension and the log my app stop it's broadcast extension.I found this line is different:
Wemeet:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x1080f7a40, sid:0x3456e, replayd(30213), 'prim'>; running count now 0
My app:
mediaserverd MEDeviceStreamClient.cpp:429 AQME Default-InputOutput: client stopping: <ZenAQIONodeClient#0x107e869a0, sid:0x3464b, replayd(30232), 'prim'>; running count now 3
As we can see, the 'running count' is different.

Related

HTCondor - Partitionable slot not working

I am following the tutorial on
Center for High Throughput Computing and Introduction to Configuration in the HTCondor website to set up a Partitionable slot. Before any configuration I run
condor_status
and get the following output.
I update the file 00-minicondor in /etc/condor/config.d by adding the following lines at the end of the file.
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = cpus=4
SLOT_TYPE_1_PARTITIONABLE = TRUE
and reconfigure
sudo condor_reconfig
Now with
condor_status
I get this output as expected. Now, I run the following command to check everything is fine
condor_status -af Name Slotype Cpus
and find slot1#ip-172-31-54-214.ec2.internal undefined 1 instead of slot1#ip-172-31-54-214.ec2.internal Partitionable 4 61295 that is what I would expect. Moreover, when I try to summit a job that asks for more than 1 cpu it does not allocate space for it (It stays waiting forever) as it should.
I don't know if I made some mistake during the installation process or what could be happening. I would really appreciate any help!
EXTRA INFO: If it can be of any help have have installed HTCondor with the command
curl -fsSL https://get.htcondor.org | sudo /bin/bash -s – –no-dry-run
on Ubuntu 18.04 running on an old p2.xlarge instance (it has 4 cores).
UPDATE: After rebooting the whole thing it seems to be working. I can now send jobs with different CPUs requests and it will start them properly.
The only issue I would say persists is that Memory allocation is not showing properly, for example:
But in reality it is allocating enough memory for the job (in this case around 12 GB).
If I run again
condor_status -af Name Slotype Cpus
I still get something I am not supposed to
But at least it is showing the correct number of CPUs (even if it just says undefined).
What is the output of condor_q -better when the job is idle?

operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full

I have a machine running multiple applications which constantly perform UNC access (\\server-ip\share) so:
std::ifstream src(fileName, std::ios::binary);
std::ofstream dst(newFileName, std::ios::binary);
CopyFromRemote(ifstream &src, ofstream &dst);
dst.flush();
dst.close();
src.close();
void CopyFromRemote(ifstream src, ofstream dst)
{
char buffer[8192]; // read 8KB each chunk
while (src.read(buffer, sizeof(buffer)))
{
dst.write(buffer, sizeof(buffer));
// Here there is code that checks that some timer !> max read time so as
// to not be stuck if there is network issue with this src.
}
if (src.eof() && src.gcount() > 0)
{
dst.write(buffer, src.gcount()); // few bytes left
}
}
As can be seen the network is heavily strained by traversing it for each 8KB (files are several MB large). The benefit here is the ability to abort a file copy in case it takes too long from specific source.
The problem I'm facing is after several days all UNC become non-accessible from this machine with error above. I'm not sure what the source of the problem is but it's sporadic & hard to nail. When the problem happens the 1st line fails (std::ifstream src...). telnet also stops to work.
Also: When killing the applications the UNC is accessible again. When restarting the processes the UNC is immediately not accessible again. Restarting the machine solves the problem for several days.
Initially I thought it was Port exhaustion but netstat does not reveal too many connections or hanging connections and the task manager performance tab does not show abnormal figures. TcpQry shows normal TCP/UDP mapping numbers.
Also: Packet capture shows there is no request when problem happens (request not reaching network). Event viewer does not reveal anything. Did following registry changes although this would probably just delay problem not eliminate it but anyway it didn't help:
Find the autodisconnect value in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters. If it's not there, create a new REG_DWORD called autodisconnect. Edit the value as Hexadecimal and set it to ffffffff.
Find KeepConn in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanworkstation\parameters. If it doesn't exist create it as a REG_DWORD value and assign it the value 65534.
Find HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters and create a new DWORD value named MaxUserPort. Set the value to 65534.
Eventually this was due to Microsoft OS bug. Since machine is an offline machine it does not get regular updates automatically. Installing all OS updates solved the problem.

Aerospike error: All batch queues are full

I am running an Aerospike cluster in Google Cloud. Following the recommendation on this post, I updated to the last version (3.11.1.1) and re-created all servers. In fact, this change cause my 5 servers to operate in a much lower CPU load (it was around 75% load before, now it is on 20%, as show in the graph bellow:
Because of this low load, I decided to reduce the cluster size to 4 servers. When I did this, my application started to receive the following error:
All batch queues are full
I found this discussion about the topic, recommending to change the parameters batch-index-threads and batch-max-unused-buffers with the command
asadm -e "asinfo -v 'set-config:context=service;batch-index-threads=NEW_VALUE'"
I tried many combinations of values (batch-index-threads with 2,4,8,16) and none of them solved the problem, and also changing the batch-index-threads param. Nothing solves my problem. I keep receiving the All batch queues are full error.
Here is my aerospace.conf relevant information:
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 4
batch-index-threads 40
proto-fd-max 15000
batch-max-requests 30000
replication-fire-and-forget true
}
I use 300GB SSD disks on these servers.
A quick note which may or may not pertain to you:
A common mistake we have seen in the past is that developers decide to use 'batch get' as a general purpose 'get' for single and multiple record requests. The single record get will perform better for single record requests.
It's possible that you are being constrained by the network between the clients and servers. Reducing from 5 to 4 nodes reduced the aggregate pipe. In addition, removing a node will start cluster migrations which adds additional network load.
I would look at the batch-max-buffer-per-queue config parameter.
Maximum number of 128KB response buffers allowed in each batch index
queue. If all batch index queues are full, new batch requests are
rejected.
In conjunction with raising this value from the default of 255 you will want to also raise the batch-max-unused-buffers to batch-index-threads x batch-max-buffer-per-queue + 1 (at least). If you do not do that new buffers will be created and destroyed constantly, as the amount of free (unused) buffers is smaller than the ones you're using. The moment the batch response is served the system will strive to trim the buffers down to the max unused number. You will see this reflected in the batch_index_created_buffers metric constantly rising.
Be aware that you need to have enough DRAM for this. For example if you raise the batch-max-buffer-per-queue to 320 you will consume
40 (`batch-index-threads`) x 320 (`batch-max-buffer-per-queue`) x 128K = 1600MB
For the sake of performance the batch-max-unused-buffers should be set to 13000 which will have a max memory consumption of 1625MB (1.59GB) per-node.

Is there a maximum concurrency for AWS s3 multipart uploads?

Referring to the docs, you can specify the number of concurrent connection when pushing large files to Amazon Web Services s3 using the multipart uploader. While it does say the concurrency defaults to 5, it does not specify a maximum, or whether or not the size of each chunk is derived from the total filesize / concurrency.
I trolled the source code and the comment is pretty much the same as the docs:
Set the concurrency level to use when uploading parts. This affects
how many parts are uploaded in parallel. You must use a local file as
your data source when using a concurrency greater than 1
So my functional build looks like this (the vars are defined by the way, this is just condensed for example):
use Aws\Common\Exception\MultipartUploadException;
use Aws\S3\Model\MultipartUpload\UploadBuilder;
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bucket)
->setKey($file)
->setConcurrency(30)
->setOption('CacheControl', 'max-age=3600')
->build();
Works great except a 200mb file takes 9 minutes to upload... with 30 concurrent connections? Seems suspicious to me, so I upped concurrency to 100 and the upload time was 8.5 minutes. Such a small difference could just be connection and not code.
So my question is whether or not there's a concurrency maximum, what it is, and if you can specify the size of the chunks or if chunk size is automatically calculated. My goal is to try to get a 500mb file to transfer to AWS s3 within 5 minutes, however I have to optimize that if possible.
Looking through the source code, it looks like 10,000 is the maximum concurrent connections. There is no automatic calculations of chunk sizes based on concurrent connections but you could set those yourself if needed for whatever reason.
I set the chunk size to 10 megs, 20 concurrent connections and it seems to work fine. On a real server I got a 100 meg file to transfer in 23 seconds. Much better than the 3 1/2 to 4 minute it was getting in the dev environments. Interesting, but thems the stats, should anyone else come across this same issue.
This is what my builder ended up being:
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource($file)
->setBucket($bicket)
->setKey($file)
->setConcurrency(20)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
I may need to up that max cache but as of yet this works acceptably. The key was moving the processor code to the server and not relying on the weakness of my dev environments, no matter how powerful the machine is or high class the internet connection is.
We can abort the process during upload and can halt all the operations and abort the upload at any instance. We can set Concurrency and minimum part size.
$uploader = UploadBuilder::newInstance()
->setClient($client)
->setSource('/path/to/large/file.mov')
->setBucket('mybucket')
->setKey('my-object-key')
->setConcurrency(3)
->setMinPartSize(10485760)
->setOption('CacheControl', 'max-age=3600')
->build();
try {
$uploader->upload();
echo "Upload complete.\n";
} catch (MultipartUploadException $e) {
$uploader->abort();
echo "Upload failed.\n";
}

HP ProLiant 100 G5/G6/G7 Servers Series - Showing Uncorrectable ECC Events Occurred in SEL Log

HP Insight Diagnostics Version 8.4.0.3521A (x86_64)
Computer Name: ezsetupsystem3c4a927c9e88
During Device test it gives following error
Total Memory-ECC test Failed
Description- Uncorrectable ECC Events occurred in SEL log Device, Ran on CPU 0
Recommended Repair- Please refer IPMI Sensor Event Log for ECC events
Error Code- 021278
Please help me n finding why is this error coming.
Is it because of WAMP server installation??
This means that the server has a DIMM which has exceeded it's acceptable ECC error count. It's location is denoted in the error. Depending on the platform, you can identify the specific slot. Which HP ProLiant model is this? This is standard warranty repair.
You have faulty memory. To identify the faulty memory run HPS Reports:
http://update.external.hp.com/HPS/HPSreports/
Once completed you have to unzip the archive an open the index.html file insite.
It will collect all hardware information and highlight in
RED
the faulty components.