I am running a simple program on my ESP32 S3 DevKit C which just reads the data read from the ADC channels and prints them to the monitor. It all runs fine, but then after a few seconds, the readout is interrupted by this error, and then this just keeps repeating itself. (What is interesting is that the error occurs in the middle of printing out the integer 766 and splits up 7 and 66).
Here is the monitor readout:
2094, 767
2095, 767
2093, 767
2099, 7E (50298) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (50298) task_wdt: - IDLE (CPU 0)
E (50298) task_wdt: Tasks currently running:
E (50298) task_wdt: CPU 0: main
E (50298) task_wdt: CPU 1: IDLE
E (50298) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x42008262:0x3FC910C00x4037673A:0x3FC910E0 0x4200305A:0x3FCF4250 0x42003EB2:0x3FCF4280 0x42002A95:0x3FCF42A0 0x4200262F:0x3FCF42C0 0x4200A5E5:0x3FCF42E0 0x4201203E:0x3FCF4300 0x420120C6:0x3FCF4320 0x4200A015:0x3FCF4350 0x42015599:0x3FCF4380 0x42010DDF:0x3FCF43A0 0x4200A1B1:0x3FCF46B0 0x42005525:0x3FCF4700 0x42018315:0x3FCF4730 0x4037B759:0x3FCF4750
0x42008262: task_wdt_isr at C:/Espressif/frameworks/esp-idf-v4.4/components/esp_system/task_wdt.c:183 (discriminator 3)
0x4037673a: _xt_lowint1 at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x4200305a: uart_ll_get_txfifo_len at c:\espressif\frameworks\esp-idf-v4.4\projects\esi2022\build/../../../components/hal/esp32s3/include/hal/uart_ll.h:316 (discriminator 1)
(inlined by) uart_tx_char at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_uart.c:156 (discriminator 1)
0x42003eb2: uart_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_uart.c:209
0x42002a95: console_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_console.c:73
0x4200262f: esp_vfs_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs.c:431 (discriminator 4)
0x4200a5e5: __swrite at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/stdio.c:94
0x4201203e: __sflush_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fflush.c:224
0x420120c6: _fflush_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fflush.c:278
0x4200a015: __sfvwrite_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fvwrite.c:251
0x42015599: __sprint_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:433
(inlined by) __sprint_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:403
0x42010ddf: _vfprintf_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:1781 (discriminator 1)
0x4200a1b1: printf at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/printf.c:56 (discriminator 5)
0x42005525: app_main at c:\espressif\frameworks\esp-idf-v4.4\projects\esi2022\build/../main/hello_world_main.c:36 (discriminator 1)
0x42018315: main_task at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/port_common.c:129 (discriminator 2)
0x4037b759: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/port.c:131
E (50298) task_wdt: Print CPU 1 backtrace
Backtrace:0x40377C6D:0x3FC916C00x4037673A:0x3FC916E0 0x400559DD:0x3FCF56B0 |<-CORRUPTED
0x40377c6d: esp_crosscore_isr at C:/Espressif/frameworks/esp-idf-v4.4/components/esp_system/crosscore_int.c:92
0x4037673a: _xt_lowint1 at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/xtensa_vectors.S:1111
66
2207, 766
2092, 775
2095, 775
2095, 767
2093, 767
2093, 774
2095, 767
And here is the code I am running:
/* This script creates a representative dataset of audio from the ESP32-S3
TO RUN: >> idf.py set-target esp32s3
>> idf.py -p COM3 -b 480600 flash monitor
MAKE SURE SPI FLASH SIZE IS 8MB:
>> idf.py menuconfig
>> Serial Flasher Config >> Flash Size (8 MB)
*/
#include <stdio.h>
#include <driver/adc.h>
#include "sdkconfig.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_spi_flash.h"
void app_main(void)
{
printf("Hello world!\n");
/* Configure desired precision and attenuation for ADC pins
Right Channel: GPIO 4 ADC1 Channel 3
Left Channel: GPIO 11 ADC2 Channel 0 */
adc1_config_width(ADC_WIDTH_BIT_DEFAULT);
adc1_config_channel_atten(ADC1_CHANNEL_3,ADC_ATTEN_DB_0);
adc2_config_channel_atten(ADC2_CHANNEL_0,ADC_ATTEN_DB_0);
int val2 = 0;
int* pval2 = &val2;
while(true){
int val1 = adc1_get_raw(ADC1_CHANNEL_3);
adc2_get_raw(ADC2_CHANNEL_0,ADC_WIDTH_BIT_DEFAULT,pval2);
printf("%d, %d\n",val1,val2);
}
If anyone has any idea why this might be happening I would greatly appreciate it.
Thanks!
By integrating vTaskDelay(); into the while(true) loop, the issue was fixed. Dave M. Had it correct that the buffer was getting filled up too quickly. This limits the amount of data going into the buffer. (See comments below)
Related
I have this minimal example of Google Benchmark usage.
The weird thing is that "42" is printed a number of times (4), not just once.
I understand that the library has to run things several times to gain statistics but this I though that this is handled by the statie-loop itself.
This is a minimal example of something more complicated where I wanted to print (outside the loop) the result to verify that different implementations of the same function would give the same result.
#include <benchmark/benchmark.h>
#include<iostream>
#include <thread> //sleep for
int SomeFunction(){
using namespace std::chrono_literals;
std::this_thread::sleep_for(10ms);
return 42;
}
static void BM_SomeFunction(benchmark::State& state) {
// Perform setup here
int result = -1;
for (auto _ : state) {
// This code gets timed
result = SomeFunction();
benchmark::DoNotOptimize(result);
}
std::cout<< result <<std::endl;
}
// Register the function as a benchmark
BENCHMARK(BM_SomeFunction);
// Run the benchmark
BENCHMARK_MAIN();
output: (42 is printed 4 times, why more than once, why 4?)
Running ./a.out
Run on (12 X 4600 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x6)
L1 Instruction 32 KiB (x6)
L2 Unified 256 KiB (x6)
L3 Unified 12288 KiB (x1)
Load Average: 0.30, 0.65, 0.79
42
42
42
42
----------------------------------------------------------
Benchmark Time CPU Iterations
----------------------------------------------------------
BM_SomeFunction 10243011 ns 11051 ns 1000
How else could I test (at least visually) that different benchmarking blocks give the same answer?
I was reading Chapter 8 of the "Modern C++ Programming Cookbook, 2nd edition" on concurrency and stumbled upon something that puzzles me.
The author implements different versions of parallel map and reduce functions using std::thread and std::async. The implementations are really close; for example, the heart of the parallel_map functions are
// parallel_map using std::async
...
tasks.emplace_back(std::async(
std::launch::async,
[=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); }));
...
// parallel_map using std::thread
...
threads.emplace_back([=, &f] {std::transform(begin, last, begin, std::forward<F>(f)); });
...
The complete code can be found here for std::thread and there for std::async.
What puzzles me is that the computation times reported in the book give a significant and consistent advantage to the std::async implementation. Moreover, the author acknowledge this fact as being obvious, without providing any hint of justification:
If we compare this [result with async] with the results from the parallel version using threads, we will find that these are faster execution times and that the speedup is significant, especially for the fold function.
I ran the code above on my computer, and even though the differences are not as compelling as in the book, I find that the std::async implementation is indeed faster than the std::thread one. (The author also later brings in standard implementations of these algorithms, which are even faster). On my computer, the code runs with four threads, which corresponds to the number of physical cores of my CPU.
Maybe I missed something, but why is it obvious that std::async should run faster than std::thread on this example? My intuition was that std::async being a higher-level implementation of threads, it should take at least the same amount of time, if not more, than threads -- obviously I was wrong. Are those findings consistent, as suggested by the book, and what is the explanation?
My original interpretation was incorrect. Refer to #OznOg's answer below.
Modified Answer:
I created a simple benchmark that uses std::async and std::thread to do some tiny tasks:
#include <thread>
#include <chrono>
#include <vector>
#include <future>
#include <iostream>
__thread volatile int you_shall_not_optimize_this;
void work() {
// This is the simplest way I can think of to prevent the compiler and
// operating system from doing naughty things
you_shall_not_optimize_this = 42;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_threads(size_t count) {
std::vector<std::optional<std::thread>> threads;
threads.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
threads[i] = std::thread { work };
for (size_t i = 0; i < count; ++i)
threads[i]->join();
threads.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
[[gnu::noinline]]
std::chrono::nanoseconds benchmark_async(size_t count, std::launch policy) {
std::vector<std::optional<std::future<void>>> results;
results.resize(count);
auto before = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < count; ++i)
results[i] = std::async(policy, work);
for (size_t i = 0; i < count; ++i)
results[i]->wait();
results.clear();
auto after = std::chrono::high_resolution_clock::now();
return after - before;
}
std::ostream& operator<<(std::ostream& stream, std::launch value)
{
if (value == std::launch::async)
return stream << "std::launch::async";
else if (value == std::launch::deferred)
return stream << "std::launch::deferred";
else
return stream << "std::launch::unknown";
}
// #define CONFIG_THREADS true
// #define CONFIG_ITERATIONS 10000
// #define CONFIG_POLICY std::launch::async
int main() {
std::cout << "Running benchmark:\n"
<< " threads? " << std::boolalpha << CONFIG_THREADS << '\n'
<< " iterations " << CONFIG_ITERATIONS << '\n'
<< " async policy " << CONFIG_POLICY << std::endl;
std::chrono::nanoseconds duration;
if (CONFIG_THREADS) {
duration = benchmark_threads(CONFIG_ITERATIONS);
} else {
duration = benchmark_async(CONFIG_ITERATIONS, CONFIG_POLICY);
}
std::cout << "Completed in " << duration.count() << "ns (" << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count() << "ms)\n";
}
I've run the benchmark as follows:
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::deferred
Completed in 4783327ns (4ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=false -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? false
iterations 10000
async policy std::launch::async
Completed in 301756775ns (301ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::deferred main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::deferred
Completed in 291284997ns (291ms)
$ g++ -Wall -Wextra -std=c++20 -pthread -O3 -DCONFIG_THREADS=true -DCONFIG_ITERATIONS=10000 -DCONFIG_POLICY=std::launch::async main.cpp -o main && ./main
Running benchmark:
threads? true
iterations 10000
async policy std::launch::async
Completed in 293539858ns (293ms)
I re-ran all the benchmarks with strace attached and accumulated the system calls made:
# std::async with std::launch::async
1 access
2 arch_prctl
36 brk
10000 clone
6 close
1 execve
1 exit_group
10002 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::async with std::launch::deferred
1 access
2 arch_prctl
11 brk
6 close
1 execve
1 exit_group
10002 futex
28 mmap
9 mprotect
2 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
1 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::async
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
# std::thread with std::launch::deferred
1 access
2 arch_prctl
27 brk
10000 clone
6 close
1 execve
1 exit_group
2 futex
10028 mmap
10009 mprotect
9998 munmap
7 newfstatat
6 openat
7 pread64
1 prlimit64
5 read
2 rt_sigaction
20001 rt_sigprocmask
1 set_robust_list
1 set_tid_address
5 write
We observe that std::async is significantly faster with std::launch::deferred but that everything else doesn't seem to matter as much.
My conclusions are:
The current libstdc++ implementation does not take advantage of the fact that std::async doesn't need a new thread for each task.
The current libstdc++ implementation does some sort of locking in std::async that std::thread doesn't do.
std::async with std::launch::deferred saves setup and destroy costs and is much faster for this case.
My machine is configured as follows:
$ uname -a
Linux linux-2 5.12.1-arch1-1 #1 SMP PREEMPT Sun, 02 May 2021 12:43:58 +0000 x86_64 GNU/Linux
$ g++ --version
g++ (GCC) 10.2.0
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ lscpu # truncated
Architecture: x86_64
Byte Order: Little Endian
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4770K CPU # 3.50GHz
Original Answer:
std::thread is a wrapper for thread objects which are provided by the operating system, they are extremely expensive to create and destroy.
std::async is similar, but there isn't a 1-to-1 mapping between tasks and operating system threads. This could be implemented with thread pools, where threads are reused for multiple tasks.
So std::async is better if you have many small tasks, and std::thread is better if you have a few tasks that are running for long periods of time.
Also if you have things that truly need to happen in parallel, then std::async might not fit very well. (std::thread also can't make such guarantees, but that's the closest you can get.)
Maybe to clarify, in your case std::async saves the overhead from creating and destroying threads.
(Depending on the operating system, you could also lose performance simply by having a lot of threads running. An operating system might have a scheduling strategy where it tries to guarantee that every thread gets executed every so often, thus the scheduler could decide go give the individual threads smaller slices of processing time, thus creating more overhead for switching between threads.)
Looks like something is not happening like expected. I compiled the whole thing on my fedora and the first results where surprising. I commented out all tests to keep only the 2 ones comparing threads and async. The output looks like confirming the behaviour:
Thead version result
size s map p map s fold p fold
10000 642 628 751 770
100000 6533 3444 7985 3338
500000 14885 5760 13854 6304
1000000 23428 11398 27795 12129
2000000 47136 22468 55518 24154
5000000 118690 55752 139489 60142
10000000 236496 112467 277413 121002
25000000 589277 276750 694742 297832
500000001.17839e+06 5553181.39065e+06 594102
Async version:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 248 232 231 273 282 273
100000 2323 1562 2827 2757 1536 1766
500000 12312 5615 12044 14014 6272 7431
1000000 23585 11701 24060 27851 12376 14109
2000000 47147 22796 48035 55433 25565 30095
5000000 118465 59980 119698 140775 62960 68382
10000000 241727 110883 239554 277958 121205 136041
Looks like the async is actually 2 times faster (for small vallues) than the threaded one.
Then I used strace to count the number of clone system call done (number of thread created):
64 clone with threads
92 clone with async
So looks like the explaination on the time spend to create thread is somehow in contradiction as he async version actually creates as many threads than the thread based one (the difference comes from the fact that there are two version of fold in the async code).
Then I tried to swap both test order of execution (put async before thread), and here are the results:
size s map p1 map p2 map s fold p1 fold p2 fold
10000 653 694 624 718 748 718
100000 6731 3931 2978 8533 3116 1724
500000 12406 5839 14589 13895 8427 7072
1000000 23813 11578 24099 27853 13091 14108
2000000 47357 22402 48197 55469 24572 33543
5000000 117923 55869 120303 139061 61801 68281
10000000 234861 111055 239124 277153 121270 136953
size s map p map s fold p fold
10000 232 232 273 328
100000 6424 3271 8297 4487
500000 21329 5547 13913 6263
1000000 23654 11419 27827 12083
2000000 47230 22763 55653 24135
5000000 117448 56785 139286 61679
10000000 235394 111021 278177 119805
25000000 589329 279637 696392 301485
500000001.1824e+06 5564431.38722e+06 606279
So now the "thread" version is 2 time faster than async for small values.
Looking at clone calls di not show any differences:
92 clone
64 clone
I didn't have much time to go further, but at least on linux, we can consider that there are no differences between the two version (the async could even be seen as less efficient as it requires more threads).
We can see that it has nothing to do with the async/thread problem.
Moreover, if we look at values that realy need computation time, the difference of time is really small and not relevent: 55752us vs 56785us for 5'000'000, and keeps matching for bigger values.
This looks like usual problem with microbenches, we somehow measure latencies of the system more than the computation time itself.
Note: the figures shown are without optimization (original code) ; adding -O3 obviously speeds up the computations, but results show the same: there is not real difference in computation time on big values.
I am able to use feenableexcept() to change the FPE environment. However, two lines of code later, something changes my control_word. Has anybody seen this before and know how to stop it?
fedisableexcept(FE_ALL_EXCEPT);
feenableexcept(FE_DIVBYZERO | FE_OVERFLOW | FE_INVALID); // FE_UNDERFLOW & FE_INEXACT are too common
fegetenv(&tmp_env); // tmp_env.__control_word = 882. As expected
int AA = 1;
fegetenv(&tmp_env); // tmp_env.__control_word = 895. No good
Can the problem be that multiple threads run the program? Is there a way to set the default control_word bits so that they do not get overriden?
I have a simple C++ program that 1) loops over FPE equations, 2) catches the SIGFPE and prints some useful information, 3) clears the FPU status word using uc_mcontext.fpregs->mxcsr, 4) computes the offending operator length and then skips that operator. This way the program can continue without exiting.
For this last run I specified feenableexcept(FE_DIVBYZERO); It seems to work at the beginning but then something changes the flags and it starts capturing other FPEs. Below is the program output:
- 0: control_word = 891
The SignalCatcher program is running!
=============================================================
Test 01: (float) 1.0e-300 = 0
=============================================================
Test 02: (float) 1.0e300 = inf
=============================================================
Test 03: acos(1.01f) = nan
=============================================================
0: control_word = 895
1. insn_get_length() = 2
B: Boost stacktrace specific: main in /home/lcordova/bin/signalCatcher/main.cpp L89
F. psiginfo specific: Caught signal 8 with signal code Int_Div-by-zero(1). Exception at address 0x401658 .
Test 04: 1/0 = 1
=============================================================
0: control_word = 895
1. insn_get_length() = 5
B: Boost stacktrace specific: main in /home/lcordova/bin/signalCatcher/main.cpp L92
F. psiginfo specific: Caught signal 8 with signal code FP_Div-by-zero(3). Exception at address 0x4016ad .
Test 05: 1.0f/0.0f = 1
=============================================================
0: control_word = 895
1. insn_get_length() = 4
B: Boost stacktrace specific: in L0
F. psiginfo specific: Caught signal 8 with signal code FP_Invalid(7). Exception at address 0x2b44f640 .
Test 06: sqrt(-1) = 0
=============================================================
0: control_word = 895
1. insn_get_length() = 4
B: Boost stacktrace specific: main in /home/lcordova/bin/signalCatcher/main.cpp L77
F. psiginfo specific: Caught signal 8 with signal code FP_Underflow(5). Exception at address 0x4014f1 .
Test 01: (float) 1.0e-300 = -124.475
=============================================================
0: control_word = 895
1. insn_get_length() = 4
B: Boost stacktrace specific: main in /home/lcordova/bin/signalCatcher/main.cpp L81
F. psiginfo specific: Caught signal 8 with signal code FP_Overflow(4). Exception at address 0x40155d .
Test 02: (float) 1.0e300 = -3.86568e-34
=============================================================
0: control_word = 895
1. insn_get_length() = 4
B: Boost stacktrace specific: acos in L0
F. psiginfo specific: Caught signal 8 with signal code FP_Invalid(7). Exception at address 0x2b46d9fa .
Test 03: acos(1.01f) = nan
I have a problem with overfilling Heap memory, using long newman runs.
I run collection using the below command line :
newman run C:\temp\004_L10_Orchestrator_L10_UAT_V4.0.postman_collection.json -g C:\temp\Euro_globals.json -e C:\temp\UAT-L10-WagerOperations.postman_environment.json --folder “5104 - Tzoker (Channel 2)” -k -n 200
after 100 iteration we take the below error :
<— Last few GCs —>
[10028:000001BE90168DD0] 355619 ms: Scavenge 2042.2 (2049.5) -> 2041.1 (2049.5) MB, 4.4 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
[10028:000001BE90168DD0] 355636 ms: Scavenge 2042.7 (2049.7) -> 2041.9 (2050.7) MB, 5.8 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
[10028:000001BE90168DD0] 355660 ms: Scavenge 2044.7 (2052.0) -> 2043.9 (2056.1) MB, 5.7 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
<— JS stacktrace —>
==== JS stack trace =========================================
0: ExitFrame [pc: 00007FF788EA154D]
1: StubFrame [pc: 00007FF788F11F49]
Security context: 0x00d1dc3c0921
2: baseClone [0000008D72BE2A39] [C:\Users\Agent\AppData\Roaming\npm\node_modules\newman\node_modules\lodash\lodash.js:1] [bytecode=000001855E0FE891 offset=0](this=0x0098e94c2531 ,0x011436a82d41 <String[#7]: integer>,5,0x0110437ab189 <JSFunction (sfi = 00000150CA97CAB1)>,0x011436a83259 <String[#4]…
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Writing Node.js report to file: report.20200122.122008.10028.0.001.json
Node.js report completed
1: 00007FF788290BEF napi_wrap+126287
2: 00007FF78822DF26 public: bool __cdecl v8::base::CPU::has_sse(void)const __ptr64+34950
3: 00007FF78822EBF6 public: bool __cdecl v8::base::CPU::has_sse(void)const __ptr64+38230
4: 00007FF788A57FEE private: void __cdecl v8::Isolate::ReportExternalAllocationLimitReached(void) __ptr64+9
i have increased the memoery availble for the Heap memory but the effect was that from 100 we pass to 150 iteration .
I was trying to understand the linux perf, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20 for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}
You can use sudo perf sched record -- ./test to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:
sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]
Note that I have four cores and the name of the executable is test. perf sched has captured all the sched:sched_switch events and dumped the data into a file called perf.data by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf starts until test terminates.
You can use sudo perf sched map to print all the recorded events in a nice format that looks like this:
*. 448826.757400 secs . => swapper:0
*A0 448826.757461 secs A0 => perf:15875
*. A0 448826.757477 secs
*. . A0 448826.757548 secs
. . *B0 448826.757601 secs B0 => migration/3:22
. . *. 448826.757608 secs
*A0 . . 448826.757625 secs
A0 *C0 . 448826.757775 secs C0 => rcu_sched:7
A0 *. . 448826.757777 secs
*D0 . . 448826.757803 secs D0 => ksoftirqd/0:3
*A0 . . 448826.757807 secs
A0 *E0 . . 448826.757862 secs E0 => kworker/1:3:13786
A0 *F0 . . 448826.757870 secs F0 => kworker/1:0:5886
A0 *G0 . . 448826.757874 secs G0 => hud-service:1609
A0 *. . . 448826.758614 secs
A0 *H0 . . 448826.758714 secs H0 => kworker/u8:2:15585
A0 *. . . 448826.758721 secs
A0 . *I0 . 448826.758740 secs I0 => gnome-terminal-:8878
A0 . I0 *J0 448826.758744 secs J0 => test:15876
A0 . I0 *B0 448826.758749 secs
The two-letter names A0, B0, C0, E0, and so on, are short names given by perf to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for loop. The name assigned to this thread is J0. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:
sudo perf sched map | grep 'test'
On my system, this prints:
A0 . I0 *J0 448826.758744 secs J0 => test:15876
J0 A0 *K0 . 448826.758868 secs K0 => test:15878
J0 *L0 K0 . 448826.758889 secs L0 => test:15877
J0 L0 K0 *M0 448826.758894 secs M0 => test:15879
Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:
*G1 L0 K0 M0 448826.822555 secs G1 => firefox:2384
then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.
If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:
sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2
The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.
We can't tell you exactly what is being scheduled - but you can find out yourself using perf.
perf record -e sched:sched_switch ./test
Note this requires a mounted debugfs and root permissions. Now a perf report will give you an overview of what the scheduler was switching to (or see perf script for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.
The reason why sleep has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.