I have a problem with overfilling Heap memory, using long newman runs.
I run collection using the below command line :
newman run C:\temp\004_L10_Orchestrator_L10_UAT_V4.0.postman_collection.json -g C:\temp\Euro_globals.json -e C:\temp\UAT-L10-WagerOperations.postman_environment.json --folder “5104 - Tzoker (Channel 2)” -k -n 200
after 100 iteration we take the below error :
<— Last few GCs —>
[10028:000001BE90168DD0] 355619 ms: Scavenge 2042.2 (2049.5) -> 2041.1 (2049.5) MB, 4.4 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
[10028:000001BE90168DD0] 355636 ms: Scavenge 2042.7 (2049.7) -> 2041.9 (2050.7) MB, 5.8 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
[10028:000001BE90168DD0] 355660 ms: Scavenge 2044.7 (2052.0) -> 2043.9 (2056.1) MB, 5.7 / 0.0 ms (average mu = 0.234, current mu = 0.129) allocation failure
<— JS stacktrace —>
==== JS stack trace =========================================
0: ExitFrame [pc: 00007FF788EA154D]
1: StubFrame [pc: 00007FF788F11F49]
Security context: 0x00d1dc3c0921
2: baseClone [0000008D72BE2A39] [C:\Users\Agent\AppData\Roaming\npm\node_modules\newman\node_modules\lodash\lodash.js:1] [bytecode=000001855E0FE891 offset=0](this=0x0098e94c2531 ,0x011436a82d41 <String[#7]: integer>,5,0x0110437ab189 <JSFunction (sfi = 00000150CA97CAB1)>,0x011436a83259 <String[#4]…
FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
Writing Node.js report to file: report.20200122.122008.10028.0.001.json
Node.js report completed
1: 00007FF788290BEF napi_wrap+126287
2: 00007FF78822DF26 public: bool __cdecl v8::base::CPU::has_sse(void)const __ptr64+34950
3: 00007FF78822EBF6 public: bool __cdecl v8::base::CPU::has_sse(void)const __ptr64+38230
4: 00007FF788A57FEE private: void __cdecl v8::Isolate::ReportExternalAllocationLimitReached(void) __ptr64+9
i have increased the memoery availble for the Heap memory but the effect was that from 100 we pass to 150 iteration .
Related
For context this question is related to the blog post on Cache Processor Effects, specifically Example 1-2.
In the code snippet below, I'm increasing the step size by 2 each time i.e. the number of operations I'm performing is decreasing by a factor of 2 each time. From the blog post, I am expecting to see for step size 1-16, the average time to complete the loop remains roughly the same. The main intuitions discussed by the author were 1) Majority of the time is contributed by memory access (i.e. we fetch then multiply) rather than arithmetic operations, 2) Each time the cpu fetches a cacheline (i.e. 64 bytes or 16 int).
I've tried to replicate the experiment on my local machine with the following code. Note that I've allocated a new int array for every step size so that they do not take advantage of previous cached data. For a similar reason, I've also only "repeated" the inner for loop for each step size only once (instead of repeating the experiment multiple times).
constexpr long long size = 64 * 1024 * 1024; // 64 MB
for (int step = 1; step <= 1<<15 ; step <<= 1) {
auto* arr = new int[size];
auto start = std::chrono::high_resolution_clock::now();
for (size_t i = 0; i < size; i += step) {
arr[i] *= 3;
}
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::milliseconds>(finish-start);
std::cout << step << " : " << microseconds.count() << "ms\n";
// delete[] arr; (updated - see Paul's comment)
}
Result, however, was very different from what was described in the blog post.
Without optimization:
clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp
1 : 222ms
2 : 176ms
4 : 152ms
8 : 140ms
16 : 135ms
32 : 128ms
64 : 130ms
128 : 125ms
256 : 123ms
512 : 118ms
1024 : 121ms
2048 : 62ms
4096 : 32ms
8192 : 16ms
16384 : 8ms
32768 : 4ms
With -O3 optimization
clang++ -g -std=c++2a -Wpedantic -Wall -Wextra -o a cpu-cache1.cpp -O3
1 : 155ms
2 : 145ms
4 : 134ms
8 : 131ms
16 : 125ms
32 : 130ms
64 : 130ms
128 : 121ms
256 : 123ms
512 : 127ms
1024 : 123ms
2048 : 62ms
4096 : 31ms
8192 : 15ms
16384 : 8ms
32768 : 4ms
Note that I'm running with Macbook Pro 2019 and my pagesize is 4096. From the observation above, it seems like until 1024 step size the time taken roughly remain linear. Since each int is 4 bytes, this seem related to the size of a page (i.e. 1024*4 = 4096) which makes me think this might be some kind of prefetching/page related optimization even when there is no optimization specified?
Does someone have any ideas or explanation on why these numbers are occurring?
In your code, you called new int[size] which is essentially a wrapper around malloc. The kernel does not immediately allocate a physical page/memory to it due to Linux's optimistic memory allocation strategy (see man malloc).
What happens when you call arr[i] *= 3 is that a page fault will occur if the page is not in the Translation Lookup Buffer (TLB). Kernel will check that your requested virtual page is valid, but an associated physical page is not allocated yet. The kernel will assign your requested virtual page with a physical page.
For step = 1024, you are requesting every page associated to arr. For step = 2048, you are requesting every other page associated to arr.
This act of assigning a physical page is your bottleneck (based on your data it takes ~120ms to assign 64mb of pages one by one). When you increase step from 1024 to 2048, now the kernel doesn't need to allocate physical pages for every virtual page associated to arr hence the runtime halves.
As linked by #Daniel Langr, you'll need to "touch" each element of arr or zero-initialize the arr with new int[size]{}. What this does is to force the kernel to assign physical pages to each virtual page of arr.
I am running a simple program on my ESP32 S3 DevKit C which just reads the data read from the ADC channels and prints them to the monitor. It all runs fine, but then after a few seconds, the readout is interrupted by this error, and then this just keeps repeating itself. (What is interesting is that the error occurs in the middle of printing out the integer 766 and splits up 7 and 66).
Here is the monitor readout:
2094, 767
2095, 767
2093, 767
2099, 7E (50298) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (50298) task_wdt: - IDLE (CPU 0)
E (50298) task_wdt: Tasks currently running:
E (50298) task_wdt: CPU 0: main
E (50298) task_wdt: CPU 1: IDLE
E (50298) task_wdt: Print CPU 0 (current core) backtrace
Backtrace:0x42008262:0x3FC910C00x4037673A:0x3FC910E0 0x4200305A:0x3FCF4250 0x42003EB2:0x3FCF4280 0x42002A95:0x3FCF42A0 0x4200262F:0x3FCF42C0 0x4200A5E5:0x3FCF42E0 0x4201203E:0x3FCF4300 0x420120C6:0x3FCF4320 0x4200A015:0x3FCF4350 0x42015599:0x3FCF4380 0x42010DDF:0x3FCF43A0 0x4200A1B1:0x3FCF46B0 0x42005525:0x3FCF4700 0x42018315:0x3FCF4730 0x4037B759:0x3FCF4750
0x42008262: task_wdt_isr at C:/Espressif/frameworks/esp-idf-v4.4/components/esp_system/task_wdt.c:183 (discriminator 3)
0x4037673a: _xt_lowint1 at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/xtensa_vectors.S:1111
0x4200305a: uart_ll_get_txfifo_len at c:\espressif\frameworks\esp-idf-v4.4\projects\esi2022\build/../../../components/hal/esp32s3/include/hal/uart_ll.h:316 (discriminator 1)
(inlined by) uart_tx_char at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_uart.c:156 (discriminator 1)
0x42003eb2: uart_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_uart.c:209
0x42002a95: console_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs_console.c:73
0x4200262f: esp_vfs_write at C:/Espressif/frameworks/esp-idf-v4.4/components/vfs/vfs.c:431 (discriminator 4)
0x4200a5e5: __swrite at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/stdio.c:94
0x4201203e: __sflush_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fflush.c:224
0x420120c6: _fflush_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fflush.c:278
0x4200a015: __sfvwrite_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/fvwrite.c:251
0x42015599: __sprint_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:433
(inlined by) __sprint_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:403
0x42010ddf: _vfprintf_r at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/vfprintf.c:1781 (discriminator 1)
0x4200a1b1: printf at /builds/idf/crosstool-NG/.build/HOST-x86_64-w64-mingw32/xtensa-esp32s3-elf/src/newlib/newlib/libc/stdio/printf.c:56 (discriminator 5)
0x42005525: app_main at c:\espressif\frameworks\esp-idf-v4.4\projects\esi2022\build/../main/hello_world_main.c:36 (discriminator 1)
0x42018315: main_task at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/port_common.c:129 (discriminator 2)
0x4037b759: vPortTaskWrapper at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/port.c:131
E (50298) task_wdt: Print CPU 1 backtrace
Backtrace:0x40377C6D:0x3FC916C00x4037673A:0x3FC916E0 0x400559DD:0x3FCF56B0 |<-CORRUPTED
0x40377c6d: esp_crosscore_isr at C:/Espressif/frameworks/esp-idf-v4.4/components/esp_system/crosscore_int.c:92
0x4037673a: _xt_lowint1 at C:/Espressif/frameworks/esp-idf-v4.4/components/freertos/port/xtensa/xtensa_vectors.S:1111
66
2207, 766
2092, 775
2095, 775
2095, 767
2093, 767
2093, 774
2095, 767
And here is the code I am running:
/* This script creates a representative dataset of audio from the ESP32-S3
TO RUN: >> idf.py set-target esp32s3
>> idf.py -p COM3 -b 480600 flash monitor
MAKE SURE SPI FLASH SIZE IS 8MB:
>> idf.py menuconfig
>> Serial Flasher Config >> Flash Size (8 MB)
*/
#include <stdio.h>
#include <driver/adc.h>
#include "sdkconfig.h"
#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "esp_system.h"
#include "esp_spi_flash.h"
void app_main(void)
{
printf("Hello world!\n");
/* Configure desired precision and attenuation for ADC pins
Right Channel: GPIO 4 ADC1 Channel 3
Left Channel: GPIO 11 ADC2 Channel 0 */
adc1_config_width(ADC_WIDTH_BIT_DEFAULT);
adc1_config_channel_atten(ADC1_CHANNEL_3,ADC_ATTEN_DB_0);
adc2_config_channel_atten(ADC2_CHANNEL_0,ADC_ATTEN_DB_0);
int val2 = 0;
int* pval2 = &val2;
while(true){
int val1 = adc1_get_raw(ADC1_CHANNEL_3);
adc2_get_raw(ADC2_CHANNEL_0,ADC_WIDTH_BIT_DEFAULT,pval2);
printf("%d, %d\n",val1,val2);
}
If anyone has any idea why this might be happening I would greatly appreciate it.
Thanks!
By integrating vTaskDelay(); into the while(true) loop, the issue was fixed. Dave M. Had it correct that the buffer was getting filled up too quickly. This limits the amount of data going into the buffer. (See comments below)
I was trying to understand the linux perf, and found some really confusing behavior:
I wrote a simple multi-threading example with one thread pinned to each core; each thread runs computation locally and does not communicate with each other (see test.cc below). I was thinking that this example should have really low, if not zero, context switches. However, using linux perf to profile the example shows thousands of context-switches - much more than what I expected. I further profiled the linux command sleep 20 for a comparison, showing much fewer context switches.
This profile result does not make any sense to me. What is causing so many context switches?
> sudo perf stat -e sched:sched_switch ./test
Performance counter stats for './test':
6,725 sched:sched_switch
20.835 seconds time elapsed
> sudo perf stat -e sched:sched_switch sleep 20
Performance counter stats for 'sleep 20':
1 sched:sched_switch
20.001 seconds time elapsed
For reproducing the results, please run the following code:
perf stat -e context-switches sleep 20
perf stat -e context-switches ./test
To compile the source code, please type the following code:
g++ -std=c++11 -pthread -o test test.cc
// test.cc
#include <iostream>
#include <thread>
#include <vector>
int main(int argc, const char** argv) {
unsigned num_cpus = std::thread::hardware_concurrency();
std::cout << "Launching " << num_cpus << " threads\n";
std::vector<std::thread> threads(num_cpus);
for (unsigned i = 0; i < num_cpus; ++i) {
threads[i] = std::thread([i] {
int j = 0;
while (j++ < 100) {
int tmp = 0;
while (tmp++ < 110000000) { }
}
});
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(i, &cpuset);
int rc = pthread_setaffinity_np(threads[i].native_handle(),
sizeof(cpu_set_t), &cpuset);
if (rc != 0) {
std::cerr << "Error calling pthread_setaffinity_np: " << rc << "\n";
}
}
for (auto& t : threads) {
t.join();
}
return 0;
}
You can use sudo perf sched record -- ./test to determine which processes are being scheduled to run in place of the one of the threads of your application. When I execute this command on my system, I get:
sudo perf sched record -- ./test
Launching 4 threads
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 23.886 MB perf.data (212100 samples) ]
Note that I have four cores and the name of the executable is test. perf sched has captured all the sched:sched_switch events and dumped the data into a file called perf.data by default. The size of the file is about 23 MBs and contains abut 212100 events. The duration of profiling will be from the time perf starts until test terminates.
You can use sudo perf sched map to print all the recorded events in a nice format that looks like this:
*. 448826.757400 secs . => swapper:0
*A0 448826.757461 secs A0 => perf:15875
*. A0 448826.757477 secs
*. . A0 448826.757548 secs
. . *B0 448826.757601 secs B0 => migration/3:22
. . *. 448826.757608 secs
*A0 . . 448826.757625 secs
A0 *C0 . 448826.757775 secs C0 => rcu_sched:7
A0 *. . 448826.757777 secs
*D0 . . 448826.757803 secs D0 => ksoftirqd/0:3
*A0 . . 448826.757807 secs
A0 *E0 . . 448826.757862 secs E0 => kworker/1:3:13786
A0 *F0 . . 448826.757870 secs F0 => kworker/1:0:5886
A0 *G0 . . 448826.757874 secs G0 => hud-service:1609
A0 *. . . 448826.758614 secs
A0 *H0 . . 448826.758714 secs H0 => kworker/u8:2:15585
A0 *. . . 448826.758721 secs
A0 . *I0 . 448826.758740 secs I0 => gnome-terminal-:8878
A0 . I0 *J0 448826.758744 secs J0 => test:15876
A0 . I0 *B0 448826.758749 secs
The two-letter names A0, B0, C0, E0, and so on, are short names given by perf to every thread running on the system. The first four columns shows which thread was running on each of the four cores. For example, in the second-to-last row, you can see that the first thread that got created in your for loop. The name assigned to this thread is J0. The thread is running on the fourth core. The asterisk indicates that it has just been context-switched to from some other thread. Without an asterisk, it means that the same thread has continued to run on the same core for another time slice. A dot represents an idle core. To determine the names for all of the four threads, run the following command:
sudo perf sched map | grep 'test'
On my system, this prints:
A0 . I0 *J0 448826.758744 secs J0 => test:15876
J0 A0 *K0 . 448826.758868 secs K0 => test:15878
J0 *L0 K0 . 448826.758889 secs L0 => test:15877
J0 L0 K0 *M0 448826.758894 secs M0 => test:15879
Now that you know the two-letter names assigned to your threads (and all other threads). you can determine which other threads are causing your threads to be context-switched. For example, if you see this:
*G1 L0 K0 M0 448826.822555 secs G1 => firefox:2384
then you'd know that three of your app threads are running, but the one of the cores is being used to run Firefox. So the fourth thread needs to wait until the scheduler decides when to schedule again.
If you want all the scheduler slots where at least one of your threads is occupying, then you can use the following command:
sudo perf sched map > mydata
grep -E 'J0|K0|L0|M0' mydata > mydata2
wc -l mydata
wc -l mydata2
The last two commands tell you how many rows (time slices) where at least one thread of your app was running. You can compare that to the total number of time slices. Since there are four cores, the total number of scheduler slots is 4 * (number of time slices). Then you can do all sorts of manual calculations and figure out exactly what's happened.
We can't tell you exactly what is being scheduled - but you can find out yourself using perf.
perf record -e sched:sched_switch ./test
Note this requires a mounted debugfs and root permissions. Now a perf report will give you an overview of what the scheduler was switching to (or see perf script for a full listing). Now there is no apparent thing in your code that would cause a context switch (e.g. sleep, waiting for I/O), so it is most likely another task that is being scheduled on these cores.
The reason why sleep has almost no context switches is simple. It goes to sleep almost immediately - which is one context switch. While the task is not active, it cannot be displaced by another task.
According to the man pages,
Progress labels are used to define correctness claims. A progress label states the requirement that the labeled global state must be visited infinitely often in any infinite system execution. Any violation of this requirement can be reported by the verifier as a non-progress cycle.
and
Spin has a special mode to prove absence of non-progress cycles. It does so with the predefined LTL formula:
(<>[] np_)
which formalizes non-progress as a standard Buchi acceptance property.
But let's take a look at the very primitive promela specification
bool initialised = 0;
init{
progress:
initialised++;
assert(initialised == 1);
}
In my understanding, the assert should hold but verification fail because initialised++ is executed exactly once whereas the progress label claims it should be possible to execute it arbitrarily often.
However, even with the above LTL formula, this verifies just fine in ispin (see below).
How do I correctly test whether a statement can be executed arbitrarily often (e.g. for a locking scheme)?
(Spin Version 6.4.7 -- 19 August 2017)
+ Partial Order Reduction
Full statespace search for:
never claim + (:np_:)
assertion violations + (if within scope of claim)
non-progress cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 28 byte, depth reached 7, errors: 0
6 states, stored (8 visited)
3 states, matched
11 transitions (= visited+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.293 actual memory usage for states
64.000 memory used for hash table (-w24)
0.343 memory used for DFS stack (-m10000)
64.539 total actual memory usage
unreached in init
(0 of 3 states)
pan: elapsed time 0.001 seconds
No errors found -- did you verify all claims?
UPDATE
Still not sure how to use this ...
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
And let's select testing for non-progress cycles. Then ispin runs this as
spin -a test.pml
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DNP -DNOCLAIM -w -o pan pan.c
./pan -m10000 -l
Which verifies without error.
So let's instead try this with ltl properties ...
/*pid: 0 = init, 1-2 = reader, 3-4 = writer*/
ltl progress_reader1{ []<> reader[1]#progress_reader }
ltl progress_reader2{ []<> reader[2]#progress_reader }
ltl progress_writer1{ []<> writer[3]#progress_writer }
ltl progress_writer2{ []<> writer[4]#progress_writer }
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
Now, first of all,
the model contains 4 never claims: progress_writer2, progress_writer1, progress_reader2, progress_reader1
only one claim is used in a verification run
choose which one with ./pan -a -N name (defaults to -N progress_reader1)
or use e.g.: spin -search -ltl progress_reader1 test.pml
Fine, I don't care, I just want this to finally run, so let's just keep progress_writer1 and worry about how to stitch it all together later:
/*pid: 0 = init, 1-2 = reader, 3-4 = writer*/
/*ltl progress_reader1{ []<> reader[1]#progress_reader }*/
/*ltl progress_reader2{ []<> reader[2]#progress_reader }*/
ltl progress_writer1{ []<> writer[3]#progress_writer }
/*ltl progress_writer2{ []<> writer[4]#progress_writer }*/
bool initialised = 0;
init{
initialised = 1;
}
active [2] proctype reader()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
progress_reader:
assert(true);
od
}
active [2] proctype writer()
{
assert(_pid >= 1);
(initialised == 1)
do
:: else ->
(initialised == 0)
progress_writer:
assert(true);
od
}
ispin runs this with
spin -a test.pml
ltl progress_writer1: [] (<> ((writer[3]#progress_writer)))
gcc -DMEMLIM=1024 -O2 -DXUSAFE -DSAFETY -DNOCLAIM -w -o pan pan.c
./pan -m10000
Which does not yield an error, but instead reports
unreached in claim progress_writer1
_spin_nvr.tmp:3, state 5, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:3, state 5, "(1)"
_spin_nvr.tmp:8, state 10, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:10, state 13, "-end-"
(3 of 13 states)
Yeah? Splendid! I have absolutely no idea what to do about this.
How do I get this to run?
The problem with your code example is that it does not have any infinite system execution.
Progress labels are used to define correctness claims. A progress
label states the requirement that the labeled global state must be
visited infinitely often in any infinite system execution. Any
violation of this requirement can be reported by the verifier as a
non-progress cycle.
Try this example instead:
short val = 0;
init {
do
:: val == 0 ->
val = 1;
// ...
val = 0;
:: else ->
progress:
// super-important progress state
printf("progress-state\n");
assert(val != 0);
od;
};
A normal check does not find any error:
~$ spin -search test.pml
(Spin Version 6.4.3 -- 16 December 2014)
+ Partial Order Reduction
Full statespace search for:
never claim - (none specified)
assertion violations +
cycle checks - (disabled by -DSAFETY)
invalid end states +
State-vector 12 byte, depth reached 2, errors: 0
3 states, stored
1 states, matched
4 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.292 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
unreached in init
test.pml:12, state 5, "printf('progress-state\n')"
test.pml:13, state 6, "assert((val!=0))"
test.pml:15, state 10, "-end-"
(3 of 10 states)
pan: elapsed time 0 seconds
whereas, checking for progress yields the error:
~$ spin -search -l test.pml
pan:1: non-progress cycle (at depth 2)
pan: wrote test.pml.trail
(Spin Version 6.4.3 -- 16 December 2014)
Warning: Search not completed
+ Partial Order Reduction
Full statespace search for:
never claim + (:np_:)
assertion violations + (if within scope of claim)
non-progress cycles + (fairness disabled)
invalid end states - (disabled by never claim)
State-vector 20 byte, depth reached 7, errors: 1
4 states, stored
0 states, matched
4 transitions (= stored+matched)
0 atomic steps
hash conflicts: 0 (resolved)
Stats on memory usage (in Megabytes):
0.000 equivalent memory usage for states (stored*(State-vector + overhead))
0.292 actual memory usage for states
128.000 memory used for hash table (-w24)
0.534 memory used for DFS stack (-m10000)
128.730 total actual memory usage
pan: elapsed time 0 seconds
WARNING: ensure to write -l after option -search, otherwise it is not handed over to the verifier.
You ask:
How do I correctly test whether a statement can be executed arbitrarily often (e.g. for a locking scheme)?
Simply write a liveness property:
ltl prop { [] <> proc[0]#label };
This checks that process with name proc and pid 0 executes infinitely often the statement corresponding to label.
Since your edit substantially changes the question, I write a new answer to avoid confusion. This answer addresses only the new content. Next time, consider creating a new, separate, question instead.
This is one of those cases in which paying attention to the unreached in ... warning message is really important, because it affects the outcome of the verification process.
The warning message:
unreached in claim progress_writer1
_spin_nvr.tmp:3, state 5, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:3, state 5, "(1)"
_spin_nvr.tmp:8, state 10, "(!((writer[3]._p==progress_writer)))"
_spin_nvr.tmp:10, state 13, "-end-"
(3 of 13 states)
relates to the content of file _spin_nvr.tmp that is created during the compilation process:
...
never progress_writer1 { /* !([] (<> ((writer[3]#progress_writer)))) */
T0_init:
do
:: (! (((writer[3]#progress_writer)))) -> goto accept_S4 // state 5
:: (1) -> goto T0_init
od;
accept_S4:
do
:: (! (((writer[3]#progress_writer)))) -> goto accept_S4 // state 10
od;
} // state 13 '-end-'
...
Roughly speaking, you can view this as the specification of a Buchi Automaton which accepts executions of your writer process with _pid equal to 3 in which it does not reach the statement with label progress_writer infinitely often, i.e. it does so only a finite number of times.
To understand this you should know that, to verify an ltl property φ, spin builds an automaton containing all paths in the original Promela model that do not satisfy φ. This is done by computing the synchronous product of the automaton modeling the original system with the automaton representing the negation of the property φ you want to verify. In your example, the negation of φ is given by the excerpt of code above taken from _spin_nvr.tmp and labeled with never progress_writer1. Then, Spin checks if there is any possible execution of this automaton:
if there is, then property φ is violated and such execution trace is a witness (aka counter-example) of your property
otherwise, property φ is verified.
The warning tells you that in the resulting synchronous product none of those states is ever reachable. Why is this the case?
Consider this:
active [2] proctype writer()
{
1: assert(_pid >= 1);
2: (initialised == 1)
3: do
4: :: else ->
5: (initialised == 0);
6: progress_writer:
7: assert(true);
8: od
}
At line 2:, you check that initialised == 1. This statement forces writer to block at line 2: until when initialised is set to 1. Luckily, this is done by the init process.
At line 5:, you check that initialised == 0. This statement forces writer to block at line 5: until when initialised is set to 0. However, no process ever sets initialised to 0 anywhere in the code. Therefore, the line of code labeled with progress_writer: is effectively unreachable.
See the documentation:
(1) /* always executable */
(0) /* never executable */
skip /* always executable, same as (1) */
true /* always executable, same as skip */
false /* always blocks, same as (0) */
a == b /* executable only when a equals b */
A condition statement can only be executed (passed) if it holds. [...]
I want to write a program to get my cache size(L1, L2, L3). I know the general idea of it.
Allocate a big array
Access part of it of different size each time.
So I wrote a little program.
Here's my code:
#include <cstdio>
#include <time.h>
#include <sys/mman.h>
const int KB = 1024;
const int MB = 1024 * KB;
const int data_size = 32 * MB;
const int repeats = 64 * MB;
const int steps = 8 * MB;
const int times = 8;
long long clock_time() {
struct timespec tp;
clock_gettime(CLOCK_REALTIME, &tp);
return (long long)(tp.tv_nsec + (long long)tp.tv_sec * 1000000000ll);
}
int main() {
// allocate memory and lock
void* map = mmap(NULL, (size_t)data_size, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
if (map == MAP_FAILED) {
return 0;
}
int* data = (int*)map;
// write all to avoid paging on demand
for (int i = 0;i< data_size / sizeof(int);i++) {
data[i]++;
}
int steps[] = { 1*KB, 4*KB, 8*KB, 16*KB, 24 * KB, 32*KB, 64*KB, 128*KB,
128*KB*2, 128*KB*3, 512*KB, 1 * MB, 2 * MB, 3 * MB, 4 * MB,
5 * MB, 6 * MB, 7 * MB, 8 * MB, 9 * MB};
for (int i = 0; i <= sizeof(steps) / sizeof(int) - 1; i++) {
double totalTime = 0;
for (int k = 0; k < times; k++) {
int size_mask = steps[i] / sizeof(int) - 1;
long long start = clock_time();
for (int j = 0; j < repeats; j++) {
++data[ (j * 16) & size_mask ];
}
long long end = clock_time();
totalTime += (end - start) / 1000000000.0;
}
printf("%d time: %lf\n", steps[i] / KB, totalTime);
}
munmap(map, (size_t)data_size);
return 0;
}
However, the result is so weird:
1 time: 1.989998
4 time: 1.992945
8 time: 1.997071
16 time: 1.993442
24 time: 1.994212
32 time: 2.002103
64 time: 1.959601
128 time: 1.957994
256 time: 1.975517
384 time: 1.975143
512 time: 2.209696
1024 time: 2.437783
2048 time: 7.006168
3072 time: 5.306975
4096 time: 5.943510
5120 time: 2.396078
6144 time: 4.404022
7168 time: 4.900366
8192 time: 8.998624
9216 time: 6.574195
My CPU is Intel(R) Core(TM) i3-2350M. L1 Cache: 32K (for data), L2 Cache 256K, L3 Cache 3072K.
Seems like it doesn't follow any rule. I can't get information of cache size or cache level from that.
Could anybody give some help? Thanks in advance.
Update:
Follow #Leeor advice, I use j*64 instead of j*16. New results:
1 time: 1.996282
4 time: 2.002579
8 time: 2.002240
16 time: 1.993198
24 time: 1.995733
32 time: 2.000463
64 time: 1.968637
128 time: 1.956138
256 time: 1.978266
384 time: 1.991912
512 time: 2.192371
1024 time: 2.262387
2048 time: 3.019435
3072 time: 2.359423
4096 time: 5.874426
5120 time: 2.324901
6144 time: 4.135550
7168 time: 3.851972
8192 time: 7.417762
9216 time: 2.272929
10240 time: 3.441985
11264 time: 3.094753
Two peaks, 4096K and 8192K. Still weird.
I'm not sure if this is the only problem here, but it's definitely the biggest one - your code would very quickly trigger the HW stream prefetchers, making you almost always hit in L1 or L2 latencies.
More details can be found here - http://software.intel.com/en-us/articles/optimizing-application-performance-on-intel-coret-microarchitecture-using-hardware-implemented-prefetchers
For your benchmark You should either disable them (through BIOS or any other means), or at least make your steps longer by replacing j*16 (* 4 bytes per int = 64B, one cache line - a classic unit stride for the stream detector), with j*64 (4 cache lines). The reason being - the prefetcher can issue 2 prefetches per stream request, so it runs ahead of your code when you do unit strides, may still get a bit ahead of you when your code is jumping over 2 lines, but become mostly useless with longer jumps (3 isn't good because of your modulu, you need a divider of step_size)
Update the questions with the new results and we can figure out if there's anything else here.
EDIT1:
Ok, I ran the fixed code and got -
1 time: 1.321001
4 time: 1.321998
8 time: 1.336288
16 time: 1.324994
24 time: 1.319742
32 time: 1.330685
64 time: 1.536644
128 time: 1.536933
256 time: 1.669329
384 time: 1.592145
512 time: 2.036315
1024 time: 2.214269
2048 time: 2.407584
3072 time: 2.259108
4096 time: 2.584872
5120 time: 2.203696
6144 time: 2.335194
7168 time: 2.322517
8192 time: 5.554941
9216 time: 2.230817
It makes much more sense if you ignore a few columns - you jump after the 32k (L1 size), but instead of jumping after 256k (L2 size), we get too good of a result for 384, and jump only at 512k. Last jump is at 8M (my LLC size), but 9k is broken again.
This allows us to spot the next error - ANDing with size mask only makes sense when it's a power of 2, otherwise you don't wrap around, but instead repeat some of the last addresses again (which ends up in optimistic results since it's fresh in the cache).
Try replacing the ... & size_mask with % steps[i]/sizeof(int), the modulu is more expensive but if you want to have these sizes you need it (or alternatively, a running index that gets zeroed whenever it exceeds the current size)
I think you'd be better off looking at the CPUID instruction. It's not trivial, but there should be information on the web.
Also, if you're on Windows, you can use GetLogicalProcessorInformation function. Mind you, it's only present in Windows XP SP3 and above. I know nothing about Linux/Unix.
If you're using GNU/Linux you can just read the content of the files /proc/cpuinfo and for further details /sys/devices/system/cpu/*. It is just common under UNIX not to define a API, where a plain file can do that job anyway.
I would also take a look at the source of util-linux, it contains a program named lscpu. This should be give you an example how to retrieve the required information.
// update
http://git.kernel.org/cgit/utils/util-linux/util-linux.git/tree/sys-utils/lscpu.c
If just taken a look at the source their. It basically reading from the file mentioned above, thats all. An therefore it is absolutely valid to read also from that files, they are provided by the kernel.