RAM consumption regarding cores/ - python-2.7

I am working on a 31 ,available, Go of RAM, 12 cores Linux KUbuntu computer.
I produce simulations which calculate functions over 4 dimensions (x,y,z,t).
I define my dimensions as arrays that I numpy.meshgrid for use. So, for each point of time, I calculate for each point x,y,z the result. It comes as heavy calculations with heavy data.
First, I learned how to use it with only one core. It works well and whatever are the size of my "boxs" ( x,y,z). Because of the fact I work a lot with Fourier transform, I define x,y,z,t as powers of 2 : 64,128,256,...
I can,without dificulties, go to x = y = z = t = 512, even if it takes a lot of time to run it (which makes sense). When I do that, I use around 20-30% of the available RAM of the computer. Great.
Then I wanted to use more cores. So I implemented this code :
import multiprocessing as mp
pool = mp.Pool(processes=8)
results = [pool.apply_async(conv_green, args=(tstep, S_, )) for tstep in t]
So here I ask my script to use 8 cores, and define my results as the use of the function "conv_green" with the args "tstep,S_" all along t.
It works pretty well, use 8 cores as expected BUT I can not run any more simulations who use figures equal or above to 512 for x,y,z,t.
This is where my problem is. Technically, switching from the mono core system to multi chanegd nothing to the routine of my calculations. I do not understand why I have enough RAM for 512... in mono core and why,sudenly, when I switch to multi cores, computer does not even want to launch it ( and the error occurs at the" results = pool.apply ..." line)
So if you guys know how this works and why I get this "treshold", thanks for helping me solving out !
Best regards.
PS : this is the error which pops out when it crashes with 512 in multi cores :
Traceback (most recent call last):
File "", line 1, in
File "/usr/lib/python2.7/dist
packages/spyderlib/widgets/externalshell/sitecustomize.py", line 540, in runfile
execfile(filename, namespace)
File "/home/alexis/Heat/Simu⁄Lecture Propre/Test Tkinter/Simulation N spots SCAN Tkinter.py", line 280, in
XYslice = array([p.get()[0] for p in results])
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call

For multiprocessing in any language each thread will need private storage which it can write to without interference from the other threads. As soon as interference is possible the data structure has to be locked, which (in the worst case) takes us back to single threading.
It would appear that your large data structure is being copied for each of the threads, effectively multiplying your memory usage by eight when you have eight processors ... or up to 200% of your available RAM.
The best solution would be to prevent the unnecessary copying.
If that's not feasible then all you can do is limit the number of processors it can run on, four should be ok in your instance but make sure your machine has lots of swap space. The swap space also gives you some play to allow the virtual memory to exceed the physical RAM, if the "working set" is small enough you may be able to significantly exceed your physical RAM given enough swap.

Related

Interpreting PGI_ACC_TIME output

I have some OpenACC-accelerated C++ code that I've compiled using the PGI compiler. Things seem to be working, so now I want to play efficiency whack-a-mole with profiling information.
I generate some timing info by setting:
export PGI_ACC_TIME=1
And then running the program.
The following output results:
-bash-4.2$ ./a.out
libcupti.so not found
Accelerator Kernel Timing data
PGI_ACC_SYNCHRONOUS was set, disabling async() clauses
/home/myuser/myprogram.cpp
_MyProgram NVIDIA devicenum=1
time(us): 97,667
75: data region reached 2 times
75: data copyin transfers: 3
device time(us): total=101 max=82 min=9 avg=33
76: compute region reached 1000 times
76: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=680,216 max=1,043 min=654 avg=680
95: compute region reached 1000 times
95: kernel launched 1000 times
grid: [1938] block: [128]
elapsed time(us): total=487,365 max=801 min=476 avg=487
110: data region reached 2000 times
110: data copyin transfers: 1000
device time(us): total=6,783 max=140 min=3 avg=6
125: data copyout transfers: 1000
device time(us): total=7,445 max=190 min=6 avg=7
real 0m3.864s
user 0m3.499s
sys 0m0.348s
It raises some questions:
I see time(us): 97,667 at the top. This seems like a total time, but, at the bottom, I see real 0m3.864s. Why is there such a difference?
If time(us): 97,667 is the total, why is it so much smaller than values lower down, such as elapsed time(us): total=680,216?
This kernel including the line (elapsed time(us): total=680,216 max=1,043 min=654 avg=680) was run 1000 times. Are max, min, and avg values based on per-run values of the kernel?
Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data (preparing to transfer, post-receipt operations)?
The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a variable definition. Should line numbers be interpreted as "the loop most closely following the indicated line number", or in some other way?
The kernel at 76 contains the kernel at 95. Are the times calculated for 76 inclusive of time spent in 95? If so, is there a convenient way to find the time spent in a kernel minus the times of all the subkernels?
(Some of these questions are a bit anal retentive, but I haven't found documentation for this, so I thought I'd be thorough.)
Part of the issue here is that the runtime can't find the CUDA Profiling library (libcupti.so), hence you're only seeing the PGI CPU side profiling not the device profiling. PGI ships libcupti.so library with the compilers (under $PGI/[linux86-64|linuxpower]/2017/cuda/[7.5|8.0]/lib64) but this is an optional install so you may not have it install on the system you're running. CUPTI also ships with the CUDA SDK, so if the system has CUDA install, you can try setting you're LD_LIBRARY_PATH there instead. On my system it's installed in "/opt/cuda-8.0/extras/CUPTI/lib64/".
The missing CUPTI library is why you're seeing the bad time, 97,667, for the file time. Also since you're missing CUPTI, the time you're seeing is being measured from the host. With CUPTI, in addition to the elapsed time, you'd see the device time for each kernel. The difference between the elapsed time and the device time is the launch overhead per kernel.
Are max, min, and avg values based on per-run values of the kernel?
Yes.
4.Since the [grid] and [block] values may vary, are the elapsed total values still a good indicator of hotspots?
I tend to first look at the avg time since there's typically more opportunities to tune these loops. If you are varying the amount of work per kernel iteration (i.e the grid size changes), then it might not be as useful, but a good starting point.
Now if you had a low average but many calls, then the elapsed time may be dominated by kernel launch overhead. In which case, I'd look to see if you can combine loops or push more work into each loop.
5.For data regions (device time(us): total=6,783) is the measurement transfer time or the entire time spent dealing with the data
(preparing to transfer, post-receipt operations)?
Just the data transfer time. For the overhead, you would need to use PGPROF/NVPROF.
6.The line numbering is weird. For instance, Line 76 in my program is clearly a for loop, Line 95 in is a close-brace, and Line 110 is a
variable definition. Should line numbers be interpreted as "the loop
most closely following the indicated line number", or in some other
way?
It's because the code's been optimized so the line number may be a bit off though it should correspond to the line numbers from compiler feedback messages (-Minfo=accel). So "the loop most closely..." option should be correct.

fine tuning vgg raise memory error

Hi i'm trying to fine tuning vgg on my problem but when i try to train the net i get this error.
OOM when allocating tensor with shape[25088,4096]
The net has this structure:
I take this tensorflow pretrained vgg implementation code from this site.
I only add this procedure to train the net:
with tf.name_scope('joint_loss'):
joint_loss = ya_loss+yb_loss+yc_loss+yd_loss+ye_loss+yf_loss+yg_loss+yh_loss+yi_loss+yl_loss+ym_loss+yn_loss
# Loss with weight decay
l2_loss = tf.add_n([tf.nn.l2_loss(v) for v in tf.trainable_variables()])
self.joint_loss = joint_loss + self.weights_decay * l2_loss
self.optimizer = tf.train.AdamOptimizer(learning_rate=self.learning_rate).minimize(joint_loss)
i try to reduce the batch size to 2 but not works i get the same error. The error is due to the big tensor that cannot be allocated in memory. I get this error only in train cause if i feed a value without minimize the net works. How i can avoid this error? how can i save memory of graphic card(Nvidia GeForce GTX 970)?
UPDATE: if i use the GradientDescentOptimizer the training process start, instead if i use AdamOptimizer i get the memory error, seems that the GradientDescentOptimizer use less memory.
Without a backward pass ("feed a value without minimizing"), TensorFlow can immediately de-allocate intermediate activations. With a backward pass, the graph has a giant U-shape, where activations from the forward pass need to be kept in memory for the backward pass. There are some tricks (such as swapping to host memory), but in general backprop means that memory usage will be higher.
Adam does keep some extra bookkeeping variables around, so it will increase memory usage proportional to the amount of memory your weight variables are already using. If your training steps take quite a while (in which case having the variable updates on the GPU isn't important), you could instead locate the optimization ops in host memory.
If you need a larger batch size and can't reduce image resolution or model size, combining gradients from multiple workers/GPUs using something like SyncReplicasOptimizer can be a good option. Looking at the paper associated with this model, it looks like they were training on 4 GPUs each with 12GB of memory.

Windows shared memory access time slow

I am currently using shared memory with two mapped files (1.9 GBytes for the first one and 600 MBytes for the second) in a software.
I am using a process that read data from the first file, process the data and write the results to the second file.
I have noticed a strong delay sometimes (the reason is out of my knowledge) when reading or writing to the mapping view with memcpy function.
Mapped files are created this way :
m_hFile = ::CreateFileW(SensorFileName,
GENERIC_READ | GENERIC_WRITE,
0,
NULL,
CREATE_ALWAYS,
FILE_ATTRIBUTE_NORMAL,
NULL);
m_hMappedFile = CreateFileMapping(m_hFile,
NULL,
PAGE_READWRITE,
dwFileMapSizeHigh,
dwFileMapSizeLow,
NULL);
And memory mapping is done this way :
m_lpMapView = MapViewOfFile(m_hMappedFile,
FILE_MAP_ALL_ACCESS,
dwOffsetHigh,
dwOffsetLow,
m_i64ViewSize);
The dwOffsetHigh/dwOffsetLow are "matching" granularity from the system info.
The process is reading about 300KB * N times, storing that in a buffer, processing and then writing 300KB * N times the processed contents of the previous buffer to the second file.
I have two different memory views (created/moved with MapViewOfFile function) with a size of 10 MBytes as default size.
For memory view size, I tested 10kBytes, 100kB, 1MB, 10MB and 100MB. Statistically no difference, 80% of the time reading process is as described below (~200ms) but writing process is really slow.
Normally :
1/ Reading is done in ~200ms.
2/ Process done in 2.9 seconds.
3/ Writing is done in ~200ms.
I can see that 80% of the time, either reading or writing (in the worst case both are slow) will take between 2 and 10 seconds.
Example : For writing, I am using the below code
for (unsigned int i = 0 ; i < N ; i++) // N = 500~3k
{
// Check the position of the memory view for ponderation
if (###)
MoveView(iOffset);
if (m_lpMapView)
{
memcpy((BYTE*)m_lpMapView + iOffset, pANNHeader, uiANNStatus);
// uiSize = ~300 kBytes
memcpy((BYTE*)m_lpMapView + iTemp, pLine[i], uiSize);
}
else
return uiANNStatus;
}
After using GetTickCount function to pinpoint where is the delay, I am seeing that the second memcpy call is always the one taking most of the time.
So, so far I am seeing N (for test, I used N = 500) calls to memcpy taking 10 seconds at the worst time when using those shared memories.
I made a temporary software that was doing the same quantity of memcpy calls, same amount of data and couldn't see the problem.
For tests, I used the following conditions, they all show the same delay :
1/ I can see this on various computers, 32 or 64 bits from windows 7 to windows 10.
2/ Using the main thread or multi-threads (up to 8 with critical sections for synchronization purpose) for reading/writing.
3/ OS on SATA or SSD, memory mapped files of the software physically on a SATA or SSD hard-disk, and if on external hard-disk, tests were done through USB1, USB2 or USB3.
I am kindly asking you what you would think my mistake is for memcpy to go slow.
Best regards.
I found a solution that works for me but not might be the case for others.
Following Thomas Matthews comments, I checked the MSDN and found two interesting functions FlushViewOfFile and FlushFileBuffers (but couldn't find anything interesting about locking memory).
Calling both after the for loop force update of the mapped file.
I am having no more "random" delay, but instead of the expected 200ms, I have an average of 400ms which is enough for my application.
After doing some tests I saw that calling those too often will cause heavy hard-disk access and will make the delay worse (10 seconds for every for loop) so the flush should be use carefully.
Thanks.

Irregular file writing performance in c++

I am writing an app which receives a binary data stream wtih a simple function call like put(DataBLock, dateTime); where each data package is 4 MB
I have to write these datablocks to seperate files for future use with some additional data like id, insertion time, tag etc...
So I both tried these two methods:
first with FILE:
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
FILE* pFile;
pFile = fopen(fNameArray,"wb");
fwrite(reinterpret_cast<const char *>(&data.dataTime), 1, sizeof(data.dataTime), pFile);
data.dataInsertionTime = time(0);
fwrite(reinterpret_cast<const char *>(&data.dataInsertionTime), 1, sizeof(data.dataInsertionTime), pFile);
fwrite(reinterpret_cast<const char *>(&data.id), 1, sizeof(long), pFile);
fwrite(reinterpret_cast<const char *>(&data.tag), 1, sizeof(data.tag), pFile);
fwrite(reinterpret_cast<const char *>(&data.data_block[0]), 1, data.data_block.size() * sizeof(int), pFile);
fclose(pFile);
second with ostream:
ofstream fout;
data.id = seedFileId;
seedFileId++;
std::string fileName = getFileName(data.id);
char *fNameArray = (char*)fileName.c_str();
fout.open(fNameArray, ios::out| ios::binary | ios::app);
fout.write(reinterpret_cast<const char *>(&data.dataTime), sizeof(data.dataTime));
data.dataInsertionTime = time(0);
fout.write(reinterpret_cast<const char *>(&data.dataInsertionTime), sizeof(data.dataInsertionTime));
fout.write(reinterpret_cast<const char *>(&data.id), sizeof(long));
fout.write(reinterpret_cast<const char *>(&data.tag), sizeof(data.tag));
fout.write(reinterpret_cast<const char *>(&data.data_block[0]), data.data_block.size() * sizeof(int));
fout.close();
In my tests the first methods looks faster, but my main problem is in both ways at first everythings goes fine, for every file writing operation it tooks almost the same time (like 20 milliseconds), but after the 250 - 300th package it starts to make some peaks like 150 to 300 milliseconds and then goes down to 20 milliseconds and then again 150 ms and so on... So it becomes very unpredictable.
When I put some timers to the code I figured out that the main reason for these peaks are because of the fout.open(...) and pfile = fopen(...) lines. I have no idea if this is because of the operating system, hard drive, any kind of cache or buffer mechanism etc...
So the question is; why these file opening lines become problematic after some time, and is there a way to make file writing operation stable, I mean fixed time?
Thanks.
NOTE: I'm using Visual studio 2008 vc++, Windows 7 x64. (I tried also for 32 bit configuration but the result is same)
EDIT: After some point writing speed slows down as well even if the opening file time is minimum. I tried with different package sizes so here are the results:
For 2 MB packages it takes double time to slow down, I mean after ~ 600th item slowing down begins
For 4 MB packages almost 300th item
For 8 MB packages almost 150th item
So it seems to me it is some sort of caching problem or something? (in hard drive or OS). But I also tried with disabling hard drive cache but nothing changed...
Any idea?
This is all perfectly normal, you are observing the behavior of the file system cache. Which is a chunk of RAM that's is set aside by the operating system to buffer disk data. It is normally a fat gigabyte, can be much more if your machine has lots of RAM. Sounds like you've got 4 GB installed, not that much for a 64-bit operating system. Depends however on the RAM needs of other processes that run on the machine.
Your calls to fwrite() or ofstream::write() write to a small buffer created by the CRT, it in turns make operating system calls to flush full buffers. The OS writes normally completely very quickly, it is a simple memory-to-memory copy going from the CRT buffer to the file system cache. Effective write speed is in excess of a gigabyte/second.
The file system driver lazily writes the file system cache data to the disk. Optimized to minimize the seek time on the write head, by far the most expensive operation on the disk drive. Effective write speed is determined by the rotational speed of the disk platter and the time needed to position the write head. Typical is around 30 megabytes/second for consumer-level drives, give or take a factor of 2.
Perhaps you see the fire-hose problem here. You are writing to the file cache a lot faster than it can be emptied. This does hit the wall eventually, you'll manage to fill the cache to capacity and suddenly see the perf of your program fall off a cliff. Your program must now wait until space opens up in the cache so the write can complete, effective write speed is now throttled by disk write speeds.
The 20 msec delays you observe are normal as well. That's typically how long it takes to open a file. That is a time that's completely dominated by disk head seek times, it needs to travel to the file system index to write the directory entry. Nominal times are between 20 and 50 msec, you are on the low end of that already.
Clearly there is very little you can do in your code to improve this. What CRT functions you use certainly don't make any difference, as you found out. At best you could increase the size of the files you write, that reduces the overhead spent on creating the file.
Buying more RAM is always a good idea. But it of course merely delays the moment where the firehose overflows the bucket. You need better drive hardware to get ahead. An SSD is pretty nice, so is a striped raid array. Best thing to do is to simply not wait for your program to complete :)
So the question is; why these file opening lines become problematic
after some time, and is there a way to make file writing operation
stable, I mean fixed time?
This observation(.i.e. varying time taken in write operation) does not mean that there is problem in OS or File System.There could be various reason behind your observation. One possible reason could be the delayed write may be used by kernel to write the data to disk. Sometime kernel cache it(buffer) in case another process should read or write it soon so that extra disk operation can be avoided.
This situation may lead to inconsistency in the time taken in different write call for same size of data/buffer.
File I/O is bit complex and complicated topic and depends on various other factors. For complete information on internal algorithm on File System, you may want to refer the great great classic book "The Design Of UNIX Operating System" By Maurice J Bach which describes these concepts and the implementation in detailed way.
Having said that, you may want to use the flush call immediately after your write call in both version of your program(.i.e. C and C++). This way you may get the consistent time in your file I/O write time. Otherwise your programs behaviour look correct to me.
//C program
fwrite(data,fp);
fflush(fp);
//C++ Program
fout.write(data);
fout.flush();
It's possible that the spikes are not related to I/O itself, but NTFS metadata: when your file count reach some limit, some NTFS AVL-like data structure needs some refactoring and... bump!
To check it you should preallocate the file entries, for example creating all the files with zero size, and then opening them when writing, just for testing: if my theory is correct you shouldn't see your spikes anymore.
UHH - and you must disable file indexing (Windows search service) there! Just remembered of it... see here.

Should we do nested goroutines?

I'm trying to build a parser for a large number of files, and I can't find information about what might possibly be called "nested goroutines" (maybe this is not the right name ?).
Given a lot of files, each of them having a lot of lines. Should I do:
for file in folder:
go do1
def do1:
for line in file:
go do2
def do2:
do_something
Or should I use only "one level" of goroutines, and do the following:
for file in folder:
for line in file:
go do_something
My question target primarily performance issues.
Thanks for reaching that sentence !
If you go through with the architecture you've specified, you have a good chance of running out of CPU/Mem/etc because you're going to be creating an arbitrary amount of workers. I suggest, instead go with an architecture that allows you to throttle via channels. For example:
In your main process feed the files into a channel:
for _, file := range folder {
fileChan <- file
}
then in another goroutine break the files into lines and feed those into a channel:
for {
select{
case file := <-fileChan
for _, line := range file {
lineChan <- line
}
}
}
then in a 3rd goroutine pop out the lines and do what you will with them:
for {
select{
case line := <-lineChan:
// process the line
}
}
The main advantage to this is that you can create as many or as few go routines as your system can handle and pass them all the same channels and whichever go routine gets to the channel first will just handle it, so you're able to throttle the amount of resources you're using.
Here is a working example: http://play.golang.org/p/-Qjd0sTtyP
The answer depends on how processor-intensive the operation on each line is.
If the line operation is short-lived, definitely don't bother to spawn a goroutine for each line.
If it's expensive (think ~5 secs or more), proceed with caution. You may run out of memory. As of Go 1.4, spawning a goroutine allocates a 2048 byte stack. For 2 million lines, you could allocate over 2GB of RAM for the goroutine stacks alone. Consider whether it's worth allocating this memory.
In short, you will probably get the best results with the following setup:
for file in folder:
go process_file(file)
If the number of files exceeds the number of CPUs, you're likely to have enough concurrency to mask the disk I/O latency involved in reading the files from disk.