How to tell that a c++ application has disk I/O bottleneck? - c++

I'm working on a "search" project. The main idea is how to build a index to respond to the search request as fast as possible. The input is a query, such as "termi termj", ouput is docs where both termi and termj appear.
the index file looks like this:(each line is called a postlist, which is sorted array of unsigned int and can be compressed with good compression ratio)
term1:doc1, doc5, doc8, doc10
term2:doc10, doc51, doc111, doc10000
...
termN:doc2, doc4, doc10
3 main time resuming procedure is
seek termi and termj's postlist in file (random disk read)
decode the postlists (cpu)
calculate the intersection of 2 postlists (cpu)
My question is, How can I know that the application can't be more efficient, it has a disk I/O bottleneck? How can I measure if my computer has used his disk 100 percent? Are there any tools on linux to help? Is there any tools can measure disk I/O perfectly like google cpu profiler can measure cpu?
My develop env is Ubuntu 14.04.
CPU: 8 cores 2.6GHz
disk: SSD
benchmark now is about 2000 queries/second, but I don't know how to improve it.
Any suggestion will be appreciated! Thank you very much!

Related

Retrieving disk read/write max speed (programmatically)

I am in the process of creating a C++ application that measures disk usage. I've been able to retrieve current disk usage (read and write speeds) by reading /proc/diskstats at regular intervals.
I would now like to be able to display this usage as a percentage (I find it is more user-friendly than raw numbers, which can be hard to interpret). Therefore, does anyone know of a method for retrieving maximum (or nominal) disk I/O speed programmatically on Linux (API call, reading a file, etc)?
I am aware of various answers about measuring disks speeds(eg https://askubuntu.com/questions/87035/how-to-check-hard-disk-performance), but all are through testing. I would like to avoid such methods as they take some time to run and entail heavy disk I/O while running (thus potentially degrading the performance of other running applications).
In the advent of IBM PC era, there was a great DOS utility, I forgot its name, but it was measuring the speed of the computer (maybe Speedtest? whatever). There was a bar in the 2/3 bottom of the screen, which is represented the speed of the CPU. If you had a 4.0 MHz (not GHz!) the bar occupied the 10% of the screen.
2-3 years later, '386 computers has risen, and the speed indicator bar overgrown not just the line but the screen, and it looked crappy.
So, there is no such as 100% disk speed, CPU speed etc.
The best you can do: if you program runs for a while, you can remember the highest value and set it as 100%. Probably you may save the value into a tmp file.

Time Consuming Tensorflow C++ Session->Run - Images for Real-time Inference

[Tensorflow (TF) on CPU]
I am using the skeleton code provided for C++ TF inference from GitHub [label_image/main.cc] in order to run a frozen model I have created in Python. This model is an FC NN with two hidden layers.
In my current project's C++ code, I run the NN's frozen classifier for each single image (8x8 pixels). For each sample, a Session->Run call takes about 0.02 seconds, which is expensive in my application, since I can have 64000 samples that I have to run.
When I send a batch of 1560 samples, the Session->Run call takes about 0.03 seconds.
Are these time measurements normal for the Session->Run Call? From the C++ end, should I send my frozen model batches of images and not single samples? From the Python end, are there optimisation tricks to alleviate that bottleneck? Is there a way to concurrently do Session-Run calls in C++?
Environment info
Operating System: Linux
Installed version of CUDA and cuDNN: N/A
What other attempted solutions have you tried?
I installed TF using the optimised instruction set for the CPU, but it does not seem to give me the huge time saving mentioned in StackOverflow
Unified the session for the Graph I created.
EDIT
It seems that MatMul is the bottleneck -- Any suggestions how to improve that?
Should I use 'optimize_for_inference.py' script for my frozen graph?
How can you measure the time in Python with high precision?
Timeline for feeding an 8x8 sample and getting the result in Python
Timeline for feeding an 8x8 batch and getting the result in Python
For the record, I have done two things that significantly increased the speed of my application:
Compiled TF to work on the optimised ISA of my machine.
Applied batching to my data samples.
Please feel free to comment here if you have questions about my answer.

Octo.py only using between 0% and 3% of my CPUs

I have been running a Python octo.py script to do word counting/author on a series of files. The script works well -- I tried it on a limited set of data and am getting the correct results.
But when I run it on the complete data set it takes forever. I am running on a windows XP laptop with dual core 2.33 GHz and 2 GB RAM.
I opened up my CPU usage and it shows the processors running at 0%-3% of maximum.
What can I do to force Octo.py to utilize more CPU?
Thanks.
As your application isn't very CPU intensive, the slow disk turns out to be the bottleneck. Old 5200 RPM laptop hard drives are very slow, which, in addition to fragmentation and low RAM (which impacts disk caching), make reading very slow. This in turns slows down processing and yields low CPU usage. You can try defragmenting, compressing the input files (as they become smaller in disk size, processing speed will increase) or other means of improving IO.

How can I run a code directly into a processor with a File System?

I have a simple anisotropic filter c/c++ code that will process an .pgm image which is an text file with greyscale information for each pixel, and after done processing, it will generate an output image with the filter applied.
This program takes up to some seconds in order for it to do about 10 iterations on a x86 CPU running windows.
Me and an academic finishing his master degree on applied computing, we need to run the code under FPGA (Altera DE2-115) to see if there is considerable results of performance gain when running the code directly on the processor (NIOS 2).
We have successfully booted up the S.O uClinux under the FPGA, but there are some errors with device hardware, and by that we can't access SD-Card not even Ethernet, so we can't get the code and image into the FPGA in order to test its performance.
So I am here asking to an alternative way to test our code performance directly into an CPU with a file system so the code can read the image and generate another one.
The alternative can be either with an product that has low cost and easy to use (I was thinking raspberry PI), or either if I could upload the code somewhere that runs automatically for me and give me the reports.
Thanks in advance.
what you're trying to do is benchmarking some software on a multi GHz x86 Processor vs. a soft-core processor running 50MHz? (as much as I can tell from Altera docs)
I can guarantee that it will be even slower on the FPGA! Since it is also running an OS (even embedded Linux) it also has threading overhead and what not. This can not be considered running it "directly" on CPU (whatever you mean by this)
If you really want to leverage the performance of an FPGA you should "convert" your C-Code into a HDL and run it directly in hardware. Accessing the data should be possible. I don't know how it's done with an Altera board but Xilinx has some libraries accessing data from a SD card with FAT.
You can use on board SRAM or DDR2 RAM to run OS and your application.
Hardware design in your FPGA must have memory controller in it. In SOPC or Qsys select external memory as reset vector and compile design.
Then open NioSII build tools for Eclipse.
In Eclipse create new project by selecting NiosII Application and BSP project.
Once the project is created, go to BSP properties and type offset of external memory in the linker tab and generate BSP.
Compile project and Run as Nios II hardware.
This will run you application on through external memory.
You wont be able to see the image but 2-D array representing image in memory can be
printed on console.

How does the compression file attribute impact performance on a file save in windows?

QUESTION:
If I set the compression attribute on a directory on a windows server, how does that effect file saving performance?
WHY I WANT TO KNOW:
I have a server that several batch processes save huge files on, and these are mostly Txt or CSV files that I'd like to compress to save disk space.
If it does compression on the fly as it writes the files, I would have to look for CPU usage when writing, and that may be an issue.
If it writes them uncompressed, and a background thread later compresses it, that would be ideal - as the batch processes would not be slowed down when they do their writes.
My alternative solution would be to not set the attribute on the directory, but have a scheduled job run the compact command on these files.
This isn't programming related, but here goes anyway:
Reading and writing from disk will require some extra CPU processing since compression is a CPU intensive task.
However, reading and writing files is typically I/O bound, not CPU bound. So your computer will spend more time waiting for the data to be written/read than it will waiting for data to be compressed/uncompressed.
So long your server isn't CPU starved, you shouldn't see a big change in performance.
Of course, before you implement any kind of changes like this, do some testing in a test environment that simulates your real server conditions.
Edit:
MSDN Docs on NTFS compression
Microsoft NTFS compression best practices