Is there a way to limit the number of output files of a process? - c++

An application of our company uses pdfimages (from xpdf) to check whether some pages in a PDF files, on which we know there is no text, consist of one image.
For this we run pdfimages on that page and count whether only one, two or more, or zero output files have been created (could be JPG, PPM, PGM or PPM).
The problem is that for some PDF files, we get millions of 14-byte PPM images, and the process has to be killed manually.
We know that by assigning the process to a job we can restrict how much time the process will run for. But it would probably be better if we could control that the process will create new files at most twice during its execution.
Do you have any clue for doing that?
Thank you.

One approach is to monitor the directory for file creations: http://msdn.microsoft.com/en-us/library/aa365261(v=vs.85).aspx - the monitoring app could then terminate the PDF image extraction process.
Another would be to use a simple ramdisk which limited the number of files that could be created: you might modify something like http://support.microsoft.com/kb/257405.
If you can set up a FAT16 filesystem, I think there's a limit of 128 files in the root directory, 512 in other dirs? - with such small files that would be reached quickly.

Also, aside from my 'joke' comment, you might want to check out _setmaxstdio and see if that helps ( http://msdn.microsoft.com/en-us/library/6e3b887c(VS.71).aspx ).

Related

Automate dekstop screening with Python

I am trying to make a program that could automatically scan the images or texts on a user's desktop and then convert it to a .txt file for text analysis.
So far I have found source codes to convert PDF and HTML into .txt. However I would like to make my program automatically scan the desktop screen at certain time intervals rather than manually inputting the source such as:
$pdf2txt.py samples/simple1.pdf
I don't know where to start so any suggestion will be appreciated.
First of all, the desktop is just a location in the file directory like:
C:\Users\Kirsteen\Desktop
So the next step would be to search through this directory for the types of files you are interested in. You'd be aiming to generate a list of valid file names that need to be converted. This Q/A might help you.
Once the files have been found run those converting scripts you have. To repeat this automatically put all of this in a loop and add a delay so that it runs once an hour/week.
To tidy things up, think about running this process in the background and making sure the program doesn't convert the files more than once if they haven't changed.

hardlink multiple file to one file

I have many files in a folder. I want to concatenate all these files to a single file. For example cat * > final_file;
But this will increase disk space. Is there is a way where I can hardlink all the files to final_file? For example ln * final_file.
This is not possible using links.
If you really need this kind of feature and can not afford to create one large file you could go for a custom file system driver. FUSE will allow you to write a simple file system driver which runs in the user space and allows to access the files as they were one large file.
You could also write a custom block device (e.g. by emulating the NBD "Network Block Device" protocol) which combines two or more files into one large block device.
Getting to know the concrete use case would help to give a better answer.
No. Hardlinking links 2 files, nothing more. The filesystem does not support that at an underlying level.

watermark files on the fly when served

I am looking for a very general answer to the feasibility of the idea, not a specific implementation.
If you want to serve small variations of the same media file to different people (say, an ePub or music file), is it possible to serve most of the file to everybody but individualized small portions of the file to each recipient for watermarking using something like Amazon WS.
If yes, would it be possible to create a dropbox-like file hosting service with these individualized media files where all users “see” most of the same physical stored file but with tiny parts of the file served individually? If, say, 1000 users had the same 10 MB mp3 file with different watermarks on a server that would amount to 10 GB. But if the same 1000 users were served the same file except for a tiny 10 kB individual watermarked portion it would only amount to 20 MB in total.
An EPUB is a single file and must be served/downloaded as such, not in pieces. Why don't you implement simple server-side logic to customize the necessary components, build the EPUB from the common assets and the customized ones, and then let users download that?
The answer is, of course, yes, it can be done, using an EC2 instance -- or any other machine that can run a web server, for that matter. The problem is that any given type of media file has different levels of complexity when it comes to customizing the file... from the simplest, where the file contains a string of bytes at a known position that can simply be overwritten with your watermark data, to a more complex format that would have to be fully or partially disassembled and repackaged every time a download is requested.
The bottom line is that for any format I can think of, the server would spent some amount of CPU resources -- possibly a significant amount -- crunching the data and preparing/reassembling the file for download. The ultimate solution would be very format-specific, and, as a side note, has really nothing to do with anything AWS other than the fact that you can host web servers in EC2.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).

How does rsync behave for concurrent file access?

I'm using rsync to run backups of my machine twice a day and the ten to fifteen minutes when it searches my files for modifications, slowing down everything considerably, start getting on my nerves.
Now I'd like to use the inotify interface of my kernel (I'm running Linux) to write a small background app that collects notifications about modified files and adds their pathnames to a list which is then processed regularly by a call to rsync.
Now, because this process by definition always works on files I've just been - and might still be - working on, I'm wondering whether I'll get loads of corrupted / partially updated files in my backup as rsync copies the files while I'm writing to them.
I couldn't find anyhing in the manpage and was yet unsuccessful in googling for the answer. I could go read the source, but that might take quite a while. Anybody know how concurrent file access is handled inside rsync?
It's not handled at all: rsync opens the file, reads as much as it can and copies that over.
So it depends how your applications handle this: Do they rewrite the file (not creating a new one) or do they create a temp file and rename that when all data has been written (as they should).
In the first case, there is little you can do: If two processes access the same data without any kind of synchronization, the result will be a mess. What you could do is defer the rsync for N minutes, assuming that the writing process will eventually finish before that. Reschedule the file if it is changes again within this time limit.
In the second case, you must tell rsync to ignore temp files (*.tmp, *~, etc).
It isn't handled in any way. If it is a problem, you can use e.g. LVM snapshots, and take the backup from the snapshot. That won't in itself guarantee that the files will be in a usable state, but it does guarantee that, as the name implies, it's a snapshot at a specific time.
Note that this doesn't have anything to do with whether you're letting rsync handle the change detection itself or if you use your own app. Your app, or rsync itself, just produces a list of files that have been changed, and then for each file, the rsync binary diff algorithm is run. The problem is if the file is changed while the rsync algorithm runs, not when producing the file list.