MPI to read all files in a directory - c++

Hi guys I am learning to program in MPI and I came across this question.
Lets say that the current working directory I have 10 files. Each file contains a column with numbers.
I want to divide the work among all processors, so for example if i use, say, two nodes, i want node 1 to read the first 5 files and the second node do the rest.
Thank you for any help.

There are no metadata operations in MPI-IO aside from open/create a file or deleting that file. I suppose it was hard to standardize over windows, unix, and, I don't know.. vax-y styles back in the old days?
The nice thing about MPI is that it provides a good basis for libraries. Write a "MPI-IO metadata" library... and share it with us!

Related

NGS Analysis: How can I extract the different variants in two VCF files

I am currently working on my thesis and I am trying to analyze the results of NGS sequencing Illumina. I am not really familiar with bioinformatics and in this part of my project, I am trying two compare two vcf files corresponding to the results of healthy tissue and tumor tissue. I want to compare these vcf files and remove their similarities. More specifically I want to remove the information of the healthy tissue from the tumor one. Have you any suggestions on which tool I should use or any way that I can do my analysis? If you can help me I would be more than thankful. Thank you in advance!
I understand your problem. First thing I would recommend is to use Unix software (I don't know which OS you're running) called VCFtools. It's pretty simple to use. But if You want to do all the processing with, for example python, you can use the pandas library for python which helps to process data in column format or PyVCF library, which is a parser for VCF files. I can help you more if you can provide some example data you're processing.

How to track the number of times my console application in C++14 has been launched?

I'm building a barebones Notepad-styled project (console-based, does not have a GUI as of now) and I'd like to track, display (and later use it in some ways) the number of times the console application has been launched. I don't know if this helps, but I'm building my console application on Windows 10, but I'd like it to run on Windows 7+ as well as on Linux distros such as Ubuntu and the like.
I prefer not storing the details in a file and then subsequently reading from it to maintain count. Please suggest a way or any other resource that details how to do this.
I'd put a strikethrough on my quote above, but SO doesn't have it apparently.
Note that this is my first time building such a project so I may not be familiar with advanced stuff... So, when you're answering please try to explain as is required for a not-so-experienced software developer.
Thanks & Have a great one!
Edit: It seems that the general advice is to use text files to protect portability and to account for the fact that if down-the-line, I need to store some extra info, the text file will come in super handy. In light of this, I'll focus my efforts on the text file.
Thanks to all for keeping my efforts from de-railing!
I prefer not storing the details in a file
In the comments, you wrote that the reason is security and you consider using a file as "over-kill" in this case.
Security can be solved easily - just encrypt the file. You can use a library like this to get it done.
In addition, since you are writing and reading to/from the file only once each time the application is opened/closed, and the file should take only small number of bytes to store such data, I think it's the right, portable solution.
If you still don't want to use a file, you can use windows registry to store the data, but this solution is not portable

Multi processes read different part of a big binary file simultanously

I have a large binary file, and it is saved on a NFS share disk. In the cluster, I want multiple processes to simultaneously read this big file. Each process gets a file pointer, opens the big file and reads starting from the supplied pointer and read some size of bytes.
How do I design this project? As far as I concerned, it is similar to some concurrency databases. Is there any lightweight library or open-source projects related to my project? I use the C++ language.
Not sure if there is a point to use a library.
You could use basic stuff. Open and reposition yourself in the file and then perform the read:
http://www.cplusplus.com/reference/fstream/ifstream/open/
http://www.cplusplus.com/reference/istream/istream/seekg/
or
http://www.cplusplus.com/reference/cstdio/fopen/
http://www.cplusplus.com/reference/cstdio/fseek/
nicolae: I agree :-)
mining: so far you haven't said anything about a need for interaction between your readers.
Consider a simple scenario.
Let's say you have your C++ program called "dostuff" which takes the following arguments:
--name something to lable your output.
--offset offset point, seek to here (default to zero).
--bytes number of bytes to process.
inputfile the file you want to read
The following would run your two processes in the background.
$ dostuff --name "proc1" --offset=0 --bytes=100 \\myserver\myshare\bigfile.dat &
$ dostuff --name "proc2" --offset=100 --bytes=100 \\myserver\myshare\bigfile.dat &
You can open a file handle within each process.
So long as the data access is read only why do you want to make it more complex?
important: I'm not saying it shouldn't be more complex, I'm suggesting you haven't yet shown a need for additional complexity. And that complexity is going to come from a need for your readers to collaborate. If they don't need to collaborate then you're pretty much done with your architecture - use the links Nicolae provided and good luck to you.

Out of Core Implementation of a Quadtree

I am trying to build a Quadtree data structure(or let's just say a tree) on the secondary memory(Hard Disk).
I have a C++ program to do so and I use fopen to create the files. Also, I am using tesseral coding to store each cell in a file named with its corresponding code to store it on the disk in one directory.
The problem is that after creating about 1,100 files, fopen just returns NULL and stops creating new files. I can create further files manually in that directory, but using C++ it can not create any further files.
I know about max limit of inode on ext3 filesystem which is (from Wikipedia) 32,000 but mine is way less than that, also note that I can create files manually on the disk; just not through fopen.
Also, I really appreciate any idea regarding the best way to store a very dynamic quadtree on disk(I need the nodes to be in separate files and the quadtree might have a depth of 50).
Using nested directories is one idea, but I think it will slow down the performance because of following the links on the filesystem to access the file.
Thanks,
Nima
Whats the errno value of the failed fopen() call?
Do you keep the files you have created open? If yes you are most probably exceeding the maximum number of open files per process.
When you use directories as data structures, you delegate the work of maintaining that structure to the file system, which is not necessarily designed to do that.
Edit: Frank is probably right that you'v exceeded the number of available file descriptors. You can increase those, but that shows that you're also using internals of your ABI as a data structure. Slow and (as resources are exhausted) unstable.
Either code for a very specific OS installation, or use a SQL database.
I have no idea why fopen wouldn't work. Look at errno.
However, storing everything in one directory is a bad idea. When you add a lot of files, it will get slow. Having a directory for every level of the tree will also be slow.
Instead, combine multiple levels into one directory. You could, for example, have one directory for every four levels of the tree. This would limit the number of directories, amount of nesting, and number of files per directory, giving very good performance.
The limitation could come from:
stdio (C library). most 256 handles. Can be increased to 1024 (in VC, call _setmaxstdio)
OS kernel on the file hanldes per process (usually 1024).

Game Programming: .DAT file?

I've seen a lot of games use something similar to a .DAT file or a specific file type that the game has for itself. I'm just beginning with C++ and DirectX and I was interested in keeping my information in something similar to a .DAT.
My initial conception was that it would hold information on the files you wanted to store within the .DAT file. Something similar to a .RAR file. Unfortunately, my googleing skills did not help me in finding the answers.
Right now I'm simply loading textures and sound files from a folder called Data.
EDIT: While I understand that .DAT is short for data, and I've found that a .DAT file generally contains any assortment of information, I'm still unsure about how to go about doing something as packing images and sound files into any type of file and being able to read them.
I'm not sure about using fstreams to achieve my task, however I will look into streams related to storing data and how to properly read from that data. Meanwhile if anyone has another answer to offer based on this new information, it would be appreciated.
EDIT: Thanks to the answers, I stumbled across a similar question on stackoverflow and felt I'd share it here. Combining resources into a single binary file
I don't think there is really such thing as .dat file format. It's short for "data," and different applications just put in some proprietary stuff in it and call it ".dat." You can read up on fstream classes to do file IO in C++. See Input/Output with files.
What you then do is make up your own file format. For example, first 4 byte is int that indicates the number of blocks in the .dat and for each block, you have 4 byte indicating the length of each block, 4 byte indicating the type of the block, the variable length data itself .. something like that.
DAT obviously stands for data, and there is no real or de facto standard on what that extension actually refers to. Your decisions on the best file formats should be based on technical considerations, not pointless attempts at security through obscurity.
Professional games use a technique where they put all the needed resources (models, textures, sounds, ai, config, etc) zipped/packed into a single file thus making it faster to manage, harder to change (some even make use of a virtual filing system from what's inside the data file). Now, for what's inside the file is different depending on the needs of the game and the data structures that you use.
If you're just starting into gamedev, i recommend you stick with keeping all you assets separate and don't bother too much about packing them into a single file.
Now if you really want to start using a packed format here's a good pointer:
Creating a PAK File Format
Here's a link which claims that .dat is a movie format, 'DAT' being short for Digital Audio Tape.
I'm not sure I believe the link, but I do remember something about a Microsoft supported format called DAT, from long ago, when I used an earlier version of Windows.
It makes more sense as a logical extension for a DATA file of some kind.
.dat, as others have said, is literally just a data file. In reality, the file extension means nothing other than association with a program. For example, I could make a word processor that saves all the documents with the .mp3 file extension. These files wouldn't be playable in any media software, but the software might try. File extensions are used to help programs know what types of files they can and cannot open--however those rules don't have to be followed.
Anyway, you can dump any sort of information to a file. Programmers/software writers will often choose .dat as the extension of that file because it has become the standard to signify 'this file just holds a ton of data' and that the data doesn't necessarily hold any standardized headers, footers, or formatting.
A dat file could really contain anything. It might be as simple as a zip archive with the extension changed, or it could be a completely custom file type. If you're just starting out, you probably don't want to write your own file format, although doing so can be fun and educational. If you want to encapsulate your data files into some kind of container, you should probably go with a zip, paq, or maybe tar.gz.