C++ Write Data to Random HDD Sector - c++

I need to write a program using C++ which is able to perform data write/read to/from both random and sequential hard disk sectors.
However, actually I am confused with the term sector and its relation with a file.
What I want to know is, if I simply:
1. Create a string contains word "Hello, world" and then;
2. Save the string into "myfile.txt",
does the data written in sequential or random sector? If it is sequential (I guess), then how can I write the string to random hard disk sector and then read it again? And also vice-versa.

What you are trying to do is pretty much impossible today because of file systems. If you want a file (which you seem to), you need a file system. A file system then places the data in some format it wants to the sectors it thinks are best. Advanced filesystems such as btrfs and zfs also do compression, checksumming and placing data on multiple hard disks. So you can't just write to a sector, because you would likely destroy data and you couldn't read it anymore because your file system doesn't understand your data format. Also it wouldn't even know that there is data there because file must be registered in the MFT/btrfs metadata/... tables.
TL;DR Don't try to do it, it will mess up your system and it won't work.

Related

Why isn't lossless compression automatic on computers?

I was just wondering what could be the impact if, say, Microsoft decided to automaticly "lossless" compress every single file saved in a computer.
What are the pros? The cons? Is it feasible?
Speed.
When compressing a file of any kind you're encoding its contents in a more compact form, often using dictionaries and/or prefix codes (An example: huffman coding). To access the data you have to uncompress it, and this translates to time and used memory, as to access a specific piece of the file you have to decompress it as a whole. While decompressing you ave to save the results somewhere and the most appropriate place is RAM.
Of course this wouldn't be a great problem (decompressing the whole file) if all of it needed to be read, and not even in the case of a stream reading it, but if a program wanted to write to the compressed file all the data would have to be compressed again, or at least a part of it.
As you can see, compressing files in the filesystem would reduce a lot the bandwidth available to applications - to read a single byte you have to read a chunk of the file and decompress it - and also require more RAM.

Creating separate regions of a mapped file to write and read simultaneously in C++

I'm trying to create a file on a SSD disk and map it for one-writer-many-reader applications, and to be able to make it safe I would like to divide the file into sectors/regions. So while the writer is writing data to its exclusively owned region, the readers can create views and reach the data from the part which is flagged let's say "available".
As soon as the writer inputs the new data, the "available" sector/region should include the newly added data as well for the reader applications.
The file size is known from the beginnig so no need for any kind of dynamic memory operation, but it will be big like over 55GB(it means it is too big to map all the file at one time). The data writing is pretty fast like 15 packages per second with each package size 1.5MB.
The OS will be Win 7 x64.
Is it possible to implement this kind of mechanism? I couldn't find any example, tutorial or any kind of document on internet, they just give simple examples like creating a file and creating a view etc... I need to know how to implement such pointer operations to divide the file into sectors and change the parts from "reserved"(for writing) and "available"(for reading). (I'm not very familiar with C++ language)

Application that Recovers Disk Contents in RAW form and Dumps as One Big Binary File?

I have a bunch of work that was lost on a hard drive that I had accidentally formatted at one time. I realize there are many data recovery tools out there, but all of them seem to try and scan the drive for actual partition data to recover files. I don't need this. I simply need a program to access each sector in a RAW fashion and dump the contents, byte-by-byte, to some file. The reason being, is that most of my work are ASCII files, so I don't care if the contents are part of a recognizable and complete file. I'm OK with parsing through the RAW data and trying to recover whatever text I can.
So does something like this exist, or am I resigned to code my own program (something I'm more than OK with)? It'd be a rather simple program that literally would scan the entire disk (sector by sector) and dump the contents (byte by byte) to a file (or likely several smaller files to make it more manageable) on another disk. What's the suggested way to make RAW disk reads from C/C++?
Could you boot from a Linux Live CD, identify the device name of the hard disk (eg with fdisk which will also tell you the partition size) then use the dd command, eg dd if=/dev/sda1 of=/mnt/path_to_external_drive to extract the raw contents of the disk?

Truncating the file in c++

I was writing a program in C++ and wonder if anyone can help me with the situation explained here.
Suppose, I have a log file of about size 30MB, I have copied last 2MB of file to a buffer within the program.
I delete the file (or clear the contents) and then write back my 2MB to the file.
Everything works fine till here. But, the concern is I read the file (the last 2MB) and clear the file (the 30MB file) and then write back the last 2MB.
To much of time will be needed if in a scenario where I am copying last 300MB of file from a 1GB file.
Does anyone have an idea of making this process simpler?
When having a large log file the following reasons should and will be considered.
Disk Space: Log files are uncompressed plain text and consume large amounts of space.
Typical compression reduce the file size by 10:1. However a file cannot be compressed
when it is in use (locked). So a log file must be rotated out of use.
System resources: Opening and closing a file regularly will consume lots of system
resources and it would reduce the performance of the server.
File size: Small files are easier to backup and restore in case of a failure.
I just do not want to copy, clear and re-write the last specific lines to a file. Just a simpler process.... :-)
EDIT: Not making any inhouse process to support log rotation.
logrotate is the tool.
I would suggest an slightly different approach.
Create a new temporary file
Copy the required data from the original file to the temporary file
Close both files
Delete the original file
Rename the temp file to the same name as the original file
To improve the performance of the copy, you can copy the data in chunks, you can play around with the chunk size to find the optimal value.
If this is your file before:
-----------------++++
Where - is what you don't want and + is what you do want, the most portable way of getting:
++++
...is just as you said. Read in the section you want (+), delete/clear the file (as with fopen(... 'wb') or something similar and write out the bit you want (+).
Anything more complicated requires OS-specific help, and isn't portable. Unfortunately, I don't believe any major OS out there has support for what you want. There might be support for "truncate after position X" (a sort of head), but not the tail like operation you're requesting.
Such an operation would be difficult to implement, as varying blocksizes on filesystems (if the filesystem has a block size) would cause trouble. At best, you'd be limited to cutting on blocksize boundaries, but this would be harry. This is such a rare case, that this is probably why such a procudure is not directly supported.
A better approach might be not to let the file grow that big but rather use rotating log files with a set maximum size per log file and a maximum number of old files being kept.
If you can control the writing process, what you probably want to do here is to write to the file like a circular buffer. That way you can keep the last X bytes of data without having to do what you're suggesting at all.
Even if you can't control the writing process, if you can at least control what file it writes to, then maybe you could get it to write to a named pipe. You could attach your own program at the end of this named pipe that writes to a circular buffer as discussed.

How do you process a large data file with size such as 10G?

I found this open question online. How do you process a large data file with size such as 10G?
This should be an interview question. Is there a systematic way to answer this type of question?
If you're interested you should check out Hadoop and MapReduce which are created with big (BIG) datasets in mind.
Otherwise chunking or streaming the data is a good way to reduce the size in memory.
I have used streambased processing in such cases. An example was when I had to download a quite large (in my case ~600 MB) csv-file from an ftp server, extract the records found and put them into a database. I combined three streams reading from each other:
A database inserter which read a stream of records from
a record factory which read a stream of text from
an ftp reader class which downloaded the ftp stream from the server.
That way I never had to store the entire file locally, so it should work with arbitrary large files.
It would depend on the file and how the data in the file may be related. If you're talking about something where you have a bunch of independent records that you need to process and output to a database or another file, it would be beneficial to multi-thread the process. Have a thread that reads in the record and then passes it off to one of many threads that will do the time-consuming work of processing the data and doing the appropriate output.
In addition to what Bill Carey said, not only does the type of file determine "meaningful chunks" but also, it determines what "processing" would mean.
In other words, what you do to process, how you determine what to process will vary tremendously.
What separates a "large" data file from a small one is--broadly speaking--whether you can fit the whole file into memory or whether you have to load portions of the file from the disk one at a time.
If the file is so large that you can't load the whole thing into memory, you can process it by identifying meaningful chunks of the file, then reading and processing them serially. How you define "meaningful chunks" will depend very much on the type of file. (i.e. binary image files will require different processing from massive xml documents.)
Look for opportunities to split the file down so that it can be tackled by multiple processes. You don't say if records in the file are related, which makes the problem harder but the solution is in principle the same - identify mutually exclusive partitions of data that you can process in parallel.
A while back I needed to process 100s of millions of test data records for some performance testing I was doing on a massively parallel machine. I used some Perl to split the input file into 32 parts (to match the number of CPUs) and then spawned 32 processes, each transforming the records in one file.
Because this job ran over the 32 processors in parallel, it took minutes rather than the hours it would have taken serially. I was lucky though, having no dependencies between any of the records in the file.