Planning for a file indexing program

Planning for a file indexing program - c++

I'm somewhat new to C++, but not programming in general. I want to write my first practice program in C++ as a file indexing program.
It's seems easy enough scanning directories for names, storing that information, and filtering them depending on what I want to view.
What I'm concerned about is at some point, I want to index a whole drive (I have an extra 1TB drive apart from my OS to store files on). I have about 400,000-500,000 files on there and I was wondering what would be the best way to store this information? I highly doubt keeping all those records in a text file is optimal and would like to think it's naive.
Is there anything else I should be concerned about?
Thanks.

Isn't some kind of database the obvious answer?
If you don't want to hook up to a server, you can try something like SQLite. Alternatively, if you only need to do basic lookups, you could also create your own proprietary file format. You can utilize any combination of binary and textual data in your file. It's hard to suggest possible layouts without knowing what data you need to store and how you'll be accessing it.

You can safely persist your data to a text file. However, you'd need to read the file into memory at startup, and do all the complex operations in memory. Even if we'd assume a naive approach, where you store the file path with every file, you'd still look at ~100 bytes/file, or ~50 megabyte. A smarter approach stores just the filename and a pointer to the directory name.

Related

Maintaining state of a program

This might be a simple question for most people out there but I'm like stuck on it.
I was wondering,most bank softwares or lets say any commercial software when closed at the end of the day and then re-opened the next,how do those programs remember everything from the previous day? I hope I make myself clear, thanks in advance for your guidance
Best.

This is not black magic.
The answer is by saving its data. You do this by putting it in a database, or writing data files.
The trick is to write your programs in a way that makes it easy to guarantee that you've restored the state you thought you saved.
A common approach is to use serialization. This means that you are able to take your giant data structure and recursively call a 'Save' function on it and its contained objects. This is very intuitive if you are taking advantage of object inheritance and polymorphism. Of course, you also write a 'Load' function to do the reverse.
You write your data in such a way that it can be read back in. For example, if you wanted to write a string, you might first write its length and then its characters. That way, when you read it you know how many bytes to allocate.
The above approach is pretty standard if you are writing binary file formats. In fact, it's the philosophy behind chunk-based formats such as AVI.
For text-based, you might choose to serialize your data in popular formats like XML or JSON. But you are only restricted by your imagination.

Is there any way to prevent access to text files other than encryption?

Basically I am implementing a dictionary and I dont want any user to access my dictionary .txt file. I need to access the file from my program. The txt files are too large that encryption method takes a lot of time to decrypt them. What other methods are there ?

You have set yourself a nearly impossible task.
By some definitions of "read", any user of your program can "read" the file, just by asking the program to make a lookup.
If you want to prevent read access to the file by other means than your program, your only effective option will be to deny physical access to the file (by hosting it on a server and have an API in front of it that only allows queries). Any knowledgeable and determined user with access to both the program and the data file will be able to decipher it.
Encrypting the data file can make this much harder for some people, but only if the key is not built into the program, but users have to enter it manually. Then only those with the key have a reasonable chance of deciphering the data file.

Bart's answer is completely valid. But the problem could have a solution.
Depending what the dictionary actually does, you might be able to use a Bloom filter or other probabilistic hash data structure to avoid storing its contents in any direct fashion whatsoever. But you wouldn't be able to access the words in that case, only to check whether a word is there or not.
Such a dictionary would still be vulnerable to attack by querying the program for all possible words. That could be feasible or not; it's unclear what you're really doing.

If you REALLY take the security seriuous, then you should use proper, long key, encryption. [But see comments about "where do you store teh key" - you can't do this safely].
However, if you simply want the file to be a little difficult to read in a text editor, you could use some form of simple compression [with the added benefit of it taking less space]. Bytepair encoding is every simple to implement. Huffman coding is a little more effort. Obviouslky using an existing encoding technique, such as gzip, bzip2, xz or some such would be somewhat less difficult for someone to break, so won't have the same protective properties.
If the purpose is to make it hard for REGULAR users (e.g. people who are not well versed in cracking codes and breaking into software) then storing the keys with a simple "bit-twiddling" - for examply take the top and bottom 4 bits of the character and reverse them: char ch = (orig_ch >> 4) | ((orig_ch & (1<<4)-1) << 4); would make the content pretty much unreadable for ordinary users.

Like many people already observed, its not possible to secure your data from users 100% of the time. Any sufficiently determined user will get to it, given enough time.
However, there is a "simple" way to make sure that the user doesn't "mess" with the data in your dictionary file: you could use binary file I/O:
ofstream out("dictionary.dat",ios::binary);
out.write((char*) &your_data, sizeof(your_data));
And for reading, you do something similar:
ifstream in("dictionary.dat",ios::binary);
in.read((char*) &your_data, sizeof(your_data));
This isn't "encryption" in true sense of the word, its just an alternative representation that makes it harder for people to "mess" with your data.
Also, rather than worrying about if users will mess with your data files, it would make more sense to implement methods of detecting if the file has been tampered (eg: checksums, file hashes etc.) and refusing to load it.

Writing raw data "signature" to disk without corrupting filesystem

I am trying to create a program that will write a series of 10-30 letters/numbers to a disk in raw format (not to a file that the OS will read). Perhaps to make my attempt clearer, if you were to open the disk in a hex editor, you would see the 10-30 letters/numbers but a file manager such as Windows Explorer would not see it (because the data is not a file).
My goal is to be able to "sign" a disk with a series of characters and to be able to read and write that "signature" in my program. I understand NTFS signs its partitions with a NTFS flag as do other file systems and I have to be careful to not write my signature to any of those critical parts.
Are there any libraries in C++/C that could help me write at a low level to a disk and how will I know a safe sector to start writing my signature to? To narrow this down, it only needs to be able to write to NTFS, FAT, FAT32, FAT16 and exFAT file systems and run on Windows. Any links or references are greatly appreciated!
Edit: After some research, USB drives allow only 1 partition without applying hacking tricks that would unfold further problems for the user. This rules out the "partition idea" unfortunately.

First, as the commenters said, you should look at why you're trying to do this, and see if it's really a good idea. Most apps which try to circumvent the normal UI the user has for using his/her computer are "bad", in various ways.
That said, you could try finding a well-known file which will always be on the system and has some slack in the block size for the disk, and write to the slack. I don't think most filesystems would care about extra data in the slack, and it would probably even be copied if the file happens to be relocated (more efficient to copy the whole block at the disk level).
Just a thought; dunno how feasible it would be, but seems like it could work in theory.

Though I think this is generally a pretty poor idea, the obvious way to do it would be to mark a cluster as "bad", then use it for your own purposes.
Problems with that:
Marking it as bad is non-trivial (on NTFS bad clusters are stored in a file named something like $BadClus, but it's not accessible to user code (and I'm not even sure it's accessible to a device driver either).
There are various programs to scan for (and attempt to repair) bad clusters/sectors. Since we don't even believe this one is really bad, almost any of these that works at all will find that it's good and put it back into use.
Most of the reasons people think of doing things like this (like tying a particular software installation to a particular computer) are pretty silly anyway.
You'd have to scan through all the "bad" sectors to see if any of them contained your signature.

This is very dangerous, however, zero-fill programs do the same thing so you can google how to wipe your hard drive with zero's in C++.
The hard part is finding a place you KNOW is unused and won't be used.

How to create a temporary file to helping cache large data in C++?

I have some large vectors, the data of them are coming from some calculations of hard disk files.
I've seen many softwares which uses a single temporary file to cache data.
I'm very curious about how to do this, but I don't know the name of this technique.
I want to change my codes as little as possible.
Thank you.
My environment is Windows/MFC/VC10/Boost.

I am pretty sure you're looking for memory mapped files.
http://www.boost.org/doc/libs/1_47_0/libs/iostreams/doc/classes/mapped_file.html

What is the best way I should go about creating a program to store information into a file, edit the information in that file, and add new information

I'm about to start on a little project i'm trying to do where I create a C++ program to store inventory data into a file ( I guess a .txt will do )
• Item Description • Quantity on Hand
• Wholesale Cost • Retail Cost • Date
Added to Inventory
I need to be able to:
• Add new records to the file
• Display any record in the file
• Change any record in the file
Is there anything I should know of before I start this that could make this much more easy & efficient...
Like for example, should I try and use XML or what that be too hard to work with via C++?
I've never really understood the most efficient way of doing this.
Like would I search through the file and look for things in brackets or something?
EDIT
The datasize shouldn't be too large. It is for homework I guess you could say. I want to write the struct's contents into a file's route, how would I go about doing that?

There are many approaches. Is this for homework or for real use? If it's for homework, there are probably some restrictions on what you may use.
Otherwise I suggest some embedded DBMS like SQLite. There are others too, but this will be the most powerful solution, and will also have the easiest implementation.
XML is also acceptable, and has many reusable implementations available, but it will start loosing performance once you go into thousands of records. The same goes for JSON. And one might still debat which one is simpler - JSON or XML.
Another possibility is to create a struct and write its contents directly to the file. Will get tricky though if the record size is not constant. And, if the record format changes, the file will need to be rebuilt. Otherwise this solution could be one of the best performance-wise - if implemented carefully.

Could you please enlighten us why don't you want to use a database engine for it?
If it is just for learning then.... give us please an estimated size of stored data in that file and the access pattern (how many users, how often they do it etc.)?
The challenge will be to create an efficient search and modification code.
For the search, it's about data structures and organization.
For the modification, it's how would you write updates to the file without reading it completely into memory, updating it there and then writing it again completely back to the file.

If this is a project that will actually be used, with the potential to have features added over time, go for a database solution from the start, even if it seems overkill. I've been down this road before, small features get added over time, and before you realize it you have implemented a database. Poorly. Bite the bullet and use a database.
If this is a learning exercise, it depends on the amount of data you want to store. If it is small, the easiest thing to do is read the entire file into memory and operate on it there. When changes are made, write the entire file back out to disk. If the data is too large to do that, the next best thing is to have fixed sized records. Create a POD struct that contains all of the data (i.e., no pointers, stl containers, etc). Then you can rewrite individual records without needed to rewrite the entire file. If neither of these will work, your best bet is a database solution.

If you insist to do it manually, I suggest JSON instead of XML.
Also, consider sqlite.

This sounds like a perfect job for SQLite. Small, fast, flexible and easy to use.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Planning for a file indexing program - c++

Related

Maintaining state of a program

Is there any way to prevent access to text files other than encryption?

Writing raw data "signature" to disk without corrupting filesystem

How to create a temporary file to helping cache large data in C++?

What is the best way I should go about creating a program to store information into a file, edit the information in that file, and add new information

Categories

Resources