storing urls to a file so they can be reachable quickly - c++

i have a file and a plenty of urls, these urls are written to a file all with the same structure plus a url CheckSum of type int. stackoverflow.com is written as:
12534214214 http://stackoverflow.com
now everytime i want to put an url into the file i need to check if the url doesn't exist
then i can put it.
but it takes too much time to do this with 1 000 000 urls:
//list of urls
list<string> urls;
size_t hashUrl(string argUrl); //this function will hash the url and return an int
file.open("anchors");
//search for the int 12534214214 if it isn't found then write 12534214214 http://stackoverflow.com
file.close();
question1 : -how can i search in a file using the checksum so the search will take a few ms?
question2 : -is there another way of storing these urls so that they can be reachable quickly?
thanks, and sorry for bad english

There is (likely [1]) no way you search a million URLS in a plain text-file in "a few milliseconds". You need to either load the entire file into memory (and when you do, you may just as well do that into some reasonable data structure, for example a std::map or std::unordered_map), or use some sort of indexing for the file - e.g have a smaller file with just the checksum and the place in the file that they are stored at.
The problem with a plain textfile is that there is no way to know where anything is. One line can be 10 bytes, another 10000 bytes. This means that you literally have to read every byte up to the point you are interested in.
Of course, the other option is to use a database library, SQLite etc (or proper a database server, such as MySQL) that allows the data to be stored/retrieved based on a "query". This hides all the index-generation and other such problems, and is already optimised both when it comes to search algorithms, as well as having clever caching and optimised code for reading/writing data to disk, etc.
[1] If all the URLS are short, it's perhaps possible that the file is small enough to cache well, and code can be written to be fast enough to linearly scan through the entire file in a few milliseconds. But a file with, say, an average of 50 bytes for each URL will be 50MB. If each byte takes 10 clock cycles to process, we're already at 130ms to process the file, even if it's directly available in memory.

Related

What's the smallest possible file size on disk?

I'm trying to find a solution to store a binary file in it's smallest size on disk. I'm reading vehicles VIN and plate number from a database that is 30 Bytes and when I put it in a txt file and save it, its size is 30B, but its size on disk is 4KB, which means if I save 100000 files or more, it would kill storage space.
So my question is that how can I write this 30B to an individual binary file to its smallest size on disk, and what is the smallest possible size of 30B on disk including other info such as file name and permissions?
Note: I do not want to save those text in database, just I want to make separate binary files.
the smallest size of a file is always the cluster size of your disk, which is typically 4k. for data like this, having many records in a single file is really the only reasonable solution.
although another possibility would be to store those files in an archive, a zip file for example. under windows you can even access the zip contents pretty similar to ordinary files in explorer.
another creative possibility: store all the data in the filename only. a zero byte file takes only 1024 bytes in the MFT. (assuming NTFS)
edit: reading up on resident files, i found that on the newer 4k sector drives, the MFT entry is actually 4k, too. so it doesn't get smaller than this, whether the data size is 0 or not.
another edit: huge directories, with tens or hundreds of thousands of entries, will become quite unwieldy. don't try to open one in explorer, or be prepared to go drink a coffee while it loads.
Most file systems allocate disk space to files in chunks. It is not possible to take less than one chunk, except for possibly a zero-length file.
Google 'Cluster size'
You should consider using some indexed file library like gdbm: it is associating to arbitrary key some arbitrary data. You won't spend a file for each association (only a single file for all of them).
You should reconsider your opposition to "databases". Sqlite is a library giving you SQL and database abilities. And there are noSQL databases like mongodb
Of course, all this is horribly operating system and file system specific (but gdbm and sqlite should work on many systems).
AFAIU, you can configure and use both gdbm and sqlite to be able to store millions of entries of a few dozen bytes each quite efficienty.
on filesystems you have the same problem. the smallest allocate size is one data-node and also a i-node. For example in IBM JFS2 is the smallest blocksize 4k and you have a inode to allocate. The second problem is you will write many file in short time. It makes a performance problems, to write in short time many inodes.
Every write operation must jornaled and commit. Or you us a old not jornaled filesystem.
A Idear is, grep many of your data recorders put a separator between them and write 200-1000 in one file.
for example:
0102030400506070809101112131415;;0102030400506070809101112131415;;...
you can index dem with the file name. Sequence numbers or so ....

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael
How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.
Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

What is a good way to traverse through a big file in c++

I have really big files, which contain data packages. The file itself is simply a really big string, and the packages are seperated with a string "PACK1.0".
Assuming "XXX" is data, a package looks like this :
PACK1.0XXXXXXXXXXXXXXXXXPACK1.0XXXXXXXXXXXXXXPACK1.0XXXXXXXXXX
I am creating a hash map which contains the number of packages, and the bytes where it begins.
Example:
PACKAGE NR | BYTE WHERE IT BEGINS IN THE STREAM
0 | 0
1 | 128
2 | 256
. | .
. | .
If I want package number 5340, I look in the hashmap at which byte the package begins, go to the byte with stream.seekg(POSITION) and parse the package, in theory.
My final problem is: I want to travel trough the file with a slider, with play&pause options. My thought was that the slider has a min=0 and max=packagecount range.
Is this a good way to traverse through a file?
What problems can this cause? What is a better way to do this?
This is my Code for storing the hashmap (this code assumes a package is 128byte long) :
std::map<int, int> THEMAP;
thefile.seekg(0,std::ios::end);
dataLength=thefile.tellg();
thefile.seekg(0,std::ios::beg);
while(position<dataLength)
{
thefile.seekg(0,position);
position=position+128;
packagecount++;
THEMAP.insert(std::make_pair(packagecount,position));
}
This is usuually a case for memory-mapped-io (MMIO). If you are Windows only then use the MapViewOfFile and the other functions in that family. For cross-platform usage I recommend glib's file map functions. What MMIO does is to map a part of a file (or an entire file) into the process' memory space, so you can access it via a simple pointer. You can determine which part of the file and which size of it is mapped, arbitrarily.
A possible strategy for you could be that you, on startup, map a fixed block of the file into memory in a loop, block by block) and search for the first package identifer in each block. This is relatively quick and gives you a first set of markers. On next access you can use this initial set to find the proper part of the file, map this and scan only this part. Of course, you'd store any marker that comes along.
Later, when you scroll through your file you just map the page (can be smaller this time, depending on how much data you need at a certain point in time) and display the needed data. Obviously, the address of the package markers can at the same time be used as start address for the memory mapping.
Nice side effects are that it is completely irrelevant what size the packages are and you can map files of any size, even gigabyte sized files. By using small views on the file the memory requirement of your application can be very small.

FAT, optimize performance when retrieve a file

I have an implementation of database with one file per record, and I have about 10000 records.
I'm trying to optimize the performance of access to file, and I have a little doubt.
Is split files into folders better then keep all in single folder, for quick access to the files? ex: from 0 to 999 in folder 0, from 1000 to 1999 in 2 etc...
What is better for this, FAT16 or FAT32?
If you are accessing the files directly, then you won't have any performance drop. If you are searching for a particular file on the disk, it would be faster to store them in folders. This way folders would emulate db indexes. But as #blow mentioned, why don't you use something like Sqlite?
When you retrieve a file by filename you most likely do a linear search in the directory containing that file, you skip all directory entries until you find the one that matches the given filename.
This search operation may be slow if you do it every time for every file, there are many files in the directory and reads are slow (if your CPU is slow you lose even more).
You may want to build some sort of an index, a compact array of pairs filename+location sorted by filename, which you can keep in memory to quickly find files w/o rereading the directory entries.
Things can be greatly simplified if there's a constant number of files and they have the same length or are padded to the same length. In that case you don't need any search as you can calculate the location of each file directly from the filename, provided, of course, that the order of the files is fixed.
The only practical difference between FAT1x and FAT32 in this context is the size of the file allocation table, that set of linked lists/chains that tells you which clusters are free or occupied by file/directory data and tells you which cluster is the next in a file/directory after the given one. In FAT32, the cluster chain elements are 32-bit, 2 times larger than on FAT16. If the number of used clusters is small (less than ~64K), you are going to read twice as much data from FAT32 while traversing the cluster chains compared with FAT16. Also, finding a free cluster on FAT32 (when you create a new file/dir or grow an existing one) can be slow if there are many clusters on the disk (and there can be up to 2^28 on FAT32 AFAIR vs 2^16 of FAT16). You don't want to start searching for a free cluster from the beginning of the FAT every time. You want to keep somewhere a pointer to the last place you stopped the search and the next time search from there and then go to the beginning of the FAT when you've reached the FAT's end.
Split them across directories (the split number depending on your cluster size) and do not use LFN (LongFileName) if you can, because it will slow down your operation. I also work on embbeded systems. I did not have to access 1000s of files like you, but i avoided LFN (especially for royalty reasons).

C++ inserting a line into a file at a specific line number

I want to be able to read from an unsorted source text file (one record in each line), and insert the line/record into a destination text file by specifying the line number where it should be inserted.
Where to insert the line/record into the destination file will be determined by comparing the incoming line from the incoming file to the already ordered list in the destination file. (The destination file will start as an empty file and the data will be sorted and inserted into it one line at a time as the program iterates over the incoming file lines.)
Incoming File Example:
1 10/01/2008 line1data
2 11/01/2008 line2data
3 10/15/2008 line3data
Desired Destination File Example:
2 11/01/2008 line2data
3 10/15/2008 line3data
1 10/01/2008 line1data
I could do this by performing the sort in memory via a linked list or similar, but I want to allow this to scale to very large files. (And I am having fun trying to solve this problem as I am a C++ newbie :).)
One of the ways to do this may be to open 2 file streams with fstream (1 in and 1 out, or just 1 in/out stream), but then I run into the difficulty that it's difficult to find and search the file position because it seems to depend on absolute position from the start of the file rather than line numbers :).
I'm sure problems like this have been tackled before, and I would appreciate advice on how to proceed in a manner that is good practice.
I'm using Visual Studio 2008 Pro C++, and I'm just learning C++.
The basic problem is that under common OSs, files are just streams of bytes. There is no concept of lines at the filesystem level. Those semantics have to be added as an additional layer on top of the OS provided facilities. Although I have never used it, I believe that VMS has a record oriented filesystem that would make what you want to do easier. But under Linux or Windows, you can't insert into the middle of a file without rewriting the rest of the file. It is similar to memory: At the highest level, its just a sequence of bytes, and if you want something more complex, like a linked list, it has to be added on top.
If the file is just a plain text file, then I'm afraid the only way to find a particular numbered line is to walk the file counting lines as you go.
The usual 'non-memory' way of doing what you're trying to do is to copy the file from the original to a temporary file, inserting the data at the right point, and then do a rename/replace of the original file.
Obviously, once you've done your insertion, you can copy the rest of the file in one big lump, because you don't care about counting lines any more.
A [distinctly-no-c++] solution would be to use the *nix sort tool, sorting on the second column of data. It might look something like this:
cat <file> | sort -k 2,2 > <file2> ; mv <file2> <file>
It's not exactly in-place, and it fails the request of using C++, but it does work :)
Might even be able to do:
cat <file> | sort -k 2,2 > <file>
I haven't tried that route, though.
* http://www.ss64.com/bash/sort.html - sort man page
One way to do this is not to keep the file sorted, but to use a separate index, using berkley db (BerkleyDB). Each record in the db has the sort keys, and the offset into the main file. The advantage to this is that you can have multiple ways of sorting, without duplicating the text file. You can also change lines without rewriting the file by appending the changed line at the end, and updating the index to ignore the old line and point to the new one. We used this successfully for multi-GB text files that we had to make many small changes to.
Edit: The code I developed to do this is part of a larger package that can be downloaded here. The specific code is in the btree* files under source/IO.
Try a modifed Bucket Sort. Assuming the id values lend themselves well to it, you'll get a much more efficient sorting algorithm. You may be able to enhance I/O efficiency by actually writing out the buckets (use small ones) as you scan, thus potentially reducing the amount of randomized file/io you need. Or not.
Hopefully, there are some good code examples on how to insert a record based on line number into the destination file.
You can't insert contents into a middle of the file (i.e., without overwriting what was previously there); I'm not aware of production-level filesystems that support it.
I think the question is more about implementation rather than specific algorithms, specifically, handling very large datasets.
Suppose the source file has 2^32 lines of data. What would be an efficent way to sort the data.
Here's how I'd do it:
Parse the source file and extract the following information: sort key, offset of line in file, length of line. This information is written to another file. This produces a dataset of fixed size elements that is easy to index, call it the index file.
Use a modified merge sort. Recursively divide the index file until the number of elements to sort has reached some minimum amount - true merge sort recurses to 1 or 0 elements, I suggest stopping at 1024 or something, this will need fine tuning. Load the block of data from the index file into memory and perform a quicksort on it and then write the data back to disk.
Perform the merge on the index file. This is tricky, but can be done like this: load a block of data from each source (1024 entries, say). Merge into a temporary output file and write. When a block is emptied, refill it. When no more source data is found, read the temporary file from the start and overwrite the two parts being merged - they should be adjacent. Obviously, the final merge doesn't need to copy the data (or even create a temporary file). Thinking about this step, it is probably possible to set up a naming convention for the merged index files so that the data doesn't need to overwrite the unmerged data (if you see what I mean).
Read the sorted index file and pull out from the source file the line of data and write to the result file.
It certainly won't be quick with all that file reading and writing, but is should be quite efficient - the real killer is the random seeking of the source file in the final step. Up to that point, the disk access is usually linear and should therefore be reasonably efficient.