We're operating a private blockchain using Quorum for a research project. We're storing large amounts of data in transactions as strings. We've hit the size limit (we think, open to thoughts there too) and have been able to span the data across multiple transactions, etc., so the ability to store everything we want isn't really the issue.
We are storing just a long string of hexadecimal characters, and what we're wondering is if instead of string, we could store them more efficiently as something else and maybe stretch the data field out by switching from string to a different datatype specialized for hexadecimal.
Any ideas? Like I said, we're operational, but I've bene having a hard time figuring out whether this is possible, or even a good idea. The hexadecimal "0xab34ef6......26ef" would not resolve to a number or anything (its a string of independent characters that just happen to be hex.
Thanks and take care,
Mark
Related
Is there a way by which instead of copying the data of file directly implement search in it?
in theory: yes, but it will be quite inefficient.
i'd recommend to put the data in a sqlite database, that way you still have a single file, but can nicely query/search for entries.
tl;dr: Yes, but it's often not worth it
You neglected to mention how the text file is sorted, exactly, and whether there are escaped characters, quotation marks, multi-octet characters etc. - these would all impact the answer.
But let's make the following assumptions:
Plain printable ASCII text, with no newlines in each string.
Newlines (i.e. 0xA characters) separate strings.
This is still not enough for a set of assumptions, because - maybe some of the strings are much longer than the others? In fact, what about the not-that-extreme case of n strings overall, but a few of them take up most of the characters? If you start sampling characters in the file, you'll need to go back and forth, linearly, at least to both edges of a single string (or forwarded until you hit a newline twice).
So let's add more assumptions, although frankly - they're quite invalid:
You know the minimum Min and maximum Max string lengths.
The ratio R of minimum to maximum length is not very high
This makes it at least theoretically reasonable to start reading from some arbitrary point in the file, and look for a complete string. However, files are usually on disks; and disks are accessed by blocks. So for reading even a single character from the file you need to read a whole block of size B (think of B as, say, 1 KiB as a reasonable example). We'll assume Max < B, otherwise you're in the huge-strings case.
Another point to be made is that disk latencies are high. This is especially true for magnetic (or optical disks), where you can wait as much as 10 msec for a single read! If you read sequentially, there's no need to "seek" or lookup the position you're interested in, and you could make use of the disk's full bandwidth. This is less of a problem with SSDs, but it's still not negligible.
So, as you can see, there's quite a bit of overhead for your binary search. It may still be worth it your file is really really large relative to Min, Max, R and B. So in a file of several Gigabytes, I'd certainly consider it. Otherwise - probably not worth the bother.
I haven't seen the common sense notion of converting an integer to network order and to write the resulting bytes into an indexable entity in a string - string database vs. writing the string representation of the number anywhere in the documentation of such databases.
Surely the size overhead of writing a 64-bit int as a string into a database must outweigh the trivial complexity of having to do a ntohl call before writing the bytes back into an integer type.
I am therefore missing something here, what are the downsides to using big-endian bytes vs. strings as indexable entities in string-string databases ?
(C++/C tags as I am talking about writing bytes into the memory location of a programatic type, BDB as that is the database I am using, could be kyotodb as well).
The advantage of big-endian in this case is that the strings would sort correctly in ascending order.
If the database architecture cannot natively store 64-bit integers, but you need to store them anyway, stringifying them this way is a way to do it.
Of course if you later upgrade the database to one that can store 64-bit integers natively, you will either be "stuck" with the implementation or have to go through a migration process.
If the database validates that the string data you send is valid in the expected encoding then you can't just give it any data you want. You'll only be able to send such integers as happen to look like a valid encoding. I don't know if BDB or kyotodb do such validation.
Also it seems to me like a hack to try to trick one data type to hold something else, and then rely on clients to all know the trick. Of course that applies whether you're using the string to hold a ascii-decimal representation of the integer or if you're using the string as a raw memory buffer to hold the integer. It seems to me that it'd be better to use a database that actually holds the types you want to hold, instead of just strings.
I need to work with a series of characters. The number of characters can be upto 1011.
In a usual array, it's not possible. What should I use?
I wanted to use gets() function to hold the string. But, is this possible for STL containers?
If not, then what's the way?
Example:
input:
AMIRAHID
output: A.M.I.R.A.H.I.D
How to make this possible if the number of characters lessened to 10^10 in 32-bit machine ?
Thank you in advance.
Well, that's roughly 100GByte of data. No usual string class will be able to hold more than fits into your main memory. You might want to look at STXXL, which is an implementation of STL allowing to store part of the data on disk.
If your machine has 1011 == 93GB of memory then it's probably a 64bit machine, so string will work. Otherwise nothing will help you.
Edited answer for the edited question: In that case you don't really need to store the whole string in memory. You can store only small part of it that fits into the memory.
Just read every character from the input, write it to the output and write a dot after it. Repeat it until you get and EOF on the input. To increase performance you can read and write large chunks of the data but such that still can fit into the memory.
Such algorithms are called online algorithms.
It is possible for an array that large to be created. But not on a 32-bit machine. Switching to STL will likely not help, and is unnecessary.
You need to contemplate how much memory that is, and if you have any chance of doing it at all.
1011 is roughly 100 gigabytes, which means you will need a 64-bit system (and compiler) to even be able to address it.
STL's strings support a max of max_size() characters, so the answer can change with the implementation.
A string suffers from the same problem as an array: *it has to fit in memory.
10^11 characters would take up over 4GB. That's hard to fit into memory on a 32-bit machine which has a 4GB memory space. You either need to split up your data into smaller chunks, and only load a bit of it at a time, or switch to 64-bit, in which case both arrays and strings should be able to hold the data (although it may still be preferable to split it up into multiple smaller strings/arrays
The SGI version of STL has a ROPE class (A rope is a big string, get it).
I am not sure it is designed to handle that much data but you can have a look.
http://www.sgi.com/tech/stl/Rope.html
If all you're trying to do is read in some massive file and write to another file the same data with periods interspersed between each character, why bother reading the whole thing into memory at once? Pick some reasonable buffer size and do it in chunks.
I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.
You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.
I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.
I want to store lots of information to a block by bits, and save it into a file.
To keep my file not so big, I want to use a small number of bits to save specified information instead of a int.
For example, I want to store Day, Hour, Minute to a file.
I only want 5 bit(day) + 5 bit(Hour) + 6 bit(Minute) = 16 bit of memory for data storage.
I cannot find a efficient way to store it in a block to put in a file.
There are some big problems in my concern:
the data length I want to store each time is not constant. It depends on the incoming information. So I cannot use a structure to store it.
there must not be any unused bit in my block, I searched for some topics that mentioned that if I store 30 bits in an int(4 byte variable), then the next 3 bit I save will automatically go into the next int. but I do not want it to happen!!
I know I can use shift right, shift left to put a number to a char, and put the char into a block, but it is inefficient.
I want a char array that I can continue putting specified bits into, and use write to put it into a file.
I think I'd just use the number of bits necessary to store the largest value you might ever need for any given piece of information. Then, Huffman encode the data as you write it (and obviously Huffman decode it as you read it). Most other approaches are likely to be less efficient, and many are likely to be more complex as well.
I haven't seen such a library. So I'm afraid you'll have to write one yourself. It won't be difficult, anyway.
And about the efficiency. This kind of operations always need bits shifting and masking, because few CPUs support directly operating into bits, especially between two machine words. The only difference is you or your compiler does the translation.