CouchDB size keep growing after deleting - mapreduce

I want to reduce disk size of database by deleting the old metrics, i.e. which are elder than 3 hours. But what I'm currently see is that when I do delete & cleanup, the number of documents is reduced but size of db has increased.
So, after some time of this constant auto clean-up, I see that size of database increased very much, but number of documents remains constant -- because of deleting. To delete, I'm doing bulk_update of items to be deleted and then do compact & cleanup.
Where can I read how this mechanism actually works and how should I delete the data properly? Another words, how to keep the constant size of database?

If you delete a document in CouchDB, the document is only marked as deleted, but its content stays in the data base (this is due to the append-only design of CouchDB).
Last year I wrote a blog post about this topic, laying out three different approaches how to solve this problem. Maybe one of them is suitable for you.

Related

Indexing a large file (32gb worth of file)

Apologies in advance as I think I need to give a background of my problem.
We have a proprietary database engine written in native c++ built for 32bit runtime, the database records are identified by their record number (basically offset in the file where a record is written) and a "unique id" (which is nothing more than -100 to LONG_MIN).
Previously the engine limits a database to only 2gb (where block of record could be a minimum size of 512bytes up to 512*(1 to 7)). This effectively limits the number of records to about 4 million.
We are indexing this 4 million records and storing the index in a hashtable (we implemented an extensible hashing for this) and works brilliantly for 2gb database. Each of the index is 24bytes each. each record's record number is indexed as well as the record's "unique id" (the indexes reside in the heap and both record number and "unique id" can point to the same index in heap). The indexes are persisted in memory and stored in the file (however only the record number based indexes are stored in file). While in memory, a 2gb database's index would consume about 95mb which still is fine in a 32bit runtime (but we limited the software to host about 7 databases per database engine for safety measure)
The problem begins when we decided to increases the size of the database from 2gb to 32gb. This effectively increased the number of records to about 64 million, which would mean the hashtable would contain 1.7gb worth of index in heap memory for a single 32gb database alone.
I ditched the in memory hashtable and wrote the index straight to a file, but I failed to consider the time it would take to search for an index in the file, considering I could not sort the indexes on demand (because write to the database happens all the time which means the indexes must be updated almost immediately). Basically I'm having problems with re-indexing, that is our software needs to check if a record exist and it does so by checking the current indexes if it resides there, but since I now changed it from in-memory to file I/O index, its now taking forever just to finish 32gb indexing (2gb indexing as I have computed it will apparently take 3 days to complete).
I then decided to store the indexes in order based on record number so I dont have to search them in file, and structure my index as such:
struct node {
long recNum; // Record Number
long uId; // Unique Id
long prev;
long next;
long rtype;
long parent;
}
It works perfectly if I use recNum to determine where in file the index record is stored and retrieves it using read(...), but my problem is if the search based on "unique id".
When I do a search on the index file based on "unique id", what I'm doing essentially is loading chunks of the 1.7gb index file and checking the "unique id" until I get a hit, however this proves to be a very slow process. I attempted to create an Index of the Index so that I could loop quicker but it still is slow. Basically, there is a function in the software that will eventually check every record in the database by checking if it exist in the index first using the "unique id" query, and if this function comes up, to finish the 1.7gb index takes 4 weeks in my calculation if I implement a file based index query and write.
So I guess what 'm trying to ask is, when dealing with large databases (such as 30gb worth of database) persisting the indexes in memory in a 32bit runtime probably isn't an option due to limited resource, so how does one implement a file based index or hashtable with out sacrificing time (at least not so much that its impractical).
It's quite simple: Do not try to reinvent the wheel.
Any full SQL database out there is easily capable of storing and indexing tables with several million entries.
For a large table you would commonly use a B+Tree. You don't need to balance the tree on every insert, only when a node exceeds the minimum or maximum size. This gives a bad worst case runtime, but the cost is amortized.
There is also a lot of logic involved in efficiently, dynamically caching and evicting parts of the index in memory. I strongly advise against trying to re-implement all of that on your own.

Efficient lookup of a buffer with stack of data modifications applied

I am trying to write a C++11 library as part of a wider project that implements a stack of changes (modification, insertion and deletion) implemented on top of an original buffer. Then, the aim is to be able to quickly look "through" the changes and get the modified data out.
My current approach is:
Maintain an ordered list of changes, ordered by offset of the start of the change
Also maintain a stack of the same changes, so they can be rolled back in order
New changes are pushed onto the stack and inserted into the list at the right place
The changes-by-offset list may be modified if the change interacts with others
For example, a modification of bytes 5-10 invalidates the start of an earlier modification from 8-12
Also, insertion or deletion changes will change the apparent offset of data occurring after them (deleting bytes 5-10 means that what used to be byte 20 is now found at 15)
To find the modified data, you can look though the list for the change that applies (and the offset within that change that applies - another change might have invalidated some of it), or find the right offset in the original data if no change touched that offset
The aim here is to make the lookup fast - adding a change might take some effort to mess with the list, but lookups later, which will outnumber the modifications greatly, in an ordered list should be pretty straightforward.
Also you don't need to continuously copy data - each change's data is kept with it, and the original data is untouched
Undo is then implemented by popping the last change off the stack and rolling back any changes made to it by this change's addition.
This seems to be quite a difficult task - there are a lot of things to take care of and I am quickly piling up complex code!
I feel sure that this must be problem that has been dealt with in other software, but looking around various hex editors and so on hasn't pointed me to a useful implementation. Is there a name for this problem ("data undo stack" and friends hasn't got me very far!), or a library that can be used, even as a reference, for this kind of thing?
I believe the most common approach (one I have used successfully in the past) is to simply store the original state and then put each change operation (what's being done + arguments) on the undo stack. Then, to get to a particular prior state you start from the original and apply all changes except the ones you want undone.
This is a lot easier to implement than trying to identify what parts of the data changed, and it works well unless the operations themselves are very time-consuming (and therefore slow to "replay" onto the original state).
I would look at persistent data structures, such as https://en.wikipedia.org/wiki/Persistent_data_structure and http://www.toves.org/books/persist/#s2 - or websearch on terms from these. I think you could do this with a persistent tree whose leaves carry short strings.

How to handle allocation/deallocation for small objects of variable size in C++

I am currently writing C++ code to store and retrieve tabular data (e.g. a spreadsheet) in memory. The data is loaded from a database. The user can work with the data and there is also a GUI class which should render the tabular data. The GUI renders only a few rows at once, but the tabular data could contain 100,000s of rows at once.
My classes look like this:
Table: provides access to rows (by index) and column-definitions (by name of column)
Column: contains the column definition like name of the column and data-type
Row: contains multiple fields (as many as there are columns) and provides access to those fields (by column name)
Field: contains some "raw" data of variable length and methods to get/set this data
With this design a table with 40 columns and 200k rows contains over 8 million objects. After some experiments I saw that allocating and deallocating 8 million objects is a very time consuming task. Some research showed that other people are using custom allocators (like Boosts pool_allocator) to solve that problem. The problem is that I can't use them in my problem domain, since their performance boost comes from relying on the fact, that all objects allocated have the same size. This is not the case in my code, since my objects differ in size.
Are there any other techniques I could use for memory management? Or do you have suggestions about the design?
Any help would be greatly appreciated!
Cheers,
gdiquest
Edit: In the meantime I found out what my problem was. I started my program under Visual Studio, which means that the debugger was attached to the debug- and also to the release-build. With an attached debugger my executable uses a so called debug heap, which is very slow. (further details here) When I start my program without a debugger attached, everything is as fast as I would have expected it.
Thank you all for participating in this question!
Why not just allocate 40 large blocks of memory? One for each column. Most of the columns will have fixed length data which makes those easy and fast. eg vector<int> col1(200000). For the variable length ones just use vector<string> col5(200000). The Small String Optimization will ensure that your short strings require no extra allocation. Only rows with longer strings (generally > 15 characters) will require allocs.
If your variable length columns are not storing strings then you could also use vector<vector<unsigned char>> This also allows a nice pre-allocation strategy. eg Assume your biggest variable length field in this column is 100 bytes you could do:
vector<vector<unsigned char>> col2(200000);
for (auto& cell : col2)
{
cell.resize(100);
}
Now you have a preallocated column that supports 200000 rows with a max data length of 100 bytes. I would definitely go with the std::string version though if you can as it is conceptually simpler.
Try rapidjson allocators, they are not limited to objects of the same size AFAIK.
You might attach an allocator to a table and allocate all table objects with it.
For more granularity, you might have row or column pools.
Apache does this, attaching all data to request and connection pools.
If you want them to be STL-compatible then perhaps this answer will help to integrate them, although I'm not sure. (I plan to try something like this myself, but haven't gotten to it yet).
Also, some allocators might be faster than what your system offers by default. TCMalloc, for example. (See also). So, you might want to profile and see whether using a different system allocator helps.

how deal with atomicity situation

Hi imagine I have such code:
0. void someFunction()
1. {
2. ...
3. if(x>5)
4. doSmth();
5.
6. writeDataToCard(handle, data1);
7.
8. writeDataToCard(handle, data2);
9.
10. incrementDataOnCard(handle, data);
11. }
The thing is following. If step 6 & 8 gets executed, and then someone say removes the card - then operation 10 will not be completed successfully. But this will be a bug in my system. Meaning if 6 & 8 are executed then 10 MUST also be executed. How to deal with such situations?
Quick Summary: What I mean is say after step 8 someone may remove my physical card, which means that step 10 will never be reached, and that will cause a problem in my system. Namely card will be initialized with incomplete data.
You will have to create some kind of protcol, for instance you write to the card a list of operatons to complete:
Step6, Step8, Step10
and as you complete the tasks you remove the entry from the list.
When you reread the data from the disk, you check the list if any entry remains. If it does, the operation did not successfully complete before and you restore a previous state.
Unless you can somehow physically prevent the user from removing the card, there is no other way.
If the transaction is interrupted then the card is in the fault state. You have three options:
Do nothing. The card is in fault state, and it will remain there. Advise users not to play with the card. Card can be eligible for complete clean or format.
Roll back the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the rollback.
Complete the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the completion.
In all three cases you need to have a flag on the card denoting a transaction in progress.
More details are required in order to answer this.
However, making some assumption, I will suggest two possible solutions (more are possible...).
I assume the write operations are persistent - hence data written to the card is still there after card is removed-reinserted, and that you are referring to the coherency of the data on the card - not the state of the program performing the function calls.
Also assumed is that the increment method, increments the data already written, and the system must have this operation done in order to guarantee consistency:
For each record written, maintain another data element (on the card) that indicates the record's state. This state will be initialized to something (say "WRITING" state) before performing the writeData operation. This state is then set to "WRITTEN" after the incrementData operation is (successfully!) performed.
When reading from the card - you first check this state and ignore (or delete) the record if its not WRITTEN.
Another option will be to maintain two (persistent) counters on the card: one counting the number of records that began writing, the other counts the number of records that ended writing.
You increment the first before performing the write, and then increment the second after (successfully) performing the incrementData call.
When later reading from the card, you can easily check if a record is indeed valid, or need to be discarded.
This option is valid if the written records are somehow ordered or indexed, so you can see which and how many records are valid just by checking the counter. It has the advantage of requiring only two counters for any number of records (compared to 1 state for EACH record in option 1.)
On the host (software) side you then need to check that the card is available prior to beginning the write (don't write if its not there). If after the incrementData op you you detect that the card was removed, you need to be sure to tidy up things (remove unfinished records, update the counters) either once you detect that the card is reinserted, or before doing another write. For this you'll need to maintain state information on the software side.
Again, the type of solution (out of many more) depends on the exact system and requirements.
Isn't that just:
Copy data to temporary_data.
Write to temporary_data.
Increment temporary_data.
Rename data to old_data.
Rename temporary_data to data.
Delete the old_data.
You will still have a race condition (if a lucky user removes the card) at the two rename steps, but you might restore the data or temporary_data.
You haven't said what you're incrementing (or why), or how your data is structured (presumably there is some relationship between whatever you're writing with writeDataToCard and whatever you're incrementing).
So, while there may be clever techniques specific to your data, we don't have enough to go on. Here are the obvious general-purpose techniques instead:
the simplest thing that could possibly work - full-card commit-or-rollback
Keep two copies of all the data, the good one and the dirty one. A single byte at the lowest address is sufficient to say which is the current good one (it's essentially an index into an array of size 2).
Write your new data into the dirty area, and when that's done, update the index byte (so swapping clean & dirty).
Either the index is updated and your new data is all good, or the card is pulled out and the previous clean copy is still active.
Pro - it's very simple
Con - you're wasting exactly half your storage space, and you need to write a complete new copy to the dirty area when you change anything. You haven't given enough information to decide whether this is a problem for you.
... now using less space ... - commit-or-rollback smaller subsets
if you can't waste 50% of your storage, split your data into independent chunks, and version each of those independently. Now you only need enough space to duplicate your largest single chunk, but instead of a simple index you need an offset or pointer for each chunk.
Pro - still fairly simple
Con - you can't handle dependencies between chunks, they have to be isolated
journalling
As per RedX's answer, this is used by a lot of filesystems to maintain integrity.
Pro - it's a solid technique, and you can find documentation and reference implementations for existing filesystems
Con - you just wrote a modern filesystem. Is this really what you wanted?

best way to store adresses of pointer to an object of class

I have a class A whose objects are created dynamically:
A *object;
object = new A;
there will be many objects of A in an execution.
i have created a method in A to return the address of a particular object depending on the passed id.
A *A::get_obj(int id)
implemetation of get_obj requires itteration so i chose vectors to store the addresses of the objects.
vector<A *> myvector
i think another way to do this is by creating a file & storing the address as a text on a particular line (this will be the id).
this will help me reduce memory usage as i will not create a vector then.
what i dont know is that will this method consume more time than the vector method?
any other option of doing the same is welcome.
Don't store pointers in files. Ever. The objects A are taking up more space than the pointers to them anyway. If you need more A's than you can hold onto at one time, then you need to create them as needed and serialize the instance data to disk if you need to get them back later before deleting, but not the address.
will this method consume more time than the vector method?
Yes, it will consume a lot more time - every lookup will take several thousand times longer. This does not hurt if lookups are rare, but if they are frequent, it could be bad.
this will help me reduce memory usage
How many object are you expecting to manage this way? Are you certain that memory-usage will be a problem?
any other option of doing the same is welcome
These are your two options, really. You can either manage the list in memory, or on disk. Depending on your usage scenario, you can combine both methods. You could, for instance, keep frequently used objects in memory, and write infrequently used ones out to disk (this is basically caching).
Storing your data in a file will be considerably slower than in RAM.
Regarding the data structure itself, if you usually use all the ID's, that is if your vector usually has empty cells, then std::vector is probably the most suitable approach. But if your vector will have many empty cells, std::map may give you a better solution. It will consume less memory and give O(logN) access complexity.
The most important thing here, imho, is the size of your data set and your platform. For a modern PC, handling an in-memory map of thousands of entries is very fast, but if you handle gigabytes of data, you'd better store it in a real on-disk database (e.g. MySQL).