Why isn't there a size method for flink MapState? - state

I need to count daily orders per user. The data may arrive late for quite a long time. So i think i can use a MapState[String,Long] for key is date and value is the order num according to the date to store the count.But as time goes by, the key-value will be one more per key per day and the state size could be too big someday. Since the data won't be late for longer than a day ,i just need to store two days of data. In this situation, i need to remove the earliest date when the size of MapState[String,Int] reaches 3. But i find out that there isn't a size method for fink MapState.
I know i can use iterator to achieve this and this is exactly what i did. But since the java.util.Map has size method, why isn't there a size method for flink MapState?

check out this: https://issues.apache.org/jira/browse/FLINK-5917 and it explains the reason. size() has been removed since then.

This is because with the RocksDB state backend the underlying representation of MapState doesn't allow for an efficient implementation of a size method.

Related

Indexing a large file (32gb worth of file)

Apologies in advance as I think I need to give a background of my problem.
We have a proprietary database engine written in native c++ built for 32bit runtime, the database records are identified by their record number (basically offset in the file where a record is written) and a "unique id" (which is nothing more than -100 to LONG_MIN).
Previously the engine limits a database to only 2gb (where block of record could be a minimum size of 512bytes up to 512*(1 to 7)). This effectively limits the number of records to about 4 million.
We are indexing this 4 million records and storing the index in a hashtable (we implemented an extensible hashing for this) and works brilliantly for 2gb database. Each of the index is 24bytes each. each record's record number is indexed as well as the record's "unique id" (the indexes reside in the heap and both record number and "unique id" can point to the same index in heap). The indexes are persisted in memory and stored in the file (however only the record number based indexes are stored in file). While in memory, a 2gb database's index would consume about 95mb which still is fine in a 32bit runtime (but we limited the software to host about 7 databases per database engine for safety measure)
The problem begins when we decided to increases the size of the database from 2gb to 32gb. This effectively increased the number of records to about 64 million, which would mean the hashtable would contain 1.7gb worth of index in heap memory for a single 32gb database alone.
I ditched the in memory hashtable and wrote the index straight to a file, but I failed to consider the time it would take to search for an index in the file, considering I could not sort the indexes on demand (because write to the database happens all the time which means the indexes must be updated almost immediately). Basically I'm having problems with re-indexing, that is our software needs to check if a record exist and it does so by checking the current indexes if it resides there, but since I now changed it from in-memory to file I/O index, its now taking forever just to finish 32gb indexing (2gb indexing as I have computed it will apparently take 3 days to complete).
I then decided to store the indexes in order based on record number so I dont have to search them in file, and structure my index as such:
struct node {
long recNum; // Record Number
long uId; // Unique Id
long prev;
long next;
long rtype;
long parent;
}
It works perfectly if I use recNum to determine where in file the index record is stored and retrieves it using read(...), but my problem is if the search based on "unique id".
When I do a search on the index file based on "unique id", what I'm doing essentially is loading chunks of the 1.7gb index file and checking the "unique id" until I get a hit, however this proves to be a very slow process. I attempted to create an Index of the Index so that I could loop quicker but it still is slow. Basically, there is a function in the software that will eventually check every record in the database by checking if it exist in the index first using the "unique id" query, and if this function comes up, to finish the 1.7gb index takes 4 weeks in my calculation if I implement a file based index query and write.
So I guess what 'm trying to ask is, when dealing with large databases (such as 30gb worth of database) persisting the indexes in memory in a 32bit runtime probably isn't an option due to limited resource, so how does one implement a file based index or hashtable with out sacrificing time (at least not so much that its impractical).
It's quite simple: Do not try to reinvent the wheel.
Any full SQL database out there is easily capable of storing and indexing tables with several million entries.
For a large table you would commonly use a B+Tree. You don't need to balance the tree on every insert, only when a node exceeds the minimum or maximum size. This gives a bad worst case runtime, but the cost is amortized.
There is also a lot of logic involved in efficiently, dynamically caching and evicting parts of the index in memory. I strongly advise against trying to re-implement all of that on your own.

CouchDB size keep growing after deleting

I want to reduce disk size of database by deleting the old metrics, i.e. which are elder than 3 hours. But what I'm currently see is that when I do delete & cleanup, the number of documents is reduced but size of db has increased.
So, after some time of this constant auto clean-up, I see that size of database increased very much, but number of documents remains constant -- because of deleting. To delete, I'm doing bulk_update of items to be deleted and then do compact & cleanup.
Where can I read how this mechanism actually works and how should I delete the data properly? Another words, how to keep the constant size of database?
If you delete a document in CouchDB, the document is only marked as deleted, but its content stays in the data base (this is due to the append-only design of CouchDB).
Last year I wrote a blog post about this topic, laying out three different approaches how to solve this problem. Maybe one of them is suitable for you.

how deal with atomicity situation

Hi imagine I have such code:
0. void someFunction()
1. {
2. ...
3. if(x>5)
4. doSmth();
5.
6. writeDataToCard(handle, data1);
7.
8. writeDataToCard(handle, data2);
9.
10. incrementDataOnCard(handle, data);
11. }
The thing is following. If step 6 & 8 gets executed, and then someone say removes the card - then operation 10 will not be completed successfully. But this will be a bug in my system. Meaning if 6 & 8 are executed then 10 MUST also be executed. How to deal with such situations?
Quick Summary: What I mean is say after step 8 someone may remove my physical card, which means that step 10 will never be reached, and that will cause a problem in my system. Namely card will be initialized with incomplete data.
You will have to create some kind of protcol, for instance you write to the card a list of operatons to complete:
Step6, Step8, Step10
and as you complete the tasks you remove the entry from the list.
When you reread the data from the disk, you check the list if any entry remains. If it does, the operation did not successfully complete before and you restore a previous state.
Unless you can somehow physically prevent the user from removing the card, there is no other way.
If the transaction is interrupted then the card is in the fault state. You have three options:
Do nothing. The card is in fault state, and it will remain there. Advise users not to play with the card. Card can be eligible for complete clean or format.
Roll back the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the rollback.
Complete the transaction the next time the card becomes available. You need enough information on the card and/or some central repository to perform the completion.
In all three cases you need to have a flag on the card denoting a transaction in progress.
More details are required in order to answer this.
However, making some assumption, I will suggest two possible solutions (more are possible...).
I assume the write operations are persistent - hence data written to the card is still there after card is removed-reinserted, and that you are referring to the coherency of the data on the card - not the state of the program performing the function calls.
Also assumed is that the increment method, increments the data already written, and the system must have this operation done in order to guarantee consistency:
For each record written, maintain another data element (on the card) that indicates the record's state. This state will be initialized to something (say "WRITING" state) before performing the writeData operation. This state is then set to "WRITTEN" after the incrementData operation is (successfully!) performed.
When reading from the card - you first check this state and ignore (or delete) the record if its not WRITTEN.
Another option will be to maintain two (persistent) counters on the card: one counting the number of records that began writing, the other counts the number of records that ended writing.
You increment the first before performing the write, and then increment the second after (successfully) performing the incrementData call.
When later reading from the card, you can easily check if a record is indeed valid, or need to be discarded.
This option is valid if the written records are somehow ordered or indexed, so you can see which and how many records are valid just by checking the counter. It has the advantage of requiring only two counters for any number of records (compared to 1 state for EACH record in option 1.)
On the host (software) side you then need to check that the card is available prior to beginning the write (don't write if its not there). If after the incrementData op you you detect that the card was removed, you need to be sure to tidy up things (remove unfinished records, update the counters) either once you detect that the card is reinserted, or before doing another write. For this you'll need to maintain state information on the software side.
Again, the type of solution (out of many more) depends on the exact system and requirements.
Isn't that just:
Copy data to temporary_data.
Write to temporary_data.
Increment temporary_data.
Rename data to old_data.
Rename temporary_data to data.
Delete the old_data.
You will still have a race condition (if a lucky user removes the card) at the two rename steps, but you might restore the data or temporary_data.
You haven't said what you're incrementing (or why), or how your data is structured (presumably there is some relationship between whatever you're writing with writeDataToCard and whatever you're incrementing).
So, while there may be clever techniques specific to your data, we don't have enough to go on. Here are the obvious general-purpose techniques instead:
the simplest thing that could possibly work - full-card commit-or-rollback
Keep two copies of all the data, the good one and the dirty one. A single byte at the lowest address is sufficient to say which is the current good one (it's essentially an index into an array of size 2).
Write your new data into the dirty area, and when that's done, update the index byte (so swapping clean & dirty).
Either the index is updated and your new data is all good, or the card is pulled out and the previous clean copy is still active.
Pro - it's very simple
Con - you're wasting exactly half your storage space, and you need to write a complete new copy to the dirty area when you change anything. You haven't given enough information to decide whether this is a problem for you.
... now using less space ... - commit-or-rollback smaller subsets
if you can't waste 50% of your storage, split your data into independent chunks, and version each of those independently. Now you only need enough space to duplicate your largest single chunk, but instead of a simple index you need an offset or pointer for each chunk.
Pro - still fairly simple
Con - you can't handle dependencies between chunks, they have to be isolated
journalling
As per RedX's answer, this is used by a lot of filesystems to maintain integrity.
Pro - it's a solid technique, and you can find documentation and reference implementations for existing filesystems
Con - you just wrote a modern filesystem. Is this really what you wanted?

Efficiently and safely assigning unique IDs

I am writing a database and I wish to assign every item of a specific type a unique ID (for internal data management purposes). However, the database is expected to run for a long (theoretically infinite) time and with a high turnover of entries (as in with entries being deleted and added on a regular basis).
If we model our unique ID as a unsigned int, and assume that there will always be less than 2^32 - 1 (we cannot use 0 as a unique ID) entries in the database, we could do something like the following:
void GenerateUniqueID( Object* pObj )
{
static unsigned int iCurrUID = 1;
pObj->SetUniqueID( iCurrUID++ );
}
However, this is fine until entries start getting removed and other ones added in their place, there may still be less than 2^32-1 entries, but we may overflow the iCurrUID and end up assigning "unique" IDs which already are being used.
One idea I had was to use a std::bitset<std::numeric_limits<unsigned int>::max-1> and then traversing that to find the first free unique ID, but this would have a high memory consumption and will take linear complexity to find a free unique ID, so I'm looking for a better method if one exists?
Thanks in advance!
I'm aware that changing the datatype to a 64-bit integer, instead of a 32-bit integer would resolve my problem; however, because I am working in the Win32 environment, and working with lists (with DWORD_PTR being 32-bits), I am looking for an alternative solution. Moreover, the data is sent over a network and I was trying to reduce bandwidth consumption by using a smaller size unique ID.
With an uint64_t (64bit), it would take you well, well over 100 years, even if you insert somewhere close to 100k entries per second.
Over 100 years, you should insert somewhere around 315,360,000,000,000 records (not taking into account leap years and leap seconds, etc). This number will fit into 49 bits.
How long to you anticipate that application to run?
Over 100 years?
This is the common thing database administrators do when they have an autoincrement field that apprpaches the 32bit limit. They change the value to the native 64bit type (or 128bit) for their DB system.
The real question is how many entries can you have until you are
guaranteed that the first one is deleted. And how often you
create new entries. An unsigned long long is guaranteed to
have a maximum value of at least 2^64, about 1.8x10^19. Even at
one creation per microsecond, this will last for a couple of
thousand centuries. Realistically, you're not going to be able
to create entries that fast (since disk speed won't allow it),
and your program isn't going to run for hundreds of centuries
(because the hardware won't last that long). If the unique id's
are for something disk based, you're safe using unsigned long
long for the id.
Otherwise, of course, generate as many bits as you think you
might need. If you're really paranoid, it's trivial to use
a 256 bit unsigned integer, or even longer. At some point,
you'll be fine even if every atom in the universe creates a new
entry every picosecond, until the end of the universe. (But
realistically... unsigned long long should suffice.)

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.