How to discern between network flows - c++

I want to be able to discern between networks flows. I am defining a flow as a tuple of three values (sourceIP, destIP, protocol). I am storing these in a c++ map for fast access. However, if the destinationIP and the sourceIP are different, but contain the same values, (e.g. )
[packet 1: source = 1.2.3.4, dest = 5.6.7.8]
[packet 2: source = 5.6.7.8, dest = 1.2.3.4 ]
I would like to create a key that treats these as the same.
I could solve this by creating a secondary key and a primary key, and if the primary key doesn't match I could loop through the elements in my table and see if the secondary key matches, but this seems really inefficient.
I think this might be a perfect opportunity for hashing, but the it seems like string hashes are only available through boost, and we are not allowed to bring in libraries, and I am not sure if I know of a hash function that only computes on elements, not ordering.
How can I easily tell flows apart according to these rules?

Compare the values of the source and dest IPs as 64-bit numbers. Use the lower one as the hash key, and put the higher one, the protocol and the direction as the values.
Do lookups the same way, use the lower value as the key.

If you consider that a single client can have more than one connection to a service, you'll see that you actually need four values to uniquely identify a flow: the source and destination IP addresses and the source and destination ports. For example, imagine two developers in the same office are searching StackOverflow at the same time. They'll both connect to stackoverflow.com:80, and they'll both have the same source address. But the source ports will be different (otherwise the company's firewall wouldn't know where to route the returned packets). So you'll need to identify each node by an <address, port> pair.
Some ideas:
As stark suggested, sort the source and destination nodes, concatenate them, and hash the result.
Hash the source, hash the destination, and XOR the result. (Note that this may weaken the hash and allow more collisions.)
Make 2 entries for each flow by hashing
<src_addr, src_port, dst_addr, dst_port> and also
<dst_addr, dst_port, src_addr, src_port>. Add them both to the map and point them both to the same data structure.

Related

How is the 'finalDigest' calculated in the 'sevLaunchAttestationReportEvent' log entry for confidential VMs in GCP?

I've experimented with launching some confidential VM instances. The simple scenario includes:
Launch an instance named 'Alice'.
Stop and relaunch instance 'Alice'.
Delete instance 'Alice', create a new VM instance named 'Alice'
I checked the 'sevLaunchAttestationReportEvent' log entry.
As expected, in all three cases the 'guestMemoryRegion' digest was identical in all cases.
However, the 'finalDigest' was different in all three cases. My questions are:
A. How is the 'finalDigest' calculated?
B. What is the purpose of a 'finalDigest' that is different at each launch of an identical VM image?
C. Can the 'finalDigest' be pre-calculate before instantiation?
Thanks.
First of all, a Confidential Virtual Machine runs on hosts based on the second generation of AMD Epyc processors, it is optimized for security workloads and includes inline memory encryption that ensures that data is encrypted while it's in RAM.
You can consult the following documentation to get further information.
Regarding your questions:
A. How is the 'finalDigest' calculated?
To calculate the digest value, a Digest Algorithm you can be use, those algorithms could be:
SHA-1
SHA-256
SHA-384
SHA-512
MD5
They are functions to take a large document and compute a "digest" (also called "hash"), this is typically used in a digital signing process.
B. What is the purpose of a 'finalDigest' that is different at each launch of an identical VM image?
A message digest or hash function is used to turn input of arbitrary length into an output of fixed length and this output can then be used in place of the original input, and the digest can be changed every time that the VM instance is turned on because some changes were executed internally in the instance. I mean, the hash algorithm takes into consideration those changes, even though a single byte is changed the digest or hash will change completely.
C. Can the 'finalDigest' be pre-calculate before instantiation?
In my opinion this is not feasible because the digest algorithm is a one-way function, that is, a function which is practically infeasible to invert.
You can get more information about the hash functions on this link.

Google Cloud Datastore - Scalability of indexing a short list of enums

Is it an issue in Datastore to index a property that can only have 4-5 possible values? Would this lead to tablet hotspots?
I am thinking of a property with an enum of string values like "done", "working", "complete". The reason for indexing such a property would be so you can create a composite index that let's you query on all entities that are "done" for example.
Yes, it would be an issue if/when you have high rates of queries using these composite indexes you mentioned, listed in Indexes:
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to
hotspots that impact Cloud Datastore latency for applications with
high read and write rates. For further guidance on dealing with
monotonic properties, see High read/write rates for a narrow key
range below.
You would also have a tablet hotspot problem if/when you hit high rates of datastore writes for entities with the same property value (for example 100s of entities becoming done per second) - another facet of the same problem. It's this case mentioned in High read/write rates to a narrow key range:
You will also see this problem if you create new entities at a high rate with a monotonically increasing indexed property like a
timestamp, because these properties are the keys for rows in the index
tables in Bigtable.
TLDR: It scales so long as entity keys are scattered.
DR:
Lets first consider the index entries being written.
We have something like:
SomeKind\E1 -> FullEntityKey1
SomeKind\E2 -> FullEntityKey2
SomeKind\E2 -> FullEntityKey3
SomeKind\E3 -> FullEntityKey4
We note that each individual index entry points to some entity.
As far as the load sharding is concerned the value being sharded is like the following:
SomeKind\E1\FullEntityKey1
SomeKind\E2\FullEntityKey2
SomeKind\E2\FullEntityKey3
SomeKind\E3\FullEntityKey4
Now lets imagine we were using randomly allocated ids for the entity keys (range [0,2] to be simple) -- we assume even distribution of writes across the random entity ids.
SomeKind\E1\0\RestOfKey1
SomeKind\E2\0\RestOfKey2
SomeKind\E2\1\RestOfKey3
SomeKind\E3\2\RestOfKey4
And then we can note that there are clear split points for the load to shard across -- that is every of the [0,2] possible random ids is a shard and the system can scale indefinitely so long as the writes are evenly distributed across the enties in SomeKind written (make the random id longer for more split points/scaling)
So the is index enum value scaling/hotspotting is highly associated with the entity keys being indexed, which are generally constructed in ways that are shardable which means that the associated index entries also are.
This is not to say that it is impossible to create situations in which hotspots may occur (for example, if the entity keys had a monotonically increasing value (like a timestamp)), or by targeting a small section of keys for a very high write rate -- but that shouldn't happen by default with typical traffic patterns and entity keys.

HD wallet (bip32) addresses derivation path

I am creating an application that needs to generate a new address from a provided XPUB key.
For instance xpub6CUGRUonZSQ4TWtTMmzXdrXDtypWKiKrhko4egpiMZbpiaQL2jkwSB1icqYh2cfDfVxdx4df189oLKnC5fSwqPfgyP3hooxujYzAu3fDVmz
I am using the Electrum wallet and a key provided by this app.
My application allows users to add their own xpub keys, so my application will be able to generate new addresses without affecting users privacy, as far as xpub keys are only used by my application and not exposed to public.
So I am looking for a way to generate new addresses correctly, I have found some libraries, however I am not sure about the derivation path, how should it look like ?
Consider the following path example
Is the derivation path is more a convention rather than a rule?
Bitcoin first external first m / 44' / 0' / 0' / 0 / 0 is this is a valid path? I have found it here https://github.com/bitcoin/bips/blob/master/bip-0044.mediawiki
I have also found out that Electrum wallets uses another schema https://bitcoin.stackexchange.com/questions/36955/what-bip32-derivation-path-does-electrum-use/36956 in the following format. It uses m/0/ for receiving addresses, and m/1/ for change addresses.
What is the maximum number (n) of addresses? How online tools calculate the balance of an HD wallet, if the N number is quite large it will require a lot of processing power to calculate sum.
So all in all, I wonder what format of the derivation path should I use in order to have no problems with compatibility?
I would be grateful for any help.
question 1-3:
It's bip44 convention, electrum isn't following it therefore it's not compatiable with other wallets which support bip44.
question 4:
the number can be infinite, if you are talking about the maximum number for a certain parent key, answer is:
Each extended key has 2^31 normal child keys, and 2^31 hardened child
key
-https://github.com/bitcoin/bips/blob/master/bip-0032.mediawiki
if your application design leads to a very large quantity of addresses, that's your own issue which you need to handle it by better design, and if you mean the compatibility with other wallets, according to bip44,
Address gap limit is currently set to 20. If the software hits 20
unused addresses in a row, it expects there are no used addresses
beyond this point and stops searching the address chain.
https://github.com/bitcoin/bips/blob/master/bip-0044.mediawiki#Address_gap_limit

What is the efficient method to compare files list in Client and remote Server

I have the below situation which need to be addressed efficiently,
I'm doing file sync from client devices to server. Sometimes what happen is file from one device doesn't get fetched to another device from the server due to some issues with server. I need to make sure that all the files in the server are synced to all the client devices using a separate thread. I am using C++ for the development and libcurl for client to server communication.
Here in the client device, we have an entry for downloaded files in the SQLite Database. Likewise in the server, we have similar updates in the server databases (MySQL) too. I need to list all the available files from the client device and send it to server and have to compare it with the list taken from the server database to find out the missed files.
I did a rough estimation that for 1 million files list (File Name with Full Path), it is about 85 MB in size. Upon compression it goes upto 10 MB in size. So transferring this entire file list (even after compression) from client to server is not a good idea. I planned to implement Bloom Filters for this as below,
Fetch files list from client side database and convert those to Bloom Filter Data Structure.
Just transferring the bloom data structure alone from client to the server.
Fetch files list from server side database and compare it with Bloom data structure received from the client and find out the missing files.
Please note that the above process initiated from client should be handled in thread at regular interval say for every 1 hour or so.
The problem with Bloom filters is false positive rates even if it very low. I don't want to miss out even a single file. Is there any other better way of doing this ?.
As you've noticed, this isn't a problem for which Bloom Filters are appropriate. With a Bloom Filter, when you get a hit you must then check the authoritative source to differentiate between a false positive and a true positive - they're useful in situations where most queries against the filter will be expected to give a negative result, which is the opposite to your case.
What you could do is have each side build a partial Prefix Tree in memory of all the filenames known to that side. It wouldn't be a full prefix tree - once you number of filenames below a node drops below a certain level you'd just include the full list of those filenames in that node. You then synchronise those prefix trees using a recursive algorithm starting at the root of the trees:
Each side creates a hash of all the sorted, concatenated filenames below the current node.
If the hashes are equal then this node and all descendents are synchronised - return.
If there are no child nodes, send the (short) list of filenames at this terminal node from one side to the other to synchronise and return.
Otherwise, recursively synchronise the child nodes and return.
The hash should be at least 128 bits, and make sure that when you concatenate the filenames for the hash you do so in a reversible manner (ie. seperate them with a character that can't appear in filenames like \0, or prefix each one with its length).
In file/pathname compression I've found a prefix-suffix compression to work better even alone than a generic (bz2) compression. When combined, the filename list could be reduced even more.
The trick is in using escape codes (e.g. <32) to indicate the number of common characters to the previous row, then use regular characters for the unique part and finally (optionally) encode the number of common characters at the end of the string.

Check a fingerprint in the database

I am saving the fingerprints in a field "blob", then wonder if the only way to compare these impressions is retrieving all prints saved in the database and then create a vector to check, using the function "identify_finger"? You can check directly from the database using a SELECT?
I'm working with libfprint. In this code the verification is done in a vector:
def test_identify():
cur = DB.cursor()
cur.execute('select id, fp from print')
id = []
gallary = []
for row in cur.fetchall():
data = pyfprint.pyf.fp_print_data_from_data(str(row['fp']))
gallary.append(pyfprint.Fprint(data_ptr = data))
id.append(row['id'])
n, fp, img = FingerDevice.identify_finger(gallary)
There are two fundamentally different ways to use a fingerprint database. One is to verify the identity of a person who is known through other means, and one is to search for a person whose identity is unknown.
A simple library such as libfprint is suitable for the first case only. Since you're using it to verify someone you can use their identity to look up a single row from the database. Perhaps you've scanned more than one finger, or perhaps you've stored multiple scans per finger, but it will still be a small number of database blobs returned.
A fingerprint search algorithm must be designed from the ground up to narrow the search space, to compare quickly, and to rank the results and deal with false positives. Just as a Google search may come up with pages totally unrelated to what you're looking for, so too will a fingerprint search. There are companies that devote their entire existence to solving this problem.
Another way would be to have a mysql plugin that knows how to work with fingerprint images and select based on what you are looking for.
I really doubt that there is such a thing.
You could also try to parallelize the fingerprint comparation, ie - calling:
FingerDevice.identify_finger(gallary)
in parallel, on different cores/machines
You can't check directly from the database using a SELECT because each scan is different and will produce different blobs. libfprint does the hard work of comparing different scans and judging if they are from the same person or not
What zinking and Tudor are saying, I think, is that if you understand how does that judgement process works (which is by the way, by minutiae comparison) you can develop a method of storing the relevant data for the process (the *minutiae, maybe?) in the database and then a method for fetching the relevant values -- maybe a kind of index or some type of extension to the database.
In other words, you would have to reimplement the libfprint algorithms in a more complex (and beautiful) way, instead of just accepting the libfprint method of comparing the scan with all stored fingerprint in a loop.
other solutions for speeding your program
use C:
I only know sufficient C to write kind of hello-world programs, but it was not hard to write code in pure C to use the fp_identify_finger_img function of libfprint and I can tell you it is much faster than pyfprint.identify_finger.
You can continue doing the enrollment part of the stuff in python. I do it.
use a time / location based SELECT:
If you know your users will scan their fingerprints with more probability at some time than other time, or at some place than other place (maybe arriving at work at some time and scanning their fingers, or leaving, or entering the building by one gate, or by other), you can collect data (at each scan) for measuring the probabilities and creating parallel tables to sort the users for their probability of arriving at each time and location.
We know that identify_finger tries to identify fingers in a loop with the fingerprint objects you provided in a list, so we can use that and give it the objects sorted in a way in which the more likely user for that time and that location will be the first in the list and so on.