matching against a list of subnets - c++

There is a list of subnets in the form of net-addr/mask, such as
12.34.45.0/24 192.168.0.0/16 45.0.0.0/10 ...
Wonder what is the best way to tell if a given IP address is in any of the subnets.
Here is a little background on the matching:
For an IP address x, we convert it to an integer. For example, 11.12.13.14 is converted to 0x0b0c0d0e. For a mask m, we convert it to integer whose leading (32-m) bits are 1, the rest are 0.
To check if IP x is in subnet A/m,
we just need to check (x&m) == (A&m)
Curious what's the data structure or functions that makes matching against a range of subnets fast. Of course, we can go through the subnets in a loop but that's not efficient.

Make a tree where each level represents n bits of the IP address. Store subnets on each level so that the number of masks bits is between n * level and n * (level +1). So for example with n = 4, you have 16 children per node. So if you are testing against 11.12.13.14 (== 0x0b0c0d0e), you could walk the tree like this:
0 -> b -> 0 -> c -> 0 -> d -> 0 -> e
And on node you keep track of the subnets with the corresponding size. I mean: level 0 should have subnets /1 to /4 (inclusive), level 1 should have subnets /5 to /8, and so on up to /29 to /32. Note that /0 matches everything, so that would be useless to have in the data structure.
To search in the tree, group the IP in groups of n bits (in my example 4). Descend to the first level matching the first n bits and test all subnets on that level. If not found descend to the next level matching the next n bits.
This way you would have to test 32/n levels of each 2^n subnets maximum. For n=4, you would have to test 8 levels, each with at max 16 subnets. This is done in no time.
Clarification: A node is a subnet, for example (in hex, one digit is a nibble, which is 4 bits): 0a.5a.00.00/16. The parent of this node would be a subnet containing this subnet: for example: 0a.50.00.00/12. The edge towards a child node could be interpreted as: "contains", like in: "this (the parent) subnet contains the subnet represented by child node". For this tree to contain all the subnets you want, you will likely have to insert nodes, which represent a subnet that is not in your list. So mark these nodes as auxiliary nodes so you know that when searching this tree, you know that there are more specific subnets after under it, but the node itself is not part of the list of subnets you want to check against. You only should add these nodes that are directly in the list, and all parent nodes to make the nodes reachable in the tree structure.
Here is a struct on how I see it:
struct subnet_tree_node
{
uint_32 ip; // 32 bit IP address
subnet_tree_node *children;
uint_8 number_of_children;
uint_8 mask; // number of bits for this subnet
uint_8 valid; // wether this node is valid or auxiliary
}

So you've established performance is a problem.
Consider each netmask/addr pair as a pair of IP addresses: First valid, last-valid.
Let us assume last-valid is always odd (Not sure if that's true with a /32 network - but that's really, really strange).
Construct a sorted vector of these IP addresses. (Complain if the networks overlap or anything stupid.)
Search the vector for your target IP address with some sort of binary chop.
If the IP address is in the vector, it is a) wierd; b) in one of the subnets.
If the IP address is not in the vector and the value below is even - it is in a sub-net. If the value below is odd, it is not in a sub-net.

Do you have any evidence that performance is an issue? There are only 2^24 subnets (well, you can have /28 subnets, but they are usually internal to an organization, so even if the organization has a class A network, there's still only 2^24 of them).
Doing 16 million ands and comparisons is going to take no time.
Keep it simple (until you really have to do better).

Thanks for the discussions here, they got me inspired with this solution.
First, with loss of generality, we assume none of the subnet covers the other subnet (or we just remove the smaller one).
Each subnet is considered as an interval [subnet_min, subnet_max].
We just need to organize all the subnets into a binary tree, each node being a pair (subnet_min, subnet_max). When searching for an IP, it traverses the tree just like a regular binary search basing only on subnet_min, with the purpose of finding the node with subnet_min is largest among all the subnet_min's that are <= the given IP. Once we find this node, we check whether the node's subnet_max is greater >= the given IP. If so, the given IP is covered by the subnet, otherwise, we can say this IP is not covered by the subnet in this node, it's also not covered by any of the subnets neither.
The last point is guaranteed by the assumption that none of the two subnets contain each other.

Related

Terraform - AWS - create multiple instances - different AZ (where instance count is greater than AZ list length)

I am facing a problem with Terraform (v0.12) to create multiple instances using count variable and subnet id's list, where the count is greater than the length of the subnet id's list.
For example;
resource "aws_instance" "main" {
count = 20
ami = var.ami_id
instance_type = var.instance_type
# ...
subnet_id = var.subnet_ids_list[count.index]
}
Where my count is '20' and length(var.subnet_ids_list) is 2. It throws the following error:
count.index is 2
var.instance_subnet_id is tuple with 2 elements
The given key does not identify an element in this collection value.
I tried to make the "subnet_ids_list" as string with comma-separated and used "split", but it too give the same error.
Later thought to append subnet elements to "subnet_ids_list" in order to make it to "20". something like;
Python 2.7
>>> subnet_ids_list = subnet_ids_list * 10
Can someone help me with how to achieve similar with Terraform or any other approaches to solve this problem.
Original like;
subnet_ids_list = ["sub-1", "sub-2"]
Converted to - satisfy the value provided to count;
subnet_ids_list = ["sub-1", "sub-2", "sub-1", "sub-2",....., "sub-1", "sub-2",] (length=20).
I don't want to use AWS autoscaling groups for this purpose.
You can use the element function if you need to loop back through a list of things as mentioned in the linked documentation:
The index is zero-based. This function produces an error if used with
an empty list.
Use the built-in index syntax list[index] in most cases. Use this
function only for the special additional "wrap-around" behavior
described below.
> element(["a", "b", "c"], 3)
a
It doesn't make sense to create a new subnet whenever you need to spin up a new EC2. I'd recommend you to take a look at the official documentation about the basics of VPC and subnets: https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Subnets.html#vpc-subnet-basics
For example, if you create a VPC with CIDR block 10.0.0.0/24, it supports 256 IP addresses. You can break this CIDR block into two subnets, each supporting 128 IP addresses. One subnet uses CIDR block 10.0.0.0/25 (for addresses 10.0.0.0 - 10.0.0.127) and the other uses CIDR block 10.0.0.128/25 (for addresses 10.0.0.128 - 10.0.0.255).
In your Terraform example, it looks like you have 2 subnets (private and public?), so your counter must be rather 0 or 1 when accessing subnet_ids_list. Even a better solution would be to tag your subnets: https://www.terraform.io/docs/providers/aws/r/subnet.html#inner
You might have another counter though to control number of instances. Hope it helps!
EDIT: Based on your comments, a Map would be a better data structure to control instance/subnet. Key could be the instance or the subnet itself, e.g. { "aws_instance" = "sub-1" }
Reference: https://www.terraform.io/docs/configuration-0-11/variables.html#maps

IP address matching filter function

I am writing code in C++ which runs both on windows and mac platform. I want to write a function which will accept machine IP address list and list of IP filters in CIDR format. This function will check if machine IP matches IP filter.
For example. If machine IP 10.210.177.47 and filter is 10.210.177.1/32
The function will check if 10.210.177.47 falls inside the filter range.
Filter can also be Plain IP address like 10.210.177.45
i need to write a common code which can run on windows and mac.
The easiest solution is to convert the mask length into a bit mask. E.g. a /8 uses the upper 8 bits to identify the network and the lower 24 bits to identify hosts within that network. Hence, by shifting the IP address (expressed as std::uint32_t) left over 24 bits (>>24, you keep just the network part. For 10.210.177.47 within 10.0.0.0/8, that leaves 10 - matches. For /24, it would leave 10.210.177 - no match.

How to discern between network flows

I want to be able to discern between networks flows. I am defining a flow as a tuple of three values (sourceIP, destIP, protocol). I am storing these in a c++ map for fast access. However, if the destinationIP and the sourceIP are different, but contain the same values, (e.g. )
[packet 1: source = 1.2.3.4, dest = 5.6.7.8]
[packet 2: source = 5.6.7.8, dest = 1.2.3.4 ]
I would like to create a key that treats these as the same.
I could solve this by creating a secondary key and a primary key, and if the primary key doesn't match I could loop through the elements in my table and see if the secondary key matches, but this seems really inefficient.
I think this might be a perfect opportunity for hashing, but the it seems like string hashes are only available through boost, and we are not allowed to bring in libraries, and I am not sure if I know of a hash function that only computes on elements, not ordering.
How can I easily tell flows apart according to these rules?
Compare the values of the source and dest IPs as 64-bit numbers. Use the lower one as the hash key, and put the higher one, the protocol and the direction as the values.
Do lookups the same way, use the lower value as the key.
If you consider that a single client can have more than one connection to a service, you'll see that you actually need four values to uniquely identify a flow: the source and destination IP addresses and the source and destination ports. For example, imagine two developers in the same office are searching StackOverflow at the same time. They'll both connect to stackoverflow.com:80, and they'll both have the same source address. But the source ports will be different (otherwise the company's firewall wouldn't know where to route the returned packets). So you'll need to identify each node by an <address, port> pair.
Some ideas:
As stark suggested, sort the source and destination nodes, concatenate them, and hash the result.
Hash the source, hash the destination, and XOR the result. (Note that this may weaken the hash and allow more collisions.)
Make 2 entries for each flow by hashing
<src_addr, src_port, dst_addr, dst_port> and also
<dst_addr, dst_port, src_addr, src_port>. Add them both to the map and point them both to the same data structure.

OPTICSXi - ELKI ResultWriter

I'm using ELKI to cluster, in a hierarchical way, a dataset of geolocations using OPTICSXi.
The result of the execution of the algorithm is a set of files.
The content of a file could be:
# Cluster: nameOfCluster
# OPTICSModel
# Parents: nameOfParents (this element doesn't exist for the root cluster)
# Children: nameOfChild_0, nameOfChild_1 ... nameOfChild_n, (optional)
ID=1 lat0 lon0 reachability=?
ID=3062 lat1 lon1 reachability=1.30972586 predecessor=1
ID=7383 lat2 lon2 reachability=2.56784445 predecessor=3062
ID=42839 lat3 lon3 reachability=4.05510623 predecessor=1
I don't understand if the elements that are in each file (in the example there are four elements) belong to the same cluster or could belong to different clusters. In the latter case, I need to write some code that builds the clusters ( for example looking at the predecessor of each node), or there are some parameters that could I specify in Elki to obtain each single cluster?
By default, ELKI will produce a directory with one file per cluster. Unless the output file already exists, in which case you will get all the clusters written into the same file, separated with comments as seen above.
With a hierarchical result, such as OPTICSXi, your should however also treat all members of the child clusters to be also part of the parent. These are clusters nested into the parent. They are not repeated in the parent, to reduce redundancy in the output.
Compare the output of OPTICSXi to OPTICS output. What the Xi approach does, is split the data for you, based on sudden drops in reachability-distance. All clusters of Xi should be subsequences of the original OPTICS cluster order.
In your case, you may have chosen minPts too small, if your cluster has just 4 elements. (Although, you may have truncated the file, or you may have a lot of elements in child clusters; so the output may be fine).
Also note that you will usually want to validate whether you want the first element(s) of your cluster to belong to the cluster or not; similarly the last elements. OPTICSXi tends to err on the first elements, but not in a systematic way that would be trivial to fix. The first and last elements are those that bridge the gap from one cluster to another. You really should verify these manually (which is a good reason to not choose minPts too small).
I strongly recommend to build/use a visualization for your specific use case. Then you could just load such a cluster into your visualization and visually inspect if the result makes sense to you. I have used OPTICSXi on geographic data, and that worked very well for me.
So, if I've understood well, in the example above, the cluster is composed of the elements
ID=1, ID=3062, ID=7383, ID=42839, and all the elements in nameOfChild_0, nameOfChild_1 ... nameOfChild_n.
Maybe, I don't have to join the children in the root element, because I guess I'll obtain a unique big cluster contained all my geo-locations, in fact I have 903 child elements and 18795 node (ID).
I've done a lot of tests, choosing minPoint = {2,5,10} and xi = {0.1, 0.01, 0.001, 0.0001, 0.00001, 0.000001}. I use a visualization of my clusters, but I can't find a good result. I'm having a lot of trouble.
Thanks to your reply I've understood that I split my elements too much, in the sense that for me each file is a cluster, and for this reason I don't consider the child elements in the parent, but I consider them as separated clusters.
Moreover, I noticed that the first and the last element sometimes are wrong, I've thought to verify if this elements are predecessor of at least one element in the cluster, or at least one element in the cluster is a predecessor of those. Does this make sense?

How to find all clusters of forest on map?

How to find all clusters of forest on map ?
I have simple class cell like (Type is enum {RIVER, FOREST,GRASS,HILL}
class Cell{
public:
Type type;
int x;
int y
};
and map like vector<Cell> grid. Can anyone suggest me algorithm to create list<list<Cell>> clusters where list contains FOREST cells in same cluster (cluster are set of connected cells - connection can be in eight direction:up,down,left,right,up_right,up_left,down_left,down_right)? I need to find all clusters of forest on map and put every single cluster in list<Cell>.
The algorithm is rather simple and it actually doesn't even depend on the exact definition of what a cluster is. Say you have a predicate cluster(f0, f1) which yields true if f0 and f1 are in the same cluster. All you need to do is to run though the grid and find a forest. If a cell f is a forest, you check if cluster(f, other) for each known forest. If cluster(f, other) yields true you add f to the cluster of other. You continue to check other known forests in other clusters: when you find another cell c in another cluster for cluster(f, c) also yields true, you merge (std::list<Cell>::spice()) the two clusters.
I had put this as a comment, but may as well answer:
Look up the union-find algorithm. Using path compression, you can just
walk through the structure afterwards and create a list for each root,
adding your cells to the appropriate list as you go.
Link: http://en.wikipedia.org/wiki/Disjoint-set_data_structure
For all your cells, perform a union with the cell above and to the left. If you want diagonals to join, then also include the top-left and top-right diagonal).
Use the path-compression version of union-find so that all nodes in a cluster point to a single root. Then all you have to do is walk through your structure (after doing all the unions) and add nodes as you go. Pseudo(ish)code:
foreach node
Find(node) // this ensures path compression
if not clusters.hasList(node.root)
clusters.createList(node.root)
end
list <- clusters.getList(node.root)
list.append(node)
end
The above assumes that if a node is a root, then node.root points to node.