Efficient way of storing Huffman tree - c++

I am writing a Huffman encoding/decoding tool and am looking for an efficient way to store the Huffman tree that is created to store inside of the output file.
Currently there are two different versions I am implementing.
This one reads the entire file into memory character by character and builds a frequency table for the whole document. This would only require outputting the tree once, and thus efficiency is not that big of a concern, other than if the input file is small.
The other method I am using is to read a chunk of data, about 64 kilobyte in size and run the frequency analysis over that, create a tree and encode it. However, in this case before every chunk I will need to output my frequency tree so that the decoder is able to re-build its tree and properly decode the encoded file. This is where the efficiency does come into place since I want to save as much space as possible.
In my searches so far I have not found a good way of storing the tree in as little space as possible, I am hoping the StackOverflow community can help me find a good solution!

Since you already have to implement code to handle a bit-wise layer on top of your byte-organized stream/file, here's my proposal.
Do not store the actual frequencies, they're not needed for decoding. You do, however, need the actual tree.
So for each node, starting at root:
If leaf-node: Output 1-bit + N-bit character/byte
If not leaf-node, output 0-bit. Then encode both child nodes (left first then right) the same way
To read, do this:
Read bit. If 1, then read N-bit character/byte, return new node around it with no children
If bit was 0, decode left and right child-nodes the same way, and return new node around them with those children, but no value
A leaf-node is basically any node that doesn't have children.
With this approach, you can calculate the exact size of your output before writing it, to figure out if the gains are enough to justify the effort. This assumes you have a dictionary of key/value pairs that contains the frequency of each character, where frequency is the actual number of occurrences.
Pseudo-code for calculation:
Tree-size = 10 * NUMBER_OF_CHARACTERS - 1
Encoded-size = Sum(for each char,freq in table: freq * len(PATH(char)))
The tree-size calculation takes the leaf and non-leaf nodes into account, and there's one less inline node than there are characters.
SIZE_OF_ONE_CHARACTER would be number of bits, and those two would give you the number of bits total that my approach for the tree + the encoded data will occupy.
PATH(c) is a function/table that would yield the bit-path from root down to that character in the tree.
Here's a C#-looking pseudo-code to do it, which assumes one character is just a simple byte.
void EncodeNode(Node node, BitWriter writer)
{
if (node.IsLeafNode)
{
writer.WriteBit(1);
writer.WriteByte(node.Value);
}
else
{
writer.WriteBit(0);
EncodeNode(node.LeftChild, writer);
EncodeNode(node.Right, writer);
}
}
To read it back in:
Node ReadNode(BitReader reader)
{
if (reader.ReadBit() == 1)
{
return new Node(reader.ReadByte(), null, null);
}
else
{
Node leftChild = ReadNode(reader);
Node rightChild = ReadNode(reader);
return new Node(0, leftChild, rightChild);
}
}
An example (simplified, use properties, etc.) Node implementation:
public class Node
{
public Byte Value;
public Node LeftChild;
public Node RightChild;
public Node(Byte value, Node leftChild, Node rightChild)
{
Value = value;
LeftChild = leftChild;
RightChild = rightChild;
}
public Boolean IsLeafNode
{
get
{
return LeftChild == null;
}
}
}
Here's a sample output from a specific example.
Input: AAAAAABCCCCCCDDEEEEE
Frequencies:
A: 6
B: 1
C: 6
D: 2
E: 5
Each character is just 8 bits, so the size of the tree will be 10 * 5 - 1 = 49 bits.
The tree could look like this:
20
----------
| 8
| -------
12 | 3
----- | -----
A C E B D
6 6 5 1 2
So the paths to each character is as follows (0 is left, 1 is right):
A: 00
B: 110
C: 01
D: 111
E: 10
So to calculate the output size:
A: 6 occurrences * 2 bits = 12 bits
B: 1 occurrence * 3 bits = 3 bits
C: 6 occurrences * 2 bits = 12 bits
D: 2 occurrences * 3 bits = 6 bits
E: 5 occurrences * 2 bits = 10 bits
Sum of encoded bytes is 12+3+12+6+10 = 43 bits
Add that to the 49 bits from the tree, and the output will be 92 bits, or 12 bytes. Compare that to the 20 * 8 bytes necessary to store the original 20 characters unencoded, you'll save 8 bytes.
The final output, including the tree to begin with, is as follows. Each character in the stream (A-E) is encoded as 8 bits, whereas 0 and 1 is just a single bit. The space in the stream is just to separate the tree from the encoded data and does not take up any space in the final output.
001A1C01E01B1D 0000000000001100101010101011111111010101010
For the concrete example you have in the comments, AABCDEF, you will get this:
Input: AABCDEF
Frequencies:
A: 2
B: 1
C: 1
D: 1
E: 1
F: 1
Tree:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1
Paths:
A: 00
B: 01
C: 100
D: 101
E: 110
F: 111
Tree: 001A1B001C1D01E1F = 59 bits
Data: 000001100101110111 = 18 bits
Sum: 59 + 18 = 77 bits = 10 bytes
Since the original was 7 characters of 8 bits = 56, you will have too much overhead of such small pieces of data.

If you have enough control over the tree generation, you could make it do a canonical tree (the same way DEFLATE does, for example), which basically means you create rules to resolve any ambiguous situations when building the tree. Then, like DEFLATE, all you actually have to store are the lengths of the codes for each character.
That is, if you had the tree/codes Lasse mentioned above:
A: 00
B: 110
C: 01
D: 111
E: 10
Then you could store those as:
2, 3, 2, 3, 2
And that's actually enough information to regenerate the huffman table, assuming you're always using the same character set -- say, ASCII. (Which means you couldn't skip letters -- you'd have to list a code length for each one, even if it's zero.)
If you also put a limitation on the bit lengths (say, 7 bits), you could store each of these numbers using short binary strings. So 2,3,2,3,2 becomes 010 011 010 011 010 -- Which fits in 2 bytes.
If you want to get really crazy, you could do what DEFLATE does, and make another huffman table of the lengths of these codes, and store its code lengths beforehand. Especially since they add extra codes for "insert zero N times in a row" to shorten things further.
The RFC for DEFLATE isn't too bad, if you're already familiar with huffman coding: http://www.ietf.org/rfc/rfc1951.txt

branches are 0 leaves are 1. Traverse the tree depth first to get its "shape"
e.g. the shape for this tree
0 - 0 - 1 (A)
| \- 1 (E)
\
0 - 1 (C)
\- 0 - 1 (B)
\- 1 (D)
would be 001101011
Follow that with the bits for the characters in the same depth first order AECBD (when reading you'll know how many characters to expect from the shape of the tree). Then output the codes for the message. You then have a long series of bits that you can divide up into characters for output.
If you are chunking it, you could test that storing the tree for the next chuck is as efficient as just reusing the tree for the previous chunk and have the tree shape being "1" as an indicator to just reuse the tree from the previous chunk.

The tree is generally created from a frequency table of the bytes. So store that table, or just the bytes themselves sorted by frequency, and re-create the tree on the fly. This of course assumes that you're building the tree to represent single bytes, not larger blocks.
UPDATE: As pointed out by j_random_hacker in a comment, you actually can't do this: you need the frequency values themselves. They are combined and "bubble" upwards as you build the tree. This page describes the way a tree is built from the frequency table. As a bonus, it also saves this answer from being deleted by mentioning a way to save out the tree:
The easiest way to output the huffman tree itself is to, starting at the root, dump first the left hand side then the right hand side. For each node you output a 0, for each leaf you output a 1 followed by N bits representing the value.

A better approach
Tree:
7
-------------
| 4
| ---------
3 2 2
----- ----- -----
A B C D E F
2 1 1 1 1 1 : frequencies
2 2 3 3 3 3 : tree depth (encoding bits)
Now just derive this table:
depth number of codes
----- ---------------
2 2 [A B]
3 4 [C D E F]
You don't need to use the same binary tree, just keep the computed tree depth i.e. the number of encoding bits. So just keep the vector of uncompressed values [A B C D E F] ordered by tree depth, use relative indexes instead to this separate vector. Now recreate the aligned bit patterns for each depth:
depth number of codes
----- ---------------
2 2 [00x 01x]
3 4 [100 101 110 111]
What you immediately see is that only the first bit pattern in each row is significant. You get the following lookup table:
first pattern depth first index
------------- ----- -----------
000 2 0
100 3 2
This LUT has a very small size (even if your Huffman codes can be 32-bit long, it will only contain 32 rows), and in fact the first pattern is always null, you can ignore it completely when performing a binary search of patterns in it (here only 1 pattern will need to be compared to know if the bit depth is 2 or 3 and get the first index at which the associated data is stored in the vector). In our example you'll need to perform a fast binary search of input patterns in a search space of 31 values at most, i.e. a maximum of 5 integer compares. These 31 compare routines can be optimized in 31 codes to avoid all loops and having to manage states when browing the integer binary lookup tree.
All this table fits in small fixed length (the LUT just needs 31 rows atmost for Huffman codes not longer than 32 bits, and the 2 other columns above will fill at most 32 rows).
In other words the LUT above requires 31 ints of 32-bit size each, 32 bytes to store the bit depth values: but you can avoid it this by implying the depth column (and the first row for depth 1):
first pattern (depth) first index
------------- ------- -----------
(000) (1) (0)
000 (2) 0
100 (3) 2
000 (4) 6
000 (5) 6
... ... ...
000 (32) 6
So your LUT contains [000, 100, 000(30times)]. To search in it you must find the position where the input bits pattern are between two patterns: it must be lower than the pattern at the next position in this LUT but still higher than or equal to the pattern in the current position (if both positions contain the same pattern, the current row will not match, the input pattern fits below). You'll then divide and conquer, and will use 5 tests at most (the binary search requires a single code with 5 embedded if/then/else nested levels, it has 32 branches, the branch reached indicates directly the bit depth that does not need to be stored; you perform then a single directly indexed lookup to the second table for returning the first index; you derive additively the final index in the vector of decoded values).
Once you get a position in the lookup table (search in the 1st column), you immediately have the number of bits to take from the input and then the start index to the vector. The bit depth you get can be used to derive directly the adjusted index position, by basic bitmasking after substracting the first index.
In summary: never store linked binary trees, and you don't need any loop to perform thelookup which just requires 5 nested ifs comparing patterns at fixed positions in a table of 31 patterns, and a table of 31 ints containing the start offset within the vector of decoded values (in the first branch of the nested if/then/else tests, the start offset to the vector is implied, it is always zero; it is also the most frequent branch that will be taken as it matches the shortest code which is for the most frequent decoded values).

There are two main ways to store huffman code LUTs as the other answers state. You can either store the geometry of the tree, 0 for a node, 1 for a leaf, then put in all the leaf values, or you can use canonical huffman encoding, storing the lengths of the huffman codes.
The thing is, one method is better than the other depending on the circumstances.
Let's say, the number of unique symbols in the data you wish to compress (aabbbcdddd, there are 4 unique symbols, a, b, c, d) is n.
The number of bits to store the geometry of the tree along side the symbols in the tree is 10n - 1.
Assuming you store the code lengths in order of the symbols the code lengths are for, and that the code lengths are 8 bits (code lengths for a 256 symbol alphabet will not exceed 8 bits), the size of the code length table will be a flat 2048 bits.
When you have a high number of unique symbols, say 256, it will take 2559 bits to store the geometry of the tree. In this case, the code length table is much more efficient. 511 bits more efficient, to be exact.
But if you only have 5 unique symbols, the tree geometry only takes 49 bits, and in this case, when compared to storing the code length table, storing the tree geometry is almost 2000 bits better.
The tree geometry is most efficient for n < 205, while a code length table is more efficient for n >= 205. So, why not get the best of both worlds, and use both? Have 1 bit at the start of your compressed data represent whether the next however many bits are going to be in the format of a code length table, or the geometry of the huffman tree.
In fact, why not add two bits, and when both of them are 0, there is no table, the data is uncompressed. Because sometimes, you can't get compression! And it would be best to have a single byte at the beginning of your file that is 0x00 telling your decoder not to worry about doing anything. Saves space by not including the table or geometry of a tree, and saves time, not having to unnecessarily compress and decompress data.

Related

Traversing lists of 0, 1 with constraint

my apologies if this was answered somewhere, I tried searching but I do not know if this kind of problem has a specific name, so nothing came up in my search...
I have a list of objects, and each of these objects can either be accepted or rejected. Every combination is assigned a value, while some combinations are not valid. (So for example we have 4 objects, and objects 1 and 2 don't go together, then every combination that has objects 1 and 2 as accepted is invalid.) It is not known beforehand which objects don't go together and it is not possible to find the invalid ones just by looking at pairs. (For example it is possible that objects 1, 2 are valid together, objects 2,3 are valid, objects 1,3 are valid, but 1,2,3 are invalid.) I modeled this as lists of 0 and 1, so now I want to traverse these lists to find the one with the maximum value in an efficient way.
My idea was to traverse the lists like a tree by starting at all zeros and then in each step flipping a zero to a one, so for example for 3 objects this gives the tree
000
/ | \
100 010 001
/ \ / \ / \
110 101 110 011 101 011
\ \ \ / / /
111
which is actually worse than just listing all 2^n options since there are duplicates, but at each node I could stop if I discovered that it is invalid. Saving the invalid combinations of ones and keeping a list of all already visited nodes I could make sure that I don't revisit already checked nodes. (But I would still have to check those if they were already visited)
Is there any better way to do this?
You can try to build tree of variants (at most 2^n options, as you noticed), but cut unappropriate branches as early as possible.
In example below I've set two binary masks - no 1,2,3 together and no 2,4 together
def buildtree(x, maxsize, level, masks):
if level == maxsize:
print("{0:b}".format(x).zfill(maxsize))
else:
buildtree(x, maxsize, level + 1, masks)
t = x | (1 << level)
good = True
for m in masks:
if (t & m) == m:
good = False
break
if good:
buildtree(t, maxsize, level + 1, masks)
buildtree(0, 4, 0, [7, 10])
0000
1000
0100
1100
0010
0110
0001
1001
0101
1101
0011
Is is possible also to remove some masks but code will be more complicated

Hash tables runtime complexity for chaining with 2 hash function

This question deal with collision based in a new approach for chaining in hash tables.
There is 2 hash functions: First function h1(x) = x mod m1
with this function all the items are hashed to the primary hash table.
inside each index for the primary hash table there is internal hash table that hash the key with function 2 : h2(x) = x mod m2 and (m1!=m2)
for example lets say i had m1 = 5 and m2 = 3
and i want to insert 2 .. h1(2) = 2 mod 5 = 2 and h2(2) = 2 mod 3=2
this mean 2 will be inserted in the second index in the primary table in the second index of the internal table.
when collision happen in the primary table (this mean h1(x)=x%m1= y%m1 =h1(y)) we going to the second hash function h2 and calculate h2(x) and h2(y) and we put each one in the corresponding index in the internal hash table. lets say h1(x)= x%5 and h2(x) = x%3 for example if we insert 7 and 12 we will get h1(7)=2 and h1(12)=2 this mean both will be in index 2 in the primary hash table. then we compute h2 for both ( h2(7) = 1 and h2(12)=0) which mean we put 7 in index 1 and 12 in index 0 in the internal table.(and by this we avoid collision)
this was the question in the exams, also first section about the question was if there is collision for this numbers - 0 5 15 17 (with m1=5 and m2=3) and there is obviously 0 and 15 have the modulo for 5 and 3. the second question was about the search worst case runtime complexity? and the third section was to give 5 numbers that make the worst case if we search for number 2 in the tablewhen collision happen in the primary table (this mean h1(x)=x%m1= y%m1 =h1(y)) we going to the second hash function h2 and calculate h2(x) and h2(y) and we put each one in the corresponding index in the internal hash table. lets say h1(x)= x%5 and h2(x) = x%3 for example if we insert 7 and 12 we will get h1(7)=2 and h1(12)=2 this mean both will be in index 2 in the primary hash table. then we compute h2 for both ( h2(7) = 1 and h2(12)=0) which mean we put 7 in index 1 and 12 in index 0 in the internal table.(and by this we avoid collision)
this was the question in the exams, also first section about the question was if there is collision for this numbers - 0 5 15 17 (with m1=5 and m2=3) and there is obviously 0 and 15 have the modulo for 5 and 3. the second question was about the search worst case runtime complexity? and the third section was to give 5 numbers that make the worst case if we search for number 2 in the table
the question is what is the search worst case runtime complexity?
and example for 5 numbers that can cause worst case if we search for number 2.
i think the complexity is o(1) and i used this 5 numbers
7 12 17 22 42
did this correct ?can anybody help with this!

How to write loop across Hierarchical Data (household-individual) in stata?

I'm now working on a household survey data set and I'd like to give certain members extra IDs according to their relationship to the household head. More specifically, I need to identify the adult children of household head and his/her spouse, if married, and assign them "sub-household IDs".
The variables are: hhid - household ID; pid -individual ID; relhead - relationship with head.
Regarding relhead, a 1 represents the head, a 6 represents a child, and a 7 represents a child-in-law. Below some example data, including in the last column the desired outcome. I assume that whenever a 6 is followed by a 7, they constitute a couple and belong to the same sub-household.
hhid pid relhead sub_hhid(desired)
50 1 1 1
50 2 3 1
50 3 6 2
50 4 6 3
50 5 7 3
-----------------------------------------------
67 1 1 1
67 3 6 2
67 4 7 2
Here are some thoughts:
There may be married and unmarried adult children within one household, the family structure is a little bit complicated, so I want to write some loop across the members in a household.
The basic idea is in the outer loop we identify the children staying-at-home and then check if there's a spouse presented, if there is, then we give the couple an indicator, if not, we continue and give the single stay_chil other indicator. After walking through all the possible members within a household, we get a series of within-household IDs. To facilitate further analysis , I need some kind of external ID variable to separate the sub-families.
* Define N as the total number of household, n as number of individual household size
* sty_chil is indicator for adult child who living with parents(head)
* sty_chil_sp is adult child's spouse
* "hid" and "ind_id" are local macros
forvalue hid=1/N {
forvalue ind_id= 1/n {
if sty_chil[`ind_id']==1 {
check if sty_chil_sp[`ind_id+1']==1 {
if yes then assign sub_hhid to this couples *a 6-7 pairs,identifid as couple
}
else { * single 6 identifid as single child
assign sub_hhid to this child
}
else { *Other relationships rather than 6, move forward
++ind_id the members within a household
}
++hid *move forward across households
}
The built-in stata by,sort: is pretty powerful but here I want to treat part of family members who fall into certain criterion and leave other untouched, so a if-else type loop is more natural for me (even by: may achieve my goal,it's always too tactful when situation become not so simpleļ¼Œand we cannot exhaust all the possible pattern of household pattern).
An immediate problem is that I don't know how to write loop across house IDs and individual IDs, because I used to acquire the household size (increment of outer loop) using by command (I'm not sure in this case it's 1 or the numerber of family members), and I'm not sure if mix up the by and if loops is a good programming practice, I favor write a "full loop" in this case. Please give me some clues how to achieve my goal and provide (illustrate)pseudo code for me.
An extra question is I cannot find the ado file which contains the content of by command, does it exist?
I will abstract from the issue of whether the assumption used to create matches is a sensible one or not. Rather, let this be an example of reaching the desired results without using explicit loops. Some logic and the use of subscripting (see help subscripting) can get you far.
clear
set more off
*----- example data -----
input ///
hhid pid relhead sub_hhid
50 1 1 1
50 3 6 2
50 4 6 3
50 5 7 3
67 1 1 1
67 3 6 2
67 4 7 2
67 5 6 3
end
list, sepby(hhid)
*----- what you want -----
bysort hhid (pid): gen hhid2 = sum( !(relhead == 7 & relhead[_n-1] == 6) )
list, sepby(hhid)
As you can see, one line of code gets you there. The reasoning is the following:
sum() gives the running sum. The arguments to sum(), being conditions, can either be True or False. The ! denotes the logical not (see help operators).
If it is not the case that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, the condition evaluates to True and takes on the value of 1, increasing the running sum by 1. If it evaluates to False, meaning that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, then it takes on the value of 0 and the running sum will not increase. This gives the result you seek.
You do this using the by: prefix, since you want to check each original household independently, so to speak.
For the the first observation of each original household, the condition always evaluates to True. This is because there exist no "previous" observation (relationship), and Stata considers relhead to be missing (., a very large number) and therefore, not equal to 6. This takes the running sum from 0 to 1 for the first observation of each sub-group, and so on.
Bottom line: learn how to use by: and take advantage of the features offered by Stata. Do not swim against the current; not here.
Edit
Please note that instead of progressively changing your example data set, you should provide a representative example from the beginning. Not doing so can render answers that are initially OK, completely inadequate.
For your modified example, add:
replace hhid2 = 1 if !inlist(relhead,6,7)
That will simply assign anyone not 6 or 7 to the same household as the head. The head is assumed to always have hhid2 == 1. If the head can have hhid2 != 1, then
bysort hhid (relhead): replace hhid2 = hhid2[1] if !inlist(relhead,6,7)
should work.
You can follow with:
bysort hhid (pid): replace hhid2 = hhid2[_n-1] + 1 if hhid2 != hhid2[_n-1] & _n > 1
but because they are IDs, it's not really necessary.
Finally, use:
gen hhid3 = string(hhid) + "_" + string(hhid2)
to create IDs with the form 50_1, 50_2, 50_3, etc.
Like I said before, if your data presents more complications, you should present a relevant example.

is vector < list <marker> > the right way to approach this?

I'm trying to solve a problem for work and am a novice programmer. I have three files, both tab delimited.
File1 has two fields: *Marker_id* & position and this file is sorted by position (0-26) and then Marker_id is in an order that is a consequence of another application but is not alphabetical.The order of Marker_id is important because the goal of my program is to find a starting Marker_id and count the number of Markers between that and an ending Marker. This file contains nearly 2,500,000 entries.
File2 has one field *Marker_id* This is the same Marker_id that is used in File1, but this file contains only around 2,200,000 entries. This file is a list of "active" markers or markers that should be counted by my program.
File3 has fields position *starting_marker* ending marker *number_markers* and other fields. I need to basically update the number_markers field by counting the number of markers between start and end.
I already have code that reads file1 into
vector< list<marker> >;
where marker is a struct:
struct MARKER{
string snp_id;
bool included;
MARKER(string temp_id) : snp_id(temp_id), included(false) { }
};
And the position (0-26) from file one specifies what index in the vector to store the markers. I also successfully update the count in file3 with the number of markers between start and stop.
However, I'm having trouble implementing a function to trim my list to only "active markers." I was going to simply do MARKER.included(true); for entries in file2 until I realized file2 does not contain position and therfore, I'd have to search every list at each vector index. This is possible, I just feel like it would be incredibly slow with so many entries.
I'm trying to think of alternatives such as storing file1 in a map where the key is Marker_id, but needing to keep Marker_id's in original order for counting is hanging me up.
Does anyone have any advice? Thanks.
UPDATE (example files):
***File1***
Marker_id position
test_marker_1 1
test_marker_2 1
test_marker_3 1
test_marker_4 1
test_marker_5 1
test_marker_6 1
test_marker_7 1
test_marker_8 1
test_marker_9 1
.
***File2***
Marker_id C20020.Log R Ratio C20020.B Allele Freq
test_marker_1 0.0180 0.0010
test_marker_3 -0.0340 0.5000
test_marker_4 0.0500 0.0700
test_marker_5 0.0500 0.0700
test_marker_6 0.0500 0.0700
test_marker_7 0.0500 0.0700
test_marker_9 0.0500 0.0700
Note: test_marker_2 and test_marker_8 are omitted from file 2 and therefore, will not be included in counts.
***File3***
position copy_num sampleID startMarker endMarker conf num_Markers
1 4 C20020 test_marker_1 test_marker_3 1774.967 0
1 3 C20020 test_marker_3 test_marker_5 17.967 0
1 0 C20020 test_marker_7 test_marker_9 107.967 0
.
***My desired output***
position copy_num sampleID startMarker endMarker conf num_Markers
1 4 C20020 test_marker_1 test_marker_3 1774.967 2
1 3 C20020 test_marker_3 test_marker_5 17.967 3
1 0 C20020 test_marker_7 test_marker_9 107.967 2
As it stands now, I have everything functioning except my counts would be 3 for all three examples since I do not exclude those Markers not found in file2.
A couple of approaches come to mind.
You could sort files 1 and 2 by marker id (temporary copies, of course), then could easily determine the markers that are in file 1 but not in file 2 in a single pass. You could then use this "exclusion list" to determine markers to ignore in the other part of the algorithm. Per your numbers, this would be ~300,000 items, which could be inserted into a hash map for quick lookups.
Of course, if you have oodles of memory, you could always just put all of file 2 into a hash map, and use it in the same way.
If memory is a real issue, but the marker values are such that they define a complete space (e.g. numbers 1 to 10 million, or whatever), where markers can be mapped to offsets, then you could create a bit map of the entire space, with 1's only for markers that are active. Again, using this bitmap to exclude those markers that are to be ignored.
Basically, as long as you can get a constant-time check for the include/exclude test, you're laughing.

Count the number of possible permutations of numbers less than integer N, given N-1 constraints

We are given an integer N and we need to count the total number of permutations of numbers less than N. We are also given N-1 constraints. e.g.:
if N=4 then count permutations of 0,1,2,3 given:
0>1
0>2
0>3
I thought about making a graph and then counting total no of permutation of numbers at same level and multiply it with permutations at other level.e.g.:
For above example:
0
/ | \
/ | \
1 2 3 ------> 3!=6 So total no of permutations are 6.
But I have difficulty in implementing it in C++. Also, this question was asked in Facebook hacker cup, the competition is over now. I have seen code of other people and found that they did it using DFS. Any help?
The simplest way to do this is to use a standard permutation generator and filter out each permutation that violates the conditions. This is obviously very inefficient and for larger values of N is not computable. Doing this is sort of the "booby" option that these contests have which allows the less smart contestants to complete the problem.
The skilled approach requires insight into the ways of counting combinations and permutations. To illustrate the method I will use an example. Inputs:
N = 7
2 < 4
0 < 3
3 < 6
We first simplify this by combining the dependent conditions into a single condition, as follows:
2 < 4
0 < 3 < 6
Start with the longest condition, and determine the combination count of the gaps (this is the key insight). For example, some of the combinations are as follows:
XXXX036
XXX0X36
XXX03X6
XXX036X
XX0XX36
etc.
Now, you can see there are 4 gaps: ? 0 ? 3 ? 6 ?. We need to count the possible partitions of X's in these four gaps. The number of such partitions is (7 choose 3) = 35 (do you see why?). Now, we next multiply by the combinations of the next condition, which is 2 < 4 over the remaining blank spots (the Xs). We can multiply because this condition is fully independent of the 0<3<6 condition. This combination count is (4 choose 2) = 6. The final condition has 2 values in 2 spots = 2! = 2. Thus, the answer is 35 x 6 x 2 = 420.
Now, let's make it a little more complicated. Add the condition:
1 < 6
The way this changes the calculation is that before 036 had to appear in that order. But, now, we have three possible arrangements:
1036
0136
0316
Thus, the total count is now (7 choose 4) x 3 x (3 choose 2) = 35 x 3 x 3 = 315.
So, to recap, the procedure is you isolate the problem into independent conditions. For each independent condition you calculate the combinations of partitions, then you multiply them together.
I have walked through this example manually, but you can program the same procedure.