For my programming project, I need to create a list of 65536 (2^16) lists which are again 65536 elements long. Each of the internal lists contain integers with two digits in hexadecimal. I was wondering if my machine could handle this very long nested list without any loss of value, let alone the running time.
I read a response on here, which says that sys.maxsize gives the maximum size of a list, and my machine gives me 2147483647. However, does it follow from the fact that the maximum elements that can be stored in a list is 2147483647, it can store 65536 lists, although each of those lists are again 65536 elements long?
The limit on the length of a list only refers to the number of elements in a list. It doesn't matter what those elements are, since the list itself only includes references to them, it doesn't actually store the content of the objects inside itself.
In your case, you have a list which stores 65536 elements, and 65536 is less than the maximum size of a list, so that's fine. The fact that the elements of the list are themselves lists is irrelevant.
The maximum size of a python list on a 32 bit system is 536,870,912 elements.
I don't believe that is limited to the total elements in each list. (65536^2 = 4,294,967,295 which exceeds this.)
Why not try it and find out?
Related
I'm creating a List data structure in PHP based off of the WHATWG Infra Standard as a programming exercise and am having some issues trying to clarify a couple items.
I don't see anywhere that implies that indices in a list must be consecutive. However, it seems to be implied in the definition of getting the indices of a list:
To get the indices of a list, return the range from 0 to the list’s size, exclusive.
In the definition of getting the indices (listed above) it says to return the range from 0 to the list's size, but wouldn't this include one extra index than is actually present? I see the sentence ends with ", exclusive" but I don't know what that means in this context.
Any insight is much appreciated!
There is no operation that would allow for a "sparse" list if that's what you are asking for.
Removing items from a list will make the following items "move" to fill the now empty positions.
And regarding "return the range from 0 to the list’s size, exclusive." "range" is a link to https://infra.spec.whatwg.org/#the-exclusive-range, which states
The range n to m, exclusive, creates a new ordered set containing all of the integers from n up to and including m − 1 in consecutively increasing order, as long as m is greater than n. If m equals n, then it creates an empty ordered set.
Don't be afraid of following all the links in the specs if you want to implement it.
I need to map an array of sorted integers, with the length varying from 1 to 4 at max, to an index of a global array. Like [13,24,32] becoming a number in range 0..n, and no other array mapping to that same number.
The quantity of arrays is a few millions, and the mapping has to be "unique" (or at least with very few collisions for smaller arrays), because these arrays represent itemsets and I use the k-1 smaller itemset to build the one of size k.
My current implementation uses a efficient hash function that produces a double between 0..1 for an array, and I store the itemsets in STL Map, with the double as the key. Got from this article:
N. D. Atreas, C. Karanikas “A faster pattern matching algorithm based on prime numbers and hashing approximation “, 2007
I'm going to implement a parallel version of this in CUDA, so I can't use something like STL Map. I could create myself easily a self-balancing binary search tree as a map in the GPU global memory, but that would be really slow. So in order to reduce global memory access to a minimum, I need to map the itemset to a huge array in the global memory.
I've tried to cast the double as a long integer and hash it with a 64 bit hash function, but it produces some collision, as expected.
So, I ask if there's a "unique" hash function for doubles between 0..1, or for array of integers from size 1..4, that gives an unique index for a table of size N.
If I make this an assumption about your arrays:
each item of your arrays (such as 13) are 32-bit integers.
Then what you ask is impossible.
You have at least 2^(32*4) possible values, or 128 bits. And you are trying to pack those into an array of much smaller size (20 bits for one million entries). You cannot do this without collisions (or there being some agreement amongst the elements, such as each element choosing the "next available index", but that's not a hash).
There are two integer arrays ,each in very large files (size of each is larger than RAM). How would you find the common elements in the arrays in linear time.
I cant find a decent solution to this problem. Any ideas?
One pass on one file build a bitmap (or a Bloom filter if the integer range is too large for a bitmap in memory).
One pass on the other file find the duplicates (or candidates if using a Bloom filter).
If you use a Bloom filter, the result is probabilistic. New passes can reduce the false positive (Bloom filters don't have false negative).
Assuming integer size is 4 bytes.
Now we can have maximum of 2^32 integers i.e I can have a bitvector of 2^32 bits (512 MB) to represent all integers where each bit reperesents 1 integer.
1. Initialize this vector with all zeroes
2. Now go through one file and set bits in this vector to 1 if you find an integer.
3. Now go through other file and look for any set bit in bit Vector.
Time complexity O(n+m)
space complexity 512 MB
You can obviously use an hash table to find common elements with O(n) time complexity.
First, you need to create an hash table using the first array, then compare the second array using this hash table.
Let's say enough RAM is available to hold 5% of hash of either given file-array (FA).
So, I can split the file arrays (FA1 and FA2) into 20 chunks each - say do a MOD 20 of the contents. We get FA1(0)....FA1(19) and FA2(0)......FA2(19). This can be done in linear time.
Hash FA1(0) in memory and compare contents of FA2(0) with this hash. Hashing and checking for existence are constant time operations.
Destroy this hash and repeat for FA1(1)...FA1(19). This is also linear. So, the whole operation is linear.
Assuming you are talking of integers with the same size, and written in the files in binary mode, you first sort the 2 files (use a quicksort, but reading and writing to the file "offsets" ).
Then you just need to move from the start of the 2 files, and check for matches, if you have a match write the output to another file (assuming you can't also store the result in memory) and keep moving on the files until EOF.
Sort files. With fixed length integers it can be done in O(n) time:
Get some part of file, sort it with radix sort, write to temporary file. Repeat until all data finished. This part is O(n)
Merge sorted parts. This is O(n) too. You can even skip repeated numbers.
On sorted files find a common subset of integers: compare numbers, write it down if they are equal, then step one number ahead on file with smaller number. This is O(n).
All operations are O(n) and final algorithm is O(n) too.
EDIT: bitmap method is much faster if you have enough memory for bitmaps. This method works for any fixed size integers, 64-bit for example. Bitmap of size 2^31 Mb will not be practical for at least a few years :)
If I want to use extendible hashing to store a maximum of 100 records, then what is the minimum array size that I need?
I am guessing that an array of 100 would be sufficient, but I could be wrong. I also suspect that I can use a smaller array.
What do you know about your hash function?
You mentioned extendible hashing.
With extendible hashing you look at your hash as a bit string and typically implement the bucket lookup via a trie. Instead of a trie based lookup though I assume you are converting this to an index into your array.
You mentioned you will have at most 100 elements. If you wanted all distinct hashes you'd have 128 possibilities since that's the closest combination of bits with 7 bits.
If your hashing function can hash each element to have 7 of 7 (or more) different bits, then you have the most optimal solution with a bucket size of 1. Leaving 128 leaf nodes, or an array of size 128.
If your hashing function can hash each element to have 6 of 7 (or more) different bits, then you have a bucket size of 2. You would have 64 leaf nodes/combinations/array size.
If your hashing function can hash each element to have 5 of 7 (or more) different bits, then you have a bucket size of 4. You would have 32 leaf nodes/combinations/array size.
Since you said you want a bucket size of 4 I think your answer would be 32 and you have a hard requirement that you have a good hashing function that can give you at least 5 of the first bits as distinct.
I think it depends on whether you need high performance or saving storage. You can just save elements into an array of 100. I don't know a lot about extendible hashing, but my general understanding about hashing is that it will have some kinds of collision, and if you use a bigger array to store it, the number of collision can reduce and the performance in adding/deleting and querying will also be faster. I think you should use at least 128 (just to be 2^k, I am not an expert in hashing):)
Say you have a List of 32-bit Integers and the same collection of 32-bit Integers in a Multiset (a set that allows duplicate members)
Since Sets don't preserve order but List do, does this mean we can encode a Multiset in less bits than the List?
If so how would you encode the Multiset?
If this is true what other examples are there where not needing to preserve order saves bits?
Note, I just used 32-bit Integers as an example. Does the data type matter in the encoding? Does the data type need to be fixed length and comparable for you to get the savings?
EDIT
Any solution should work well for collections that have low duplication as well as high duplication. Its obvious with high duplication encoding a Multiset by just simply counting duplicates is very easy, but this takes more space if there is no duplication in the collection.
In the multiset, each entry would be a pair of numbers: The integer value, and a count of how many times it is used in the set. This means additional repeats of each value in the multiset do not cost any more to store (you just increment the counter).
However (assuming both values are ints) this would only be more efficient storage than a simple list if each list item is repeated twice or more on average - There could be more efficient or higher performance ways of implementing this, depending on the ranges, sparsity, and repetitive of the numbers being stored. (For example, if you know there won't be more than 255 repeats of any value, you could use a byte rather than an int to store the counter)
This approach would work with any types of data, as you are just storing the count of how many repeats there are of each data item. Each data item needs to be comparable (but only to the point where you know that two items are the same or different). There is no need for the items to take the same amount of storage each.
If there are duplicates in the multiset, it could be compressed to a smaller size than a naive list. You may want to have a look at Run-length encoding, which can be used to efficiently store duplicates (a very simple algorithm).
Hope that is what you meant...
Data compression is a fairly complicated subject, and there are redundancies in data that are hard to use for compression.
It is fundamentally ad hoc, since a non-lossy scheme (one where you can recover the input data) that shrinks some data sets has to enlarge others. A collection of integers with lots of repeats will do very well in a multimap, but if there's no repetition you're using a lot of space on repeating counts of 1. You can test this by running compression utilities on different files. Text files have a lot of redundancy, and can typically be compressed a lot. Files of random numbers will tend to grow when compressed.
I don't know that there really is an exploitable advantage in losing the order information. It depends on what the actual numbers are, primarily if there's a lot of duplication or not.
In principle, this is the equivalent of sorting the values and storing the first entry and the ordered differences between subsequent entries.
In other words, for a sparsely populated set, only little saving can be had, but for a more dense set, or one with clustered entries - more significant compression is possible (i.e. less bits need to be stored per entry, possibly less than one in the case of many duplicates). I.e. compression is possible but the level depends on the actual data.
The operation sort followed by list delta will result in a serialized form that is easier to compress.
E.G. [ 2 12 3 9 4 4 0 11 ] -> [ 0 2 3 4 4 9 11 12 ] -> [ 0 2 1 1 0 5 2 1 ] which weighs about half as much.