remove elements: but which container to prefer - c++

I am keeping the nonzeros of a sparse matrix representation in some triplets, known in the numerical community as Compressed Sparse Row storage, entries are stored row-wise, for instance a 4x4 matrix is represented as
r:0 0 1 1 2 2 3 3 3
c:0 3 2 3 2 3 1 2 3
v:1 5 2 2 4 1 5 4 5
so 'r' gives row indices, 'c' gives column indices and 'v' are the values associated to the 2 indices above that value.
I would like to delete some rows and columns from my matrix representation, say rows and cols: 1 and 3. So I should remove 1s and 3s from the 'r' and 'c' arrays. I am also trying to learn more about the performance of the stl containers and read a bit more. As first try, created a multimap and delete the items by looping over them with the find method of multimap. This removes the found keys however might leave some of the searched values in the 'c' array then I swapped the key,value pairs and do the same operation for this second map, however this did not seem to be a very good solution to me, it seems to be pretty fast(on a problem with 50000 entries), though. So the question is what would be the most efficient way to do this with standard containers?

You could use a map (between a pair of rows and columns) and the value, something like map<pair<int,int>, int>
If you then want to delete a row, you iterate over the elements and erase those with the to-be deleted row. The same can be done for columns.

How are you accessing the matrix? Do you look up particular rows/columns and do things with them that way, or do you use the whole matrix at a time for operations like matrix-vector multiplications or factorization routines? If you're not normally indexing by row/column, then it may be more efficient to store your data in std::vector containers.
Your deletion operation is then a matter of iterating straight through the container, sliding down subsequent elements in place of the entries you wish to delete. Obviously, there are tradeoffs involved here. Your map/multimap approach will take something like O(k log n) time to delete k entries, but whole-matrix operations in that representation will be very inefficient (though hopefully still O(n) and not O(n log n)).
Using the array representation, deleting a single row or column would take O(n) time, but you could delete an arbitrary number of rows or columns in the same single pass, by keeping their indices in a pair of hash tables or splay trees and doing a lookup for each entry. After the deletion scan, you could either resize the vectors down to the number of elements you have left, which saves memory but might entail a copy, or just keep an explicit count of how many entries are valid, trading dead memory for saving time.

Related

What is the cheapest way to sort a permutation in C++?

The problem is:
You have to sort an array in ascending order(permutation: numbers from 1 to N in a random order) using series of swaps. Every swap has a price and there are 5 types of prices. Write a program that sorts the given array for the smallest price.
There are two kinds of prices: priceByValue and priceByIndex. All of the prices of a kind are given in 2 two-dimensional arrays N*N. Example how to access prices:
You want to swap the 2nd and the 5th elements from the permutation with values of 4 and 7. The price for this swap will be priceByValue[4][7] + priceByIndex[2][5].
Indexes of all arrays are counted from 1 (, not from 0) in order to have access to all of the prices (the permutation elements’ values start from 1): priceByIndex[2][5] would actually be priceByIndex[1][4] in code. Moreover, the order of the indexes by which you access prices from the two-dimensional arrays doesn’t matter: priceByIndex[i][j] = priceByIndex[j][i] and priceByIndex[i][i] is always equal to 0. (priceByValue is the same)
Types of prices:
Price[i][j] = 0;
Price[i][j] = random number between 1 and 4*N;
Price[i][j] = |i-j|*6;
Price[i][j] = sqrt(|i-j|) *sqrt(N)*15/4;
Price[i][j] = max(i,j)*3;
When you access prices by index i and j are the indexes of the elements you want to swap from the original array; when you access prices by value i and j are the values of the elements you want to swap from the original array. (And they are always counted from 1)
Things given:
N - an integer from 1 to 400, Mixed array, Type of priceByIndex, priceByIndex matrix, Type of priceByValue, priceByValue matrix. (all elements of a matrix are from the given type)
Things that should 'appear on the screen': number of swaps, all swaps (only by index - 2 5 means that you have swapped 2nd and 3rd elements) and the price.
As I am still learning C++, I was wondering what is the most effective way to sort the array in order to try to find the sort with the smallest cost.
There might be a way how to access series of swaps that result a sorted array and see which one is with the smallest price and I need to sort the array by swapping the elements which are close by both value and index, but I don’t know how to do this. I would be very grateful if someone can give me a solution how to find the cheapest sort in code. Thank you in advance!
More: this problem might have no solution, I am just trying to get a result close to the ideal.
Dynamic Programming!
Think of the problem as a graph. Each of the N-factorial permutations represents a graph vertex, and the allowed swaps are just arcs between vertices. The price-tag of a swap is just the weight on the arc.
When you look at the problem this way, it can be easily solved with Dijkstra's algortihm for finding the lowest cost path through a graph from one vertex to another.
This is also called Single Pair Shortest Path
you can use an algorithm for sorting an array in lexicographical order and modify it so that it fits your needs ( you did not mention the sorting criteria like the desired result i.e. least value first, ... ) there are multiple algorithms available for this, i.e. quick sort,...
a code example is in https://www.geeksforgeeks.org/lexicographic-permutations-of-string/

Best data structure for finding maximum in a 2d matrix with update queries

I have a 2d matrix of doubles. My task is to find the maximum element of the matrix at any point.
Queries will be of 2 types:
Update query : In this query, 2n - 1 elements will be updated, i.e. all elements of row i and column i will be updated. (by updates I mean, change the element, it can be anything, increment or decrement)
Maximum query : Return maximum element in the 2d array.
I came up with a solution by using binary heaps. My idea is to keep a maxheap of n^2 elements implemented using an array, and maintain another array of size n^2 to keep the indices of heap elements. So (i,j)th element in the matrix which is nothing but (i*n + j)th element in the flat array will store the index corresponding to its position in the heap.
So this way, 2n-1 updates will be handled in (2n-1)log(n^2) time. And maximum query can be answered in O(1) time.
I wasn't able to use STL implementation because I have to keep track of heap elements, i.e. upon update query I should know which heap elements are supposed to be updated. STL also doesn't support changing keys.
How do I improve the update query time? Is there some other data structure which can handle these operations faster?
I'd use a STL vector of indices i*n+j. Keep sorted this n^2 sized array using your own compare function. Sorting after update is n^2 O(log n^2). Querying the maximun is asking for the first element in the vector.
Edit
If you're interested only in maximum value, you can keep its position (i,j) cached. When the matrix is updated it will need to be sorted again only if this cached position change.

Hashing algorithm for pair of integers [duplicate]

This question already has answers here:
Create a hashcode of two numbers
(7 answers)
Closed 6 years ago.
The problem: Storing dynamic adjacency list of a graph in a file while retaining O(1) algorithmic complexity of operations.
I am trying to store dynamic bidirectional graph in a file(s). Both nodes and edges can be added and removed and the operations must be O(1). My current design is:
File 1 - Nodes
Stores two integers per node (inserts are appends and removals use free list):
number of incoming edges
number of outgoing edges
File 2 - Edges
Stores 4 integers per edge (inserts are appends and removals use free list + swap with last edge for a node to update its new index):
from node (indice to File 1)
from index (i.e. third incoming edge)
to node (indice to File 1)
to index (i.e. second outgoing edge).
File 3 - Links
Serves as openly addressed hash table of locations of edges in File 2. Basically when you read a node from File 1 you know there are x incoming edges and y outgoing edges. With that you can go to File 3 to get the position of each of these edges in File 2. The key thus being:
index of node in File 1 (i.e. 0 for first node, 1 for second node)
0 <= index of edge < number of outgoing/incoming edges
Example of File 3 keys if represented as chained hash table (that is unfortunately not suitable for files but would not require hashing...):
Keys (indices from `File 1` + 0 <= index < number of edgesfrom `File 1`, not actually stored)
1 | 0 1 2
2 | 0 1
3 |
4 | 0
5 | 0 1 2
I am using qHash and QPair to hash these atm however the number of collisions is very high. Especially when I compare it to single int hashing that is very efficient with qHash. Since the values stored are indices to yet another file probing is rather expensive so I would like to cut the number of collissions down.
Is there a specialized hashing algorithm or approach to use for pair of ints that could perform better in this situation? Or of course a different approach that would avoid this problem like how to implement chained hash table in a file for example (I can only think of using buffers but that would be overkill for sparse graphs like mine I believe)?
If you read through comments on this answer, they claim qHash of an int just returns that int unchanged (which is a fairly common way to hash integers for undemanding use in in-memory hash tables). So, using a strong general-purpose hash function will achieve a dramatic reduction in collisions, though you may loose out on some incidental caching benefits of having nearby keys more likely to hash to the same area on disk, so do measure rather than taking it for granted that fewer collisions means better performance. I also suggest trying boost::hash_combine to create an overall hash from multiple hash values (just using + or XOR is a very bad idea). Then, if you're reading from disk, there's probably some kind of page size - e.g. 4k, 8k - which you'll have to read in to access any data anywhere on that page, so if there's a collision it'd still be better to look elsewhere on the already-loaded page, rather than waiting to load another page from disk. Simple linear probing manages that a lot of the time, but you could improve on that further by wrapping back to the start of the page to ensure you've searching all of it before probing elsewhere.

Perfect hashing function in a hash table implementation of a sparse matrix class

I'm currently implementing a sparse matrix for my matrix library - it will be a hash table. I already implemented a dense matrix as a nested vector, and since I'm doing it just to learn new stuff, I decided that my matrices will be multi-dimensional (not just a 2D table of numbers, but also cubes, tesseracts, etc).
I use an index type which holds n numbers of type size_t, n being a number of dimensions for this particular index. Index of dimension n may be used only as an address of an element in a matrix of appropriate dimension. It is simply a tuple with implicit constructor for easy indexing, like Matrix[{1,2,3}].
My question is centered around the hashing function I plan on using for my sparse matrix implementation. I think that the function is always minimal, but is perfect only up to a certain point - to the point of size_t overflow, or an overflow of intermediate operation of the hashing function (they are actually unsigned long long). Sparse matrices have huge boundaries, so it's practically guaranteed to overflow at some point (see below).
What the hashing function does is assign consecutive numbers to matrix elements as follows:
[1 2 3 4 5 6 7 8 ...]^T //transposed 1-dimensional matrix
1 4 7
2 5 8
3 6 9 //2-dimensional matrix
and so on. Unfortunately, I'm unable show you the ordering for higher order matrices, but I hope that you get the idea - the value increases top to bottom, left to right, back to front (for cube matrices), etc.
The hashing function is defined like this:
value = x1+d1*x2+d1*d2*x3+d1*d2*d3*x3+...+d1*d2*d3*...*d|n-1|*xn
where:
x1...xn are index members - row, column, height, etc - {x1, x2, x3,
..., xn}
d1...d|n-1| are matrix boundary dimensions - one past the end of matrix in the appropriate direction
I'm actually using a recursive form of this function (simple factoring, but complexity becomes O(n) instead of O(n^2)):
value = x1+d1(x2+d2(x3+d3(...(x|n-1|+d|n-1|(xn))...)))
I'm assuming that elements will be distributed randomly (uniform) across the matrix, but the bucket number is hash(key) mod numberOfBuckets, so it is practically guaranteed to have collisions despite the fact, that the function is perfect.
Is there any way to exploit the features of my hash function in order to minimize collisions?
How should I choose the load factor? Should I leave the choice to the user? Is there any good default value for this case?
Is the hash table actually a good solution for this problem? Are there any other data structures that guarantee average O(1) complexity given that I can roll the index into a number and a number into an index (mind the size_t overflow)? I am aware of different ways to store a sparse matrix, but I want to try the DOK way first.

Fast Algorithm for finding largest values in 2d array

I have a 2D array (an image actually) that is size N x N. I need to find the indices of the M largest values in the array ( M << N x N) . Linearized index or the 2D coords are both fine. The array must remain intact (since it's an image). I can make a copy for scratch, but sorting the array will bugger up the indices.
I'm fine with doing a full pass over the array (ie. O(N^2) is fine). Anyone have a good algorithm for doing this as efficiently as possible?
Selection is sorting's austere sister (repeat this ten times in a row). Selection algorithms are less known than sort algorithms, but nonetheless useful.
You can't do better than O(N^2) (in N) here, since nothing indicates that you must not visit each element of the array.
A good approach is to keep a priority queue made of the M largest elements. This makes something O(N x N x log M).
You traverse the array, enqueuing pairs (elements, index) as you go. The queue keeps its elements sorted by first component.
Once the queue has M elements, instead of enqueuing you now:
Query the min element of the queue
If the current element of the array is greater, insert it into the queue and discard the min element of the queue
Else do nothing.
If M is bigger, sorting the array is preferable.
NOTE: #Andy Finkenstadt makes a good point (in the comments to your question) : you definitely should traverse your array in the "direction of data locality": make sure that you read memory contiguously.
Also, this is trivially parallelizable, the only non parallelizable part is when you merge the queues when joining the sub processes.
You could copy the array into a single dimensioned array of tuples (value, original X, original Y ) and build a basic heap out of it in (O(n) time), provided you implement the heap as an array.
You could then retrieve the M largest tuples in O(M lg n) time and reference their original x and y from the tuple.
If you are going to make a copy of the input array in order to do a sort, that's way worse than just walking linearly through the whole thing to pick out numbers.
So the question is how big is your M? If it is small, you can store results (i.e. structs with 2D indexes and values) in a simple array or a vector. That'll minimize heap operations but when you find a larger value than what's in your vector, you'll have to shift things around.
If you expect M to get really large, then you may need a better data structure like a binary tree (std::set) or use sorted std::deque. std::set will reduce number of times elements must be shifted in memory, while if you use std::deque, it'll do some shifting, but it'll reduce number of times you have to go to the heap significantly, which may give you better performance.
Your problem doesn't use the 2 dimensions in any interesting way, it is easier to consiger the equivalent problem in a 2d array.
There are 2 main ways to solve this problem:
Mantain a set of M largest elements, and iterate through the array. (Using a heap allows you to do this efficiently).
This is simple and is probably better in your case (M << N)
Use selection, (the following algorithm is an adaptation of quicksort):
Create an auxiliary array, containing the indexes [1..N].
Choose an arbritary index (and corresponding value), and partition the index array so that indexes corresponding to elements less go to the left, and bigger elements go to the right.
Repeat the process, binary search style until you narrow down the M largest elements.
This is good for cases with large M. If you want to avoid worst case issues (the same quicksort has) then look at more advanced algorithms, (like median of medians selection)
How many times do you search for the largest value from the array?
If you only search 1 time, then just scan through it keeping the M largest ones.
If you do it many times, just insert the values into a sorted list (probably best implemented as a balanced tree).