DAG algorithms and Latin square problems - directed-acyclic-graphs

Every Latin square corresponds to a directed acyclic graph with a lattice arrangement, and whose edges indicate order (<). For example:
(source: enjoysudoku.com)
I'm interested in determining which Latin squares have unique orderings, that is, whose corresponding DAG permits no other valid vertex labelling.
A valid labelling maintains uniqueness in rows and columns, and also satisfies the order relationships specified by the edges.
I can't find any references to labelling problems of this particular kind. Does this imply that the DAG structure is not particularly useful for this kind of analysis?
This relates to a study of uniqueness properties wrt Futoshiki puzzles.

Related

How does word2vec learn word relations?

Which part of the algorithm specifically makes the embeddings to have the king - boy + girl = queen ability? Did they just did this by accident?
Edit :
Take the CBOW as an example. I know about they use embeddings instead of one-hot vectors to encode the words and made the embeddings trainable instead of how we do when using one hot vectors that the data itself is not trainable. Then the output is a one-hot vector for target word. They just average all the surrounding word embeddings at some point then put some lego layers afterwards. So at the end they find the mentioned property by surprise, or is there a training procedure or network structure that gave the embeddings that property?
The algorithm simply works to train (optimize) a shallow neural-network model that's good at predicting words, from other nearby words.
That's the only internal training goal – subject to the neural network's constraints on how the words are represented (N floating-point dimensions), or combined with the model's internal weights to render an interpretable prediction (forward propagation rules).
There's no other 'coaching' about what words 'should' do in relation to each other. All words are still just opaque tokens to word2vec. It doesn't even consider their letters: the whole-token is just a lookup key for a whole-vector. (Though, the word2vec variant FastText varies that somewhat by also training vectors for subwords – & thus can vaguely simulate the same intuitions that people have for word-roots/suffixes/etc.)
The interesting 'neighborhoods' of nearby words, and relative orientations that align human-interpretable aspects to vague directions in the high-dimensional coordinate space, fall out of the prediction task. And those relative orientations are what gives rise to the surprising "analogical arithmetic" you're asking about.
Internally, there's a tiny internal training cycle applied over and over: "nudge this word-vector to be slightly better at predicting these neighboring words". Then, repeat with another word, and other neighbors. And again & again, millions of times, each time only looking at a tiny subset of the data.
But the updates that contradict each other cancel out, and those that represent reliable patterns in the source training texts reinforce each other.
From one perspective, it's essentially trying to "compress" some giant vocabulary – tens of thousands, to millions, of unique words – into a smaller N-dimensional representation - usually 100-400 dimensions when you have enough training data. The dimensional-values that become as-good-as-possible (but never necessary great) at predicting neighbors turn out to exhibit the other desirable positionings, too.

Removing edges and splitting a connected component if necessary (C++, Boost)

I have a large graph (the number of vertices can be in the range of 50,000-100,000, the adjacency matrix need not be sparse). Edges in the graph can be removed/added, and I want to update the resulting connected components structure after such changes. I have implemented this in a straightforward fashion with a BFS search myself in C++ (keeping track of unordered_maps of vertices to connected component ids and updating them), but I am wondering if there is a more efficient way to do this using Boost's graph library.
I was able to find some questions similar to this here in Stackoverflow, and came to know of filtered_graph (and the connected_components function) but I am worried about the overhead involved in creating such filtered instances, every time we add or remove an edge. (Or should this be a concern at all?!)
I believe your solution is essentially the best possible. If you are only allowed to add edges, then I believe the algorithm can be improved by keeping track of connected components in terms of vertices included, and then when an edge is included you check to see if the two vertices belong to different connected components, in which case you merge the two connected components. This will reduce the complexity from quadratic to best-case per edge added. However, if you are allowed to insert and delete edges, I don't see any asymptotically faster way to solve the problem other that what you described.
There are algorithms for maintaining connectivity under edge insertions and deletions that are faster than recalculating. This is called "dynamic graph connectivity". Here is a paper on experimental evaluations (some newer theoretical results have been found since, but it is unclear whether they have practical relevance).

Which Data Mining task to retrieve a unique instance

I'm working with data mining, and I'm familiar with classification, clustering and regression tasks. In classification, one can have a lot of instances (e.g. animals), their features (e.g. number of legs) and a class (e.g. mammal, reptile).
But what I need to accomplish is, given some attributes, including the class attribute, to determine which unique instance I'm referring to (e.g. giraffe). I can supply all known attributes that I have, and if the model can’t figure out the answer, it can ask for another attribute – just analogous to a 20 questions style of game.
So, my question is: does this specific task have a name? It seems to be similar to classification, where the class is unique to each instance, but this wouldn’t fit on the current training models, except perhaps for a decision tree model.
Your inputs, denoted features in machine learning, are tuples of species (what I think you mean by "instance"), and physical attributes. Your outputs are broader taxonomic ranks. Thus, assigning one to each input is a classification problem. Since your features are incomplete, you want to perform ... classification with incomplete data, or impute missing features. Searching for these terms will give you enough leads.
(And the other task is properly called clustering.)
IMHO you are looking for simply a decision tree.
Except, that you don't train it on your categorial attribute (your "class"), but on the individual instance label.
You need to carefully choose the splitting measure though, as many measures go for class sizes - and all your classes have size 1 now. Finding a good split for the decision tree may involve planning some splits ahead to get an optimal balanced tree. A random forest like approach may be of use to improve the chance of finding a good tree.

Clarification needed about min/sim hashing + LSH

I have a reasonable understanding of a technique to detect similar documents
consisting in first computing their minhash signatures (from their shingles, or
n-grams), and then use an LSH-based algorithm to cluster them efficiently
(i.e. avoid the quadratic complexity which would entail a naive pairwise
exhaustive search).
What I'm trying to do is to bridge three different algorithms, which I think are
all related to this minhash + LSH framework, but in non-obvious ways:
(1) The algorithm sketched in Section 3.4.3 of Chapter 3 of the book Mining of Massive Datasets (Rajaraman and Ullman), which seems to be the canonical description of minhashing
(2) Ryan Moulton's Simple Simhashing article
(3) Charikar's so-called SimHash algorihm, described in this article
I find this confusing because what I assume is that although (2) uses the term
"simhashing", it's actually doing minhashing in a way similar to (1), but with
the crucial difference that a cluster can only be represented by a single
signature (even tough multiple hash functions might be involved), while two
documents have more chances of being similar with (1), because the signatures
can collide in multiple "bands". (3) seems like a different beast altogether, in
that the signatures are compared in terms of their Hamming distance, and the LSH
technique implies multiple sorting of the signatures, instead of banding them. I
also saw (somewhere else) that this last technique can incorporate a notion of
weighting, which can be used to put more emphasis on certain document parts, and
which seems to lack in (1) and (2).
So at last, my two questions:
(a) Is there a (satisfying) way in which to bridge those three algorithms?
(b) Is there a way to import this notion of weighting from (3) into (1)?
Article 2 is actually discussing minhash, but has erroneously called it simhash. That's probably why it is now deleted (it's archived here). Also, confusingly, it talks about "concatenating" multiple minhashes, which as you rightly observe reduces the chance of finding two similar documents. It seems to prescribe an approach that yields only a single "band", which will give very poor results.
Can the algorithms be bridged/combined?
Probably not. To see why, you should understand what the properties of the different hashes are, and how they are used to avoid n2 comparisons between documents.
Properties of minhash and simhash:
Essentially, minhash generates multiple hashes per document, and when there are two similar documents it is likely that a subset of these hashes will be identical. Simhash generates a single hash per document, and where there are two similar documents it is likely that their simhashes will be similar (having a small hamming distance).
How they solve the n2 problem:
With minhash you index all hashes to the documents that contain them; so if you are storing 100 hashes per document, then for each new incoming document you need to look up each of its 100 hashes in your index and find all documents that share at least (e.g.) 50% of them. This could mean building large temporary tallies, as hundreds of thousands of documents could share at least one hash.
With simhash there is a clever technique of storing each document's hash in multiple permutations in multiple sorted tables, such that any similar hash up to a certain hamming distance (3, 4, 5, possibly as high as 6 or 7 depending on hash size and table structure) is guaranteed to be found nearby in at least one of these tables, differing only in the low order bits. This makes searching for similar documents efficient, but restricts you to only finding very similar documents. See this simhash tutorial.
Because the use of minhash and simhash are so different, I don't see a way to bridge/combine them. You could theoretically have a minhash that generates 1-bit hashes and concatenate them into something like a simhash, but I don't see any benefit in this.
Can weighting be used in minhash?
Yes. Think of the minhashes as slots: if you store 100 minhashes per document, that's 100 slots you can fill. These slots don't all have to be filled from the same selection of document features. You could devote a certain number of slots to hashes calculated just from specific features (care should be taken, though, to always select features in such a way that the same features will be selected in similar documents).
So for example if you were dealing with journal articles, you could assign some slots for minhashes of document title, some for minhashes of article abstract, and the remainder of the slots for minhashes of the document body. You can even set aside some individual slots for direct hashes of things such as author's surname, without bothering about the minhash algorithm.
It's not quite as elegant as how simhash does it, where in effect you're just creating a bitwise weighted average of all the individual feature hashes, then rounding each bit up to 1 or down to 0. But it's quite workable.

Distinguish directed and undirected graph

I need to write a graph using C++ and I have a little problem. My graph should be directed or undirected, weighted or unweighted, based on matrix or list all on user's choice. And distinguishing matrix from list graph is not a big deal, since it's two different classes, I got some problem with other parameters. The most obvious way to distinguish them is to make two bool variables and check them on every adding and deleting of vertex. It is quite obvious and easy to understand, but I doubt it's efficiency, because every time I add or delete vertex I have to do additional if. I also could write subclasses for it, but I seriously doubt if it's worth it.
Every library is okay to use, if it's not representing graph itself.
For directed and undirected best case is using bool variable for your graph, however You can assume your graph is weighted and directed, but for undirected edges add one edge from a→b and one edge from b→a. Also if there isn't weight function set its weight to 1.
But if you looking for graph library it depends to your programming language, but I'd suggest graph boost library which implemented fully in c++, and too many other people implement it partially in other languages.