calculating "levenshtein social network" *very* efficiently - levenshtein-distance

I'm doing a code challenge online involving finding the 'social network' of words who are related through their Levenshtein distances. My Levenshtein function is correct. I'm recursively adding to a global set, and I'm using a map of tuples to boolean values to cache whether or not any pair of words has a Levenshtein distance of 1. The code is supposed to terminate in 5 seconds. I'm not sure how this is even close to possible. I'm sure that there is some aha insight that makes
this possible. Can anyone see that right off the bat?
Problem Statement:
Two words are friends if they have a Levenshtein distance of 1. That is, you can add, remove, or substitute exactly one letter in word X to create word Y. A word’s social network consists of all of its friends, plus all of their friends, and all of their friends’ friends, and so on. Write a program to tell us how big the social network for the word 'hello' is, using this word list
My pseudocode:
get_network(friend)
if friend not in network
add friend to network
friends = []
check friend against all words in network
consult cache or calculate lev distance
cache if necessary, append to friends if necessary
for all friends
get_network(friend)
To rephrase the question: "what's the fundamental insight that makes possible an astronomical boost in efficiency?"

Related

Algorithm to rank the simplicity of a random name

I have been looking for a name for a new project. I want the name to have available domains and social media handles. For months, all those I can think of are taken.
So I generated a list of names with at least a consonant and a vowel and checked if the domains are available (which is very fast). I have about a million possible names.
I would like to sort them by some rank of simplicity. "Aaazq" would be close to the bottom, "Cawel" would be close to the top. I thought of the CVC structure (Consonant-Vowel-Consonant) and wonder if some more sophisticated algorithm exists. I searched for "sonority" but it has a different meaning in linguistics.
How can I automatically rank the simplicity of a random name?
I assume you would judge simplicity as compared to a target language, say English. Something that is 'simple' in English might not be 'simple' in German or Korean, as these languages have very different phonological structures.
I would recommend the following:
collect some data of the language you are using. Just get some novels from Project Gutenberg, for example, or newspaper articles. Whatever you can easily get hold of.
now generate n-grams from this: all sequences of two (bigrams) or three (trigrams) letters. Turn this into a frequency list, so that common n-grams are at the top of the list with a high frequency.
turn your suggested name into n-grams. Count how many times the respective n-gram occurs in your frequency list, and take the average or median of the result
Your examples would do as follows:
aa aa az zq: "aa" is rare ("aardvark") "az" a bit more common ("glaze", "raze"), and "zq" would not exist. So, not a very high score.
ca aw we el: all of these are fairly common in English words, so a reasonably high score.
You could also add a dummy # at the beginning and the end, so in your first example you'd get #a, which is fine, as many English words start with "a", but the final q# bombs out, as there's only words such as "Iraq" which end in a "q".
You can obviously do the same for other languages.
Also, you can reverse the process in a way, and pick random n-grams from your frequency list to generate names: by picking higher-frequency n-grams you will make sure the name is a good match with the phonological structure of your target language.
Note for pedants: I use phonological structure, but it's really its representation in the spelling system that we're dealing with here.

Classify K-means in Text Mining

The goal is to create a computer-generated news site that aggregates headlines from different news sources around the world:
Taking a look at the centroid table results I want to Understand the following:
https://ibb.co/n1mvnbk
I used K=5
and I am using TF-IDF
Explain what those numbers mean?
When an attribute is zero in multiple clusters, what does it mean?
When I sort the centroid table by each cluster at a descending order, I find some words or attributes that have a higher value with this cluster while zero values in other clusters. Does this mean that these words occur more or less frequently in this cluster?
How can I discuss the clustering model
Do all the clusters make sense and why?
Do you think k=5 is a good choice for this dataset? or I need to choose 3? How can I classify that?
I believe K=5 denotes number of cluster you are looking into current Dataset. On the basis 5 centroid will be placed in data will be around them.
Do you think k=5 is a good choice for this dataset? Its hard to predict this way. It is all done by mathematical combination and permutation.
You might use Elbow Method to identify correct number of cluster needed for any given dataset. This methodology is based on WCSS(Within Cluster Sums of Squares) which find distance between points and provide centroid points.
Those numbers are the average tf-idf of the cluster. So a 0 means that the word is not in the cluster, and the highest valued words are most characteristic words for the cluster.
Note that for text you'll want to use spherical k-means rather than regular k-means.
Choosing k is a big problem. Forget the elbow method, it never works except for you examples. Experiment with different k and choose the one that is most convincing or most useful. None of the usual heuristics for choosing the k in k-means will work here I fear (VRC is IMHO the best). The main reason is that the data cannot be well partitioned into k clusters. There is no reason to assume there are exactly k topics in the world, nor that every document only contains one topic. Instead, topics will be a complex structure itself. For example there is Trump, but there also is the Trump Erdogan meeting, and there is the impeachment. These are not disjoint. But you will also have articles that don't fit into any of these topics. This leads to the effect that the true best k would likely be very very large, as large as the number of articles (and hence not useful).

Custom word weights for sentences when calling h2o transform and word2vec, instead of straight AVERAGE of words

I am using H2O machine learning package to do natural language predictions, including the functions h2o.word2vec and h2o.transform. I need sentence level aggregation, which is provided by the AVERAGE parameter value:
h2o.transform(word2vec, words, aggregate_method = c("NONE", "AVERAGE"))
However, in my case I strongly wish to avoid equal weighting of "the" and "platypus" for example.
Here's a scheme I concocted to achieve custom word-weightings. If H2O's word2vec "AVERAGE" option uses all the words including duplicates that might appear, then I could effect a custom word weighting when calling h2o.transform by adding additional duplicates of certain words to my sentences, when I want to weight them more heavily than other words.
Can any H2O experts confirm that that the word2vec AVERAGE parameter is using all the words rather than just the unique words when computing AVERAGE of the words in sentence?
Alternatively, is there a better way? I tried but I find myself unable to imagine any correct math to multiply the sentence average by some factor, after it was already computed.
Yes, h2o.transform will consider each occurrence of a word for the averaging, not just the unique words. Your trick will therefore work.
There is currently no direct way to provide user defined weights. You could probably do an ugly hack and weight directly the word embeddings but that won't be a straightforward solution I could recommend.
We can add this feature to H2O. I would love to hear what API would work for you (how would you like to provide the weights).

How to normalize sequence of numbers?

I am working user behavior project. Based on user interaction I have got some data. There is nice sequence which smoothly increases and decreases over the time. But there are little discrepancies, which are very bad. Please refer to graph below:
You can also find data here:
2.0789 2.09604 2.11472 2.13414 2.15609 2.17776 2.2021 2.22722 2.25019 2.27304 2.29724 2.31991 2.34285 2.36569 2.38682 2.40634 2.42068 2.43947 2.45099 2.46564 2.48385 2.49747 2.49031 2.51458 2.5149 2.52632 2.54689 2.56077 2.57821 2.57877 2.59104 2.57625 2.55987 2.5694 2.56244 2.56599 2.54696 2.52479 2.50345 2.48306 2.50934 2.4512 2.43586 2.40664 2.38721 2.3816 2.36415 2.33408 2.31225 2.28801 2.26583 2.24054 2.2135 2.19678 2.16366 2.13945 2.11102 2.08389 2.05533 2.02899 2.00373 1.9752 1.94862 1.91982 1.89125 1.86307 1.83539 1.80641 1.77946 1.75333 1.72765 1.70417 1.68106 1.65971 1.64032 1.62386 1.6034 1.5829 1.56022 1.54167 1.53141 1.52329 1.51128 1.52125 1.51127 1.50753 1.51494 1.51777 1.55563 1.56948 1.57866 1.60095 1.61939 1.64399 1.67643 1.70784 1.74259 1.7815 1.81939 1.84942 1.87731
1.89895 1.91676 1.92987
I would want to smooth out this sequence. The technique should be able to eliminate numbers with characteristic of X and Y, i.e. error in mono-increasing or mono-decreasing.
If not eliminate, technique should be able to shift them so that series is not affected by errors.
What I have tried and failed:
I tried to test difference between values. In some special cases it works, but for sequence as presented in this the distance between numbers is not such that I can cut out errors
I tried applying a counter, which is some X, then only change is accepted otherwise point is mapped to previous point only. Here I have great trouble deciding on value of X, because this is based on user-interaction, I am not really controller of it. If user interaction is such that its plot would be a zigzag pattern, I am ending up with 'no user movement data detected at all' situation.
Please share the techniques that you are aware of.
PS: Data made available in this example is a particular case. There is no typical pattern in which numbers are going to occure, but we expect some range to be continuous with all the examples. Solution I am seeking is generic.
I do not know how much effort you want to involve in this problem but if you want theoretical guaranties,
topological persistence seems well adapted to your problem imho.
Basically with that method, you can filtrate local maximum/minimum by fixing a scale
and there are theoritical proofs that says that if you sampling is
close from your function, then you extracts correct number of maximums with persistence.
You can see these slides (mainly pages 7-9 to get the idea) to get an idea of the method.
Basically, if you take your points as a landscape and imagine a watershed starting from maximum height and decreasing, you have some picks.
Every pick has a time where it is born which is the time where it becomes emerged and a time where it dies which is when it merges with an higher pick. Now a persistence diagram pictures a point for every pick where its x/y coordinates are its time of birth/death (by assumption the first pick does not die and is not shown).
If a pick is a global maximal, then it will be further from the diagonal in the persistence diagram than a local maximum pick. To remove local maximums you have to remove picks close to the diagonal. There are fours local maximums in your example as you can see with the persistence diagram of your data (thanks for providing the data btw) and two global ones (the first pick is not pictured in a persistence diagram):
If you noise your data like that :
You will still get a very decent persistence diagram that will allow you to filter local maximum as you want :
Please ask if you want more details or references.
Since you can not decide on a cut off frequency, and not even on the filter you want to use, I would implement several, and let the user set the parameters.
The first thing that I thought of is running average, and you can see that there are so many things to set, to get different outputs.

The New Villa Acm solution strategy

I am trying to solve this ACM problem The New Villa
and i am not figuring out how to approach this problem definitely its graph problem but doors and the room that have switches to other rooms are very confusing to make a generic solution. Can some body help me in defining the strategy for this problem.
Also i want some discussion forum for ACM problems if you know any one then please share.
Thanks
A.S
It seems like a pathfinding problem on states.
You can represent each vertex with a binary vector of size n + an indentifier - where which room you are in at the moment [n is the number of rooms].
G=(V,E) where V = {all binary vectors of size n and a recored for which room you are in} and E = {(u,v) | you can switch from binary vector u to v by clicking a button in the room you are in, or move to adjacent lights on room }
Now you only need to run a search algorithm on the possible paths.
Possible search algorithms:
BFS - simplest to program, though slowest run time
bi - directional BFS - since there is only one target node,
a bi-directional search will work here, it is expected to be much
faster then BFS
A* - find an admissible heurstic function and run
informed A* on the problem. It is harder to program it then the rest - but if you find a good heurisitc, it will most likely perform much better.
(*) All of the above are both complete [will find a solution if one exists] and optimal [will find the shortest solution, if one exists]
(*) This solution runs in exponential time on the number of rooms, but it should end up for d <= 10 as indicated in the problem in reasonable time.