Finding connected components in an implicit representation of a graph

Finding connected components in an implicit representation of a graph - c++

I have a CS problem for school that I just can't seem to wrap my head around.
We're given a vector of strings, which composes an "image". This vector of strings effectively represents a 2 dimensional matrix, and, in each of the spaces, there can be one of 4 characters: 'K' - A knight, 'D' - a dragon, '#' - a castle wall, and ' ' - empty space.
The problem requires us to write several functions, for instance: safe_knights - which finds how many knights are positioned such that dragons cannot move to them, castle_count - which counts the castles in the image, and knights_in_castles - which counts how many knights reside within castles. (see definitions below)
The assignment is somewhat vague, and I'm confused as to where to begin with this. The one hint that I have is that we should be looking for connected components of spaces - if I knew which spaces are connected, I'd be able to determine whether a dragon has a path to a knight (safe_knights needs this), or if a region of spaces is enclosed by castle walls (would be used for castle_count).
I know that I can use one of the graph traversal algorithms to help find connected components, but I'm scratching my head over how to implement this to work on the graph which is represented implicitly in the vector of strings. The pieces just aren't coming together for me.
Any hints or ideas to point me in the right direction would be appreciated!
//Definitions of terms above:
A castle is a region of at least one space enclosed by walls. Walls are connected if they are orthogonal or on the same line, not diagonal.
Castles: || Not Castles:
#### ### ###### || ## ### #
# # # # # #### || # # # # #
#### # # # # || ### # ####
### ######### ||
A knight is safe if there is no dragon that can move to it. Dragons cannot move diagonally.
Safe Knights: || Tasty Knights:
### ### || ##### ### D
#K# D # K# || #K D# # # K
### ### D || ##### ####
Example question to clarify what I don't understand:
The vector of strings is called image.
Say that (i, j) is equivalent to the character at (image[i])[j] (i.e. the jth char in the string at image[i].)
Then, how can I look at (i,j) and say "this is in a castle" or "this space can be reached by the dragon at (m,n)"?
Do I need some kind of tracking, like storing known connected components as a member of the class?
I'm assuming that each point (i,j) is a vertex in the graph, so I think I need some way to look at (i,j) and decide which other vertices it's adjacent to. I was told by the instructor that I don't need to represent the graph separately (e.g. scan through the vector and construct an adjacency matrix.) So I need to operate on the graph implicitly as opposed to having an explicit representation.
Edit:
So, I've thought about this some more, and it seems like the basis of each function is going to be doing a traversal, with slightly different stipulations on what constitutes adjacency. For example:
safe_knights - starting with a knight, DFS to legal spaces until a dragon is found or no more moves can be made. legal spaces are those in cardinal directions not blocked by a wall or the edge of the board.
castle_count - starting with a wall, DFS to spaces in cardinal directions containing walls, until this can't be done anymore. I think I'm going to also need some way to tell whether I've made it all the way back to where I started - maybe I can remember that starting node within that function. I might also have to check if there's a space in the middle.
knights_in_castles - this one is a little confusing - maybe starting with each knight, check the spaces around him until a wall is found, then check if that wall is part of a castle?

I guess your problem is how to use an incidence matrix (storing edge length in each cell) to represent the chessboard, before you apply ordinary graph traversal algorithm on it. Yet actually there is no need to do this.
Normally on a chessboard, we directly use the board matrix to represent this problem, no need to store edge length in a cell. If we want to find connected components, we use depth-first search (DFS) or breadth-first search (BFS) or A* search to find them.
A recursive DFS algorithm is quite simple to implement, but a little hard to understand at first.
DFS(node v)
{
visit(v);
for each child w of v {
// On a chessboard, they are right, left, up, and down neighbor cells
DFS(w);
}
}
In most cases, we use a global flag matrix to mark which cell has
been visited.
Before 'visit(v)', we always need to check that v is a legal cell (by check coordinates not over bound).
For example, if want the 'safe_knights', we start with a knight, DFS recursively search neighbor legal (not over bound, not a wall) cells, until we encounter a dragon, and fast return.

Related

Checking if there exists a string of directions that always leads to same vertex

There is a graph with n vertices. Each one has 4 edges, one for each side of the world. All of them are directed. I have to write a program that checks if there exists a string of directions that always leads to same vertex, no matter where you start.
For example:
example 1
If you go S W S you always will be in 3.
example 2
Such string of directions doesn't exist.
I have an idea how to do it but I would need a bool array with size 2^n. However, program has to work for n up to size 1000.
What's the best way to do it?

The sort of string you're describing is called a synchronizing word. If you are just trying to test whether such a word exists, there's a polynomial-time algorithm described in these lecture slides. Intuitively, for each pair of nodes u and v, you build a new graph where the start node is the pair {u, v}. Each node has a transition defined on each character c to the set {t(c, u), t(c, v)}, where t(c, u) represents the node transitioned to by reading character c in state u. You can expand out this graph using DFS or BFS. The original graph has a synchronizing word if and only if for each pair of nodes {u, v}, the above process produces a graph that has a path from {u, v} to some singleton node.
If you search online, you can find all sorts of other readings on this topic. Hopefully the terminology and the above links can help get you started!

Maximum Bipartite Matching C++

I'm solving a matching problem with two vectors of a class
class matching
{
public:
int n;
char match;
};
This is the algorithm I'm trying to implement:
int augment(vector<matching> &left, vector<matching> &right)
{
while(there's no augmenting path)
if(condition for matching)
<augment>
return "number of matching";
}
For the rough matching, if left[i] matches with right[j], then left[i].n = j, left[i].match ='M' , right[j].n = i and right[j].match = 'M' and the unmatched ones have members n = -1 and match = 'U'
While finding the augmenting paths, if one exists for another (i, j), then we change the member match of the one being unmatched from 'M' to 'U' and its n = -1 and the two matched with the augmenting path have their members match changed to 'A' while we change their members n according to their indices.
I don't know if this is the right approach to solving this, this is my first attempt on maximum matching and I've read a lot of articles and watched tutorials online and I can't get my 'code' to function appropriately.
I do not need a code, I can write my code. I just want to understand this algorithm step by step. If someone can give me an algorithm like the one I was trying above, I would appreciate it. Also, if I have been going the wrong direction since, please correct me.

I am not sure if you are finding the augmenting paths correctly. I suggest the following approach.
Find an initial matching in a greedy way. To obtain this we travel through every vertex in the left side and greedily try to match it with some free (unmatched) vertex on the right side.
Try to find an augmenting path P in the graph. For this we need to do a breadth-first search starting from all the free vertices on the left side and alternating through matched and unmatched edges in the search. (i.e. the second level contains all the right side vertices adjacent to level-1
vertices, the third level contains all the left side vertices that are
matched to level-2 vertices, the fourth level contains all the right side
vertices adjacent to level-3 vertices etc). We stop the search when we
visit a free vertex in any future level and compute the augmenting path P
using the breadth-first search tree computed so far.
If we can find an augmenting path P in the previous step: Change the matched and unmatched edges in P to unmatched and matched edges respectively and goto step 2.
Else: The resulting matching obtained is maximum.
This algorithm requires a breadth-first search for every augumentation and so it's worst-case complexity is O(nm). Although Hopcroft-Karp algorithm can perform multiple augmentations for each breadth-first search and has a better worst-case complexity, it
seems (from the Wikipedia article) that it isn't faster in practice.

Algorithm to print to screen path(s) through a text maze

For my C++ assignment, I'm basically trying to search through a chunk of text in a text file (that's streamed to my vector vec) beginning at the second top character on the left. It's for a text maze, where my program in the end is supposed to print out the characters for a path through it.
An example of a maze would be like:
###############
Sbcde####efebyj
####hijk#m#####
#######lmi#####
###############
###############
###############
###############
###############
###############
###############
###############
###############
###############
###############
Where '#' is an unwalkable wall and you always begin on the left at the second top character. Alphabetical characters represent walkable squares. Exit(s) are ALWAYS on the right. The maze is always a 15x15 size in a maze.text file. Alphabetical characters repeat within the same maze, but not directly beside each other.
What I'm trying to do here is: if a square next to the current one has an alphabetical character, add it to the vector vec, and repeat this process until I get to the end of the maze. Eventually I am supposed to make this more complicated by printing to the screen multiple paths that exist in some mazes.
So far I have this for the algorithm itself, which I know is wrong:
void pathcheck()
{
if (isalpha(vec.at(x)) && !(find(visited.begin(), visited.end(), (vec.at(x))) != visited.end()) )
{
path.push_back(vec.at(x));
visited.push_back(vec.at(x));
pathcheck(vec.at(x++));
pathcheck(vec.at(x--));
pathcheck(vec.at(x + 16));
pathcheck(vec.at(x - 16));
}
}
visited is my vector keeping track of the visited squares.
How would I update this so it actually works, and eventually so I can manage more than one path (i.e. if there were 2 paths, the program would print to the screen both of them)? I recall being told that I may need another vector/array that keeps track of squares that I've already visited/checked, but then how would I implement that here exactly?

You're on the right track. When it comes to mazes, the typical method of solving is through either a depth-first search (the most efficient solution for finding some path) or breadth-first search (less efficient, but is guarenteed to find the optimal path). Since you seem to want to do an exhaustive search, these choices are basically interchangeable. I suggest you read up on them:
http://en.wikipedia.org/wiki/Depth-first_search
http://en.wikipedia.org/wiki/Breadth-first_search
Basically, you will need to parse your maze and represent it as a graph (where each non "#" is a node and each link is a walkable path). Then, you keep a list of partial paths (i.e. a list of nodes, in the order you visited them, for example, [S, b, c] is the partial path starting from S and ending at c). The main idea of DFS and BFS is that you have a list of partial paths, and one by one you remove items from the list, generate all possible partial paths leading from that partial path, then place them in the list and repeat. The main difference between DFS and BFS is that DFS implements this list as a stack (i.e. new items have greatest priority) and BFS uses a queue (i.e. new items have lowest priority).
So, for your maze using DFS it would work like this:
Initial node is S, so your initial path is just [S]. Push [S] into your stack ([ [S] ]).
Pop the first item (in this case, [S]).
Make a list of all possible nodes you can reach in 1 move from the current node (in your case, just b).
For each node from step 3, remove any nodes that are part of your current partial path. This will prevent loops. (i.e. for partial path [S, b], from b we can travel to c and to S, but S is already part of our partial path so returning is pointless)
If one of the nodes from step 4 is the goal node, add it to your partial path to create a completed path. Print the path.
For each node from step 4 that IS NOT the goal node, generate a new partial path and push it into the stack (i.e. for [S], we generate [S, b] and push it into the stack, which now should look like [ [S, b] ])
Repeat steps 2 through 6 until the stack is empty, meaning you have traversed every possible path from the starting node.
NOTE: in your example there are duplicate letters (for example, three "e"s). For your case, maybe make a simple "Node" class that includes a variable to hold the letter. That way each "e" will have it's own instance and the pointers will be different values letting you easily tell them apart. I don't know C++ exactly, but in pseudo code:
class Node:
method Constructor(label):
myLabel = label
links = list()
method addLink(node):
links.add(node)
You could read every character in the file and if it is not "#", create a new instance of Node for that character and add all the adjacent nodes.
EDIT: I've spent the last 3 years as a Python developer and I've gotten a bit spoiled. Look at the following code.
s = "foo"
s == "foo"
In Python, that assertion is true. "==" in Python compares the string's content. What I forgot from my days as a Java developer is that in many languages "==" compares the string's pointers. That's why in many languages like Java and C++ the assertion is false because the strings point to different parts of memory.
My point is that because this assertion is not true, you COULD forgo making a Node class and just compare the characters (using ==, NOT using strcmp()!) BUT this code could be a bit confusing to read and must be documented.
Overall, I'd use some sort of Node class just because it's fairly simple to implement and results in more readable code AND only requires parsing your maze once!
Good Luck

Checking if a list of strings can be chained

Question
Implement a function bool chainable(vector<string> v), which takes a set of strings as parameters and returns true if they can be chained. Two strings can be chained if the first one ends with the same character the second starts with, e.g.:
ship->petal->lion->nick = true; (original input is list not LL)
ship->petal axe->elf = false;
My Solution:
My logic is that if its chainable there will only be one start and end that don't match. So I create a list of starts and a list of ends. like so.
starts:s,p,l,n
ends: p,l,n,k
if I remove the common elements, the lists should contain at most one items. namely s and k. If so the list is chainable. If the list is cyclic, the final lists are empty.
But i think I am missing some cases here,
EDIT:
Okay clearly I had faults in my solution. Can we conclude the best algorithm for this ?

The problem is to check if a Eulerian path exists in the directed graph whose vertices are the letters occurring as first or last letter of at least one of the supplied words and whose edges are the supplied words (each word is the edge from its first letter to its last).
Some necessary conditions for the existence of Eulerian paths in such graphs:
The graph has to be connected.
All vertices with at most two exceptions have equally many incoming and outgoing edges. If exceptional vertices exist, there are exactly two, one of them has one more outgoing edge than incoming, the other has one more incoming edge than outgoing.
The necessity is easily seen: If a graph has Eulerian paths, any such path meets all vertices except the isolated vertices (neither outgoing nor incoming edges). By construction, there are no isolated vertices in the graph under consideration here. In a Eulerian path, every time a vertex is visited, except the start and end, one incoming edge and one outgoing edge is used, so each vertex with the possible exception of the starting and ending vertex has equally many incoming and outgoing edges. The starting vertex has one more outgoing edge than incoming and the ending vertex one more incoming edge than outgoing unless the Eulerian path is a cycle, in which case all vertices have equally many incoming and outgoing edges.
Now the important thing is that these conditions are also sufficient. One can prove that by induction on the number of edges.
That allows for a very efficient check:
record all edges and vertices as obtained from the words
use a union find structure/algorithm to count the connected components of the graph
record indegree - outdegree for all vertices
If number of components > 1 or there is (at least) one vertex with |indegree - outdegree| > 1 or there are more than two vertices with indegree != outdegree, the words are not chainable, otherwise they are.

Isn't that similar to the infamous traveling salesman problem?
If you have n strings, you can construct a graph out of them, where each node corresponds to one string. You construct the edges the following way:
If string (resp. node) a and b are chainable, you introduce an edge a -> b with weight 1.
For all unchainable strings (resp. nodes) a and b, you introduce an edge a -> b with weight n.
Then, all your strings are chainable (without repetition) if and only if you can find an optimal TSP route in the graph whose weight is less than 2n.
Note: Your problem is actually simpler than TSP, since you always can transform string chaining into TSP, but not necessarily the other way around.

Here's a case where your algorithm doesn't work:
ship
pass
lion
nail
Your start and end lists are both s, p, l, n, but you can't make a single chain (you get two chains - ship->pass and lion->nail).
A recursive search is probably going to be best - pick a starting word (1), and, for each word that can follow it (2), try to solve the smaller problem of creating a chain starting with (2) that contains all of the words except (1).

As phimuemue pointed out, this is a graph problem. You have a set of strings (vertices), with (directed) edges. Clearly, the graph must be connected to be chainable -- this is easy to check. Unfortunately, the rules beyond this are a little unclear:
If strings may be used more than once, but links can't, then the problem is to find an Eulerian path, which can be done efficiently. An Eulerian path uses each edge once, but may use vertices more than once.
// this can form a valid Eulerian path
yard
dog
god
glitter
yard -> dog -> god -> dog -> glitter
If the strings may not be used more than once, then the problem is to find a Hamiltonian path. Since the Hamiltonian path problem is NP-complete, no exact efficient solution is known. Of course, for small n, efficiency isn't really important and a brute force solution will work fine.
However, things are not quite so simple, because the set of graphs that can occur as inputs to this problem are limited. For example, the following is a valid directed graph (in dot notation) (*).
digraph G {
alpha -> beta;
beta -> gamma;
gamma -> beta;
gamma -> delta;
}
However, this graph cannot be constructed from strings using the rules of this puzzle: Since alpha and gamma both connect to beta, they must end with the same character (let's assume they end with 'x'), but gamma also connects to delta, so delta must also start with 'x'. But delta cannot start with 'x', because if it did, then there would be an edge alpha -> delta, which is not in the original graph.
Therefore, this is not quite the same as the Hamiltonian path problem, because the set of inputs is more restricted. It is possible that an efficient algorithm exists to solve the string chaining problem even if no efficient algorithm exists to solve the Hamiltonian path problem.
But... I don't know what that algorithm would be. Maybe someone else will come up with a real solution, but in the mean time I hope someone finds this answer interesting.
(*) It also happens to have a Hamiltonian path: alpha -> beta -> gamma -> delta, but that's irrelevant for what follows.

if you replace petal and lion with pawn and label, you still have:
starts:s,p,l,n
ends: p,l,n,k
You're algorithm decides its chainable, but they aren't.
The problem is you are disconnecting the first and last letters of each word.
A recursive backtracking or dynamic programming algorithm should solve this problem.

seperatedly check for "Is chainable" and is "cylcic"
if it's to be cyclic it must be chainable first. you could do something like this:
if (IsChainable)
{
if (IsCyclic() { ... }
}
Note: That's the case if you check only the first and last element of the chain for "cylic".

This can be solved by a reduction to the Eulerian path problem by considering a digraph G with N(G) = Σ and E(G) = a->e for words aWe.

Here's a simple program to do this iteratively:
#include <string>
#include <vector>
#include <iostream>
using std::vector;
using std::string;
bool isChained(vector<string> const& strngs)
{
if (strngs.size() < 2) return false; //- make sure we have at least two strings
if (strngs.front().empty()) return false; //- make sure 1st string is not empty
for (vector<string>::size_type i = 1; i < strngs.size(); ++i)
{
string const& head = strngs.at(i-1);
string const& tail = strngs.at(i);
if (tail.empty()) return false;
if (head[head.size()-1] != tail[0]) return false;
}
return true;
}
int main()
{
vector<string> chained;
chained.push_back("ship");
chained.push_back("petal");
chained.push_back("lion");
chained.push_back("nick");
vector<string> notChained;
notChained.push_back("ship");
notChained.push_back("petal");
notChained.push_back("axe");
notChained.push_back("elf");
std::cout << (isChained(chained) ? "true" : "false") << "\n"; //- prints 'true'
std::cout << (isChained(notChained) ? "true" : "false") << "\n"; //- prints 'false'
return 0;
}

Similar String algorithm

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...
Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.
Like say I have a string:
"Into the clear blue sky"
and I'm doing a compare with the following two strings:
"The color is sky blue" and
"In the blue clear sky"
I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.
I'm sure places like Google probably use something similar with the "Did you mean:" feature...
* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...
Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.

Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.
To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.
edit:
Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the c_lear blue sky
The color is sky blue | is__ the colo_r blue sky
R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)
(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)
Other example
original strings | best rearrangement w/ lev distance per word
Into the clear blue sky | Into the clear blue sky
In the blue clear sky | In__ the clear blue sky
R_dist = dist( 1 2 4 3 5 ) --> 1 2 *3 4* 5 = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)
And to show all possible combinations of the three...
The color is sky blue | The colo_r is sky blue
In the blue clear sky | the c_lear in sky blue
R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)
Anyway you make the cost function the second choice will be lowest cost, which is what you expected!

One way to determine a measure of "overall similarity without respect to order" is to use some kind of compression-based distance. Basically, the way most compression algorithms (e.g. gzip) work is to scan along a string looking for string segments that have appeared earlier -- any time such a segment is found, it is replaced with an (offset, length) pair identifying the earlier segment to use. You can use measures of how well two strings compress to detect similarities between them.
Suppose you have a function string comp(string s) that returns a compressed version of s. You can then use the following expression as a "similarity score" between two strings s and t:
len(comp(s)) + len(comp(t)) - len(comp(s . t))
where . is taken to be concatenation. The idea is that you are measuring how much further you can compress t by looking at s first. If s == t, then len(comp(s . t)) will be barely any larger than len(comp(s)) and you'll get a high score, while if they are completely different, len(comp(s . t)) will be very near len(comp(s) + comp(t)) and you'll get a score near zero. Intermediate levels of similarity produce intermediate scores.
Actually the following formula is even better as it is symmetric (i.e. the score doesn't change depending on which string is s and which is t):
2 * (len(comp(s)) + len(comp(t))) - len(comp(s . t)) - len(comp(t . s))
This technique has its roots in information theory.
Advantages: good compression algorithms are already available, so you don't need to do much coding, and they run in linear time (or nearly so) so they're fast. By contrast, solutions involving all permutations of words grow super-exponentially in the number of words (although admittedly that may not be a problem in your case as you say you know there will only be a handful of words).

One way (although this is perhaps better suited a spellcheck-type algorithm) is the "edit distance", ie., calculate how many edits it takes to transform one string to another. A common technique is found here:
http://en.wikipedia.org/wiki/Levenshtein_distance

You might want to look into the algorithms used by biologists to compare DNA sequences, since they have to cope with many of the same things (chunks may be missing, or have been inserted, or just moved to a different position in the string.
The Smith-Waterman algorithm would be one example that'd probably work fairly well, although it might be too slow for your uses. Might give you a starting point, though.

i had a similar problem, i needed to get the percentage of characters in a string that were similar. it needed exact sequences, so for example "hello sir" and "sir hello" when compared needed to give me five characters that are the same, in this case they would be the two "hello"'s. it would then take the length of the longest of the two strings and give me a percentage of how similar they were. this is the code that i came up with
int compare(string a, string b){
return(a.size() > b.size() ? bigger(a,b) : bigger(b,a));
}
int bigger(string a, string b){
int maxcount = 0, currentcount = 0;//used to see which set of concurrent characters were biggest
for(int i = 0; i < a.size(); ++i){
for(int j = 0; j < b.size(); ++j){
if(a[i+j] == b[j]){
++currentcount;
}
else{
if(currentcount > maxcount){
maxcount = currentcount;
}//end if
currentcount = 0;
}//end else
}//end inner for loop
}//end outer for loop
return ((int)(((float)maxcount/((float)a.size()))*100));
}

I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.

There's another way. Pattern recognition using convolution. Image A is run thru a Fourier transform. Image B also. Now superimposing F(A) over F(B) then transforming this back gives you a black image with a few white spots. Those spots indicate where A matches B strongly. Total sum of spots would indicate an overall similarity. Not sure how you'd run an FFT on strings but I'm pretty sure it would work.

The difficulty would be to match the strings semantically.
You could generate some kind of value based on the lexical properties of the string. e.g. They bot have blue, and sky, and they're in the same sentence, etc etc... But it won't handle cases where "Sky's jean is blue", or some other odd ball English construction that uses same words, but you'd need to parse the English grammar...
To do anything beyond lexical similarity, you'd need to look at natural language processing, and there isn't going to be one single algorith that would solve your problem.

Possible approach:
Construct a Dictionary with a string key of "word1|word2" for all combinations of words in the reference string. A single combination may happen multiple times, so the value of the Dictionary should be a list of numbers, each representing the distance between the words in the reference string.
When you do this, there will be duplication here: for every "word1|word2" dictionary entry, there will be a "word2|word1" entry with the same list of distance values, but negated.
For each combination of words in the comparison string (words 1 and 2, words 1 and 3, words 2 and 3, etc.), check the two keys (word1|word2 and word2|word1) in the reference string and find the closest value to the distance in the current string. Add the absolute value of the difference between the current distance and the closest distance to a counter.
If the closest reference distance between the words is in the opposite direction (word2|word1) as the comparison string, you may want to weight it smaller than if the closest value was in the same direction in both strings.
When you are finished, divide the sum by the square of the number of words in the comparison string.
This should provide some decimal value representing how closely each word/phrase matches some word/phrase in the original string.
Of course, if the original string is longer, it won't account for that, so it may be necessary to compute this both directions (using one as the reference, then the other) and average them.
I have absolutely no code for this, and I probably just re-invented a very crude wheel. YMMV.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js