C++ String Comparisons [duplicate] - c++

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
string comparison with the most similar string
I was wondering what the best way to go about comparing two strings for (For a certain percentage of) similarity is. EX: String 1 is "I really like to eat pie," and String 2 is "I really like to eat cheese," with a function returning "true" because more than 50% of the characters are similar.
I was thinking that I could see if each character in one string is somewhere in the other, but there's probably a more precise way to go about things. Any suggestions?

Levenshtein distance might be suitable. It tells how many single-character insertions, deletions or replacements must be made in order to transform one string into the other. You can also give different priorities to the three operations.

For a fuzzy compare like this you could split each string up into words (using strtok()) and compare the two word arrays case-insensitive using stricmp(). There is also the SOUNDEX algorithm to compare words to see if they sound the same.

Related

How to implement a fastest algorithm for match the prefix with string?

There has about 100K strings - prefixes, now we need to know does a given string is matched with one of these prefixes or not. For example, the prefixes are:
12
123
1234
12345
Now the given string is 123abc, it will matched with "123" prefix;
If the given string is 12340098, it will matched with "1234" prefix.
Since there has 100K prefixes, therefore we need a very fast way to match it, how could we use the C++ to implement it ?
I think you're looking for the trie data structure, which is optimized for queries of the form "are any of these strings prefixes of a given string?" or "is this given string a prefix of any of these other strings?" (This is related to the deterministic finite automaton that #Sam Varshavchik mention in the comment, though that connection requires a bit of CS theory to fully understand).
There are many ways to implement a trie in C++. I'd advise starting off by reading up on the data structure to get a better sense for how it works, then using that to guide your implementation. If in the course of coding it up you run into some issues, feel free to post a follow-up question.

Search string for string sequence [duplicate]

This question already has answers here:
How would you count occurrences of a string (actually a char) within a string?
(34 answers)
Optimized version of strstr (search has constant length)
(5 answers)
Closed 7 years ago.
What is the most efficient way to count a number of occurrences of a substring in another string in C++? For example, I have a very huge string like
"GQWHIWQGHWGGEEEGQIHIGWHIQWGHIEEEGPHIQPIWGHQPWGPHEEEGQIHWPWGQHPQWGEEE"
and I want to count how often "EEE" occurs.
I could go step by step in a for loop and check every letter if it's an E and if so, count them and if there are 3 es, increment a counter, but I guess there is a more efficient way of doing this.
Maybe a string function? I just wasn't able to find or google a suitable one.
I am searching for a clean C++11 solution.
Well, if you want a fast and efficent solution, take a look at Knuth–Morris–Pratt algorithm - it takes only O(N+M) to search.
If you want something in STL style, then take a look at std::string::find

How to find longest palindrome [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Write a function that returns the longest palindrome in a given string
I have a C++ assignment which wants me write a program that finds the longest palindrome in a given text. For example, the text is this: asdqerderdiedasqwertunut, my program should find tunut in the index of 19. However if input is changed into this astunutsaderdiedasqwertunutit should find astunutsa in the index of 0 instead of tunutin index of 22.
So, my problem is this. But I am a beginner at the subject, i know just string class, loops, ifs. It would be great if you could help me on this.
Thanks in advance.
The idea is very simple:
Write a function is_palindrome(string) that takes a string, and returns true if it is a palindrome and false if it is not
With that function in hand, write two nested loops cutting out different substrings from the original string. Pass each substring to is_palindrome(string), and pick the longest one among the strings returning true.
You can further optimize your program by examining longest substrings ahead of shorter ones. If you examine substrings from longest to shortest, you'll be able to return as soon as you find the first palindrome.
Dasblinkenlight's idea is pretty good, but it's faster this way:
A palindrome has either an even number of letters or odd, so you have two situations. Let's start with the even. You need to find two consecutive identical letters, and then check whether the immediately previous letter is identical to the next letter. The same in the other situation, except at first you only need one letter. I don't speak English that well, so I hope you understood. :)

C++: Comparing two strings [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
comparing two strings with comma seperated values
I am working in C++, where I have two strings:
string str1 = "1,4,8,",
str2 = "4,1,8,";
Both strings contains comma separated values. Now I just want to check whether all the elements in str1 also exist in str2, regardless of their position. Is there any direct way to check this? Do I need to write custom code for this?
As far as C++ is concerned, those strings are just sequences of characters. If you apply meaning to those characters (such as "comma separated values"), then you'll have to write some code to extract the data and deal with it.
I would do something like:
split the string on ','
convert each sequence of digits into an integer (skipping over empty elements)
insert those integers into a set (one for each input string)
compare the sets
It's up to you to determine what kind of integer to use.
Yes, you need to write custom code, although not a lot of it. Once you figure out the algorithm you can post here if you have further questions on how to implement each part.

Matching unmatched strings based on a unknown pattern

Alright guys, I really hurt my brain over this one and I'm curious if you guys can give me any pointers towards the right direction I should be taking.
The situation is this:
Lets say, I have a collection of strings (let it be clear that the pattern of this strings is unknown. For a fact, I can say that the string contain only signs from the ASCII table and therefore, I don't have to worry about weird Chinese signs).
For this example, I take the following collection of strings (note that the strings don't have to make any human sense so don't try figuring them out :)):
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
Now, what I need to have is a way of finding logical groups (and subgroups) of these set of strings, so in the above example, just by rational thinking, you can combine the first 3, the 2 after that and the last 2. Also the resulting groups from the first 5 can be combined in one main group with 2 subgroups, this should give you something like this:
{
{
"[001].[FOO].[TEST] - 'foofoo.test'",
"[002].[FOO].[TEST] - 'foofoo.test'",
"[003].[FOO].[TEST] - 'foofoo.test'",
}
{
"[001].[FOO].[TEST] - 'foofoo.test.sample'",
"[002].[FOO].[TEST] - 'foofoo.test.sample'",
}
}
{
{
"-001- BAR.[TEST] - 'bartest.xx1",
"-002- BAR.[TEST] - 'bartest.xx1"
}
}
Sorry for the layout above but indenting with 4 spaces doesn't seem to work correctly (or I'm frakk'n it up).
Anyway, I'm not sure how to approach this problem (how to get the result desired as indicated above).
First of, I thought of creating a huge set of regexes which would parse most known patterns but the amount of different patterns is just to huge that this isn't realistic.
Another think I thought of was parsing each individual word within a string (so strip all non alphabetic or numeric characters and split by those), and if X% matches, I can assume the strings belong to the same group. (where X will probably be around 80/90). However, I find the area of speculation kinda big. For example, when matching strings with each 20 words, the change of hitting above 80% is kinda big (that means that 4 words can differ), however when matching only 8 words, 2 words at most can differ.
My question to you is, what would be a logical approach in the above situation?
As for a reallife example:
Thanks in advance!
Basically I would consider each string as a bag of characters. I would define a kind of distance between two strings which would be sth like "number of characters belonging to both strings" divided by "total number of characters in string 1 + total number of characters in string 2". (well, it's not a distance mathematically speaking...) and then I would try to apply some algorithms to cluster your set of strings.
Well, this is just a basic idea but I think it would be a good start to try some experiments...
Building on #PierrOz' answer, you might want to experiment with multiple measures, and do a statistical cluster analysis on those measures.
For example, you could use four measures:
How many letters (upper/lowercase)
How many digits
How many of ([,],.)
How many other characters (probably) not included above
You then have, in this example, four measures for each string, and you could, if you wished, apply a different weight to each measure.
R has a number of functions for cluster analysis. This might be a good starting point.
Afterthought: the measures can be almost anything you invent. Some more examples:
Binary: does the string contain a given character (0 or 1)?
Binary: does the string contain a given substring?
Count: how many times does the given substring appear?
Binary: does the string include all these characters?
Enough for a least a weekend's tinkering...
I would recommend using this: http://en.wikipedia.org/wiki/Hamming_distance as the distance.
Also, For files a good heuristic would be to remove checksum in the end from the filename before calculating the distance:
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_[35218661].mkv
->
[BSS]_Darker_Than_Black_-_The_Black_Contractor_-_Gaiden_-_01_.mkv
A check is simple - it's always 10 characters, the first being [, the last -- ], and the rest ALPHA-numeric :)
With the heuristic and the distance max of 4, your stuff will work in the vast majority of the cases.
Good luck!
Your question is not easy to understand, but I think what you ask is impossible to do in a satisfying way given any group of strings. Take these strings for instance:
[1].[2].[3].[4].[5]
[a].[2].[3].[4].[5]
[a].[b].[3].[4].[5]
[a].[b].[c].[4].[5]
[a].[b].[c].[d].[5]
[a].[b].[c].[d].[e]
Each is close to those listed next to it, so they should all group with their neighbours, but the first and the last are completely different, so it would not make sense to group those together. Given a more "grouping" dataset you might get pretty good results with a method like the one PierrOz describes, but there is no guarantee for meaningful results.
May I enquire what the purpose is? It would allow us all to better understand what errors might be tolerated, or perhaps even come up with a different approach to solving the problem.
Edit: I wonder, would it be OK if one string ends up in multiple different groups? That could make the problem a lot simpler, and more reliably give you useful information, but you would end up with a bigger grouping tree with the same node copied to different branches.
I'd be tempted to tackle this with cluster analysis techniques. Hit Wikipedia for an introduction. And the other answers probably fall within the domain of cluster analysis, but you might find some other useful approaches by reading a bit more widely.