How to get the most frequent element of a list? - c++

Let's say I have a list:
std::list<std::string> list ("the", "the", "friend", "hello", "the");
In this case, the most common element in the list is "the". Is there a way to get this element in C++??
Thanks!

A general algorithm to solve your problem is to build a dictionary of word frequencies. Here is a pseudo code algorithm, that does exactly that:
let L be the input sequence of strings (can be a list, doesn't matter)
let F be an empty dictionary that maps string to a number
for each string S in L
if not F contains S then
F[S] = 0
F[S] += 1
Once the dictionary is constructed, all you need to do is to find the mapping with the highest value, and return the key.
The C++ standard library provides associative containers (aka dictionaries, aka maps), and an algorithm for searching for the greatest element within a container.

Related

Python - iterate over two lists and find matches and position of mis-matches

I am working in Python 2.7
I am trying to iterate over 2 lists, of un-equal length, and I want to create a new list, containing the matching elements (same elements in the same position), and when the elements do not match, I need to have some text as well as the position of the miss-matching elements.
list1=[1,2,3,4]
list2=[1,2,3,5,6]
This outputs the matches
match=[[b] for a, b in zip(list1, list2) if a==b]
result:
[1,2,3]
But I do not know, in a one-liner, how to also flag the mis-matches:
[1,2,3,"nomatch-pos4"]
or
[1,2,3,"nomatch-pos4","nomatch-pos5"]
It does not matter if it will iterate over the maximum or minimum of the 2 list lengths.
it first find the minimum of the two lists and iterate over the shorter list and check if an element in the list matches with other list in same position. check below code:
match = [list1[i] if list1[i] == list2[i] else 'nomatch-pos'+str(i+1) for i in range(0,min(len(list1),len(list2)))]

How to get list elements by matching string using scala?

I have following list-
List((name1,233,33),(name2,333,22),(name3,444,55),())
I have another string which I want to match with list and get matched elements from list.
There will be only one element in list that matches to given string.
The list may contains some empty elements as given as last element in above list.
Suppose I am maching string 'name2' which will occurs only once in the list, then
My expected output is -
List(name2,333,22)
How do I find matching list element using scala??
.find(_._1 == name2)
will be better
Consider collect over the tuples list, for instance like this,
val a = List(("name1",233,33),("name2",333,22),("name3",444,55),())
Then
a collect {
case v # ("name2",_,_) => v
}
If you want only the first occurrence, use collectFirst. This partial function ignores tuples that do not include 3 items.

Using regex to access values from a map in keys

val m = Map("a"->2,"ab"->3,"c"->4)
scala> m.get("a");
scala> println(res.get)
2
scala> m.get(/a\.*/)
// or something similar.
Can i get a list of all key-value pairs where key contains "a" without having to iterate over the entire map , by doing something as simple as specifying a regex in the key value?
Thanks in advance!
No, you cannot do that without iterating over the entire map. In fact, I can't even think of a single data structure that would allow it, say nothing of the API.
Of course, iterating is pretty simple:
m.filterKeys(_ matches "a.*")

Which string-finding algorithm is appropriate for this?

I have a big string say "aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" (but maybe longer) and I have a collection of lots of little strings. I want to count (overlap is OK) how many times the little strings are found in the big string. I care only about speed. KMP seemed good but it looked like Rabin-Karp dealt with multiple but was slow.
The problem with most string searching algorithms is that they will take at least time O(k) to return k matches, so if you have a string with say 1 million "a"s, and 1 million queries of the little string "a", then it will take around 1 million, million iterations to count all the matches!
An alternative linear time approach would be to:
Construct a suffix tree of the big string: O(n) where n is len(big string)
Precompute the number of suffixes below each node in the suffix tree: O(n)
For each small string, find the node in the suffix tree that has the small string as a suffix: O(m) where m is len(small string)
Add to the total count the number of suffixes below that node. (Each suffix corresponds to a different match of the small string in the big string)
This will take time O(n+p) where n is the length of the big string, and p is the total length of all the small strings.
Example Code
As requested, here is some small(ish) example code in Python that uses this approach:
from collections import defaultdict
class SuffixTree:
def __init__(self):
"""Returns an empty suffix tree"""
self.T=''
self.E={}
self.nodes=[-1] # 0th node is empty string
def add(self,s):
"""Adds the input string to the suffix tree.
This inserts all substrings into the tree.
End the string with a unique character if you want a leaf-node for every suffix.
Produces an edge graph keyed by (node,character) that gives (first,last,end)
This means that the edge has characters from T[first:last+1] and goes to node end."""
origin,first,last = 0,len(self.T),len(self.T)-1
self.T+=s
nc = len(self.nodes)
self.nodes += [-1]*(2*len(s))
T=self.T
E=self.E
nodes=self.nodes
Lm1=len(T)-1
for last_char_index in xrange(first,len(T)):
c=T[last_char_index]
last_parent_node = -1
while 1:
parent_node = origin
if first>last:
if (origin,c) in E:
break
else:
key = origin,T[first]
edge_first, edge_last, edge_end = E[key]
span = last - first
A = edge_first+span
m = T[A+1]
if m==c:
break
E[key] = (edge_first, A, nc)
nodes[nc] = origin
E[nc,m] = (A+1,edge_last,edge_end)
parent_node = nc
nc+=1
E[parent_node,c] = (last_char_index, Lm1, nc)
nc+=1
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last_parent_node = parent_node
if origin==0:
first+=1
else:
origin = nodes[origin]
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last+=1
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
return self
def make_choices(self):
"""Construct a sorted list for each node of the possible continuing characters"""
choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
for (origin,c),edge in self.E.items():
choices[origin].append(c)
choices=[sorted(s) for s in choices] # should not have any repeats by construction
self.choices=choices
return choices
def count_suffixes(self,term):
"""Recurses through the tree finding how many suffixes are based at each node.
Strings assumed to use term as the terminating character"""
C = self.suffix_counts = [0]*len(self.nodes)
choices = self.make_choices()
def f(node=0):
t=0
X=choices[node]
if len(X)==0:
t+=1 # this node is a leaf node
else:
for c in X:
if c==term:
t+=1
continue
first,last,end = self.E[node,c]
t+=f(end)
C[node]=t
return t
return f()
def count_matches(self,needle):
"""Return the count of matches for this needle in the suffix tree"""
i=0
node=0
E=self.E
T=self.T
while i<len(needle):
c=needle[i]
key=node,c
if key not in E:
return 0
first,last,node = E[key]
while i<len(needle) and first<=last:
if needle[i]!=T[first]:
return 0
i+=1
first+=1
return self.suffix_counts[node]
big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
x=s.count_matches(needle)
print needle,'has',x,'matches'
It prints:
a has 11 matches
ab has 1 matches
abc has 0 matches
However, in practice I would recommend you simply use a pre-existing Aho-Corasick implementation as I would expect this to be much faster in your particular case.
Matching against a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for me. It does find matches one at a time, so Peter de Rivaz's idea might be better if there are a huge number of matches. On the other hand, Aho-Corasick doesn't need to keep the big string in memory - you can just stream it through - and is very practical to implement and tune - the Wikipedia link notes that the orginal fgrep used it.
Thinking about it, you can work round the mega-match problem. Aho-Corasick can be viewed as creating a deterministic finite state machine just capable of recognizing each of the strings it is searching for. The state of the machine corresponds to the last N characters seen. If you wish to match two strings and one is a suffix of the other you need to be careful that when you are in the state that says you have just matched the longer string that you also recognize that this means that you have just matched the shorter string. If you deliberately choose not to do this, then the counts you accumulate for the shorter string will be incorrect - but you know that they are too low by the number of times the longer string was seen. So if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the string you are searching, and you can fix up the counts at the end by going through the long strings and then incrementing counts for the shorter strings which are suffixes of the long strings. This will take time at most linear in the total size of the strings being searched for.
Building on another answer (and hopefully convince you this is the best type of answer), you can look up http://en.wikipedia.org/wiki/Suffix_tree and also go through the references listed there to learn about suffix trees if you really want the fastest solution for your problem, that also can make it possible to get the number of matches without iterating over all the matches, and the running times and memory requirements you get are the absolute best possible for any substring matching or match counting algorithm. Once you understand how the suffix tree works and how to build/use it, then the only additional tweak you need is store the number of distinct string starting positions that are being represented at each internal node of the tree, a minor modification that you can easily do efficiently (linear time, as already claimed) by recursively getting the count from children nodes and adding them up to get the count at a current node. Then these counts allow you to count substring matches without iterating over all of them.
If I understood correctly, your input string consists of many one-character blocks.
In this case, you can compress your text using the Run-length encoding.
For example:
s = aaabbbbcc
is encoded as:
encoded_s = (a3)(b4)(c2)
Now you may search for patterns in encoded text.
If you want a concrete algorithm, just search the web for patterns matching in Run-length encoded strings. You can achieve time complexity O(N + M), where N and M are the lengths of compressed text and compressed pattern. Both M and N in general are much smaller than original lengths, so it beats any standard pattern matching algorithm e.g. KMP.
1) I'd go with finite automata. Can't think of a specialized library right now, but the general-purpose PCRE can be used to construct an automata efficiently searching for the given substring. For substring "foo" and "bar" one can construct a pattern /(foo)|(bar)/, scan a string and get the "id" number of the matched substring by iterating the ovector checking which group matched.
RE2::FindAndConsume is good if you only need the total count, not grouping by substring.
P.S. Example using Boost.Xpressive and loading the strings from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S. Recently I've had a good time creating a Ragel machine for a similar task. For a small set of searched strings a "normal" DFA would work, buf if you have a larger ruleset then using Ragel scanners shows good results (here is a related answer).
P.P.P.S. PCRE has the MARK keyword which is super useful for that kind of subpattern classification (cf).
2) Quite some time ago I wrote a Trie-based thingie in Scala for that kind of load: https://gist.github.com/ArtemGr/6150594; Trie.search goes over the string trying to match the current position to a number encoded in the Trie. The trie is encoded in a single cache-friendly array, I expect it to be as good as non-JIT DFAs.
3) I've been using boost::spirit for substring matching, but never got to measuring how it fares against other approaches. Spirit uses some efficient structure for the symbols matching, perhaps the structure can be used on its own without the overhead of Spirit.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
exact_t() {add
("foo", 1)
("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
+alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);

SML pair tuples conversion

I've been trying to solve this pair tuples problem where the input is a list of tuples and the output is a tuple of lists where the first element of each tuple is grouped together and similarly with the second (i.e. [(1,2),(3,4),(5,6)] --> ([1,3,5],[2,4,6])).
I've thought of this code but it gives me an error:
fun convert L = foldl (fn ((x,y),(u,v)) => ((u#x),(v#y)) ([],[]) L;
Any suggestions for a fix?
Concatenation (#) takes two lists, but x and y are values, so you need to wrap them with [] to make a single-element list:
fun convert l=foldl (fn((x,y),(u,v))=>(u#[x],v#[y])) (nil,nil) l
You can use cons instead of concatenation, though the lists inside the returned tuple are reversed:
fun convert l=foldl (fn((x,y),(u,v))=>(x::u,y::v)) (nil,nil) l
# concatenates lists (and x and y are not lists).
Try (u#[x],v#[y]).
Note, however, that appending is a linear-time operation, while prepending (i.e. x::u) is constant. As Alex pointed out, this will build your lists in reverse, but you can resolve this by processing your input in reverse as well - i.e., by using foldr instead of foldl.