Count number of characters in Haskell MapReduce code - mapreduce

In Haskell code from Real World Haskell, Chapter 24, an example of using MapReduce to count the number of LINES in a file is implemented as follows:
import qualified Data.ByteString.Lazy.Char8 as LB
lineCount :: [LB.ByteString] -> Int64
lineCount = mapReduce rdeepseq (LB.count '\n')
rdeepseq sum
It's clear to me that this is counting the number of newline characters. If I wanted to count the number of a's, I do:
import qualified Data.ByteString.Lazy.Char8 as LB
lineCount :: [LB.ByteString] -> Int64
lineCount = mapReduce rdeepseq (LB.count 'a')
rdeepseq sum
I've tried this, and it works. How do I modify this code to count the number of characters (ie total number of characters present? Is there some sort of regular expression framework I can use?

It's clear to me that this is counting the number of newline characters.
Well, not really. A ByteString is a string of bytes. (If you want a string of characters, you should use Text from either Data.Text or Data.Text.Lazy, in the text package.)
Data.ByteString.Lazy.Char8 exports an interface that lets you pretend you're working with characters, but it assumes one character = one byte, à la ISO-8859-1 or ASCII. Unicode it ain't.
How do I modify this code to count the number of characters (ie total number of characters present?
LB.count :: Char -> ByteString -> Int64, so we're looking for a function of type ByteString -> Int64. That function is LB.length.
lineCount = mapReduce rdeepseq LB.length
rdeepseq sum
Is there some sort of regular expression framework I can use?
It's easy enough to use full-blown parsers in Haskell that we (well, I at least) use parsers instead of regular expressions. If your data is in the form of a ByteString (or a Text, for that matter) I'd recommend using attoparsec.

Related

Remove all emojis from a string in Haskell

I made a Mastodon / Twitter <--> IRC bot a while back. It's been working great, but someone complained that when people use emojis on mastodon (which seems to happen a lot in some usernames ..) it breaks his terminal.
I was wondering if there is a way to remove those from the ByteStrings before sending them to IRC (or at least provide an option to do so), googling a bit I found this : removing emojis from a string in Python
Looks like \U0001F600-\U0001F64F should be the emoji range if I understand it correctly, but I've never been big with regex. Any easy-ish way to translate that to Haskell ? I've tried reading up a bit on regex but I only get "lexical error in string/character literal at character 'U'" when I try, I assume that syntax must be a python thing.
Thanks
Unicode characters are represented by a single backslash, followed by an optional x for hexadecimal, o for octal and none for decimal number representing the character [0]:
putStrLn "\x1f600" -- 😀
Here, \x is a prefix for the hexadecimal representation of the first emoji character in Unicode.
You can now remove the emojis using RegExp or you could simply do:
emojis = concat [['\x1f600'..'\x1F64F'],
['\x1f300'..'\x1f5ff'],
['\x1f680'..'\x1f6ff'],
['\x1f1e0'..'\x1f1ff']]
someString = "hello 🙋"
removeEmojis = filter (`notElem` emojis)
putStrLn . removeEmojis $ someString -- "hello "
[0] Haskell Language 2010: Lexical Structure#Character and String Literals
Not a emoji or unicode expert, but this seems to work:
isEmoji :: Char -> Bool
isEmoji c = let uc = fromEnum c
in uc >= 0x1F600 && uc <= 0x1F64F
str = "😁wew😁"
As Daniel Wagner points out, this can be made even better:
isEmoji :: Char -> Bool
isEmoji c = c >= '\x1F600' && c <= '\x1F64F'
Demo in ghci:
λ> str
"\128513wew\128513"
λ> filter isEmoji str
"\128513\128513"
λ> filter (not . isEmoji) str
"wew"
Explanation: fromEnum function converts the character to the corresponding Int value defined by the Unicode. I just check for the unicode range of emoji in the function to determine if it's actually an emoji.

Reg exp in matlab

I'm analyzing a file in matlab and I want to find the number of occurrences of the letter I (capitalized). I'm confused on how to write the regular expression for this step. Would it be something like (lines,'.I.')? Any help would be greatly appreciated.
If you want to count the number of capital 'I's in a file, assuming you have read the file in as a string, you could just do this:
count = sum(file_string == 'I');
If, as in this case, the file is read into a cell-string, one possible way of doing this would be to use:
count = sum(strcat(file_cellstr{:}) == 'I');
strcat will concatenate all of the strings passed to it into a single string. Passing file_cellstr{:} to strcat is essentially concatenating each of the cells (i.e. each line in your case) into a single string, then searching through it for the letter 'I'. If you wanted to find a whole word, you could use
count = length(strfind(strcat(file_cellstr{:}),'word'));
If you wanted a regular expression match, you could do the following:
count = length(regexp(strcat(file_cellstr{:}),'[a-z]+'));

How can I parse a char array with octal values in Python?

EDIT: I should note that I want a general case for any hex array, not just the google one I provided.
EDIT BACKGROUND: Background is networking: I'm parsing a DNS packet and trying to get its QNAME. I'm taking in the whole packet as a string, and every character represents a byte. Apparently this problem looks like a Pascal string problem, and using the struct module seems like the way to go.
I have a char array in Python 2.7 which includes octal values. For example, let's say I have an array
DNS = "\03www\06google\03com\0"
I want to get:
www.google.com
What's an efficient way to do this? My first thought would be iterating through the DNS char array and adding chars to my new array answer. Every time i see a '\' char, I would ignore the '\' and two chars after it. Is there a way to get the resulting www.google.com without using a new array?
my disgusting implementation (my answer is an array of chars, which is not what i want, i want just the string www.google.com:
DNS = "\\03www\\06google\\03com\\0"
answer = []
i = 0
while i < len(DNS):
if DNS[i] == '\\' and DNS[i+1] != 0:
i += 3
elif DNS[i] == '\\' and DNS[i+1] == 0:
break
else:
answer.append(DNS[i])
i += 1
Now that you've explained your real problem, none of the answers you've gotten so far will work. Why? Because they're all ways to remove sequences like \03 from a string. But you don't have sequences like \03, you have single control characters.
You could, of course, do something similar, just replacing any control character with a dot.
But what you're really trying to do is not replace control characters with dots, but parse DNS packets.
DNS is defined by RFC 1035. The QNAME in a DNS packet is:
a domain name represented as a sequence of labels, where each label consists of a length octet followed by that number of octets. The domain name terminates with the zero length octet for the null label of the root. Note that this field may be an odd number of octets; no padding is used.
So, let's parse that. If you understand how "labels consisting of "a length octet followed by that number of octets" relates to "Pascal strings", there's a quicker way. Also, you could write this more cleanly and less verbosely as a generator. But let's do it the dead-simple way:
def parse_qname(packet):
components = []
offset = 0
while True:
length, = struct.unpack_from('B', packet, offset)
offset += 1
if not length:
break
component = struct.unpack_from('{}s'.format(length), packet, offset)
offset += length
components.append(component)
return components, offset
import re
DNS = "\\03www\\06google\\03com\\0"
m = re.sub("\\\\([0-9,a-f]){2}", "", DNS)
print(m)
Maybe something like this?
#!/usr/bin/python3
import re
def convert(adorned_hostname):
result1 = re.sub(r'^\\03', '', adorned_hostname )
result2 = re.sub(r'\\0[36]', '.', result1)
result3 = re.sub(r'\\0$', '', result2)
return result3
def main():
adorned_hostname = r"\03www\06google\03com\0"
expected_result = 'www.google.com'
actual_result = convert(adorned_hostname)
print(actual_result, expected_result)
assert actual_result == expected_result
main()
For the question as originally asked, replacing the backslash-hex sequences in strings like "\\03www\\06google\\03com\\0" with dots…
If you want to do this with a regular expression:
\\ matches a backslash.
[0-9A-Fa-f] matches any hex digit.
[0-9A-Fa-f]+ matches one or more hex digits.
\\[0-9A-Fa-f]+ matches a backslash followed by one or more hex digits.
You want to find each such sequence, and replace it with a dot, right? If you look through the re docs, you'll find a function called sub which is used for replacing a pattern with a replacement string:
re.sub(r'\\[0-9A-Fa-f]+', '.', DNS)
I suspect these may actually be octal, not hex, in which case you want [0-7] rather than [0-9A-Fa-f], but nothing else would change.
A different way to do this is to recognize that these are valid Python escape sequences. And, if we unescape them back to where they came from (e.g., with DNS.decode('string_escape')), this turns into a sequence of length-prefixed (aka "Pascal") strings, a standard format that you can parse in any number of ways, including the stdlib struct module. This has the advantage of validating the data as you read it, and not being thrown off by any false positives that could show up if one of the string components, say, had a backslash in the middle of it.
Of course that's presuming more about the data. It seems likely that the real meaning of this is "a sequence of length-prefixed strings, concatenated, then backslash-escaped", in which case you should parse it as such. But it could be just a coincidence that it looks like that, in which case it would be a very bad idea to parse it as such.

Which string-finding algorithm is appropriate for this?

I have a big string say "aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd" (but maybe longer) and I have a collection of lots of little strings. I want to count (overlap is OK) how many times the little strings are found in the big string. I care only about speed. KMP seemed good but it looked like Rabin-Karp dealt with multiple but was slow.
The problem with most string searching algorithms is that they will take at least time O(k) to return k matches, so if you have a string with say 1 million "a"s, and 1 million queries of the little string "a", then it will take around 1 million, million iterations to count all the matches!
An alternative linear time approach would be to:
Construct a suffix tree of the big string: O(n) where n is len(big string)
Precompute the number of suffixes below each node in the suffix tree: O(n)
For each small string, find the node in the suffix tree that has the small string as a suffix: O(m) where m is len(small string)
Add to the total count the number of suffixes below that node. (Each suffix corresponds to a different match of the small string in the big string)
This will take time O(n+p) where n is the length of the big string, and p is the total length of all the small strings.
Example Code
As requested, here is some small(ish) example code in Python that uses this approach:
from collections import defaultdict
class SuffixTree:
def __init__(self):
"""Returns an empty suffix tree"""
self.T=''
self.E={}
self.nodes=[-1] # 0th node is empty string
def add(self,s):
"""Adds the input string to the suffix tree.
This inserts all substrings into the tree.
End the string with a unique character if you want a leaf-node for every suffix.
Produces an edge graph keyed by (node,character) that gives (first,last,end)
This means that the edge has characters from T[first:last+1] and goes to node end."""
origin,first,last = 0,len(self.T),len(self.T)-1
self.T+=s
nc = len(self.nodes)
self.nodes += [-1]*(2*len(s))
T=self.T
E=self.E
nodes=self.nodes
Lm1=len(T)-1
for last_char_index in xrange(first,len(T)):
c=T[last_char_index]
last_parent_node = -1
while 1:
parent_node = origin
if first>last:
if (origin,c) in E:
break
else:
key = origin,T[first]
edge_first, edge_last, edge_end = E[key]
span = last - first
A = edge_first+span
m = T[A+1]
if m==c:
break
E[key] = (edge_first, A, nc)
nodes[nc] = origin
E[nc,m] = (A+1,edge_last,edge_end)
parent_node = nc
nc+=1
E[parent_node,c] = (last_char_index, Lm1, nc)
nc+=1
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last_parent_node = parent_node
if origin==0:
first+=1
else:
origin = nodes[origin]
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
if last_parent_node>0:
nodes[last_parent_node] = parent_node
last+=1
if first <= last:
edge_first,edge_last,edge_end=E[origin,T[first]]
span = edge_last-edge_first
while span <= last - first:
first+=span+1
origin = edge_end
if first <= last:
edge_first,edge_last,edge_end = E[origin,T[first]]
span = edge_last - edge_first
return self
def make_choices(self):
"""Construct a sorted list for each node of the possible continuing characters"""
choices = [list() for n in xrange(len(self.nodes))] # Contains set of choices for each node
for (origin,c),edge in self.E.items():
choices[origin].append(c)
choices=[sorted(s) for s in choices] # should not have any repeats by construction
self.choices=choices
return choices
def count_suffixes(self,term):
"""Recurses through the tree finding how many suffixes are based at each node.
Strings assumed to use term as the terminating character"""
C = self.suffix_counts = [0]*len(self.nodes)
choices = self.make_choices()
def f(node=0):
t=0
X=choices[node]
if len(X)==0:
t+=1 # this node is a leaf node
else:
for c in X:
if c==term:
t+=1
continue
first,last,end = self.E[node,c]
t+=f(end)
C[node]=t
return t
return f()
def count_matches(self,needle):
"""Return the count of matches for this needle in the suffix tree"""
i=0
node=0
E=self.E
T=self.T
while i<len(needle):
c=needle[i]
key=node,c
if key not in E:
return 0
first,last,node = E[key]
while i<len(needle) and first<=last:
if needle[i]!=T[first]:
return 0
i+=1
first+=1
return self.suffix_counts[node]
big="aaaaaaaaaaabbbbbbbbbcccccccccccddddddddddd"
small_strings=["a","ab","abc"]
s=SuffixTree()
term=chr(0)
s.add(big+term)
s.count_suffixes(term)
for needle in small_strings:
x=s.count_matches(needle)
print needle,'has',x,'matches'
It prints:
a has 11 matches
ab has 1 matches
abc has 0 matches
However, in practice I would recommend you simply use a pre-existing Aho-Corasick implementation as I would expect this to be much faster in your particular case.
Matching against a large collection of strings sounds like http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for me. It does find matches one at a time, so Peter de Rivaz's idea might be better if there are a huge number of matches. On the other hand, Aho-Corasick doesn't need to keep the big string in memory - you can just stream it through - and is very practical to implement and tune - the Wikipedia link notes that the orginal fgrep used it.
Thinking about it, you can work round the mega-match problem. Aho-Corasick can be viewed as creating a deterministic finite state machine just capable of recognizing each of the strings it is searching for. The state of the machine corresponds to the last N characters seen. If you wish to match two strings and one is a suffix of the other you need to be careful that when you are in the state that says you have just matched the longer string that you also recognize that this means that you have just matched the shorter string. If you deliberately choose not to do this, then the counts you accumulate for the shorter string will be incorrect - but you know that they are too low by the number of times the longer string was seen. So if you modify Aho-Corasick so that only the longest match at each point is recognized and counted, then the cost of matching remains linear in the number of characters in the string you are searching, and you can fix up the counts at the end by going through the long strings and then incrementing counts for the shorter strings which are suffixes of the long strings. This will take time at most linear in the total size of the strings being searched for.
Building on another answer (and hopefully convince you this is the best type of answer), you can look up http://en.wikipedia.org/wiki/Suffix_tree and also go through the references listed there to learn about suffix trees if you really want the fastest solution for your problem, that also can make it possible to get the number of matches without iterating over all the matches, and the running times and memory requirements you get are the absolute best possible for any substring matching or match counting algorithm. Once you understand how the suffix tree works and how to build/use it, then the only additional tweak you need is store the number of distinct string starting positions that are being represented at each internal node of the tree, a minor modification that you can easily do efficiently (linear time, as already claimed) by recursively getting the count from children nodes and adding them up to get the count at a current node. Then these counts allow you to count substring matches without iterating over all of them.
If I understood correctly, your input string consists of many one-character blocks.
In this case, you can compress your text using the Run-length encoding.
For example:
s = aaabbbbcc
is encoded as:
encoded_s = (a3)(b4)(c2)
Now you may search for patterns in encoded text.
If you want a concrete algorithm, just search the web for patterns matching in Run-length encoded strings. You can achieve time complexity O(N + M), where N and M are the lengths of compressed text and compressed pattern. Both M and N in general are much smaller than original lengths, so it beats any standard pattern matching algorithm e.g. KMP.
1) I'd go with finite automata. Can't think of a specialized library right now, but the general-purpose PCRE can be used to construct an automata efficiently searching for the given substring. For substring "foo" and "bar" one can construct a pattern /(foo)|(bar)/, scan a string and get the "id" number of the matched substring by iterating the ovector checking which group matched.
RE2::FindAndConsume is good if you only need the total count, not grouping by substring.
P.S. Example using Boost.Xpressive and loading the strings from a map: http://ericniebler.com/2010/09/27/boost-xpressive-ftw/
P.P.S. Recently I've had a good time creating a Ragel machine for a similar task. For a small set of searched strings a "normal" DFA would work, buf if you have a larger ruleset then using Ragel scanners shows good results (here is a related answer).
P.P.P.S. PCRE has the MARK keyword which is super useful for that kind of subpattern classification (cf).
2) Quite some time ago I wrote a Trie-based thingie in Scala for that kind of load: https://gist.github.com/ArtemGr/6150594; Trie.search goes over the string trying to match the current position to a number encoded in the Trie. The trie is encoded in a single cache-friendly array, I expect it to be as good as non-JIT DFAs.
3) I've been using boost::spirit for substring matching, but never got to measuring how it fares against other approaches. Spirit uses some efficient structure for the symbols matching, perhaps the structure can be used on its own without the overhead of Spirit.
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
using qi::lit; using qi::alnum; using qi::digit; using qi::_val; using qi::_1; using boost::phoenix::ref;
static struct exact_t: qi::symbols<char, int> {
exact_t() {add
("foo", 1)
("bar", 2);}
} exact;
int32_t match = -1;
qi::rule<const char*, int()> rule =
+alnum >> exact [ref (match) = _1];
const char* it = haystack; // Mutable iterator for Spirit.
qi::parse (it, haystack + haystackLen, rule);

Simulating range(L,N) in erlang

Early in the morning playing with Erlang I got a curious result:
-module(bucle01).
-compile(export_all).
for(N) when N >=0 ->
lists:seq(1,N).
for(L,N) when L =< N ->
lists:seq(L,N);
for(L,N) when L > N ->
lists:reverse(for(N,L)).
When I run the program I see this:
> bucle01:for(1,10).
[1,2,3,4,5,6,7,8,9,10]
> bucle01:for(10,1).
[10,9,8,7,6,5,4,3,2,1]
>bucle01:for(7,10).
[7,8,9,10]
>bucle01:for(8,10).
"\b\t\n" %% What's that !?!
>bucle01:for(10,8).
"\n\t\b" %% After all it has some logic !
Any "Kool-Aid" to "Don't drink too much" please ?
Strings in Erlang are just lists of ASCII numbers. The Erlang shell tries to determine, without metadata, if your list is a list of numbers or a string by looking for printable characters.
\b (backspace), \t (tab) and \n (newline) are all somewhat common ASCII characters and therefore the shell shows you the string instead of the numbers. The internal structure of the list is exactly the same, however.
This is also covered by the Erlang FAQ:
Why do lists of numbers get printed incorrectly?
And here's a few ideas to prevent this magic: Can I disable printing lists of small integers as strings in Erlang shell?