Replace all occurrences based on map - c++

This question might have already been answered but I haven't found it yet.
Let's say I have a std::map<string,string> which contains string pairs of <replace_all_this, to_this>.
I checked Boost's format library, which is close but not perfect:
std::map<string,string> m;
m['$search1'] = 'replace1';
m['$search2'] = 'replace2';
format fmter1("let's try to %1% and %2%.");
fmter % 36; fmter % 77;
for(auto r : m) {
fmter % r.second;
}
// would print "let's try to replace1 and replace2
This would work, but I lose control of what.
Actually I'd like to have this as result:
format fmter1("let's try to $search2 and $search1 even if their order is different in the map.");
...
//print: "let's try to replace2 and replace1 even if their order is different in the map".
Please note: map can contain more items, and items can occur multiple times in the formatter.
What is the way to go for this in 2020, I'd like it to be effective and fast, so I'd avoid iterating over the map multiple times.

There may be new libraries but there is no new algorithm to do that faster than what we have so far.
Assuming your format implies a $<name> for your variables, you can search for the first '$', read the <name> search for that in the map, then do the replace. This allows you to either skip the replacement or process it too (i.e. make it recursive where a variable can reference another).
I don't think that doing it the other way around would be any faster: i.e. go through the map and search for the names in the string means you'd be parsing the strings many times and if you have many variables, it will be a huge waste if most are not likely part of your string. Also if you want to prevent some level of recursivity, it's very complicated.
Where you can eventually optimize is in calculating the size of the resulting string and allocate that buffer once instead of using += which is not unlikely going to be slower.
I have such an implementation in my snaplogger. The variable has to be between brackets and it can include multiple parameters to further tweak the data. There is documentation here about what is supported by that function and as written, you can easily extend the class to add more features. It's probably not a one to one match to what you're looking for, but it shows you that there is not 20 ways of implementing that function.

Related

Maxima: creating a function that acts on parts of a string

Context: I'm using Maxima on a platform that also uses KaTeX. For various reasons related to content management, this means that we are regularly using Maxima functions to generate the necessary KaTeX commands.
I'm currently trying to develop a group of functions that will facilitate generating different sets of strings corresponding to KaTeX commands for various symbols related to vectors.
Problem
I have written the following function makeKatexVector(x), which takes a string, list or list-of-lists and returns the same type of object, with each string wrapped in \vec{} (i.e. makeKatexVector(string) returns \vec{string} and makeKatexVector(["a","b"]) returns ["\vec{a}", "\vec{b}"] etc).
/* Flexible Make KaTeX Vector Version of List Items */
makeKatexVector(x):= block([ placeHolderList : x ],
if stringp(x) /* Special Handling if x is Just a String */
then placeHolderList : concat("\vec{", x, "}")
else if listp(x[1]) /* check to see if it is a list of lists */
then for j:1 thru length(x)
do placeHolderList[j] : makelist(concat("\vec{", k ,"}"), k, x[j] )
else if listp(x) /* check to see if it is just a list */
then placeHolderList : makelist(concat("\vec{", k, "}"), k, x)
else placeHolderList : "makeKatexVector error: not a list-of-lists, a list or a string",
return(placeHolderList));
Although I have my doubts about the efficiency or elegance of the above code, it seems to return the desired expressions; however, I would like to modify this function so that it can distinguish between single- and multi-character strings.
In particular, I'd like multi-character strings like x_1 to be returned as \vec{x}_1 and not \vec{x_1}.
In fact, I'd simply like to modify the above code so that \vec{} is wrapped around the first character of the string, regardless of how many characters there may be.
My Attempt
I was ready to tackle this with brute force (e.g. transcribing each character of a string into a list and then reassembling); however, the real programmer on the project suggested I look into "Regular Expressions". After exploring that endless rabbit hole, I found the command regex_subst; however, I can't find any Maxima documentation for it, and am struggling to reproduce the examples in the related documentation here.
Once I can work out the appropriate regex to use, I intend to implement this in the above code using an if statement, such as:
if slength(x) >1
then {regex command}
else {regular treatment}
If anyone knows of helpful resources on any of these fronts, I'd greatly appreciate any pointers at all.
Looks like you got the regex approach working, that's great. My advice about handling subscripted expressions in TeX, however, is to avoid working with names which contain underscores in Maxima, and instead work with Maxima expressions with indices, e.g. foo[k] instead of foo_k. While writing foo_k is a minor convenience in Maxima, you'll run into problems pretty quickly, and in order to straighten it out you might end up piling one complication on another.
E.g. Maxima doesn't know there's any relation between foo, foo_1, and foo_k -- those have no more in common than foo, abc, and xyz. What if there are 2 indices? foo_j_k will become something like foo_{j_k} by the preceding approach -- what if you want foo_{j, k} instead? (Incidentally the two are foo[j[k]] and foo[j, k] when represented by subscripts.) Another problematic expression is something like foo_bar_baz. Does that mean foo_bar[baz], foo[bar_baz] or foo_bar_baz?
The code for tex(x_y) yielding x_y in TeX is pretty old, so it's unlikely to go away, but over the years I've come to increasing feel like it should be avoided. However, the last time it came up and I proposed disabling that, there were enough people who supported it that we ended up keeping it.
Something that might be helpful, there is a function texput which allows you to specify how a symbol should appear in TeX output. For example:
(%i1) texput (v, "\\vec{v}");
(%o1) "\vec{v}"
(%i2) tex ([v, v[1], v[k], v[j[k]], v[j, k]]);
$$\left[ \vec{v} , \vec{v}_{1} , \vec{v}_{k} , \vec{v}_{j_{k}} ,
\vec{v}_{j,k} \right] $$
(%o2) false
texput can modify various aspects of TeX output; you can take a look at the documentation (see ? texput).
While I didn't expect that I'd work this out on my own, after several hours, I made some progress, so figured I'd share here, in case anyone else may benefit from the time I put in.
to load the regex in wxMaxima, at least on the MacOS version, simply type load("sregex");. I didn't have this loaded, and was trying to work through our custom platform, which cost me several hours.
take note that many of the arguments in the linked documentation by Dorai Sitaram occur in the reverse, or a different order than they do in their corresponding Maxima versions.
not all the "pregexp" functions exist in Maxima;
In addition to this, escaping special characters varied in important ways between wxMaxima, the inline Maxima compiler (running within Ace editor) and the actual rendered version on our platform; in particular, the inline compiler often returned false for expressions that compiled properly in wxMaxima and on the platform. Because I didn't have sregex loaded on wxMaxima from the beginning, I lost a lot of time to this.
Finally, the regex expression that achieved the desired substitution, in my case, was:
regex_subst("\vec{\\1}", "([[:alpha:]])", "v_1");
which returns vec{v}_1 in wxMaxima (N.B. none of my attempts to get wxMaxima to return \vec{v}_1 were successful; escaping the backslash just does not seem to work; fortunately, the usual escaped version \\vec{\\1} does return the desired form).
I have yet to adjust the code for the rest of the function, but I doubt that will be of use to anyone else, and wanted to be sure to post an update here, before anyone else took time to assist me.
Always interested in better methods / practices or any other pointers / feedback.

How to find special values in large file using C++ or C

I've some values I want to find in a large (> 500 MB) text file using C++ or C. I know that a possible matching value can only exist at the very beginning of each line and its length is exactly ten characters. Okay, I can read the whole file line by line searching the value with substr() or use regexp but that is a little bit ugly and very slow. I consider to use a embedded database (e.g. Berkeley DB) but the file I want to search in is very dynamic and I see a problem to bring it into the database every time. Due to a limit of memory it is not possible to load the whole file at once into memory. Many thanks in advance.
This doesn't seem well suited to C/C++. Since the problem is defined with the need to parse whole lines of text, and perform pattern matching on the first 10-chars, something interpreted, such as python or perl would seem to be simpler.
How about:
import os
pattern ='0123456789' # <-- replace with pattern
with open('myfile.txt') as f:
for line in f:
if line.startswith(pattern):
print "Eureka!'
I don't see how you're going to do this faster than using the stdio library, reading each line in turn into a buffer, and using strchr, strcmp, strncmp or some such. Given the description of your problem, that's already fairly optimal. There's no magic that will avoid the need to go through the file line by line looking for your pattern.
That said, regular expressions are almost certainly not needed here if you're dealing with a fixed pattern of exactly ten characters at the start of a line -- that would be needlessly slow and I wouldn't use the regex library.
If you really, really need to beat the last few microseconds out of this, and the pattern is literally constant and at the start of a line, you might be able to do a memchr on read-in buffers looking for "\npattern" or some such (that is, including the newline character in your search) but you make it sound like the pattern is not precisely constant. Assuming it is not precisely constant, the most obvious method (see first paragraph) is the the most obvious thing to do.
If you have a large number of values that you are looking for then you want to use Aho-Corasick. This algorithm allows you to create a single finite state machine that can search for all occurrences of any string in a set simultaneously. This means that you can search through your file a single time and find all matches of every value you are looking for. The wikipedia link above has a link to a C implementation of Aho-Corasick. If you want to look at a Go implementation that I've written you can look here.
If you are looking for a single or a very small number of values then you'd be better off using Boyer-Moore. Although in this case you might want to just use grep, which will probably be just as fast as anything you write for this application.
How about using memory mapped files before search?
http://beej.us/guide/bgipc/output/html/multipage/mmap.html
One way may be loading and searching for say first 64 MB in memory, unload this then load the next 64 MB and so on (in multiples of 4 KB so that you are not overlooking any text which might be split at the block boundary)
Also view Boyer Moore String Search
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
Yes this can be done fast. Been there. Done that. It is easy to introduce bugs, however.
The trick is in managing end of buffer, since you will read a buffer full of data, search that buffer, and then go on to the next. Since the pattern could span the boundary between two buffers, you wind up writing most of your code to cover that case.
At any rate, outside of the boundary case, you have a loop that looks like the following:
unsigned short *p = buffer;
while( (p < EOB) && ( patterns[*p] ) ) ++p;
This assumes that EOB has been appropriately initialized, and that patterns[] is an array of 65536 values which are 0 if you can't be at the start of your pattern and 1 if you can.
Depending on your CR/LF and byte order conventions, patterns to set to 1 might include \nx or \rx where x is the first character in your 10 character pattern. Or x\n or x\r for the other byte order. And if you don't know the byte order or convention you can include all four.
Once you have a candidate location (EOL followed by the first byte) you do the work of checking the remaining 9 bytes. Building the patterns array is done offline, ahead of time. Two byte patterns fit in a small enough array that you don't have too much memory thrashing when doing the indexing, but you get to zip through the data twice as fast as if you did single byte.
There is one crazy optimization you can add into this, and that is to write a sentinel at the end of buffer, and put it in your patterns array. But that sentinel must be something that couldn't appear in the file otherwise. It gets the loop down to one test, one lookup and one increment, though.

Stream or Iterator to generate all strings that match a regular expression?

This is a follow-up to my previous question.
Suppose I want to generate all strings that match a given (simplified) regular expression.
It is just a coding exercise and I do not have any additional requirements (e.g. how many strings are generated actually). So the main requirement is to produce nice, clean, and simple code.
I thought about using Stream but after reading this question I am thinking about Iterator. What would you use?
The solution to this question asks for too much code for it to be practical to answer here, but the outline goes as follows.
First, you want to parse your regular expression--you can look into parser combinators for this, for example. You'll then have an evaluation tree that looks like, for example,
List(
Constant("abc"),
ZeroOrOne(Constant("d")),
Constant("efg"),
OneOf(Constant("h"),List(Constant("ij"),ZeroOrOne(Constant("klmnop")))),
Constant("qrs"),
AnyChar()
)
Rather than running this expression tree as a matcher, you can run it as a generator by defining a generate method on each term. For some terms, (e.g. ZeroOrOne(Constant("d"))), there will be multiple options, so you can define an iterator. One way to do this is to store internal state in each term and pass in either an "advance" flag or a "reset" flag. On "reset", the generator returns the first possible match (e.g. ""); on advance, it goes to the next one and returns that (e.g. "d") while consuming the advance flag (leaving the rest to evaluate with no flags). If there are no more items, it produces a reset instead for everything inside itself and leaves the advance flag intact for the next item. You start by running with a reset; on each iteration, you put an advance in, and stop when you get it out again.
Of course, some regex constructs like "d+" can produce infinitely many values, so you'll probably want to limit them in some way (or at some point return e.g. d...d meaning "lots"); and others have very many possible values (e.g. . matches any char, but do you really want all 64k chars, or howevermany unicode code points there are?), and you may wish to restrict those also.
Anyway, this, though time-consuming, will result in a working generator. And, as an aside, you'll also have a working regex matcher, if you write a match routine for each piece of the parsed tree.

Simple spell checking algorithm

I've been tasked with creating a simple spell checker for an assignment but have given next to no guidance so was wondering if anyone could help me out. I'm not after someone to do the assignment for me, but any direction or help with the algorithm would be awesome! If what I'm asking is not within the guildlines of the site then I'm sorry and I'll look elsewhere. :)
The project loads correctly spelled lower case words and then needs to make spelling suggestions based on two criteria:
One letter difference (either added or subtracted to get the word the same as a word in the dictionary). For example 'stack' would be a suggestion for 'staick' and 'cool' would be a suggestion for 'coo'.
One letter substitution. So for example, 'bad' would be a suggestion for 'bod'.
So, just to make sure I've explained properly.. I might load in the words [hello, goodbye, fantastic, good, god] and then the suggestions for the (incorrectly spelled) word 'godd' would be [good, god].
Speed is my main consideration here so while I think I know a way to get this work, I'm really not too sure about how efficient it'll be. The way I'm thinking of doing it is to create a map<string, vector<string>> and then for each correctly spelled word that's loaded in, add the correctly spelled work in as a key in the map and the populate the vector to be all the possible 'wrong' permutations of that word.
Then, when I want to look up a word, I'll look through every vector in the map to see if that word is a permutation of one of the correctly spelled word. If it is, I'll add the key as a spelling suggestion.
This seems like it would take up HEAPS of memory though, cause there would surely be thousands of permutations for each word? It also seems like it'd be very very slow if my initial dictionary of correctly spelled words was large?
I was thinking that maybe I could cut down time a bit by only looking in the keys that are similar to the word I'm looking at. But then again, if they're similar in some way then it probably means that the key will be a suggestion meaning I don't need all those permutations!
So yeah, I'm a bit stumped about which direction I should look in. I'd really appreciate any help as I really am not sure how to estimate the speed of the different ways of doing things (we haven't been taught this at all in class).
The simpler way to solve the problem is indeed a precomputed map [bad word] -> [suggestions].
The problem is that while the removal of a letter creates few "bad words", for the addition or substitution you have many candidates.
So I would suggest another solution ;)
Note: the edit distance you are describing is called the Levenshtein Distance
The solution is described in incremental step, normally the search speed should continuously improve at each idea and I have tried to organize them with the simpler ideas (in term of implementation) first. Feel free to stop whenever you're comfortable with the results.
0. Preliminary
Implement the Levenshtein Distance algorithm
Store the dictionnary in a sorted sequence (std::set for example, though a sorted std::deque or std::vector would be better performance wise)
Keys Points:
The Levenshtein Distance compututation uses an array, at each step the next row is computed solely with the previous row
The minimum distance in a row is always superior (or equal) to the minimum in the previous row
The latter property allow a short-circuit implementation: if you want to limit yourself to 2 errors (treshold), then whenever the minimum of the current row is superior to 2, you can abandon the computation. A simple strategy is to return the treshold + 1 as the distance.
1. First Tentative
Let's begin simple.
We'll implement a linear scan: for each word we compute the distance (short-circuited) and we list those words which achieved the smaller distance so far.
It works very well on smallish dictionaries.
2. Improving the data structure
The levenshtein distance is at least equal to the difference of length.
By using as a key the couple (length, word) instead of just word, you can restrict your search to the range of length [length - edit, length + edit] and greatly reduce the search space.
3. Prefixes and pruning
To improve on this, we can remark than when we build the distance matrix, row by row, one world is entirely scanned (the word we look for) but the other (the referent) is not: we only use one letter for each row.
This very important property means that for two referents that share the same initial sequence (prefix), then the first rows of the matrix will be identical.
Remember that I ask you to store the dictionnary sorted ? It means that words that share the same prefix are adjacent.
Suppose that you are checking your word against cartoon and at car you realize it does not work (the distance is already too long), then any word beginning by car won't work either, you can skip words as long as they begin by car.
The skip itself can be done either linearly or with a search (find the first word that has a higher prefix than car):
linear works best if the prefix is long (few words to skip)
binary search works best for short prefix (many words to skip)
How long is "long" depends on your dictionary and you'll have to measure. I would go with the binary search to begin with.
Note: the length partitioning works against the prefix partitioning, but it prunes much more of the search space
4. Prefixes and re-use
Now, we'll also try to re-use the computation as much as possible (and not just the "it does not work" result)
Suppose that you have two words:
cartoon
carwash
You first compute the matrix, row by row, for cartoon. Then when reading carwash you need to determine the length of the common prefix (here car) and you can keep the first 4 rows of the matrix (corresponding to void, c, a, r).
Therefore, when begin to computing carwash, you in fact begin iterating at w.
To do this, simply use an array allocated straight at the beginning of your search, and make it large enough to accommodate the larger reference (you should know what is the largest length in your dictionary).
5. Using a "better" data structure
To have an easier time working with prefixes, you could use a Trie or a Patricia Tree to store the dictionary. However it's not a STL data structure and you would need to augment it to store in each subtree the range of words length that are stored so you'll have to make your own implementation. It's not as easy as it seems because there are memory explosion issues which can kill locality.
This is a last resort option. It's costly to implement.
You should have a look at this explanation of Peter Norvig on how to write a spelling corrector .
How to write a spelling corrector
EveryThing is well explain in this article, as an example the python code for the spell checker looks like this :
import re, collections
def words(text): return re.findall('[a-z]+', text.lower())
def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model
NWORDS = train(words(file('big.txt').read()))
alphabet = 'abcdefghijklmnopqrstuvwxyz'
def edits1(word):
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
deletes = [a + b[1:] for a, b in splits if b]
transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]
inserts = [a + c + b for a, b in splits for c in alphabet]
return set(deletes + transposes + replaces + inserts)
def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)
def known(words): return set(w for w in words if w in NWORDS)
def correct(word):
candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]
return max(candidates, key=NWORDS.get)
Hope you can find what you need on Peter Norvig website.
for spell checker many data structures would be useful for example BK-Tree. Check Damn Cool Algorithms, Part 1: BK-Trees I have done implementation for the same here
My Earlier code link may be misleading, this one is correct for spelling corrector.
off the top of my head, you could split up your suggestions based on length and build a tree structure where children are longer variations of the shorter parent.
should be quite fast but i'm not sure about the code itself, i'm not very well-versed in c++

efficient algorithm for searching one of several strings in a text?

I need to search incoming not-very-long pieces of text for occurrences of given strings. The strings are constant for the whole session and are not many (~10). Additional simplification is that none of the strings is contained in any other.
I am currently using boost regex matching with str1 | str2 | .... The performance of this task is important, so I wonder if I can improve it. Not that I can program better than the boost guys, but perhaps a dedicated implementation is more efficient than a general one.
As the strings stay constant over long time, I can afford building a data structure, like a state transition table, upfront.
e.g., if the strings are abcx, bcy and cz, and I've read so far abc, I should be in a combined state that means you're either 3 chars into string 1, 2 chars into string 2 or 1 char into string 1. Then reading x next will move me to the string 1 matched state etc., and any char other than xyz will move to the initial state, and I will not need to retract back to b.
Any ideas or references are appreciated.
Check out the Aho–Corasick string matching algorithm!
Take a look at Suffix Tree.
Look at this: http://www.boost.org/doc/libs/1_44_0/libs/regex/doc/html/boost_regex/configuration/algorithm.html
The existence of a recursive/non-recursive distinction is a pretty strong suggestion that BOOST is not necessarily a linear-time discrete finite-state machine. Therefore, there's a good chance you can do better for your particular problem.
The best answer depends quite a bit on how many haystacks you have and the minimum size of a needle. If the smallest needle is longer than a few characters, you may be able to do a little bit better than a generalized regex library.
Basically all string searches work by testing for a match at the current position (cursor), and if none is found, then trying again with the cursor slid farther to the right.
Rabin-Karp builds a DFSM out of the string (or strings) for which you are searching so that the test and the cursor motion are combined in a single operation. However, Rabin-Karp was originally designed for a single needle, so you would need to support backtracking if one match could ever be a proper prefix of another. (Remember that for when you want to reuse your code.)
Another tactic is to slide the cursor more than one character to the right if at all possible. Boyer-Moore does this. It's normally built for a single needle. Construct a table of all characters and the rightmost position that they appear in the needle (if at all). Now, position the cursor at len(needle)-1. The table entry will tell you (a) what leftward offset from the cursor that the needle might be found, or (b) that you can move the cursor len(needle) farther to the right.
When you have more than one needle, the construction and use of your table grows more complicated, but it still may possibly save you an order of magnitude on probes. You still might want to make a DFSM but instead of calling a general search method, you call does_this_DFSM_match_at_this_offset().
Another tactic is to test more than 8 bits at a time. There's a spam-killer tool that looks at 32-bit machine words at a time. It then does some simple hash code to fit the result into 12 bits, and then looks in a table to see if there's a hit. You have four entries for each pattern (offsets of 0, 1, 2, and 3 from the start of the pattern) and then this way despite thousands of patterns in the table they only test one or two per 32-bit word in the subject line.
So in general, yes, you can go faster than regexes WHEN THE NEEDLES ARE CONSTANT.
I've been looking at the answers but none seem quite explicit... and mostly boiled down to a couple of links.
What intrigues me here is the uniqueness of your problem, the solutions exposed so far do not capitalize at all on the fact that we are looking for several needles at once in the haystack.
I would take a look at KMP / Boyer Moore, for sure, but I would not apply them blindly (at least if you have some time on your hand), because they are tailored for a single needle, and I am pretty convinced we could capitalized on the fact that we have several strings and check all of them at once, with a custom state machine (or custom tables for BM).
Of course, it's unlikely to improve the big O (Boyer Moore runs in 3n for each string, so it'll be linear anyway), but you could probably gain on the constant factor.
Regex engine initialization is expected to have some overhead,
so if there are no real regular expressions involved,
C - memcmp() should do fine.
If you can tell the file sizes and give some
specific use cases, we could build a benchmark
(I consider this very interesting).
Interesting: memcmp explorations and timing differences
Regards
rbo
There is always Boyer Moore
Beside Rabin-Karp-Algorithm and Knuth-Morris-Pratt-Algorithm, my Algorithm-Book suggests a Finite State Machine for String Matching. For every Search String you need to build such a Finite State Machine.
You can do it with very popular Lex & Yacc tools, with the support of Flex and Bison tools.
You can use Lex for getting tokens of the string.
Compare your pre-defined strings with the tokens returned from Lexer.
When match is found, perform the desired action.
There are many sites which describe about Lex and Yacc.
One such site is http://epaperpress.com/lexandyacc/