replace (key, value) in a string with boost regex - c++

I have a map (collection of pairs) of strings I have to replace in a string.
For example if I have map { {"foo", "foo2"}, {"bar", "bar2"} }
the final result for string "foo barman football bar" should be: "foo2 bar2man foo2tball bar2".
There are a number of questions/answers already but none of them mention the biggest problem here. If the map has "circular" replacement { {"foo", "bar"}, {"bar", "foo"} }
the result should be: "bar fooman bartball foo".
Another problem could be { {"foo", "fo"}, {"fo", "f"} }
The algorithm should replace an instance only once.
Sorry for not providing SSCCE ... I am still looking how to approach it.
Option 1:
Sort the map by key size (descending) - this will eliminate inclusion problems.
Then based on idea that comes from the classic swap approach. Go through the map and replace all occurrences of key with something unique. In my case ? could not be used so ?1 for the first key ?2 for the second key etc.
Now on second pass replace ?1 with the value 1, ?2 with value 2 etc.
Cons:
1. you have to have easy way to define swap key that do no match with any of the keys or values.
2. Seems performance expensive
Option 2:
Create a general match pattern say ((key1|key2|key3...)(.*?))*.
Enumerate the matches and for every key match replace the key with the value.
Regenerate the result.
Cons:
1. Creating the matching tree could be memory expensive
2. preparing general pattern could be a hassle

Related

RegEx 5 Strings to Match - Replace Based on Match

Angular/JS Application
I have this: input.replace('/<|>|"|&|&apos;/gm', need this to be based on match value).
So I want to search by all those strings - but I want to replace the value based on which one matched. So if " matches = replace with " and if > matches = replace with >
I basically want to avoid this:
input.replace('/</gm', <)
input.replace('>/gm', >)
input.replace('"', ")
I think it has something to do with capturing groups - not a regex person.
Maybe the answer can only be: inputString.replace('/</gm', '<').replace('/>/gm', '>').replace('/"/gm', '"').replace('/&/gm', '&').replace('/&apos;/gm', '\'');
What's commonly done is to simply chain the replacements, executing one after another as in your example:
input.replace(/</g, "<").replace(/>/g, ">").replace(/"/g, '"').replace(/&/g, "&").replace(/&apos;/g, "'")
the downside of this it that it really doesn't scale well: Each replace operation runs in linear time. Thus for m replacement and a string of length n, the time complexity will be O(n * m). If you now were to implement support for all 2k+ named HTML entities, this would quickly blow up and your performance would degrade severely - not to mention the O(m) garbage strings that are created in the process, making for O(n * m) garbage data.
The proper way is to create a lookup table (a hash table, called a dictionary in JS) with O(1) access with all the named entities and their replacements:
const namedEntities = {lt: "<", gt: ">", quot: '"', amp: "&", apos: "'"}
return input.replace(/&(lt|gt|quot|amp|apos);/g, (_, match) => namedEntities[match])
this passes a replacement function to String.replace; no garbage strings are created and the time complexity - assuming an ideal RegEx implementation - is O(n).
If you want to religiously follow DRY, you might want to build the RegEx from the keys:
const regex = new RegExp("&(" + Object.keys(namedEntities).join("|") + ");", "g")
return input.replace(regex, (_, match) => namedEntities[match])
alternatively, consider using a more general RegEx, leveraging the dictionary to check whether an entity is valid and defaulting to no replacement:
return input.replace(/&(.+?);/g, (entity, match) => namedEntities[match] || entity)

parse URL params in Perl

I am working on some tutorials to explain things like GET/POST's and need to parse the URI manually. The follow perl code works, but I am trying to do two things:
list each key/value
be able to look up one specific value
What I do NOT care about is replacing the special chars to spaces or anything, the one value I need to get should be a number. In other languages I have used, the regular expression in question should group each key/value into one grouping with a part 1/part 2, does Perl do the same? If so, how do I put that into a map?
my #paramList = split /(?:\?|&|;)([^=]+)=([^&|;]+)/, $ENV{'REQUEST_URI'};
if(#paramList)
{
print "<h1>The Params</h1><ul>";
foreach my $i (#paramList) {
if($i) {
print "<li>$i</li>";
}
}
print "<ul>";
}
Per the request, here is a basic example of the input:
REQUEST_URI = /cgi-bin/printenv_html.pl?customer_name=fdas&phone_number=fdsa&email_address=fads%40fd.com&taxi=van&extras=tip&pickup_time=2020-01-14T20%3A45&pickup_place=&dropoff_place=Airport&comments=
goal is the following where the left of the equal is the key, and the right is the value:
customer_name=fdas
phone_number=fdsa
email_address=fads%40fd.com
taxi=van
extras=tip
pickup_time=2020-01-14T20%3A45
pickup_place=
dropoff_place=Airport
comments=
How about feeding your list of key-value pairs into a hash?
my %paramList = $ENV{'REQUEST_URI'} =~ /(?:\?|&|;)([^=]+)=([^&|;]+)/g;
(no reason for the split as far as I can tell)
This relies crucially on there being an even-sized list of matches, where each "before-=" thing becomes a key in the hash, with the value being its pairing "after-=" thing.
In order to also get "pairs" without a value (like comments=) change + in the last pattern to *

Does mongodb $regex without the option `i` still make use of the index if I am searching on the Index?

I have a model with a normal index using Mongoose.
const mod = new mongoose.Schema({
number: { type: String, required: true, index: { unique: true } },
});
I am using a regex in a query to get the mod corresponding to a specific number. Will my regex query utilize the index that is on this model?
query.number = {
$regex: `.*Q10.*`
}
modelName.find(query)
I am concerned that this is looking through the entire collection without using the indexes. What would be the best way to know if I am using the index. Or if you happen to know a way that will utilize the index could you show me? Here I am looking for all close to Q10, not trying to get an exact match. Would using /^Q10.* be better and use the index?
Referencing MongoDB regex information on index and comments made on this post stackoverflow previous question
The best way to confirm index usage for a given query is using MongoDB's query explain() feature. See Explain Results in the manual for your version of MongoDB for more information on the output fields and interpretation.
With regular expressions a main concern is efficient use of indexes. An unanchored substring match like /Q10/ will require examining all index keys (assuming a candidate index exists, as in your example). This is an improvement over scanning the full collection data (as would be the case without an index), but not as ideal as being able to check a subset of relevant index keys as is possible with a regex prefix search.
If you are routinely searching for substring matches and there is a common pattern to your strings, you could design a more scalable schema. For example, you could save whatever your Q10 value represents into a separate field (such as part_number) where you could use a prefix match or an exact match (non-regex).
To illustrate, I set up some test data using MongoDB 3.4.2 and the mongo shell:
// Needles: strings to search for
db.mod.insert([{number:'Q10'}, {number: 'foo-Q10'}, {number:'Q10-123'}])
// Haystack: some string values to illustrate key comparisons
for (i=0; i<1000; i++) { db.mod.insert({number: "I" + i}) }
Regex search without an index:
db.mod.find({ number: { $regex: /Q10/ }}).explain('executionStats')
The winningPlan is a COLLSCAN (collection scan) which requires the server retrieve every document in the collection to perform the comparison. Note that the original regex includes an unnecessary .* prefix and suffix; this is implicit with a substring match so can be written more concisely as /Q10/.
Highlights from the executionStats section of the explain output:
"nReturned": 2,
"totalKeysExamined": 0,
"totalDocsExamined": 1003,
The explain output confirms there are no index keys examined and 1003 documents (all docs in this collection).
Add an index for the following two examples:
db.mod.createIndex({number:1}, {unique: true})
Regex substring search with an index:
db.mod.find({ number: { $regex: /Q10/}}).explain('executionStats')
The winningPlan is still an IXSCAN, but now has to examine all 1003 indexed string values to find substring matches:
"nReturned": 3,
"totalKeysExamined": 1003,
"totalDocsExamined": 3,
Regex prefix search with an index:
db.mod.find({ number: { $regex: /^Q10/}}).explain('executionStats')
The winningPlan is an IXSCAN (Index scan) which requires 3 key comparisons and 2 document fetches to return the 2 matching documents:
"nReturned": 2,
"totalKeysExamined": 3,
"totalDocsExamined": 2,
A prefix search isn't equivalent to the first two searches, as it will not match the document with value foo-Q10. However, this does illustrate a more efficient regex search.
Note that totalKeysExamined is 3. It might be reasonable to expect this to be 2 since there were only 2 matches, however this metric includes any comparisons with out-of-range keys (eg. end of a range of values). For more information see Explain Results: keysExamined.
With the index enabled, For case sensitive regular expression queries, the query traverses the entire index (load into memory), then load the matching documents to be returned into memory. Its expensive but still could be better than a full collection scan.
For /John Doe/ regex ,mongo will scan the entire keyset in the index
then fetch the matched documents.
However, if you use a prefix query :
Further optimization can occur if the regular expression is a “prefix
expression”, which means that all potential matches start with the
same string. This allows MongoDB to construct a “range” from that
prefix and only match against those values from the index that fall
within that range.

Checking if a string contains a character in Scala

I have a collection of Strings and I'm checking if they're correctly masked or not.
They're in a map and so I'm iterating over it, pulling out the text value and then checking. I'm trying various different combinations but none of which are giving me the finished result that I need. I have gotten it working by iterating over each character but that feels very java-esque.
My collection is something like:
"text"-> "text"
"text"-> "**xt"
"text"-> "****"
in the first two cases I need to confirm that the value is not all starred out and then add them to another list that can be returned.
Edit
My question: I need to check if the value contains anything other an '*', how might I accomplish this in the most efficient scala-esque way?
My attempt at regex also failed giving many false positives and it seems like such a simple task. I'm not sure if regex is the way to go, I also wondered if there was a method I could apply to .contains or use pattern matching
!string.matches("\\*+") will tell you if the string contains characters other than *.
If I understand correctly, you want to find the keys in your map for which the value is not just stars. You can do this with a regex :
val reg = "\\*+".r
yourMap.filter{ case (k,v) => !reg.matches(v) }.keys
If you're not confortable with a regex, you can use a forall statement:
yourMap.filter{ case(k,v) => v.forall(_ == '*') }.keys
Perhaps I misunderstood your question, but if you started with a Map you could try something like:
val myStrings = Map("1"-> "text", "2"-> "**xt", "3"-> "****")
val newStrings = myStrings.filterNot( _._2.contains("*") )
This would give you a Map with just Map(1 -> "text").
Try:
val myStrings = Map("1"-> "text", "2"-> "**xt", "3"-> "****")
val goodStrings = myStrings.filter(_._2.exists(_ !='*'))
This finds all cases where the value in the map contains something other than an asterisk. It will remove all empty and asterisk-only strings. For something this simple, i.e. one check, a regex is overkill: you're just looking for strings that contain any non-asterisk character.
If you only need the values and not the whole map, use:
val goodStrings = myStrings.values.filter(_.exists(_ !='*'))

Using regex to access values from a map in keys

val m = Map("a"->2,"ab"->3,"c"->4)
scala> m.get("a");
scala> println(res.get)
2
scala> m.get(/a\.*/)
// or something similar.
Can i get a list of all key-value pairs where key contains "a" without having to iterate over the entire map , by doing something as simple as specifying a regex in the key value?
Thanks in advance!
No, you cannot do that without iterating over the entire map. In fact, I can't even think of a single data structure that would allow it, say nothing of the API.
Of course, iterating is pretty simple:
m.filterKeys(_ matches "a.*")