Mongodb word count using map reduce - regex

I have a problem with counting words
I want to count word in projects.log.subject.
ex) count [A],[B],[C]..
I searched how to use map reduce.. but I don't understand how to use it for result i want.
{
"_id": ObjectID("569f3a3e9d2540764d8bde59"),
"A": "book",
"server": "us",
"projects": [
{
"domainArray": [
{
~~~~
}
],
"log": [
{
~~~~~,
"subject": "[A][B]I WANT THIS"
}
],
"before": "234234234"
},
{
"domainArray": [
{
~~~~
}
],
"log": [
{
~~~~~,
"subject": "[B][C]I WANT THIS"
}
],
"before": "234234234"
},....
] //end of projects
}//end of document

This is a basic principle of using regular expressions and testing each string against the source string and emitting the found count for the result. In mapReduce terms, you want your "mapper" function to possibly emit multiple values for each "term" as a key, and for every array element present in each document.
So you basically want a source array of regular expressions to process ( likely just a word list ) to iterate and test and also iterate each array member.
Basically something like this:
db.collection.mapReduce(
function() {
var list = ["the", "quick", "brown" ]; // words you want to count
this.projects.forEach(function(project) {
project.log.forEach(function(log) {
list.forEach(function(word) {
var res = log.subject.match(new RegExp("\\b" + word + "\\b","ig"));
if ( res != null )
emit(word,res.length); // returns number of matches for word
});
});
});
},
function(key,values) {
return Array.sum(values);
},
{ "out": { "inline": 1 } }
)
So the loop processes the array elements in the document and then applies each word to look for with a regular expression to test. The .match() method will return an array of matches in the string or null if done was found. Note the i and g options for the regex in order to search case insensitive and beyond just the first match. You might need m for multi-line if your text includes line break characters as well.
If null is not returned, then we emit the current word as the "key" and the count as the length of the matched array.
The reducer then takes all output values from those emit calls in the mapper and simply adds up the emitted counts.
The result will be one document keyed by each "word/term" provided and the count of total occurances in the inspected field within the collection. For more fields, just add more logic to sum up the results, or similarly just keep "emitting" in the mapper and let the reducer do the work.
Note the "\\b" represents a word boundary expression to wrap each term escaped by` in order to construct the expression from strings. You need these to discriminate "the" from "then" for example, by specifying where the word/term ends.
Also that as regular expressions, characters like [] are reserved, so if you actually were looking for strings like that the you similarly escape, i.e:
"\[A\]"
But if you were actually doing that, then remove the word boundary characters:
new RegExp( "\[A\]", "ig" )
As that is enough of a complete match in itself.

Related

$expr with $regexMatch doesn't work when the pattern is inside an array

Based in this example, I'm using $expr and $regexMatch to implement "reverse regex" queries in MongoDB. For instance this example works
However, this only seems to work when the regex is in a first level field in the MongoDB document. In the case the regex is within an element in an array (as in this other example I get errors like this:
query failed: (Location51105) Executor error during find command :: caused by :: $regexMatch needs 'regex' to be of type string or regex
Is there any way of supporting this case?
The regex allows only string input, You can use $map operator to loop the array elements and check the condition,
$map to iterate loop of patterns.pattern array and check $regexMatch condition, it will return boolean value
$anyElementTrue to check if any element is true then it will true
db.collection.find({
"$expr": {
"$anyElementTrue": {
"$map": {
"input": "$patterns.pattern",
"in": {
"$regexMatch": {
"input": "Room1",
"regex": "$$this",
"options": "i"
}
}
}
}
}
})
Playground

Extracting multiple values with RegEx in a Google Sheet formula

I have a Google spreadsheet with 2 columns.
Each cell of the first one contains JSON data, like this:
{
"name":"Love",
"age":56
},
{
"name":"You",
"age":42
}
Then I want a second column that would, using a formula, extract every value of name and string it like this:
Love,You
Right now I am using this formula:
=REGEXEXTRACT(A1, CONCATENER(CHAR(34),"name",CHAR(34),":",CHAR(34),"([^",CHAR(34),"]+)",CHAR(34),","))
The RegEx expresion being "name":"([^"]+)",
The problem being that it currently only returns the first occurence, like this:
Love
(Also, I don't know how many occurences of "name" there are. Could be anywhere from 0 to around 20.)
Is it even possible to achieve what I want?
Thank you so much for reading!
EDIT:
My JSON data starts with:
{
"time":4,
"annotations":[
{
Then in the middle, something like this:
{
"name":"Love",
"age":56
},
{
"name":"You",
"age":42
}
and ends with:
],
"topEntities":[
{
"id":247120,
"score":0.12561166,
"uri":"http://en.wikipedia.org/wiki/Revenue"
},
{
"id":31512491,
"score":0.12504959,
"uri":"http://en.wikipedia.org/wiki/Wii_U"
}
],
"lang":"en",
"langConfidence":1.0,
"timestamp":"2020-05-22T12:17:47.380"
}
Since your text is basically a JSON string, you may parse all name fields from it using the following custom function:
function ExtractNamesFromJSON(input) {
var obj = JSON.parse("[" + input + "]");
var results = obj.map((x) => x["name"])
return results.join(",")
}
Then use it as =ExtractNamesFromJSON(C1).
If you need a regex, use a similar approach:
function ExtractAllRegex(input, pattern,groupId,separator) {
return Array.from(input.matchAll(new RegExp(pattern,'g')), x=>x[groupId]).join(separator);
}
Then use it as =ExtractAllRegex(C1, """name"":""([^""]+)""",1,",").
Note:
input - current cell value
pattern - regex pattern
groupId - Capturing group ID you want to extract
separator - text used to join the matched results.

how to match and replace the repeated group patterns and align the result?

I have a code snippet like below
[ "sortBy", "String", "sort by method" ],
[ "sortOrder", "String", "sort order includes ascend and descend" ],
[ "count", "Int", "The number of results to return." ],
[ "names", "Array<String>", "array of strings represents name" ]
I want to use regular expression to match and replace and align so that the result would be look like this:
{ Name = "sortBy"; Ref = "String"; Description = Some "sort by method" }
{ Name = "sortOrder"; Ref = "String"; Description = Some "sort order includes ascend and descend" }
{ Name = "count"; Ref = "Int"; Description = Some "The number of results to return." }
{ Name = "names"; Ref = "Array<String>"; Description = Some "array of strings represents name" }
and each column should be aligned. I am stuck at the beginning how to group match it and align the result. My search is this
*\[ *"(.*)", *"(.*)", *"(.*)" *\],
in visual studio code but it only match the first row. Instead I want to to match all rows at once and replace it and then align it.
The point here is to match and capture only the parts you need to keep, and just match other parts.
You may use
^( *)\[( *)(".*?"),( *)(".*?"),( *)(".*?" *)\],?$
Replace with $1{$2Name = $3;$4Ref = $5;$6Description = Some $7}.
See the regex demo
Details
^ - start of line
( *) - Group 1 ($1): leading spaces
\[ - a [ char (will be replaced with {)
( *) - Group 2 ($2): spaces after [
(".*?") - Group 3 ($3): "..." substring
, - a comma (will be replaced with ;)
( *) - Group 4 ($4): spaces after the first ,
(".*?") - Group 5 ($5): "..." substring
, - a comma (will be replaced with ;)
( *) - Group 6 ($6): spaces after the second ,
(".*?" *) - Group 7 ($7): "..." substring and 0+ spaces after
\],?$ - ], an optional , and end of line.
Here is an answer using a macro extension. Because you need to run two separate regex's (although the second regex is very simple). First a demo with your original text first, some badly formatted text second and your desired results last:
Select your text first and then trigger the macro. I am using alt+r as the keybinding but you can choose whatever you want.
Using the macro extension multi-command put this into your settings.json:
"multiCommand.commands": [
{
"command": "multiCommand.insertAlignRows",
"sequence": [
"editor.action.insertCursorAtEndOfEachLineSelected",
"cursorHomeSelect",
{
"command": "editor.action.insertSnippet",
"args": {
"snippet": "${TM_SELECTED_TEXT/^(\\s*)\\[\\s*(.{12})\\s*(.{18})\\s*([^\\]]*)\\],?/$1{ Name = $2 Ref = $3Description = Some $4}/g}",
}
},
"cursorHomeSelect",
{
"command": "editor.action.insertSnippet",
"args": {
"snippet": "${TM_SELECTED_TEXT/,/;/g}",
}
},
]
}
]
In keybindings.json:
{
"key": "alt+r", // choose whatever keybinding you want
"command": "extension.multiCommand.execute",
"args": { "command": "multiCommand.insertAlignRows" },
"when": "editorTextFocus"
},
The regex that is doing almost all of the work is:
^(\s*)\[\s*(.{12})\s*(.{18})\s*([^\]]*)\],?
I removed the double escapes necessary in snippets but not in the find/replace widget so you could just use this regex in your find input (and not do the macro at all) and
$1{ Name = $2 Ref = $3Description = Some $4}
in the replace field. And then just replace , with ; after that.
Back to that regex: ^(\s*)\[\s*(.{12})\s*(.{18})\s*([^\]]*)\],? which looks brittle because of the "magic numbers" 12 and 18 derived from your sample text. But it isn't as bad as it first seems as the demo with the bad original formatting shows. They are just counting characters and as long as your input is reasonably close to what you presented it'll work.
The 12 can actually be from 12-16, with the 12 being the length of your longest first item (like "sortOrder",) and the 16 being the minimum number from the beginning of the first items to where the second items (like "String") begin.
Likewise the 18 could be 17-24 given your input and where you want the final column to start. Play with the numbers, it is pretty easy in regex101 demo.
I think the only restriction is that your input not look like this:
[ "names", "Array<String>", "array of strings represents name" ]
[ "sortOrder","String", "sort order includes ascend and descend" ],
where a later column starts before the end of the previous column - as in column 3 starts before all the column 2's end. Likewise for some column 2 item starting before all the column 1 items have ended like
[ "sortOrder", "String", "sort order includes ascend and descend" ],
[ "names", "Array<String>", "array of strings represents name" ]
If your input is that bad you could fix it first with some simple regex's.
Remember you can also adjust where the columns start in your replace by adding/subtracting spaces, as between the $2 Ref in my example above or $3Description - you can add space(s) after the $3 if you wish.

Regexp_replace for JSON

How to regexp_replace the phone_num & phone_ext with only numeric instead of characters.
[ {
"phone_type":"HOME",
"phone_num":"(+1)123-456-7890",
"phone_ext":"-85254-",
"phone_status":"Y",
},
{
"phone_type":"HOME",
"phone_num":"+001-123-456-7890",
"phone_ext":"85-254",
"phone_status":"N",
}
]
should be displayed as
[ {
"phone_type":"HOME",
"phone_num":"11234567890",
"phone_ext":"85254",
"phone_status":"Y",
},
{
"phone_type":"HOME",
"phone_num":"0011234567890",
"phone_ext":"85254",
"phone_status":"N",
}
]
Well, finding the text is fairly easy.
/phone_(num|ext)"\s*:\s*"([^"]*)",/gmi
Next part is finding the second grouping ([^"]*) within your match function and strip all none numeric characters. This will vary by application.

How to find all values in word by using regexp in MongoDB?

Let's say I have the following string in MongoDB document:
{"name": "space delimited string"}
I need to build mongodb query with regexp to find this document by entering the following search request:
space string
It look like LIKE operator in RDBS. I know that there is latest MongoDB 3 with full-text search but I need regexp due current outdated version.
Please help me to construct mongodb query with regexp to find document by entering the search above.
Thanks
As I see it there are a couple of options.
If you mean "AND" for all words then use positive lookahead:
{ "name": /(?=.*\bspace\b)(?=.*\bstring\b).+/ }
or if an $all operator suits you better:
{ "name": { "$all": [/\bspace\b/,/\bstrig\b/] } }
And if you mean "OR" for either of the words then you can do:
{ "name": /\bspace\b|\bstring\b/ }
or use an $in operator:
{ "name": { "$in": [/\bspace\b/,/\bstring\b/] } }
Noting that in all cases you likely want those \b boundary matches in there to delimit the "word", or otherwise you are getting "partial" words.
So it depends on which you mean and which suits you best. You can construct the regular expression using its own syntaxt to either mean "AND" or "OR", or alternately you can just use the equivalent MongoDB logical expresions ( $all or $in ) that take a "list" of regular expressions instead.
So build a string for regex or build a list. Your choice.
Naturally of course you need to "break up" a string into the "words" in order to process. Lacking an a language tag here, but as a JavaScript example:
As a single regular expression for "AND":
var searchString = "space string";
var expression = new RegExp(
"" + searchString.split(" ").map(function(word) {
return "(?=.*\\b" + word + "\\b)"
}).join("") + ".+"
)
var query = { "name": expression };
Or for an "OR" condition on a single expression:
var expression = new RegExp(
searchString.split(" ").map(function(word) {
return "\\b" + word + "\\b"
}).join("|")
);
var query = { "name": expression };
Or as a list of expressions:
var type = "AND",
query = { "name": {} };
// List of expressions
var list = searchString.split(" ").map(function(word) {
return new RegExp("\\b" + word + "\\b")
});
// Determine operator based on type
query.name[( type === "AND") ? "$all" : "$in"] = list;