MongoDB query with special characters in key - regex

In my case, I have keys in my MongoDB database that contain a dot in their name (see attached screenshot). I have read that it is possible to store data in MongoDB this way, but the driver prevents queries with dots in the key. Anyway, in my MongoDB database, keys do contain dots and I have to work with them.
I have now tried to encode the dots in the query (. to \u002e) but it did not seem to work. Then I had the idea to work with regex to replace the dots in the query with any character but regex seems to only work for the value and not for the key.
Does anyone have a creative idea how I can get around this problem? For example, I want to have all the CVE numbers for 'cve_results.BusyBox 1.12.1'.
Update #1:
The structure of cve_results is as follows:
"cve_results" : {
"BusyBox 1.12.1" : {
"CVE-2018-1000500" : {
"score2" : "6.8",
"score3" : "8.1",
"cpe_version" : "N/A"
},
"CVE-2018-1000517" : {
"score2" : "7.5",
"score3" : "9.8",
"cpe_version" : "N/A"
}
}}

With the following workaround I was able to directly access documents by their keys, even though they have a dot in their key:
db.getCollection('mycollection').aggregate([
{$match: {mymapfield: {$type: "object" }}}, //filter objects with right field type
{$project: {mymapfield: { $objectToArray: "$mymapfield" }}}, //"unwind" map to array of {k: key, v: value} objects
{$match: {mymapfield: {k: "my.key.with.dot", v: "myvalue"}}} //query
])

If possible, it could be worth inserting documents using \u002e instead of the dot, that way you can query them while retaining the ASCII values of the . for any client rendering.
However, It appears there's a work around to query them like so:
db.collection.aggregate({
$match: {
"BusyBox 1.12.1" : "<value>"
}
})

You should be able to use $eq operator to query fields with dots in names.

Related

Fluentd Parsing

Hi i'm trying to parse single line log using fluentd. Here is log i'm trying to parse.
F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F8..........etc
This will parse into like this:
{ "F2" : "4200000000000000", "F3" : "000000", "F4" : "000000060000" ............etc }
I tried to use regex but it's confusing and making me write multiple regexes for different keys and values. Is there any easier way to achieve this ?
EDIT1: Heya! I will make this more detailed. I'm currently tailing logs using fluentd to Elasticsearch+Kibana. Here is unparsed example log that fluentd sending to Elasticsearch:
21/09/02 16:36:09.927238: 1 frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random message and strings,F7:.......etc
Elasticsearch recived message:
{"message":"frSMS:0:13995:#HTF4J::141:141:msg0210,00000000000000000,000000,000000,007232,00,#,F2:00000000000000000,F3:002000,F4:000000820000,F6:Random
digits and chars,F7:.......etc"}
This log has only message key so i can't index and create dashboard on only using whole message field. What am i trying to achieve is catch only useful fields, add key into it if it has no key and make indexing easier.
Expected output:
{"logdate" : "21/09/02 16:36:09.927238",
"source" : "frSMS",
"UID" : "#HTF4J",
"statuscode" : "msg0210",
"F2": "00000000000000000",
"F3": "randomchar314516",.....}
I used regex plugin to parse into this but it was too overwhelming and . Here is what i did so far:
^(?<logDate>\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b)....(?<source>fr[A-Z]{3,4}|to[A-Z]{3,4}\b).(?<status>\d\b).(?<dummyfield>\d{5}\b).(?<HUID>.[A-Z]{5}\b)..(?<d1>\d{3}\b).(?<d2>\d{3}\b).(?<msgcode>msg\d{4}\b).(?<dummyfield1>\d{16}\b).(?<dummyfield2>\d{6}\b).(?<dummyfield3>\d{6,7}\b).(?<dummyfield4>\d{6}\b).(?<dummyfield5>\d{2}\b)...
Which results to :
"logDate": "21/09/02 16:36:09.205706",
"source": "toSMS" ,
"status": "0",
"dummyfield": "13995" ,
"UID" : "#HTFAA" ,
"d1" : "156" ,
"d2" : "156" ,
"msgcode" : "msg0210",
"dummyfield1" :"0000000000000000" ,
"dummyfield2" :"002000",
"dummyfield3" :"2000000",
"dummyfield4" :"00",
"dummyfield5" :"2000000" ,
"dummyfield6" :"867202"
Which only applies to example log and has useless fields like field1, dummyfield, dummyfield1 etc.
Other logs has the useful values and keys(date,source,msgcode,UID,F1,F2 fields) like i showcased on expected output. Not useful fields are not static(they can be none, or has less|more digits and chars) so they trigger the pattern not matched error.
So the question is:
How do i capture useful fields that i mentioned using regex?
How do i capture F1,F2,F3...... fields that has different value
patterns like char string mixed?
PS: I wraped the regex i wrote into html snippet so the <> capturing fields don't get deleted
Regex pattern to use:
(F[\d]+):([\d]+)
This pattern will catch all the 'F' values with whatever digit that comes after - yes even if it's F105 it still works. This whole 'F105' will be stored as the first group in your regex match expression
The right part of the above pattern will catch the value of all the digits following ':' up until any charachter that is not a digit. i.e. ',', 'F', etc.. and will store it as the second group in your regex match
Use
Depending on your coding language you will have to access your regex matches variable with an iterator and extract group 1 and group 2 respectivly
Python example:
import re
log = 'F2:4200000000000000,F3:000000,F4:000000060000,F6:000000000000,F7:000000000,F105:9726450'
pattern = '(F[\d]+):([\d]+)'
matches = re.finditer(pattern,log)
log_dict = {}
for match in matches:
log_dict[match.group(1)] = match.group(2)
print(log_dict)
Output
{'F2': '4200000000000000', 'F3': '000000', 'F4': '000000060000', 'F6': '000000000000', 'F7': '000000000', 'F105': '9726450'}
Assuming the logdate will be static(in pattern wise) You can ignore useless values using ".+" regex and get collect the useful values by their patterns. So the regex will be like this :
(?\d{2}.\d{2}.\d{2}\s\d{2}:\d{2}:\d{2}.\d{6}\b).+(?fr[A-Z]{3,4}|to[A-Z]{3,4}).+(?#[A-Z0-9]{5}).+(?msg\d{4})
And output will be like:
{"logdate" : "21/09/02 16:36:09.927238", "source" : "frSMS",
"UID" : "#HTF4J","statuscode" : "msg0210"}
And I'm working on getting F2,F3,FN keys and values.

Does #DynamoDBAttribute support document paths in the attribute name?

I've checked the DynamoDB documentation, and I can't find anything to confirm or deny whether this is allowed.
Is it valid to use a Document Path for the attributeName of #DynamoDBAttribute, as in this code snippet?
#DynamoDBDocument
public class MyClass {
#DynamoDBAttribute(attributeName="object.nestedObject.myAttribute")
private String myAttribute;
.
.
.
// Getters & Setters, etc
}
Edit: Just to be clear, I am specifically trying to find out whether document paths are valid in the #DynamoDBAttribute Java annotation as a way to directly access a nested value. I know that document paths work in general when specifying a query, but this question is specifically about DynamoDBMapper annotations.
Yes, the attribute name can have Dot on it. However, in my opinion, it is not recommended to have Dot on attribute name. Usually, the Dot will be used to navigate the tree in Map attribute.
The following are the naming rules for DynamoDB:
All names must be encoded using UTF-8, and are case-sensitive.
Table names and index names must be between 3 and 255 characters long,
and can contain only the following characters:
a-z
A-Z
0-9
_ (underscore)
(dash)
. (dot)
Attribute names must be between 1 and 255 characters long.
Accessing Map Elements:-
The dereference operator for a map element is . (a dot). Use a dot as
a separator between elements in a map:
MyMap.nestedField
MyMap.nestedField.deeplyNestedField
I can create the item with attribute name containing Dot and query the item using FilterExpression successfully.
It works similarly in all language AWS SDKs. As long as data type is defined as String, it would work as expected.
Some JS examples:-
Create Item:-
var table = "Movies";
var year = 2017;
var title = "putitem data test 2";
var dotAttr = "object.nestedObject.myAttribute";
var params = {
TableName:table,
Item:{
"yearkey": year,
"title": title,
"object.nestedObject.myAttribute": "S123"
},
ReturnValues : 'NONE'
};
Update:-
It works fine with #DynamoDBAttribute annotation as well.
private String dotAttr;
#DynamoDBAttribute(attributeName = "object.nestedObject.myAttribute")
public String getDotAttr() {
return dotAttr;
}
It is not possible to reference a nested path using the attribute name in a #DynamoDBAttribute. I needed to use a POJO type with an added#DynamoDBDocument annotation to represent each level of nesting.

regex to convert key values in a column to an hstore or jsonb column?

I have a database that has a table called activity with a column called detail that has this unfortunate representation of key/value pairs:
Key ID=[813],\n
Key Name=[Name of Key],\n
Some Field=[2732],\n
Another Field=[2751],\n
Description=[A text string here],\n
Location=[sometext],\n
Other ID=[2360578],\n
It's maybe clear from the formatting above, this is a one value per line and \n is a newline character so there's always one extra newline. I'm trying to avoid having an external program process this data, so I'm looking into postgresql's regex functions. The goal is to convert this to a jsonb or hstore column, I don't really care which.
Schema for the table is like:
CREATE TABLE activity
(
id integer NOT NULL,
activity_type integer NOT NULL,
ts timestamp with time zone,
detail text NOT NULL,
details_hstore hstore,
details_jsonb jsonb,
CONSTRAINT activity_pkey PRIMARY KEY (id),
);
So I'd like to run an UPDATE where I update the details_jsonb or details_hstore with the processed data from detail.
This:
select regexp_matches(activity.detail, '(.*?)=\[(.*?)\]\,[\r|\n]', 'g') as val from activity
gets me these individual rows (this is from pgadmin, I assume these are all strings):
{"Key ID",813}
{"Key Name","Name of Key"}
{"Some Field",2732}
{"Another Field",2751}
{Description,"A text string here"}
{Location,sometext}
{"Other ID",2360578}
I'm not a regex whiz but I think I need some kind of grouping. Also, that's returning as a text array of some kind, but what I really want is like this for jsonb
{"Key ID": "813", "Key Name": "Name of Key"}
or even better, if it's a number only then
{"Key ID": 813, "Key Name": "Name of Key"}
and/or the equivalent for hstore.
I feel like I'm a number of regex-in-postgres concepts away from this goal.
First is how to get ALL the pairs together in some kind of array or something, not as separate rows.
Second is, can I figure if it's a number and optionally get "" around strings and nothing around numbers for jsonb or hstore
Third, get that as some kind of string/text
Fourth is, how to then write that into another jsonb/hstore field using an update
Is this kind of regex update too much to get working in an update? i.e. update activity set details_jsonb = [[insane regex here]]? hstore is also an option (though I like that jsonb has types), so if it's easier to go to an hstore function like hstore(text[]) that's fine too.
Am I crazy and do I need to just write an external process not-in-postgresql that does this?
I would first split the single value into multiple lines. Each line can then be converted to an array from which this can be aggregated into a JSON object:
select string_to_array(regexp_replace(t.line, '(^\s+)|(\s+$)', '', 'g'), '=')
from activity a, regexp_split_to_table(a.detail, ',\s*\n') t (line)
This returns the following:
element
------------------------------------
{KeyID,[813]}
{"Key Name","[Name of Key]"}
{"Some Field",[2732]}
{"Another Field",[2751]}
{Description,"[A text string here]"}
{Location,[sometext]}
{"Other ID",[2360578]}
{}
The regex to split the detail value into lines might need some improvements though.
The regexp_replace(t.line, '(^\s+)|(\s+$)', '', 'g') is there trim the values before converting them to an array.
Now this can be aggregated into a single JSON value, or each line can be converted into a single hstore value (unfortunately there is no hstore_agg())
with activity (detail) as (
values (
'Key ID=[813],
Key Name=[Name of Key],
Some Field=[2732],
Another Field=[2751],
Description=[A text string here],
Location=[sometext],
Other ID=[2360578],
')
), elements (element) as (
select string_to_array(regexp_replace(t.line, '\s', ''), '=')
from activity a, regexp_split_to_table(a.detail, ',') t (line)
)
select json_agg(jsonb_object(element))
from elements
where cardinality(element) > 1 -- this removes the empty line
The above returns a JSON object:
[ { "KeyID" : "[813]" },
{ "Key Name" : "[Name of Key]" },
{ "Some Field" : "[2732]" },
{ "Another Field" : "[2751]" },
{ "Description" : "[A text string here]" },
{ "Location" : "[sometext]" },
{ "Other ID" : "[2360578]" }
]

Using a CouchDB view, can I count groups and filter by key range at the same time?

I'm using CouchDB. I'd like to be able to count occurrences of values of specific fields within a date range that can be specified at query time. I seem to be able to do parts of this, but I'm having trouble understanding the best way to pull it all together.
Assuming documents that have a timestamp field and another field, e.g.:
{ date: '20120101-1853', author: 'bart' }
{ date: '20120102-1850', author: 'homer'}
{ date: '20120103-2359', author: 'homer'}
{ date: '20120104-1200', author: 'lisa'}
{ date: '20120815-1250', author: 'lisa'}
I can easily create a view that filters documents by a flexible date range. This can be done with a view like the one below, called with key range parameters, e.g. _view/all-docs?startkey=20120101-0000&endkey=20120201-0000.
all-docs/map.js:
function(doc) {
emit(doc.date, doc);
}
With the data above, this would return a CouchDB view containing just the first 4 docs (the only docs in the date range).
I can also create a query that counts occurrences of a given field, like this, called with grouping, i.e. _view/author-count?group=true:
author-count/map.js:
function(doc) {
emit(doc.author, 1);
}
author-count/reduce.js:
function(keys, values, rereduce) {
return sum(values);
}
This would yield something like:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":2}
]
}
However, I can't find the best way to both filter by date and count occurrences. For example, with the data above, I'd like to be able to specify range parameters like startkey=20120101-0000&endkey=20120201-0000 and get a result like this, where the last doc is excluded from the count because it is outside the specified date range:
{
"rows": [
{"key":"bart","value":1},
{"key":"homer","value":2}
{"key":"lisa","value":1}
]
}
What's the most elegant way to do this? Is this achievable with a single query? Should I be using another CouchDB construct, or is a view sufficient for this?
You can get pretty close to the desired result with a list:
{
_id: "_design/authors",
views: {
authors_by_date: {
map: function(doc) {
emit(doc.date, doc.author);
}
}
},
lists: {
count_occurrences: function(head, req) {
start({ headers: { "Content-Type": "application/json" }});
var result = {};
var row;
while(row = getRow()) {
var val = row.value;
if(result[val]) result[val]++;
else result[val] = 1;
}
return result;
}
}
}
This design can be requested as such:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/authors_by_date?startkey=<startDate>&endkey=<endDate>
This will be slower than a normal map-reduce, and is a bit of a workaround. Unfortunately, this is the only way to do a multi-dimensional query, "which CouchDB isn’t suited for".
The result of requesting this design will be something like this:
{
"bart": 1,
"homer": 2,
"lisa": 2
}
What we do is basically emit a lot of elements, then using a list to group them as we want. A list can be used to display a result in any way you want, but will also often be slower. Whereas a normal map-reduce can be cached and only change according to the diffs, the list will have to be built anew every time it is requested.
It is pretty much as slow as getting all the elements resulting from the map (the overhead of orchestrating the data is mostly negligible): a lot slower than getting the result of a reduce.
If you want to use the list for a different view, you can simply exchange it in the URL you request:
http://<couchurl>/<db>/_design/authors/_list/count_occurrences/<view>
Read more about lists on the couchdb wiki.
You need to create a combined view:
combined/map.js:
function(doc) {
emit([doc.date, doc.author], 1);
}
combined/reduce.js:
_sum
This way you will be able to filter documents by start/end date.
startkey=[20120101-0000, "a"]&endkey=[20120201-0000, "a"]
Although your problem is hard to solve in general case, knowing some more restrictions on the possible queries can help a lot. E.g. if you know you will search on the ranges that will cover full days/months you can user the arrays of [year, month, day, time] instead of the string:
emit([doc.date_year, doc.date_month, doc.date_day, doc.date_time, doc.author] doc);
Even if you cannot predict that all possible queries will fit into grouping based on this key type, splitting the key may help you to optimize your range queries and decrease number of lookups needed (with the cost of some extra space).

How to query MongoDB for matching documents where item is in document array

If my stored document looks like this:
doc = {
'Things' : [ 'one' , 'two' , 'three' ]
}
How can I query for documents which contain one in Things?
I know the $in operator queries a document item against a list, but this is kind of the reverse. Any help would be awesome.
Use MongoDB's multikeys support:
MongoDB provides an interesting "multikey" feature that can automatically index arrays of an object's values.
[...]
db.articles.find( { tags: 'april' } )
{"name" : "Warm Weather" , "author" : "Steve" ,
"tags" : ["weather","hot","record","april"] ,
"_id" : "497ce4051ca9ca6d3efca323"}
Basically, you don't have to worry about the array-ness of Things, MongoDB will take care of that for you; something like this in the MongoDB shell would work:
db.your_collection.find({ Things: 'one' })