Effective replacement for at_pointer() function with Boost-Json 1.75

Effective replacement for at_pointer() function with Boost-Json 1.75 - c++

I am searching for an effective way to access to deeply nested element in JSON.
Let's say:
{
"parent": [
0,
{
"child": "nested elem"
},
"string"
]
}
I want to access to the "nested element" with a JSON pointer/url : /parent/1/child
I have tried a recusive function that check at each node if the value is an array or an object (structured) and get the nested value but it involved a lot of copies and I thought that it might be a better way.
I am aware of boost::json::value::at_pointer(), it would do the job perfectly but I'm stuck with Boost::Json 1.75 and I cannot upgrade to 1.81 for some reasons.
Thank you for you'r help !

Related

Mongodb index based text searches to match full string

While searching for entries in a mongodb instance using the text indexing function of mongodb, I seem to receive results which contain any of the words in the input string. So for example if I search for 'google seo', it'd return results for google seo, google, and seo. I only need it to return results which have the entire string or atleast both of them in the sentence. so results like 'Why should I google seo', 'What is google seo', 'What does google have to do with seo' etc. should return. Any combination of the following would be perfect. I can currently mitigate the entire issue by just using a mongodb regex but that is way slower than the index search as I have over 250m entires. As a test, index searches took on average 1.72s whilst the regex searches took over 27.23s. I want the speed of the index searches with even just half the accuracy of regex searches as if the user can search quicker, it doesn't really matter if the results aren't the most accurate. Also programmatically creating regex searches to match all words in a string if they are just located in the input string anywhere. e.g. for me to return results which contain the words 'google' and 'seo' in the same sentence, it is alot of unnecessary code which also isnt 100% accurate. The current data base schema is as follows
{
_id: 0000000000,
search_string: string,
difficulty: number,
clicks: number,
volume: number,
keyword: string
}
The backend is a NodeJS server.
Any help is appreciated. Thanks!

Would combining the two approaches (text search and a regex) work?
No playground link since this needs a text index to demonstrate, but consider the following sample documents:
test> db.foo.find()
[
{ _id: 1, val: 'google seo' },
{ _id: 2, val: 'google ' },
{ _id: 3, val: 'seo random ' },
{ _id: 4, val: 'none' }
]
As described in the question and noted in the documentation, a search on 'google seo' returns all documents that match at least one of those terms (3 of the 4 in this sample data):
test> db.foo.find({$text:{$search:'google seo'}})
[
{ _id: 2, val: 'google ' },
{ _id: 1, val: 'google seo' },
{ _id: 3, val: 'seo random ' }
]
If we expand the query predicates to also include regexes on both of the terms via the $all operator, the results are narrowed down to just the single document:
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' }
]
It also works if the words are out of order as we'd expect:
test> db.foo.insert({_id:5, val:'seo out of order google string'})
{ acknowledged: true, insertedIds: { '0': 5 } }
test> db.foo.find({$text:{$search:'google seo'}, val:{$all:[/google/i, /seo/i]}})
[
{ _id: 1, val: 'google seo' },
{ _id: 5, val: 'seo out of order google string' }
]
The database first selects the candidate documents using the text index and then performs the final filtering via the regex prior to returning them to the client.
Alternatively if you are using Atlas you might look into the Atlas Search functionality. Seems like must or filter would satisfy this use-case as well (reference).

What worked for me
Running any sort of regex on possibly hundreds of thousands of datapoints will always be very time and resource intensive. Also doing it natively with mongodb means that data is not sent in chunks / asynchronously ( at least as far as my knowledge extends).
Instead there are two approaches that can decrease either time, server resources or bandwidth usage.
Using the server to process the data before sending it over.This might seem obvious but if you have the hardware overhead to perform such an operation on the server end than it is much better and faster to run string comparisons server side and send the data back in chunks to be lazy loaded into your app.
For me this decreased average search times from over 29.3s to just below 2.23s on average with a sample size of a database of 250m entries, 80k results per search and around 10k-15k filtered results.
If you don't have the processing overhead and are willing to sacrifice on bandwidth and user experience, then doing this on the client side isn't out of the equation especially considering just how capable modern hardware is. This provides a bit more flexibility such that all the data can be shown with the relevant data being shown first and the other 'irrelevent data' being shown last. This does need to be well optimized and be implemented with the supposition that your app would mostly be run on modern hardware or preferably not on mobile devices.
These were the best solutions for me. There might be better native techniques but over the span of week, I wasn't able to find any better ( faster ) solutions.
EDIT
I feel it's kind of necessary to elaborate on what kind of processing does the data undergo before it is sent out and exactly how I do it. Currently I have a database of around 250m entries. Each entry having the schema described in the question. The average query would usually be something like 'who is putin', 'android', 'iphone 13' etc. The database is made up of 12 collections for each 'major' keyword (what, why, should, how, when, which, who etc.) so the query is first stripped of those. So if the query was 'who is putin' it is converted to just 'is putin'. For cases where there is no keyword, all collections are checked. If there is a better way. let me know
After that we send the query to the database and retrieve the results.
the query undergoes another function afterwards which rids it of 'unnecessary' words so words like is, if, a, be etc. are also removed and it returns an array of the major words so a query like 'What is going on between Russia and Ukraine' gets converted to ['going', 'between', 'Russia', 'Ukraine'] . As the results are received, we go over each of them to see if they include all the words from the array and whichever do, are returned to the client. Pretty basic operation here as we don't care about cases, spaces and so on. Simply uses the js contains() method. The average times that I get while retrieving a query with precisely 2,344,123 results takes around 2.12s cold to return the first results and just over 8.32s cold to end. Running the same query again reduces times to around .84s warm and 1.98s warm to finish (cold for first time and warm for subsequent requests).

Dynamodb insertion efficiency depending on the type of the input + impact on the pricing?

I am doing some tests with dynamodb in local and I have seen a behaviour that I can't explain that leads to a particular question. For the context, I was doing my tests with the nodejs SDK V3 using DynamoDBDocumentClient (an utility that will convert javascript object to dynamodb attributes https://docs.aws.amazon.com/sdk-for-javascript/v2/developer-guide/dynamodb-example-document-client.html).
So I simply noticed in local that when I was calling PutItem with a list of map it was a lot faster (300ms) than when calling with a map of list (4s).
List of map (case 1):
{
"L": [
{
"M": {
"Element1": {
"L": [
{
"S": "1"
},
{
"N": "1660964494"
},
{
"S": "1"
}
]
}
}
}, //... more elements continuing here
Map of list (case 2):
{
"M": {
"Element1": {
"L": [
{
"S": "1"
},
{
"N": "1660964550"
},
{
"S": "1"
}
]
},// More and more elements after that...
Also, as expected, depending on the structure of the datas the request will cost more or less capacityUnits. In local for my datas I have:
case 1 => 176 CapacityUnits
case 2 => 138 CapacityUnits
However since case 1 seems a lot faster than case 2 in local (maybe because of the way dynamodb is storing List and maps) I would like to know if it will be better to use case 2 since it will use less capacityUnits so it means I will pay less I guess.
Maybe case 1 is better because there is a cost in dynamodb for the speed of the request? (can't find any documentation on this)
Or maybe it's just a behaviour in local and the speed doesn't correlate with production dynamodb?

A write capacity unit is 1 KB, suggesting that your list-of-maps and map-of-lists have slightly different lengths - 176 KB vs 138 KB. I don't think it's surprising, even in your example the first example seems a little bit longer. It's not a big difference... But obviously the shorter version has the advantage that it will cost you less to write, store, and finally read. However, I suggest that you measure this on the actual DynamoDB, not on DynamoDB Local which isn't guaranteed to be identical to DynamoDB.
However, you should also keep in mind that having very large items is not a good idea in DynamoDB. There is a hard limit of 400 KB which your 176 KB example comes pretty close to reaching, so if your use case grows a bit it will exceed this hard limit. Also, every read of the 176 KB item will need to read its entirety, and every modification to a small piece will need to re-write the entire 176 KB, costing you a lot.
Instead, an third alternative you can consider instead of a list is a partition: A partition can have multiple items (identified with different sort keys), so instead of a list you can keep the list of items in a partition. This will allow you to read or modify an individual item without paying for reading or writing the entire list - but will also allow you to read the entire list (the partition) when you want to.

Power BI Multiple Measures

I currently have a custom visual works works great, it currently displays a gauge based upon an input measure, who's fill can change colour based upon predefined limits.
Now the percentage is based upon an external number, 33, which is entered as text within the gauge definition. However, I would like this to be entered as a measure, since that way it can be driven by an external source (Sharepoint list for example).
However, I'm having great issues in my capabilities file of using more than a single measure. I understand that usually you have a category and several measures which relate to elements within the category (thinking of graphs etc).
I currently have my within my capabilities file the following data roles section:
"dataRoles": [
{
"displayName": "Value 1",
"name": "dataValue1",
"kind": "Measure"
},
{
"displayName": "Value 2",
"name": "dataValue2",
"kind": "Measure"
}
],
The data view mappings section is of that below:
"dataViewMappings": [
{
"conditions": [
{
"dataValue1": {
"max": 1
}
},
{
"dataValue2": {
"max": 1
}
}
],
"single": {
"role": ""
}
}
]
It compiles and seems to work, until you add a second measure, then weird things happen (yes I know not a technical explanation :) ) but I'll explain.
I also have a section which defines the colours and at what value the colours are used and whilst I can switch off the title etc with no issue, the custom section toggles quickly from being switched off to on again (so it stays the same value).
I know that this is something to do with the multiple measures that I'm trying to implement, since without them, it works flawlessly. Any help or if anyone has source code of a visual using multiple independent measures, I'd be most grateful.
Kind Regards.

It would appear that you can only bring in multiple measures if they are grouped as opposed to being single entities in their own right.
So I've gone about this in a different manner.

AWS DynamoDB geospatial query using ElasticSearch

I am storing transactions tagged with geo-coordinates in a DynamoDB table. I want to do a geo-spatial query to find all transactions within e.g. 10 miles distance from an input pair of coordinates.
I saw here that I could perform such a geo-spatial query using AWS ElasticSearch. However, I am not sure if I want to pay the hourly fee for the service at this time if that is the only purpose I will use it for.
An alternative I thought of is to keep only 4 digits after the decimal point of each coordinate when storing and read all the transactions that have the same set of coordinates since they would essentially belong to the same like 100~200 m^2 range. This isn't a very good solution in terms of accuracy and range.
Any suggestion to a better alternative for such a geo-spatial query or on whether ElasticSearch would be a worthy investment based on time/cost?

You could consider using "Geo Library for Amazon DynamoDB". Features include Radius Queries: "Return all of the items that are within a given radius of a geo point."
It seems to have at least Java and JavaScript versions:
https://github.com/awslabs/dynamodb-geo
https://www.npmjs.com/package/dynamodb-geo
Elasticsearch seems to support GeoHashing natively so it will probably have even better performance: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-geohashgrid-aggregation.html
Personally I would recommend using Elasticsearch for searching because it's extremely powerful at that and searching with DynamoDB can be difficult.

You can't change the data type after the index field is created. I used this code in kabana to declare the data type as "geo_point". Then uploaded an item with the geopoint field and it worked.
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-geo-distance-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html
POST yourProjectName/_mappings/yourProjectType
{
"properties":{
"geopoint or whatever field you're storing the geo data": {
"type": "geo_point"
}
}
}
}
POST _search
{
"query": {
"bool" : {
"must" : {
"match":{
"summary": "something"
}
},
"filter" : {
"geo_distance" : {
"distance" : "12km",
"geopoint" : "40.054447974637476,-82.92002800852062"
}
}
}
}
}

mongodb find followed by update semantics

This page shows a update reaching into a previously retrieved (find) document and querying a subelement (array) to update it. I pretty much need to do the exact same thing. Code for the example:
> t.find()
{ "_id" : ObjectId("4b97e62bf1d8c7152c9ccb74"), "title" : "ABC",
"comments" : [ { "by" : "joe", "votes" : 3 }, { "by" : "jane", "votes" : 7 } ] }
> t.update( {'comments.by':'joe'}, {$inc:{'comments.$.votes':1}}, false, true )
What are the rules governing find-followed-by-update, I haven't noticed an explanation for this in the documentation. Does the same stuff apply to use of mongodb via drivers ? A link to the relevant semantics would be helpful. I am using the C++ driver.
edit: self answer
the 2 commands can be rolled into one (and this is one way of removing the ambiguity this question raises), the query part of an update can refer to a array sub-element, and the $ symbol will reference to it. I assume you can only reference one sub-element in the query part of an update operation. In my case the update operation looks as follows :
db.qrs.update ( { "_id" : ObjectId("4f1fa126adf93ab96cb6e848"), "urls.u_id" : 171 }, { "$inc" : { "urls.$.CC": 1} })
The _id correctly "primes" the right unique row, and the second query element "urls.u_id" : 171 assures that the row in question has the right field. urls.$.CC then routes the $inc operation to the correct array entry.
recomendation to any mongodb dev or document writer
Do not show examples which have potential race conditions in them. Always avoid showing multiple operations that can be done atomically.

The rules are relatively straightforward. The results of the update may or may not be available to any subsequent reads depending on a number of things (slaveOk true/false in combination with repsets, update and find using different connections, write safety). You can guarantee it to be available if you do a safe write (w >= 1) and perform the find on the same connection. Most drivers offer functionality for this (typically "requestStart" and "requestDone").
All that said, there's a much better solution available to you for this, namely findAndModify. This operation finds a document, updates it and returns either the old version of the document or the newly updated version. This command is available in the C++ driver. For a reference look here : http://www.mongodb.org/display/DOCS/findAndModify+Command
EDIT : Please note that the "find" in the example is only there to show the reader of the documentation what the structure/schema of the documents inside the collection is to place the subsequent "update" in context. The "update" operation is in no way affected by the "find" before it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Effective replacement for at_pointer() function with Boost-Json 1.75 - c++

Related

Mongodb index based text searches to match full string

Dynamodb insertion efficiency depending on the type of the input + impact on the pricing?

Power BI Multiple Measures

AWS DynamoDB geospatial query using ElasticSearch

mongodb find followed by update semantics

Categories

Resources