MarkLogic Optic javaScript Geospatial Difference - mapreduce

I want to reduce the selected items by their distance from a point using MarkLogic Optic.
I have a table with data and a lat long
const geoData = op.fromView("namespace","coordinate");
geoData.where(op.le(distance(-28.13,153.4,geoData.col(lat),geoData(long)),100))
The distance function I have already written and utilises geo.distance(cts.point(lat,long),cts.point(lat,long)) but the geoData.col("lat") passes an object that describes the full names space of the col not the value.
op.schemaCol('namespace', 'coordinate', 'long')
I suspect I need to do a map/reduce function but MarkLogic documentation gives the normal simplistic examples that are next to useless.
I would appreciate at some help.
FURTHER INFORMATION
I have mostly solved most of this problem except that some column have null values. The data is sparse and not all rows have a long lat.
So when the cts.points runs in the where statement and two null values are passed it raises an exception.
How do I coalesce or prevent execution of cts.points when data columns are null? I dont want to reduce the data set as the null value records still need to be returned they will just have a null distance.

Where possible, it's best to do filtering by passing a constraining cts.query() to where().
A constraining query matches the indexed documents and filters the set of rows to the rows that TDE projected from those documents before retrieving the filtered rows from the indexes.
If the lat and long columns are each distinct JSON properties or XML elements in the indexed documents, it may be possible to express the distance constraint using techniques similar to those summarized here:
http://docs.marklogic.com/guide/search-dev/geospatial#id_42017
In general, it's best to use map/reduce SJS functions for postprocessing on the filtered result set because the rows have to be retrieved to the enode to process in SJS.
Hoping that helps,

Related

Return subset of attribute value when reading item

My dynamodb table is storing multiple attributes on each item, one of which being an array with a relatively large amount of entries. Whenever I read an item of my table, I really only need a certain subset of that array, so it's a waste of data throughput to query the entire thing.
Example: A good analogy would be stock price history; each entry in my table represents a certain stock, and the array attribute is the price history of the stock. When I query a stock, I will always know a start and end date (index of my array), which will usually be a very small subset of the entire array, so I'd optimally like to return my item with just that subset filled in the array attribute.
I guess a more standard way of doing this would be to use a relational database instead with a "prices" table, but that wouldn't fit that well with the rest of my model, while also ending up in being a table with an absolute humongous amount of entries.
Really, what I'm trying to do is just reduce the cost of my AWS calls.
Note: the question asks how to a return a subset of a single attribute value (which is unsupported in DynamoDB) rather than how to return a subset of available attributes (which is supported). I've answered the latter.
You haven't indicated which language SDK you are using, but all SDKs expose the DynamoDB GetItem API call and they all allow you to indicate which attributes you want returned to you in ProjectionExpression or the legacy AttributesToGet.
An example ProjectionExpression is "Artist, Genre" for the two named attributes.

Setting PowerBi filter on Report Load and maintaining dataColor order

I'm trying to set a PowerBi report filter on load but also maintain my dataColors array position. I've created a video to illustrate my issue - I hope that's allowed...
https://www.loom.com/share/40f0040311ee4487a46a0ad23c6ea1c9
When I apply a filter the behaviour differs to selecting the filter in the UI. I guess it's because as there is only one strand of data on load that it takes the first data colour from the array but I'd like to maintain this order.
Any help appreciated - cheers!
Rob
Currently, it is not possible to maintain the dataColors array position w.r.t filters / data. The dataColors array within the themes file will be applied sequentially. Color at position one will be applied to the very first data strand that was specified in this scenario.

Google Sheets Array formula for counting the number of values in each column

I'm trying to create an array formula to auto-populate the total count of values for each column as columns are added.
I've tried doing this using a combination of count and indirect, as well as tried my hand at query, but I can't seem to get it to show unique value counts for each column.
This is my first time attempting to use query, and at first it seemed possible from reading through the documentation on the query language, but I haven't been able to figure it out.
Here's the shared document: https://docs.google.com/spreadsheets/d/15VwsL7uTsORLqBDrnT3VdwAWlXLh-JgoJVbz7wkoMAo/edit?usp=sharing
I know I can do this by writing a custom function in apps script, but I'd like to use the built-in functions if I can for performance reasons (there is going to be a lot of data), and I want quick refresh rates.
try:
=ARRAYFORMULA(IF(B5:5="",,TRANSPOSE(MMULT(TRANSPOSE(N(B6:99<>"")), SIGN(ROW(B6:99))))))
In B3 try
=ArrayFormula(IF(LEN(B5:5), COUNTIF(IF(B6:21<>"", COLUMN(B6:21)), COLUMN(B6:21)),))

How to get row count for large dataset in Informatica?

I am trying to get the row count for a dataset with 280 fields with out having affect on the performance. Looking for best possible ways to perform.
The better option to avoid performance issue is, use sorter transformation and sort the columns and pass the pipeline to aggregator transformation. In aggregator transformation please check the option sorted input.
In terms if your source is a database then, index the required conditional columns in the table and also partition the table if required.
For your solution, I have in mind 2 options:
Using Aggregator (remember to use a predefined order by to improve performance with the next trans), SQ > Aggregator > Target. Inside the aggregator add new ports with the sum() and/or count() functions. Remember to select the columns to group
Check this out this example:
https://www.guru99.com/aggregator-transformation-informatica.html
Using Source Qualifier query override. Use a traditional select count/sum with group by from the database- SQ > Target.
By the way. Informatica is very good with the performance, more than the columns you need to review how many records you are processing. A best practice is always to stress the datasource/database more than the Infa app.
Regards,
Juan
If all you need is just to count the rows, use the Aggregator. That's what it's for. However, this will create cache - to limit it's size, use a single port.
To avoid caching, you can use a variable in expression and just increment it. This however will give you an extra column with all rows numbered, not just a single value. You'll still need to aggregate it. Here it would be possible to use aggregater with no function to return just the last value.

MongoDB: what is the most efficient way to query a single random document?

I need to pick a document from a collection at random (alternatively - a small number of successive documents from a randomly-positioned "window").
I've found two solutions: 1 and 2. The first is unacceptable since I anticipate large collection size and wish to minimize the document size. The second seems ineffective (I'm not sure about the complexity of skip operation). And here one can find a mention of querying a document with a specified index, but I don't know how to do it (I'm using C++ driver).
Are there other solutions to the problem? Which is the most efficient?
I had a similar issue once. In my case, I had a date property on my documents. I knew the earliest date possible in the dataset so in my application code, I would generate a random date within the range of EARLIEST_DATE_IN_SET and NOW and then query mongodb using a GTE query on the date property and simply limit it to 1 result.
There was a small chance that the random date would be greater than the highest date in the data set, so i accounted for that in the application code.
With an index on the date property, this was a super fast query.
It seems like you could mold solution 1 there, (assuming your _id key was an auto-inc value), then just do a count on your records, and use that as the upper limit for a random int in c++, then grab that row.
Likewise, if you don't have an autoinc _id key, just create one with your results.. having an additional field with an INT shouldn't add that much to your document size.
If you don't have an auto-inc field Mongo talks about how to quickly add one here:
Auto Inc Field.