Fetching With Distinct/Unique Values - mapreduce

I have a Cloudant database with objects that use the following format:
{
"_id": "0ea1ac7d5ef28860abc7030444515c4c",
"_rev": "1-362058dda0b8680a818b38e9c68c5389",
"text": "text-data",
"time-data": "1452988105",
"time-text": "3:48 PM - 16 Jan 2016",
"link": "http://url/to/website"
}
I want to fetch objects where the text attribute is distinct. There will be objects with duplicate text and I want Cloudant to handle removing them from a query.
How do I go about creating a MapReduce view that will do this for me? I'm completely new to MapReduce and I'm having difficulty understanding the relationship between the map and reduce functions. I tried tinkering with the built-in COUNT function and writing my own view, but they've failed catastrophically, haha.
Anyways, would it be easier to just delete the duplicates? If so, how do I do that?
While I'm trying to study this and find ELI5s, would anyone help me out? Thanks in advance! I appreciate it.

I'm not sure a MapReduce view is what you are looking for. A MapReduce view will essentially allow you to get the text and the number of docs with that same text, but you really won't be able to get the rest of the fields in the doc (because MapReduce has no idea which doc to return when multiple docs match the text). Here is a sample MapReduce view:
{
"_id": "_design/textObjects",
"views": {
"by_text": {
"map": "function (doc) { if (doc.text) { emit(doc.text, 1); }}",
"reduce": "_count"
}
},
"language": "javascript"
}
What this is doing:
The Map part of the Map Reduce takes each doc and maps it into a doc that looks like this:
{"key":"text-data", "value":1}
So, if you had 7 docs, 2 where text="text-data" and 5 where text="other-text-data" the data would look like this:
{"key":"text-data", "value":1}
{"key":"text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
{"key":"other-text-data", "value":1}
The reduce part of the MapReduce ("reduce": "_count") groups the docs above by the key and returns the count:
{"key":"text-data","value":2},
{"key":"other-text-data","value":5}
You can query this view on your Cloudant instance:
https://<yourcloudantinstance>/<databasename>
/_design/textObjects
/_view/by_text?group=true
This will result in something similar to the following:
{"rows":[
{"key":"text-data","value":2},
{"key":"other-text-data","value":5}
]}
If this is not what you are looking for, but rather you are just looking to keep the latest info for a specific text value then you can simply find an existing document that matches that text and update it with new values:
Add an index on text:
{
"index": {
"fields": [
"text"
]
},
"type": "json"
}
Whenever you add a new document find the document with that same exact text:
{
"selector": {
"text": "text-value"
},
"fields": [
"_id",
"text"
]
}
If it exists update it. If not then insert a new document.
Finally, if you want to keep multiple docs with the same text value, but just want to be able to query the latest you could do something like this:
Add a property called latest or similar to your docs.
Add an index on text and latest:
{
"index": {
"fields": [
"text",
"latest"
]
},
"type": "json"
}
Whenever you add a new document find the document with that same exact text where latest == true:
{
"selector": {
"text": "text-value",
"latest" : true
},
"fields": [
"_id",
"text",
"latest"
]
}
Set latest = false on the existing document (if one exists)
Insert the new document with latest = true
This query will find the latest doc for all text values:
{
"selector": {
"text": {"$gt":null}
"latest" : true
},
"fields": [
"_id",
"text",
"latest"
]
}

Related

How do I extract a string of numbers from random text in Power Automate?

I am setting up a flow to organize and save emails as PDF in a Dropbox folder. The first email that will arrive includes a 10 digit identification number which I extract along with an address. My flow creates a folder in Dropbox named in this format: 2023568684 : 123 Main St. Over a few weeks, additional emails arrive that I need to put into that folder. The subject always has a 10 digit number in it. I was building around each email and using functions like split, first, last, etc. to isolate the 10 digits ID. The problem is that there is no consistency in the subjects or bodies of the messages to be able to easily find the ID with that method. I ended up starting to build around each email format individually but there are way too many, not to mention the possibility of new senders or format changes.
My idea is to use List files in folder when a new message arrives which will create an array that I can filter to find the folder ID the message needs to be saved to. I know there is a limitation on this because of the 20 file limit but that is a different topic and question.
For now, how do I find a random 10 digit number in a randomly formatted email subject line so I can use it with the filter function?
For this requirement, you really need regex and at present, PowerAutomate doesn't support the use of regex expressions but the good news is that it looks like it's coming ...
https://powerusers.microsoft.com/t5/Power-Automate-Ideas/Support-for-regex-either-in-conditions-or-as-an-action-with/idi-p/24768
There is a connector but it looks like it's not free ...
https://plumsail.com/actions/request-free-license
To get around it for now, my suggestion would be to create a function app in Azure and let it do the work. This may not be your cup of tea but it will work.
I created a .NET (C#) function with the following code (straight in the portal) ...
#r "Newtonsoft.Json"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
dynamic data = JsonConvert.DeserializeObject(requestBody);
string strToSearch = System.Text.Encoding.UTF8.GetString(Convert.FromBase64String((string)data?.Text));
string regularExpression = data?.Pattern;
var matches = System.Text.RegularExpressions.Regex.Matches(strToSearch, regularExpression);
var responseString = JsonConvert.SerializeObject(matches, new JsonSerializerSettings()
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
});
return new ContentResult()
{
ContentType = "application/json",
Content = responseString
};
}
Then in PowerAutomate, call the HTTP action passing in a base64 encoded string of the content you want to search ...
The is the expression in the JSON ... base64(variables('String to Search')) ... and this is the json you need to pass in ...
{
"Text": "#{base64(variables('String to Search'))}",
"Pattern": "[0-9]{10}"
}
This is an example of the response ...
[
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 33,
"Length": 10,
"Value": "2023568684"
},
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 98,
"Length": 10,
"Value": "8384468684"
}
]
Next, add a Parse JSON action and use this schema ...
{
"type": "array",
"items": {
"type": "object",
"properties": {
"Groups": {
"type": "object",
"properties": {}
},
"Success": {
"type": "boolean"
},
"Name": {
"type": "string"
},
"Captures": {
"type": "array"
},
"Index": {
"type": "integer"
},
"Length": {
"type": "integer"
},
"Value": {
"type": "string"
}
},
"required": [
"Groups",
"Success",
"Name",
"Captures",
"Index",
"Length",
"Value"
]
}
}
Finally, extract the first value that you find which matches the regex pattern. It returns multiple results if found so if you need to, you can do something with those.
This is the expression ... #{first(body('Parse_JSON'))?['value']}
From this string ...
We're going to search for string 2023568684 within this text and we're also going to try and find 8384468684, this should work.
... this is the result ...
Don't have a Premium PowerAutomate licence so can't use the HTTP action?
You can do this exact same thing using the LogicApps service in Azure. It's the same engine with some slight differences re: connectors and behaviour.
Instead of the HTTP, use the Azure Functions action.
In relation to your action to fire when an email is received, in LogicApps, it will poll every x seconds/minutes/hours/etc. rather than fire on event. I'm not 100% sure which email connector you're using but it should exist.
Dropbox connectors exist, that's no problem.
You can export your PowerAutomate flow into a LogicApps format so you don't have to start from scratch.
https://learn.microsoft.com/en-us/azure/logic-apps/export-from-microsoft-flow-logic-app-template
If you're concerned about cost, don't be. Just make sure you use the consumption plan. Costs only really rack up for these services when the apps run for minutes at a time on a regular basis. Just keep an eye on it for your own mental health.
TO get the function URL, you can find it in the function itself. You have to be in the function ...

How to highlight custom extractions using a2i's crowd-textract-analyze-document?

I would like to create a human review loop for images that undergone OCR using Amazon Textract and Entity Extraction using Amazon Comprehend.
My process is:
send image to Textract to extract the text
send text to Comprehend to extract entities
find the Block IDs in Textract's output of the entities extracted by Comprehend
add new Blocks of type KEY_VALUE_SET to textract's JSON output per the docs
create a Human Task with crowd-textract-analyze-document element in the template and feed it the modified textract output
What fails to work in this process is step 5. My custom entities are not rendered properly. By "fails to work" I mean that the entities are not highlighted on the image when I click them on the sidebar. There is no error in the browser's console.
Has anyone tried such a thing?
Sorry for not including examples. I will remove secrets/PII from my files and attach them to the question
I used the AWS documentation of the a2i-crowd-textract-detection human task element to generate the value of the initialValue attribute. It appears the doc for that attribute is incorrect. While the the doc shows that the value should be in the same format as the output of Textract, namely:
[
{
"BlockType": "KEY_VALUE_SET",
"Confidence": 38.43309020996094,
"Geometry": { ... }
"Id": "8c97b240-0969-4678-834a-646c95da9cf4",
"Relationships": [
{ "Type": "CHILD", "Ids": [...]},
{ "Type": "VALUE", "Ids": [...]}
],
"EntityTypes": ["KEY"],
"Text": "Foo bar"
},
]
the a2i-crowd-textract-detection expects the input to have lowerCamelCase attribute names (rather than UpperCamelCase). For example:
[
{
"blockType": "KEY_VALUE_SET",
"confidence": 38.43309020996094,
"geometry": { ... }
"id": "8c97b240-0969-4678-834a-646c95da9cf4",
"relationships": [
{ "Type": "CHILD", "ids": [...]},
{ "Type": "VALUE", "ids": [...]}
],
"entityTypes": ["KEY"],
"text": "Foo bar"
},
]
I opened a support case about this documentation error to AWS.

Error when importing CSV file into Amazon Personalize

I am trying to import a CSV file into Amazon Personalize
my schema looks like this:
{
"type": "record",
"name": "Items",
"namespace": "com.amazonaws.personalize.schema",
"fields": [
{
"name": "ITEM_ID",
"type": "string"
},
{
"name": "AUTHOR",
"type": "string",
"categorical": true
},
{
"name": "COUNTRY",
"type": "string",
"categorical": true
},
{
"name": "CITY",
"type": "string",
"categorical": true
},
{
"name": "STYLES",
"type": "string",
"categorical": true
},
{
"name": "CATEGORIES",
"type": "string",
"categorical": true
}
],
"version": "1.0"
}
the first few rows of data look like this:
ITEM_ID,AUTHOR,COUNTRY,CITY,STYLES,CATEGORIES
5b4253a7e12434f55875381e,5acd193f48ed4b9b3add5be6,US,city_us_austin,5ad45bc575eb016f3cdb562b|571aa21888a4fd9934f0fd7b|571aa21888a4fd9934f0fd79|5ad45e8c75eb016f3cdb563f|5b4ea35abaa12285687a1f47,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
5b4253a7e12434f55875381f,5acd193f48ed4b9b3add5be6,US,city_us_jackson,571aa21888a4fd9934f0fd82|57600e419e4959cd069658eb|5ad45c3a75eb016f3cdb5631|571aa21888a4fd9934f0fd7b|57aaa7094a393f531ace43f0|575e6d8e34ca56f742bea1c8|571aa21888a4fd9934f0fd8f,593a866a082c26444eab2d3c|5a8e4820fc112d414fbc1be3
I get the error
Failed to create a data import job for item dataset.
Input csv has rows that do not conform to the dataset schema. Please ensure all required data fields are present and that they are of the type specified in the schema.
How can I figure out what is wrong with the CSV (it's thousands of lines long), so I have not idea if its a general mistake, or something wrong on a specific line?
In my experience, so long as the dataset is not >250 thousand records, you can still use Excel to check the data utilizing data filters and corresponding search functions. If it's more than that, look into using Notepad++ and RegEx. Your problem may be one of the following things:
(1) There's a missing comma. This would misalign your data and keep it from being processed.
(2) There's a missing ITEM_ID value. For Items, Personalize requires ITEM_ID and at least one metadata field. It might give this error if there is an instance where you are missing ITEM_ID or have ITEM_ID but no other metadata field values.
(3) STYLES and/or CATEGORIES exceeds 256 characters. There is probably a limit on String length, but I can't get a clear answer on this from the developer's guide. I would guess it's 256 characters. If I was betting money, this would be my guess on your problem.
Here is a different approach to solve the problem, maybe will be useful for other cases. I had the same issue, but when dealing with int columns having null values. Pandas by default converts the columns to float data type - something AWS Personalize dataset import job will not accept if you have dedfined these columns as int or long. Long story short, converting these columns to int solves the problem:
df.column_name = df.column_name.astype(pd.Int32Dtype())

FB Graph API - Filtering results by specific IDs?

I am making a request to a specific node and edge using the graph API:
https://graph.facebook.com/v2.6/NODE_ID/EDGE_NAME
Example:
https://graph.facebook.com/v2.6/00000000000000/reports
which returns the results below:
"data": [
{
"id": "111111111111111",
"name": "Report A"
},
{
"id": "22222222222222",
"name": "Report B"
},
{
"id": "33333333333333",
"name": "Report C"
}
]
The above is literally returning a list of reports by id/name that exist under a specific company.
If I want to filter the results by specific reports, how can I go about doing this?
I tried variations such as the below, but they haven't worked and still return all reports:
https://graph.facebook.com/v2.6/00000000000000/reports?ids=22222222222222
I know I can make the report ID as the node to access it directly:
https://graph.facebook.com/v2.6/22222222222222/
But I want to view the properties of a subset of reports that belong to the company, so I was thinking I could build an array to do this.
https://graph.facebook.com/v2.6/00000000000000/reports?ids=22222222222222,33333333333333
Expected Result:
"data": [
{
"id": "111111111111111",
"name": "Report A"
},
{
"id": "22222222222222",
"name": "Report B"
},
{
"id": "33333333333333",
"name": "Report C"
}
]
This seems like it should work based on the below documentation, but it does not...
https://developers.facebook.com/docs/graph-api/using-graph-api
Could it be because the edge I'm accessing isn't able to recognize these IDs for some reason...? I know it's hard to say without knowing what I'm doing, but I can't disclose fully as it's proprietary...
Any advice is appreciated.

Populating search results with meta data in Amazon CloudSearch

Unfortunately, Amazon CloudSearch does not support nested JSON, meaning that the below document structure is not valid.
[{
"type": "add",
"id": 1,
"fields": {
"company_name": "My Company",
"services": [
{
"id": 123,
"name": "Construction",
"logo": "logo1.png"
},
{
"id": 456,
"name": "Programming",
"logo": "logo2.png"
}
]
}
}]
Basically I cannot nest an array of objects under the services key. In this particular scenario, only the nested name field has to be searchable, so what I could do is the following:
[{
"type": "add",
"id": 1,
"fields": {
"company_name": "My Company",
"services": [ "Construction", "Programming" ]
}
}]
The above JSON is valid, and I can still search for the service names. However, now I have lost some meta data about my services that I need when displaying the search results. Is there any way in which I can add the meta data to the document in Amazon CloudSearch and have it returned with my search results, such that I can use it when displaying the results?
Or do I have to fetch this additional meta data from my database afterwards to populate the search results with the additional data required to display the results? This does not seem feasible because it complicates my code much more than if I could fetch this data straight from CloudSearch. This would also impact the performance of the search, even though I could use caching - but I kind of want to avoid that if possible, because I don't need it for anything else right now.
So my questions are:
Can I somehow add the meta data for services to the CloudSearch documents and have it returned with my search results?
If not, should I then extract this data from my data store upon receiving the search results from CloudSearch?
Do you have any other solutions or ideas? Are there any best practices with this?
Thank you in advance!