Best way to load 1MM JSON records into AWS Redshift with Kinesis Firehose? - amazon-web-services

I've got a bunch of JSON records that I want to add to an Amazon Redshift instance from S3, via Kinesis Firehose. It's several hundred files, give or take, that have 1,000 or so records each, and each file looks like the below sample. For my purposes, I don't care about the info entry, at least for now. I have a working Kinesis Firehose service that can update my Redshift DB with the sample stock ticker data, so that part is OK. My questions are (and hopefully this shouldn't actually be split into two different posts):
This is in large part a learning exercise, so if it's overkill for what I'm trying to do, that's OK. If there's a reason it's actually a bad idea, let me know.
If I want to just ignore the info field, do I have to use a Lambda to strip it, or is there a way to do that without one? If so, are there any tricks that wouldn't be the same as writing a script to process from a regular textfile? As I'm typing this I realize I could probably just put info in the DB and never touch it, but if there's a reason not to do that, or a cleaner way than that, I'd appreciate hearing it.
When I have individual manufacturers with a set of features, and there could be dozens of features per manufacturer, does it make sense to make a separate DB table for features, or am I coming at it from a Python dict/Perl hash perspective that doesn't make sense for a SQL DB when I need to tie them back together later?
Sample:
{
"info": {
"generated_on": "2022-08-09 19:25:34",
"version": "v1"
},
"manufacturer": [
{
"name": "Audi",
"id": 1,
"num_features": 2,
"features": [
{
"name": "seat heaters",
"standard": "N",
"cost": 100
},
{
"name": "A/C",
"standard": "Y",
"cost": 0
}
]
},
{
"name": "BMW",
"id": 2,
"num_features": 3,
"features": [
{
"name": "seat heaters",
"standard": "Y",
"cost": 0
},
{
"name": "backup camera",
"standard": "N",
"cost": 500
},
{
"name": "A/C",
"standard": "Y",
"cost": 0
}
]
}
]
}

Related

How do I extract a string of numbers from random text in Power Automate?

I am setting up a flow to organize and save emails as PDF in a Dropbox folder. The first email that will arrive includes a 10 digit identification number which I extract along with an address. My flow creates a folder in Dropbox named in this format: 2023568684 : 123 Main St. Over a few weeks, additional emails arrive that I need to put into that folder. The subject always has a 10 digit number in it. I was building around each email and using functions like split, first, last, etc. to isolate the 10 digits ID. The problem is that there is no consistency in the subjects or bodies of the messages to be able to easily find the ID with that method. I ended up starting to build around each email format individually but there are way too many, not to mention the possibility of new senders or format changes.
My idea is to use List files in folder when a new message arrives which will create an array that I can filter to find the folder ID the message needs to be saved to. I know there is a limitation on this because of the 20 file limit but that is a different topic and question.
For now, how do I find a random 10 digit number in a randomly formatted email subject line so I can use it with the filter function?
For this requirement, you really need regex and at present, PowerAutomate doesn't support the use of regex expressions but the good news is that it looks like it's coming ...
https://powerusers.microsoft.com/t5/Power-Automate-Ideas/Support-for-regex-either-in-conditions-or-as-an-action-with/idi-p/24768
There is a connector but it looks like it's not free ...
https://plumsail.com/actions/request-free-license
To get around it for now, my suggestion would be to create a function app in Azure and let it do the work. This may not be your cup of tea but it will work.
I created a .NET (C#) function with the following code (straight in the portal) ...
#r "Newtonsoft.Json"
using System.Net;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Primitives;
using Newtonsoft.Json;
public static async Task<IActionResult> Run(HttpRequest req, ILogger log)
{
string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
dynamic data = JsonConvert.DeserializeObject(requestBody);
string strToSearch = System.Text.Encoding.UTF8.GetString(Convert.FromBase64String((string)data?.Text));
string regularExpression = data?.Pattern;
var matches = System.Text.RegularExpressions.Regex.Matches(strToSearch, regularExpression);
var responseString = JsonConvert.SerializeObject(matches, new JsonSerializerSettings()
{
ReferenceLoopHandling = ReferenceLoopHandling.Ignore
});
return new ContentResult()
{
ContentType = "application/json",
Content = responseString
};
}
Then in PowerAutomate, call the HTTP action passing in a base64 encoded string of the content you want to search ...
The is the expression in the JSON ... base64(variables('String to Search')) ... and this is the json you need to pass in ...
{
"Text": "#{base64(variables('String to Search'))}",
"Pattern": "[0-9]{10}"
}
This is an example of the response ...
[
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 33,
"Length": 10,
"Value": "2023568684"
},
{
"Groups": {},
"Success": true,
"Name": "0",
"Captures": [],
"Index": 98,
"Length": 10,
"Value": "8384468684"
}
]
Next, add a Parse JSON action and use this schema ...
{
"type": "array",
"items": {
"type": "object",
"properties": {
"Groups": {
"type": "object",
"properties": {}
},
"Success": {
"type": "boolean"
},
"Name": {
"type": "string"
},
"Captures": {
"type": "array"
},
"Index": {
"type": "integer"
},
"Length": {
"type": "integer"
},
"Value": {
"type": "string"
}
},
"required": [
"Groups",
"Success",
"Name",
"Captures",
"Index",
"Length",
"Value"
]
}
}
Finally, extract the first value that you find which matches the regex pattern. It returns multiple results if found so if you need to, you can do something with those.
This is the expression ... #{first(body('Parse_JSON'))?['value']}
From this string ...
We're going to search for string 2023568684 within this text and we're also going to try and find 8384468684, this should work.
... this is the result ...
Don't have a Premium PowerAutomate licence so can't use the HTTP action?
You can do this exact same thing using the LogicApps service in Azure. It's the same engine with some slight differences re: connectors and behaviour.
Instead of the HTTP, use the Azure Functions action.
In relation to your action to fire when an email is received, in LogicApps, it will poll every x seconds/minutes/hours/etc. rather than fire on event. I'm not 100% sure which email connector you're using but it should exist.
Dropbox connectors exist, that's no problem.
You can export your PowerAutomate flow into a LogicApps format so you don't have to start from scratch.
https://learn.microsoft.com/en-us/azure/logic-apps/export-from-microsoft-flow-logic-app-template
If you're concerned about cost, don't be. Just make sure you use the consumption plan. Costs only really rack up for these services when the apps run for minutes at a time on a regular basis. Just keep an eye on it for your own mental health.
TO get the function URL, you can find it in the function itself. You have to be in the function ...

How to POST logical type decimal values for Avro records to Confluent Rest Proxy?

We are using the Confluent Rest Proxy to communicate with Kafka and need to test a variety of data. We are using the Rest Proxy to allow a vendor to communicate with our Kafka system.
One of our fields in the Avro schema has a logical type of decimal. To keep this simple, let's assume the schema shown here:
{
"fields": [
{
"name": "fieldName",
"type": "string"
},
{
"name": "amount",
"type": {
"logicalType": "decimal",
"precision": 16,
"scale": 2,
"type": "bytes"
}
}
],
"name": "Sample",
"namespace": "com.test.sample",
"type": "record"
}
It's easy enough to write to the topic via a Java producer, using Avro Tools to produce the appropriate class files. But when attempting to use the Rest Proxy, we have to pass values such as this:
{"value_schema_id":132,"records": [{"value":{"fieldName":"Field Name","amount":"\u0001ã"}}]}
This was copied from a record created via the Java producer and then downloaded from the topic. But in the amount field, we'd like to be able to pass a value such as 123.45. We're using Postman for the most part to send data. Is there a way to do this with a logical decimal field and without having to create and serialize the data first to see the representation such as \u0001ã?

What's a "cloud-native" way to convert a Location History REST API into AWS Location pings?

My use case: I've got a Spot Tracker that sends location data up every 5 minutes. I'd like to get these pings into AWS Location, so I can do geofencing, mapping, and other fun stuff with them.
Spot offers a REST API that will show the last X number of events, such as:
"messages": {
"message": [
{
"id": 1605371088,
"latitude": 41.26519,
"longitude": -95.99069,
"dateTime": "2021-06-26T23:21:24+0000",
"batteryState": "GOOD",
"altitude": -103
},
{
"id": 1605371124,
"latitude": 41.2639,
"longitude": -95.98545,
"dateTime": "2021-06-26T23:11:24+0000",
"altitude": 0
},
{
"id": 1605365385,
"latitude": 41.25448,
"longitude": -95.94189,
"dateTime": "2021-06-26T23:06:01+0000",
"altitude": -103
},
...
]
}
What's the most idiomatic, cloud-native way to turn these into pings that go into AWS Location?
Here's a diagram of my initial approach:
The idea is, use a timed Lambda to periodically hit the Spot endpoint, and keep track of the latest one I've sent out in a store like Dynamo:
I'm not an AWS expert, but I feel like there must be a cleaner integration. Are there other tools that would help with this? Is there anything in AWS IOT, for example, that would help me not have to keep track of the last one I uploaded?

CouchDB Map syntax

I have documents in CouchDB (v. 2.1.1) as follows:
{
"xyz": "a",
"abc": "def"
},
{
"xyz": "a",
"ghi": "jkl"
},
{
"xyz": "a",
"mno": "pqr"
},
{
"xyz": "a",
"stu": "vwx"
},
{
"xyz": "a",
"bcd": 1000
}
If I run a simple map function, for example:
function (doc) {
if (doc.xyz ){
emit(doc.xyz, doc.abc);}}
I get:
{
"id": "4c3406a1d92942b4fb10d1314e0061a9",
"key": "a",
"value": "def"
},
{
"id": "4c3406a1d92942b4fb10d1314e006ccf",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00787f",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00871e",
"key": "a",
"value": null
},
{
"id": "4c3406a1d92942b4fb10d1314e00906a",
"key": "a",
"value": null
}
I want to try and eliminate the 'null' outputs.
I am looking at having a CouchDB database with many small documents containing small snippets of information rather than having larger documents containing much more information per document.
My question is, is my document design a good one and if so how do I get just what I am looking for rather than rows of 'nulls'. If my storage design is not ideal, what kind of design should I be looking at to simplify the output given my plan to have many small 'docs'.
EDIT:
Having looked at possible answers, I have decided that having numerous small documents as I described in my question is not giving me the kind of benefit I imangined they would.
I was unable to get a satisfactory solution to the map function to get readable answers.
However, I investigated the 'Mango' query system available in recent updates of CouchDB and I was able using these queries to get acceptable output from a database like my supplied one.
This is what I did:
curl -X POST http://admin:123#127.0.0.1:5984/ptn/_find -d '{"selector": {"$or": [{"abc": {"$gt": null}},{"ghi": {"$gt": null}}]},"fields": ["abc","ghi"]}' -H "Content-Type:application/json"
Un-minified:
{
"selector": {
"$or": [
{
"abc": {
"$gt": null
}
},
{
"ghi": {
"$gt": null
}
}
]
},
"fields": [
"abc",
"ghi"
]
}
The output:
{"docs":[
{"abc":"def"},
{"ghi":"jkl"}
]
.....
A concise answer.
Sorting can be done but sorted fields must be indexed. Indexing is in any case advised for larger data sets.
Reference:
http://docs.couchdb.org/en/2.1.1/api/database/find.html
As my question required a map function, this perhaps cannot be regarded as a valid answer but for me it is an answer. I have tried the 'Mango' query system a little on other databases and it seems to be more useful/powerful than I thought is was although it offers no means of totaling etc.

Populating search results with meta data in Amazon CloudSearch

Unfortunately, Amazon CloudSearch does not support nested JSON, meaning that the below document structure is not valid.
[{
"type": "add",
"id": 1,
"fields": {
"company_name": "My Company",
"services": [
{
"id": 123,
"name": "Construction",
"logo": "logo1.png"
},
{
"id": 456,
"name": "Programming",
"logo": "logo2.png"
}
]
}
}]
Basically I cannot nest an array of objects under the services key. In this particular scenario, only the nested name field has to be searchable, so what I could do is the following:
[{
"type": "add",
"id": 1,
"fields": {
"company_name": "My Company",
"services": [ "Construction", "Programming" ]
}
}]
The above JSON is valid, and I can still search for the service names. However, now I have lost some meta data about my services that I need when displaying the search results. Is there any way in which I can add the meta data to the document in Amazon CloudSearch and have it returned with my search results, such that I can use it when displaying the results?
Or do I have to fetch this additional meta data from my database afterwards to populate the search results with the additional data required to display the results? This does not seem feasible because it complicates my code much more than if I could fetch this data straight from CloudSearch. This would also impact the performance of the search, even though I could use caching - but I kind of want to avoid that if possible, because I don't need it for anything else right now.
So my questions are:
Can I somehow add the meta data for services to the CloudSearch documents and have it returned with my search results?
If not, should I then extract this data from my data store upon receiving the search results from CloudSearch?
Do you have any other solutions or ideas? Are there any best practices with this?
Thank you in advance!