AWS IoT guarantee delivery - amazon-web-services

I am starting with AWS IoT service with Raspberry Pi as a device. And I do not understand how I can make guarantee delivery for my data to AWS IoT MQTT service.
There are two cases:
The device has no Internet connection but powered on. In this case, I can use in-memory store (offline queue from AWS SDK library).
The device is powered off. In this case, I am losing my data in RAM.
How to save my data without running some database engine on Raspberry.
Do you have some best practices?

You will need to somehow save your data to disk to mitigate issue #2. The best practice is to use an established database system. SqLite is a very lightweight database. They are not that hard to use, give it a shot! If you really hate that idea, you could just save the data in json format to a text file. That works as well.

Related

AWS IoT Device Onboarding

I'm working on a learning project for IoT with AWS IoT Things and ESP32 using Arduino/C (no micro-python). While I have shadows and messages working well, the part I'm not sure about is the best approach to onboard new devices.
Currently the onboarding process is:
I create the Thing in the AWS Console
I create the certs
I save the certs to laptop
I copy cert contents into the Shadow.h then upload the sketch to the ESP32
This feels incredibly manual :(
Hypothetically how would a reseller of ESP32-based IoT devices automate the onboarding process? How can the Things and certs be automated?
Many thanks in advance
Ant
We're talking about provisioning devices in cloud.
If you (or your organization) is adding your own devices to your own cloud, then it's quite easy to automate. Steps 1 and 2 are the cloud-side part of provisioning - just install the required SDK-s and write a script in your favourite supported scripting language to do the dirty work. For steps 3 and 4 you just use the device's own Flash to store the device certificates. Espressif has a useful non-volatile storage system called NVS - it's fairly easy to use and supports Flash encryption (this bit could be more elegant, but it works). You can use their NVS Partition Generator to pre-create the required storage with the device's certs in it, then flash it into the device when setting it up. Device-side provisioning can be scripted together with cloud-side provisioning so you can do the whole thing in a single step. The Arduino IDE is not the tool to use, though. You just need the final program binaries, but everything else you need to create on your own.
If you're talking about a third party taking your device and provisioning it in their cloud, this is a bit more difficult (but not impossible). Presumably they need to do steps 1 & 2 on their own and you need to give them a way to configure their AWS endpoints and certificates on the device. So you need to build some interface which allows them to do it.

Sending Data From Elasticsearch to AWS Databases in Real Time

I know this is a very different use case for Elasticsearch and I need your help.
Main structure (can't be changed):
There are some physical machines and we have sensors there. Data from
these sensors are going to AWS Greengrass.
Then, with Lambda function data are going to Elasticsearch by using
MQTT. Elasticsearch is running on the docker.
This is the structure and until here everything is ready and running โœ…
Now, on the top of the ES I need some software that can send this data by using MQTT to Cloud database, for example DynamoDB.
But this is not one time migration. It should send the data continuously. Basically, I need a channel between ES and AWS DynamoDB.
Also, sensors are producing so much data and we don't want to store all of them in the Cloud but we want to store them in ES. Some filtering is needed in the Elasticsearch side before we send data to Cloud. Like "save every 10th data to cloud" so we can only save 1 data out of 10.
Do you have any idea about how can it be done? I have no experience in this field and it looks like a challenging task. I would love to get some suggestions from experienced people in these areas.
Thanks a lot! ๐Ÿ™Œ๐Ÿ˜Š
I havenโ€™t worked on a similar use case but you can try looking into Logstash for this.
It's an open source service, part of ELK stack and provides the option of filtering the output. The pipeline will look something link below:
data ----> ES ----> Logstash -----> DynamoDB or any other destination.
It supports various plugins required for your use case, like:
DynamoDB output plugin -
https://github.com/tellapart/logstash-output-dynamodb
Logstash MQTT Output Plugin -
https://github.com/kompa3/logstash-output-mqtt

What happens to the data when uploading it to gcp bigquery when there is no internet?

I am using GCP Bigquery to store some data. I have created a pub/sub job for the Dataflow of the event.Currently, I am facing a issue with data loss. Sometimes, due to "no internet connection" the data is not uploaded to bigquery and the data for that time duration is lost. How can i overcome this situation.
Or what kind of database should i use to store data offline and then upload it online whenever there is connectivity.
Thank You in advance!
What you need to have is either a retry mechanism or a persistent storage. There can be several ways to implement this.
You can use a Message Queue to store data and process. Choice of message queue can be either cloud based like AWS SQS, Cloud Pub/Sub(GCP) or a hosted one like Kafka, RabbitMq.
Another but a bit unoptimized way could be to persist data locally till it is successfully uploaded on the cloud. Local storage can be either some buffer or database etc. If upload is failed you, retry again from the storage. This is something similar to Producer Consumer Problem.
You can use a Google Compute Engine to store your data and always run your data loading job from there. In that case, if your internet connection is lost, data will still continue to load into BigQuery.
By what I understood you are publishing data to PubSub and Dataflow does the rest to get the data inside BigQuery, is it right?
The options I suggest to you:
If your connection loss happens occasionally and for a short amount of time, a retry mechanism could be enough to solve this problem.
If you have frequent connection loss or connection loss for large periods of time, I suggest that you mix a retry mechanism with some process redundancy. You could for example have two process running in different machines to avoid this kind of situation. Its important to mention that for this case you could also try only a retry mechanism but it would be more complex because you would need to determine if the process failed, save the data somewhere (if its not saved) and trigger the process again in the future.
I suggest that you take a look in Apache Nifi. Its a very powerful data flow automation software that might help you solving this kind of issue. Apache Nifi has specific processors to push data directly to PubSub.
As a last suggestion, you could create an automated process to make data quality analysis after the data ingestion. Having this process you could determine more easily if your process failed.

Uploading large files to server

The project I'm working on logs data on distributed devices that needs to be joined in a single database on a remote server.
The logs cannot be streamed as they are recorded (network may not be available etc) so they must be sent in bulky 0.5-1GB text based csv files occasionally.
As far as I understand this means having a web service receive the data in form of post requests is out of the question because of file sizes.
So far I've come up with this approach: Use some file transfer protocol (ftp or similar) to upload files from device to server. Devices would have to figure out a unique filename to do this with. Have the server periodically check for new files, process them by committing them to the database and deleting them afterwards.
It seems like a very naive way to go about it, but simple to implement.
However, I want to avoid any pitfalls before I implement any specifics. Is this approach scaleable (more devices, larger files)? Implementation will either be done using a private/company owned server or a cloud service (Azure for instance) - will it work for different platforms?
You could actually do this through web/http as well, after setting a higher value for post request in the web server (post_max_size andupload_max_filesize for PHP). This will allow devices to interact regardless of platform. Should't be too hard to make a POST request server from any device. A simple cURL request could get this job done.
FTP is also possible. Or SCP, to make it safer.
Either way, I think this does need some application on the server to be able to fetch and manage these files using a database. Perhaps a small web application? ;)
As for the unique name, you could use a combination of the device's unique ID/name along with current unix time. You could even hash this (md5/sh1) afterwards if you like.

What is the "proper" way to use DynamoDB for an iOS app?

I've just started messing around with AWS DynamoDB in my iOS app and I have a few questions.
Currently, I have my app communicating directly to my DynamoDB database. I've been reading around lately and people are saying this isn't the proper way to go about getting data from my database.
By this I mean is I just have a function in my code querying my Dynamo database and returning the result.
How I do it works but is there a better way I should be going about this?
Amazon DynamoDB itself is a highly-scalable service and standing up another server in front of it requires scaling the service also in line with the RCU/WCU configured for your tables, which we can and should avoid.
If your mobile application doesn't need a backend server and you can perform all the business functions from the mobile device, then you should probably think about
Using the AWS DynamoDB SDK for iOS devices to write your client application that runs on the mobile device
Use AWS Token Vending Machine to authenticate your mobile users to grant them credentials to be used to run operations on DynamoDB tables.
Control access (i.e what operations should be allowed on tables etc.,) using IAM policies.
HTH.
From what you say, I can guess that you are talking about a way you can distribute data to many clients (ios apps).
There are few integration patterns (a very good book on this: Enterprise Integration Patterns), one of which is called shared database. It is essentially about using a common database for multiple clients to share the data. Main drawback for that pattern (in your case) is that you are doing assumption about how the database schema looks like. It can potentially bring you some headache supporting the schema in the future, if your business logic changes.
The more advanced approach would be sending events on every change in your data instead of directly writing changes to the database from client apps. This way you can add additional processing to the events before the data they carry is written to the database. For example, you may want to change the event format in the new version of your app, but still want to support legacy users, so you add translation procedure which transforms both types of events to the format which fits the database schema. It's basically a question of whether to work with diffs vs snapshots.
You should be aware of added complexity of working with events, and it can be an overkill if your app is simple and changes in schema are unlikely.
Also consider that you can do data preprocessing using DynamoDB Streams, which gives you some advantages of using events still keeping it simple to implement.