Framework to run small independent tasks - amazon-web-services

I have a requirement where I need to do following
Pull records from database with certain frequency(hourly, daily etc)
for each records make some API calls
for each record pdate the database with new info
do more stuff.....
The number of records can be huge. Is their a framework which can manage this workflow where each task/record can be on different stage during execution.

Look at the Cadence Workflow. There are multiple production applications that rely on Cadence with requirements similar to yours.

Related

How do i get all transactions in an account without going through all the blocks?

How do i fetch all the transactions in a specific NEAR accounts without going through all the blocks like the example here :
https://docs.near.org/docs/api/naj-cookbook#recent-transaction-details
I'm trying to show the transactions in react and fetching all the blocks takes too much time
It's impossible to do through the API. It is possible to get with SQL query from public PostgreSQL database of the Indexer for Explorer
https://github.com/near/near-indexer-for-explorer#shared-public-access
However, the access is shared across everyone and has a very limited number of connections. So it's not the most reliable solution if you're building some project and need to do such queries regularly.
So if you need to get all transactions for the account regularly you will need to write and run your own indexer that stores data you need in some database so you can access it on regular basis.
Useful links:
https://docs.near.org/docs/concepts/indexer
https://docs.near.org/docs/tutorials/near-indexer

Updating all entities of KIND in Google Cloud Datastore

we have a dataset of ~10 million entities or a certain Kind in Datastore. We want to change the products functionality, so we would like to change the fields on all Kind entities.
Is there a smart/quick way to do it, that does not involve iterating over all of the entities in series?
Probably you can use Dataflow to help you with your problem.
Dataflow is a stream and batch data processing service, fully managed by GCP.
It was open sourced in the Apache Beam project. It is fully compatible with this SDK. This allows you to test your developments locally before run them on GCP.
It exposes two main concepts, a PCollection, basically the data that is being handled by the tool, and pipelines, the different steps necessary to capture the data, the transformations that must be performed, and how and where the results obtained should be written.
It provides support for Java, Python and Go, and a rich feature set and variety of possible data sources and transformations.
In the specific case of Datastore, Dataflow provides support for read, write and delete data. See for instance the relevant documentation for Python.
You can see a good example of how to interact with datastore in the Apache Beam Github repository.
These two other articles could be also interesting: 1 2.
I would presume that you have to loop through each one and update it as it's a NoSQL data store like mongo from what I can see. We have a system that uses SQL and Mongo and the demoralised data is a pain, we had to write migrations that would loop through all and update.

Need help building an uptime dashboard for a distributed system

I have a product for which I would like to create a dashboard to show
its availability/uptime over time and display any outages.
Specifically I am looking for
ability to report historical information on service uptime
provide details on any service outages
The product is running on a fleet of linux servers and connects to a DB running
on a separate instance, also we have some dedicated instances that run nightly
batch jobs. My system also relies on some external services to provide
additional functionality for select customers. There is redis cache also for
caching data for multiple customers.
We replicate all the above setup (application servers, DB, jobs servers, redis
cache etc) into dedicated clusters for large customers. Small customers are put
on one of the shared clusters to keep costs low.
Currently we are running health checks on application servers only and providing
that information in a simple HTML page. This is a go to page for end-users/customers
and support teams.
Since the product is constructed using multiple systems/services our current HTML
page often times says that the system is up and running fine while can be experiencing
issues with some of its components or external services.
Current health check is using a simple HTTP request and looks for a 200
status code, this check runs every minute and we plot this data into a simple
chart to show last 30 days. We also show a list of outages with timestamp and
additional static information that is manually added.
We would like to build a more robust solution that monitors much more than the HTTP port
and where we have more details like what part
of the system is having issues and how those issues are impacting the system and
which customers are impacted.
Appreciate any guidance or help. We prefer to build the solution using
open source tools since we dont have much budget. Goal is to improve things for
my team members who are already overloaded.
I'm not sure if this will be overkill or not for your setup, given that I don't know your product, but have a look at the ELK Stack and see if you can use some components or at least some ideas from there:
What is the ELK Stack?
The Complete Guide to the ELK Stack

Online update spanner schema is extremely slow

Online updating spanner schema takes minutes even for very very small tables (10s of rows).
i.e. - adding/dropping/altering columns, adding tables, etc.
This can be quite frustrating for development processes and new version deployments.
Any plans for improvement?
Few more questions:
Anyone knows a 3rd party schema comparison tool for spanner? couldn't find any.
What about data backups? in order to save historical snapshots.
Thanks in advance
Schema Updates:
Since Cloud Spanner is a distributed database, it makes sure to update all moving parts of the system which takes the latency as described.
As a suggestion, you could batch the schema updates. This ensures the lower latencies (nearly equivalent to executing a single schema update) and can be executed using API / gcloud command-line tools.
Schema Comparison Tool:
You could use the getDatabaseDdl API to maintain history of your schema changes and use your tool of choice to diff them.

Check changes before run synchronize

Is there any way to check changes in database before running synchronize with MS Sync Framework?
I have a database with about 100 tables, 80% of these tables are not changed very often. I divided database into multiple scopes to handle the sync priority. Even though, there's no change in database, It takes a long time to finish synchronization.
i suggest you trace the Sync process to find out what's going on: How to: Trace the Synchronization Process
there is no specific API call in the Sync Framework SDK for simply checking a table has changed. most the API calls will do an actual change enumeration(read: query the base and tracking tables)
if you have large number of rows in your tables, you might want to set a retention period on the Sync Framework metadata to keep it small. see How to: Clean Up Metadata for Collaborative Synchronization (SQL Server)
Yes. Check out the Sync Framework Team Blog on Synchronization Services for ADO .NET for Devices: Improving performance by skipping tables that don’t need synchronization