Schedule a task inside the Hyperledger Composer ledger - blockchain

I need to update some field every day.
I came up with 2 possible solutions:
Update it via REST API - for example, a program that runs on some server, updates the field via REST API and then sleeps 1 day. Problem: if the program stops it does not update the ledger, thus the network does not work correctly anymore.
Make a smart contract that sleeps 1 day and then updates the fields. Problem: as far as I know how the internals are working, isn't that going to make problems with reaching the consensus?

correct on 2 - you likely won't get a deterministic result (whatever you're updating it with but sounds like its date-based and not sure whether its a calendar or elapsed day etc etc) and the update will fail. You're best to manage the update based on a field value from the client side. You need some high availability solution and/or checks for your client-side issue that the update takes place. The ledger is not really the place to rely on applying your operational, schedule-based update.

Related

Is there a way to compute this amount of data and still serve a responsive website?

Currently I am developing a django + react website, that will (I hope) serve a decent number of users. The project demo is mostly complete, and I am starting to think about the scale required to put this thing into production
The website essentially does three things:
Grab data from external APIs (i.e. Twitter) for 50,000 unique keywords (the keywords dont change). This process happens every 30 minutes
Run computation on all of the data, and save that computation to the database. Assume that the algorithm is as optimized as possible
When a user visits the website it should serve a pretty graph/chart of all of the computational data per keyword
The issue being, this is far too intense a task to be done by the same application that serves the website, users would be waiting decades to see their data. My current plan is to have a separate API made that services the website with the data, that the website can then store in it's database. This separate API would process the data without fear of affecting users, and it should be able to finish its current computation in under 30 minutes, in time for the next round of data.
Can anyone help me understand how I can better equip my project to handle the scale? I'd love some ideas.
As a 4th year CS Student I figured it's time to put a real project out into the world and I am very excited about it and the progress I've made so far. My main worry is that the end users will be negatively effected, if I don't figure out some kind of pipeline to make this process happen.
To re-iterate my idea:
Django + React - This is the forward facing website
External API - Grabs the data off the internet and processes it, and waits for a GET request from the website
Is there a better way to do this? Or on the other hand am I severely overestimating how computationally heavy this is.
Edit: Including current research
Handling computationally intensive tasks in a Django webapp
Separation of business logic and data access in django
What you want is to have the computation task to be executed by a different process in the "background".
The most straight-forward and popular solution is to use Celery, see here.
The Celery worker(s) - which performs the background task - can either run on the same machine as the web-application or (when scale becomes an issue), you can change the configuration so that it will run on an entirely different machine.

DJANGO : Hit an api multiple times simultaneously

Here is my project situation. I am using Django
I am running a cron job every day at a particular time. What I do is I query a list of users whose payment needs to be requested. The payment api is collection of two api's each per user depending on the type of subscription.
Question :: how should the api be hit.
1. Should be hit in order(i.e. one after the other. I think this is not a good way)
2. Should I hit it concurrently (all api hits are independent of each other).
If the second way is better, what are the tools/libraries in django that I can use.
Definitely the second way will be much faster since it is involving parallelism. But will it put a lot of pressure/load on the server. What is the standard way to go in such a scenario?

MassTransit and Quartz.net - How to populate lots of scheduled jobs

I have been looking at using MassTransit's Quartz.Net (using AdoJobStore) implementation to schedule messages send/publish for future and all of this works fairly smoothly.
The bit where I am stuck is, as a part of the production deployment, I need to set up a lot of "Scheduled Messages" to be issued at various times during the next year odd.
Is there a mechanism available to pre-populate the Quartz SQL store with Triggers/Jobs externally ?
I finally figured a way to do this; posting here, so it might help others if needed.
The Quartz SQL DB is nothing but a simple data serialised into objects.
e.g. varbinary for JOB_DATA and ticks in for time. Other values are fairly simple.
I ended up created a sample app to setup some schedules and then reversed the database later to figure.
It was all quite simple in the end, and now I have splain SQL insert script which inserts the schedules as a part of the CD pipeline.

Designing a distributed web scraper

The Problem
Lately I've been thinking about how to go about scraping the contents of a certain big, multi-national website, to get specific details about the products the company offers for sale. The website has no API, but there is some XML you can download for each product by sending a GET request with the product ID to a specific URL. So at least that's something.
The problem is that there are hundreds of millions of potential product ID's that could exist (between, say, 000000001 and 500000000), yet only a few hundred thousand products actually exist. And it's impossible to know which product ID's are valid.
Conveniently, sending a HEAD request to the product URL yields a different response depending on whether or not the product ID is valid (i.e. the product actually exists). And once we know that the product actually exists, we can download the full XML and scrape it for the bits of data needed.
Obviously sending hundreds of millions of HEAD requests will take an ungodly amount of time to finish if left to run on a single server, so I'd like to take the opportunity to learn how to develop some sort of distributed application (totally new territory for me). At this point, I should mention that this particular website can easily handle a massive amount of incoming requests per second without risk of DOS. I'd prefer not to name the website, but it easily gets millions of hits per day. This scraper will have a negligible impact on the performance of the website. However, I'll immediately put a stop to it if the company complains.
The Design
I have no idea if this is the right approach, but my current idea is to launch a single "coordination server", and some number of nodes to communicate with that server and perform the scraping, all running as EC2 instances.
Each node will launch some number of processes, and each process will be designated a job by the coordination server containing a distinct range of potential product ID's to be scraped (e.g. product ID 00001 to 10000). These jobs will be stored in a database table on the coordination server. Each job will contain info about:
Product ID start number
Product ID end number
Job status (idle, in progress, complete, expired)
Job expiry time
Time started
Time completed
When a node is launched, a query will be sent to the coordination server asking for some configuration data, and for a job to work on. When a node completes a job, a query will be sent updating the status of the job just completed, and another query requesting a new job to work on. Each job has an expiry time, so if a process crashes, or if a node fails for any reason, another node can take over an expired job to try it again.
To maximise the performance of the system, I'll need to work out how many nodes should be launched at once, how many processes per node, the rate of HTTP requests sent, and which EC2 instance type will deliver the most value for money (I'm guessing high network performance, high CPU performance, and high disk I/O would be the key factors?).
At the moment, the plan is to code the scraper in Python, running on Ubuntu EC2 instances, possibly launched within Docker containers, and some sort of key-value store database to hold the jobs on the coordination server (MongoDB?). A relational DB should also work, since the jobs table should be fairly low I/O.
I'm curious to know from more experienced engineers if this is the right approach, or if I'm completely overlooking a much better method for accomplishing this task?
Much appreciated, thanks!
You are trying to design a distributed workflow system which is, in fact, a solved problem. Instead of reinventing the wheel, I suggest you look at AWS's SWF, which can easily do all state management for you, leaving you free to only worry about coding your business logic.
This is how a system designed using SWF will look like (Here, I'll use SWF's standard terminologies- you might have to go through the documentation to understand those exactly):
Start one workflow per productID.
1st activity will check whether this productID is valid, by making a HEAD request as you mentioned.
If it isn't, terminate workflow. Otherwise, 2nd activity will fetch relevant XML content, by making the necessary GET request, and persist it, say, in S3.
3rd activity will fetch the S3 file, scrape the XML data and do whatever with it.
You can easily change the design above to have one workflow process a batch of product IDs.
Some other points that I'd suggest you keep in mind:
Understand the difference between crawling and scraping: crawling means fetching relevant content from the website, scraping means extracting necessary data from it.
Ensure that what you are doing is strictly legal!
Don't hit the website too hard, or they might blacklist your IP ranges. You have two options:
Add delay between two crawls. This too can be easily achieved in SWF.
Use anonymous proxies.
Don't rely too much on XML results from some undocumented API, because that can change anytime.
You'll need high network performance EC2 instances. I don't think high CPU or memory performance would matter to you.

How to set up website periodic tasks?

I'm not sure if the topic is appropiate to the question... Anyway, suppose I've a website, done with PHP or JSP and a cheap hosting with limited functionalities. More precisely I don't own the server and I can't run daemons or services at my will. Now I want to run periodic (say every minute or two) tasks to do certain statistics. They are likely to be time consuming so I just can't repeat the calculation for every user accessing a page. I could do the calculation once when a user loads a page and I calculate that enough time has passed, but in this situation if the calculation gets very long the response time may be excessive and timeout (it's not likely I don't mean to run such long task, but I'm considering the worst case).
So given these costraints, what solutions would you suggest?
Each cheap hosting will have support of crontabs. Check out the hosting packages first.
If not, load in the page, and launch the task by AJAX. This way your response time doesn't suffer and you do in a different thread the work.
If you choose to use crontab, you're going to have to know a bit more to execute your PHP scripts from them. Depending on if your PHP executes as CGI or an apache module has an effect. There's a good article on how to do this at:
http://www.htmlcenter.com/blog/running-php-scripts-with-cron/
If you don't have access to crontab on your hosting provider (find a new one) there are other options. For example:
http://www.setcronjob.com/
Will call a script on your site, remotely every X period .. you have to renew it once a month I think. If you take their paid service ($5/year according to the front page) I think the jobs you set up last until you cancel them or your paid term runs out.
In Java, you could just create a Timer. It can create a background thread that will perform a given function every so often.
My preference would be to start the Timer object in a ContextListener.contextInitialized() method, but you may have a more appropriate place.