Airflow | Scheduling URLs in batches over multiple days - airflow-scheduler

I need some suggestions related to setting up scrapers with Airflow.
Issue is lets say i have a website to scrape that has some 3k links. I want to divide this over 3 batches of 1k each and all these have to run on different days .
What can be the best approach in this case? If we can do some conditional parameter basis scheduling. Example I have excel as my data source so next to every url i can mention like Batch no ... now basis batch number if we can schedule differently?
Hope this makes some sense. Please suggest

Related

Cloud SQL to BigQuery ETL tool

I have a Cloud SQL instance with hundreds of databases, one for each customer. Each database has the same tables in it, but data only for the specific customer.
What I want to do with it, is transform in various ways so to get an overview table with all of the customers. Unfortunately, I cannot seem to find a tool that can iterate over all the databases a Cloud SQL instance has, execute queries and then write that data to BigQuery.
I was really hoping that Dataflow would be the solution but as far as I have tried and looked online, I cannot find a way to make it work. Since I spent a lot of time already on investigating Dataflow, I thought it might be best to ask here.
Currently I am looking at Data Fusion, Datastream, Apache Airflow.
Any suggestions?
Why Dataflow doesn't fit your needs? You could run a query to find out the tables, and then iteratively build the Pipeline/JdbcIO sources/PCollections based on those results.
Beam has a Flatten transform that can join PCollections.
What you are trying to do is one of the use cases why Dataflow Flex Templates was created (to have dynamic DAG creation within Dataflow itself) but that can be pulled without Flex Templates as well.
Airflow can be used for this sort of thing (essentially, you're doing the same task over and over, so with an appropriate operator and a for-loop you can certainly generate a DAG with hundreds of near-identical tasks that export each of your databases).
However, I'd be remiss not to ask: should you?
There may be a really excellent reason why you've created hundreds of databases in one instance, rather than one database with a customer field on each table. Yet if security is paramount, a row level security policy could add an additional element of safety without putting you in this difficult situation. Adding an index over the customer field would allow you to retrieve the appropriate sub-table swiftly (in return for a small speed cost when inserting new rows) so performance also doesn't seem like a reason to do this.
Given that it would then be pretty straightforward to get your data into BigQuery I would be moving heaven and earth to switch over to this setup, if I were you!

Is it possible to stage the model files in Dataflow?

I am facing tough times deploying Dataflow pipelines. Thanks to GCP Documents.. Below is what I am trying to achieve.
I have 4 deep learning models (binary files each 1 gb). I want to get predictions from all the 4 models. So I stired all the 4 models in bucket. And in my pipeline I do.
download_blob(......, destination_file_name = 'model.bin')
fasttext.load_model('model.bin')
It works fine but I have below concern.
Everytime a job is created it downloads these files which will consume lot of time. If I call 100 jobs, So the models will be downloaded 100 times. Is there any way I could avoid it?
Is there any way I could stage these files in some location so that even if I trigger job 100 times the model is downloaded just one time?
As mentioned at GCP Dataflow Computation Graph and Job Execution, you could put the model data in a custom container. Of course the container itself will still have to be staged on the workers.
You could also consider if a single pipeline (perhaps streaming if the input is not known ahead of time) would server your needs better than many successive runs.

Cron Jobs vs Task Scheduler table for scheduled emails

Preamble: I have a web app, the backend is based on the serverless architecture. It's basically an amplify app hosted on AWS with a dynamoDB database. I've learnt is possible to create a task scheduling system of sorts more here. A quick summary of the article is "Its possible to create a task scheduling table taking advantage of TTL and dynamoDB streams to execute lambda function at specific times. The TTL specifies a set time for an record to be deleted, we can capture this delete event in a dynamoDB stream and run some tasks based on information from the stream"
Problem:
The goal is to send a series of emails to users who sign up for our service. Each user that signs up gets a series of "Getting Started" emails. The first of the emails is sent 24 hours after a user signs up, the second 3 days later and the third exactly 7 days after sign up.
I see how a cron job would be suitable here, but it just seems a bit inefficient to me. I would basically have to search the users table for users whose sign up time falls between a specific 24 hour period and send the email to the users whereas with a Task scheduler table I could add a task to the table ( something like send first email to user300 with a TTL of when I want it to be sent ) and listen for delete events to run the task. No need to run a cron job daily, just a function that handles each task as it comes.
I think this is more like a performance vs storage problem. Having a task scheduler table would take up space, if we add all the emails to be sent to a user as tasks on the table (each email to be sent to a specific user is it's own task) each time a user signs up then I see the task scheduler table growing 3n records for every n user signed up. But this may not really be a problem as tasks are deleted after they are run. I do not know the performance cost of using a cron job for this particular task hence I'm here. I also may be wrong and the cost of running and updating this task scheduler table may be more than that of the cron job.
I initially thought of setting up a dummy user table and running both the cron and the task scheduler and documenting cost of running both, but you can imagine how much time and effort that would take.
So I guess my question is which is a more efficient solution in terms of performance and cost?
There is no perfect solution here. Keep in mind that Dynamodb TTL takes up to 48h to invoke, so it's probably unacceptable. CRON Jobs with Lambda are cheap, and it's easy to set. You coul also use SQS and populate it with daily CRON. Yan Cui wrote great article about this problem https://theburningmonk.com/2019/03/dynamodb-ttl-as-an-ad-hoc-scheduling-mechanism/
This may not exactly be an answer. Based on the medium article you linked the guy had a plausible reason why the TTL and dynamoDB streams would be better than a cron job which you reiterated. Setting up a cron job is easier and cheaper (free) and I doubt the performance will be that much worse unless the database is huge. I don't have any experience doing something like this so I wouldn't know how large the database would have to be for it to make sense to switch over. Alternatively, you can have as many cron jobs as you want so I don't see how you couldn't just set up a user specific cron job whenever someone signs up.
You can setup a CloudWatch Event to fire a Lambda function on a regular schedule. The Lambda function can search a database for an applicable result set and perform other actions - send an email, a text message, etc.
Here is an AWS tutorial that covers a very similar use case with step by step instructions. This tutorial is implemented by using the AWS Java API (but you can implement it using other supported programming languages).
https://github.com/awsdocs/aws-doc-sdk-examples/tree/master/javav2/usecases/creating_scheduled_events
From a Cost perspective - Lambda allows 1M free requests per month. Details are here - https://aws.amazon.com/lambda/pricing/

Scrapy concurrent or distributed crawls

I would like to use scrapy to crawl fairly large websites. In some cases I will already have the links to scrape and in others I will need to extract (crawl) them. I will also need to access a database twice when running. Once in order to determine if a url is required to be scraped (Spider middleware) and once in order to store the extracted information (Item pipeline).
Ideally, I would be able to run concurrent or distributed crawls in order to speed things up. What is the recommended way to run concurrent or distributed crawls with scrapy?
You should check scrapy_redis.
It is very simple to implement. Your scheduler and duplicate filter will be store in a redis queue. All the spiders will work concurrently, and you should speed up your crawl time.
Hope this helps.
The Scrapy Cluster documentation contains a page listing many existing Scrapy-based solutions for distributed crawls.

Amazon MapReduce with cronjob + APIs

I have a website set up on an EC2 instance which lets users view info from 4 of their social networks.
Once a user joins, the site should update their info every night, to show up-to-date and relevant information the next day.
Initially we had a cron-job which went through each user and did the necessary calls to the APIs and then stored the data on the DB (amazon rds instance).
This operation should take between 2 to 30 seconds per person, which means doing it 1 by 1 would take days to update.
I was looking at MapReduce and would like to know if it would be a suitable option for what im trying to do, but at the moment I can't tell for sure.
Would I be able to give an .sql file to MapReduce, with all the records I want to update + a script that tells MapReduce what to do with each record and have it process them all simultaneously?
If not, what would be the best way to go about it?
Thanks for your help in advance.
I am assuming each user's data is independent of the other users' data, which seems logaical to me. If that-s not the case, please ignore this answer.
Since you have mutually independent data (that is, each user's data is independent from other users') there is no need to use MapReduce. MR is just a paradigm in programming that simplifies data manipulation when the data is not independent (map prepares the data, then there is sorting phase, then reduce pulls the results from the sorted records).
In your case, if you want to use more computers, just split the load between them - each computer should process ~10000 users per hour (very rough estimate). Then users can be distributed among computers beforehand or they can be requested in chunks of 1000 or so users, so the machines that end sooner can process more users.
BUT there is an added bonus in using MR framework (such as Hadoop), even if you only use one phase (map only). It does the error handling for you (nodes failing, jobs failing,...) and it takes care of distributing the input among the nodes.
I'm not sure if MR is worth all the trouble to set it up, depends on your previous experience - YMMV.
If my understanding is correct. should this application to be implement as MapReduce, all the processings are done in the Map phase and reduce might simple output the Map phase result.
So if I were to implement this, I would just divide the job into multiple EC2 instances with each instance process a given range of record in your sql data. This has made the assumption that you have an good idea of how to divide the data to different instances.
The advantage is that you needn't pay for the price of Elastic MapReduce and avoid any possible MapReduce overhead.