I would like Crawl 3 million web pages in a day. Due to variety of web nature - HTML, pdf etc. I need to use Selenium, Playwright etc. I noticed to use Selenium one has to build a custom container using Google DataFlow
Is it a good choice to use Selenium inside ParDo Fns ? Can we use a single instance of Selenium across multiple instances ?
Is the same applicable Playwright, should I build a custom image ?
You can do anything in a Python DoFn that you can do from Python. Yes, I would definitely use custom containers for complex dependencies like this.
You can share instances of Selenium (or any other object) per DoFn instance by initializing it in your setup method. You can share it for the whole process by using a module-level global or something like shared (noting that it may be accessed by more than one thread at once).
Related
The use-case is this: we need to pull in data from a third-party service and update the database with fresh records, every week. The different ways I have been able to explore have been
either creating a custom django-admin command
or running a background task using Celery (and probably ELK for logging)
I just want to know which way is more feasible and simpler? And if there's another way that I can explore. What I want is monitoring the task for the first few runs then just relying on the logs.
I am used to Google Cloud SQL, where you can connect to a database outside of GAE. Is something like this possible for the GAE datastore, using the Python NDB interface ideally ?
Basically, my use-case is I want to run acceptance tests that pre-populate and clean a datastore.
It looks like the current options are a JSON API or protocol buffers -- in beta. If so, it's kind of a pain then I can't use my NDB models to populate the data, but have to reimplement them for the tests, and worry that they haven't been saved to the datastore in the exact same way as if through the application.
Just checking I'm not missing something....
PS. yes I know about remote_api_shell, I don't want a shell though. I guess piping commands into it is one way, but ugghh ...
Cloud Datastore can be accessed via client libraries outside of App Engine. They run on the "v1 API" which just went GA (August 16, 2016) after a few years in Beta.
The Client Libraries are available for Python, Java, Go, Node.js, Ruby, and there is even .NET.
As a note, GQL language variant supported in DB/NDB is a bit different from what the Cloud Datastore service itself supports via the v1 API. The NDB Client Library does some of its own custom parsing that can split certain queries into multiple ones to send to the service, combining the results client-side.
Take a read of our GQL reference docs.
Short answer: they're working on it. Details in
google-cloud-datastore#2 and gcloud-python#40.
I'm building a simple website with django that requires constant monitoring of text-based data from another website, that's the way it have to be.
How could I run this service on my web-host using django? would I have to start a separate app and run it via SSH, so it updates the database used by django, or are there any easier/better way?
You could use celery to schedule a job that would read data from that other website and do whatever you want with it.
As an alternative to celery, you could also create a cron job that executes a custom django-admin command. That would give you full access to your django install and ORM. The downside is that cron's smallest time resolution is 1 minute, so if you need it to be real-time, you're not going to be able to do that.
If you do need realtime, then creating a python daemon might be a better option.
How to make Django execute something automatically at a particular time.?
For example, my django application has to ftp upload to remote servers at pre defined times. The ftp server addresses, usernames, passwords, time, day and frequency has been defined in a django model.
I want to run a file upload automatically based on the values stored in the model.
One way to do is to write a python script and add it to the crontab. This script runs every minute and keeps an eye on the time values defined in the model.
Other thing that I can roughly think of is maybe django signals. I'm not sure if they can handle this issue. Is there a way to generate signals at predefined times (Haven't read indepth about them yet).
Just for the record - there is also celery which allows to schedule messages for the future dispatch. It's, however, a different beast than cron, as it requires/uses RabbitMQ and is meant for message queues.
I have been thinking about this recently and have found django-cron which seems as though it would do what you want.
Edit: Also if you are not specifically looking for Django based solution, I have recently used scheduler.py, which is a small single file script which works well and is simple to use.
I've had really good experiences with django-chronograph.
You need to set one crontab task: to call the chronograph python management command, which then runs other custom management commands, based on an admin-tweakable schedule
The problem you're describing is best solved using cron, not Django directly. Since it seems that you need to store data about your ftp uploads in your database (using Django to access it for logs or graphs or whatever), you can make a python script that uses Django which runs via cron.
James Bennett wrote a great article on how to do this which you can read in full here: http://www.b-list.org/weblog/2007/sep/22/standalone-django-scripts/
The main gist of it is that, you can write standalone django scripts that cron can launch and run periodically, and these scripts can fully utilize your Django database, models, and anything else they want to. This gives you the flexibility to run whatever code you need and populate your database, while not trying to make Django do something it wasn't meant to do (Django is a web framework, and is event-driven, not time-driven).
Best of luck!
My company uses a lot of different web services on daily bases. I find that I repeat same steps over and over again on daily bases.
For example, when I start a new project, I perform the following actions:
Create a new client & project in Liquid Planner.
Create a new client Freshbooks
Create a project in Github or Codebasehq
Developers to Codebasehq or Github who are going to be working on this project
Create tasks in Ticketing system on Codebasehq and tasks in Liquid Planner
This is just when starting new projects. When I have to track tasks, it gets even trickier because I have to monitor tasks in 2 different systems.
So my question is, is there a tool that I can use to create a web service that will automate some of these interactions? Ideally, it would be something that would allow me to graphically work with the web service API and produce an executable that I can run on a server.
I don't want to build it from scratch. I know, I can do it with Python or RoR, but I don't want to get that low level.
I would like to add my sources and pass data around from one service to another. What could I use? Any suggestions?
Progress DataXtend Semantic Integrator lets you build WebServices through an Eclipse based GUI.
It is a commercial product, and I happen to work for the company that makes it. In some respects I think it might be overkill for you, as it's really an enterprise-level data mapping tool for mapping disparate data sources (web services, databases, xml files, COBOL) to a common model, as opposed to a simple web services builder, and it doesn't really support your github bits, anymore than normal Eclipse plugins would.
That said, I do believe there are Mantis plugins for github to do task tracking, and I know there's a git plugin for Eclipse that works really well (jgit).
Couldn't you simply use Selenium to execute some of this tasks for you? Basically as long as you can do something from the browser, Selenium will also be able to do. Selenium comes with a language called "selenese", so you can even use it to programmatically create an "API" with your tasks.
I know this is a different approach to what you're originally looking for, but I've been using selenium for a number of tasks, and found it's even good to execute ANT tasks or unit tests.
Hope this helps you
What about Apache Camel?
Camel lets you create the Enterprise Integration Patterns to implement routing and mediation rules in either a Java based Domain Specific Language (or Fluent API), via Spring based Xml Configuration files or via the Scala DSL. This means you get smart completion of routing rules in your IDE whether in your Java, Scala or XML editor.
Apache Camel uses URIs so that it can easily work directly with any kind of Transport or messaging model such as HTTP, ActiveMQ, JMS, JBI, SCA, MINA or CXF Bus API together with working with pluggable Data Format options.