HTML to PDF on Google AppEngine

HTML to PDF on Google AppEngine - python-2.7

We're currently trying to convert html files to PDF on AppEngine using Python. The HTML files are from a third-party vendor so we have no control over their format. Both the Flexible and Standard environments are options, but every path we go down we seem to hit a roadblock:
PDFkit requires a wkhtml2pdf install, no PIP package available, however converts perfectly offline
xhtml2pdf / PISA - works even on GAE Standard but doesn't support many features such as float and badly formatted HTML
WeasyPrint - C dependencies in theory would run on the Flexible environment but no pip packages available for dependencies including Cairo and Pango
Has anyone got a robust solution running on AppEngine with any of the above? Or with other libraries I am missing?

I ran into this same problem a year back and concluded that this is currently not possible in App Engine, at least with a good quality conversion. (Someone please point out if things have changed)
xhtml2pdf - I was able to successfully run it in standard App Engine but not at all happy with the conversion quality.
PDFkit - Ran into a similar problem and came up with a different solution. Hosted PDFkit on a Compute Engine Instance and exposed an endpoint wherein a POST request with the HTML file will return the converted PDF as a response. This gave me the best/expected results in terms of quality/speed of processing.
It did incur some extra charges but I was able to utilize the instance for something else too ;). I chose the least possible configuration initially since I was not storing anything on the Compute Engine Instance.

Related

Headless chrome cli in Production

I will be doing some pdf generation for my application. Currently, my plan is to create HTML using templates and convert them to PDF.
The pdf's aren't long. Maximum 3 pages. And approximately we will be making approx 100 docs in a day.
I was happy with the results I got from chrome --headless in my local machine. I called the cli command directly from my clojure code. So far so good. Looking at the number of wrappers available (Browserless, Chromeless, Puppeteer, ...) I'm not sure about the scalability factor in production.
Is it safe to use/call the chrome cli directly in production boxes?
What will I miss if I skip these wrappers?
My server side stack is Clojure/Compojure/Leiningen. Any insights/alternatives are appreciated.

I'm using Athena PDF for pdf generation in combination with Clojure:
https://github.com/arachnys/athenapdf
It has a REST interface. Since it runs in Docker its easy to scale.

Instead of detouring through html and chrome, I'd just use a pdf creating library such as clj-pdf. Here is a nice blog post about it.
p.s. If you dont mind running a third program to generate the pdf, I would have used emacs with org-mode (or heck, even writing it in elisp altogether) ;)

scikit learn on google cloud platform through datalab or compute engine?

I am running a Django App inside GCP. My idea was to call a python script from "view.py" for some machine learning algorithm and then display the result on the page.
But now I understand that running a machine learning library like Scikit-learn on GAE will not be possible (read Tim's answer here and this thread).
But suppose I need to still do this, I believe there are 2 ways possible, but I am not sure weather my guess is right or wrong
1) As the Google-Datalab provides the entire anaconda like distribution, if we have any datalab api which can be called from a python file in the Django app, I can achieve my goal ?
2) If I can install the scikit-learn library on any compute engine on GCP and somehow send it the request to run my code and then return the output back to the python file in the Django app ?
I am very new to client-server and cloud computing on the whole, so please provide examples (if possible) for any suggestion/ pointer for the help.
Regards,

I believe what you want is to use the App Engine Flex environment rather than the standard App Engine environment.
App Engine Flex uses a compute engine VM for running your code, so it does not have the library limitations that standard App Engine has.
Specifically, you'll need to add a 'requirements.txt' file to specify the version of scikit-learn that you want installed, and then add a 'vm: true' clause to your app.yaml file.

sklearn is now supported on ML Engine.
So, another alternative now is to use online prediction on Cloud ML Engine, and deploy your scikit-learn model as a web service.
Here is a fully worked out example of using fully-managed scikit-learn training, online prediction and hyperparameter tuning:
https://github.com/GoogleCloudPlatform/training-data-analyst/blob/master/blogs/sklearn/babyweight_skl.ipynb

Foswiki: Uploading and downloading topics without FTP

I have a Foswiki wiki on a server. Is it possible to script the following without FTP access (for various reasons I can't use it):
Download a topic's wikitext, modify it locally, then upload it again (overwriting the topic)
Upload wikitext to a new topic
I've been doing these tasks manually, but I'd like to automate them. I've looked into the Foswiki API and a few plugins, but nothing seems capable of doing this.
Is there a way? (any programming language)

If you have web access, you could drive the bin/view and bin/save scripts remotely from a script.
Take a look at our BuildContrib upload target for an example. It gets a strikeone key and downloads the original topic to recover any form data. It then uploads the topic text, creating a new version. It's written in perl, and uses LWP.
https://github.com/foswiki/distro/blob/master/BuildContrib/lib/Foswiki/Contrib/BuildContrib/Targets/upload.pm

The following isn't(!) the right solution (sure exists an nice Foswiki-way approach), but if you know perl, you can do anything with the:
Install Firefox
install MozRepl addon into it
Install the WWW::Mechanize::Firefox perl module
Now, you can script anything what you can do directly from the browser, e.g. logging into the Foswiki, click buttons, save topics, etc..etc. Drawback - it isn't an easy way - you need to know many details.
Myself using this technique for testing.

Image processing on a web server

I want to run image processing algos on server which can interact easily with web apps. The image processing algos are compute heavy and wont be available in custom built libraries. Currently I am using Ruby on Rails on Heroku for my website.
What would be the best architecture to achieve this? take images from website - run image processing algo on it - display back on website
Most of my image processing code is on C/C++.
Can i call C/C++ code from Ruby on Rails directly? Is this possible on Heroku?
Or should I design a system where C/C++ code expose some APIs which can be called by Ruby on Rails server?

Heroku typically uses small virtual machine instances, so depending on just how heavy your processing is, it may not be the best choice of architecture. However, if you do use it I would do this:
Use a background task gem to do your processing. Have this running on a separate process (called a worker rather than a dyno in Heroku terminology). Delayed Job is a tried and tested solution for background tasks with a wealth of online information relating to integrating it into Heroku, but there are newer ones like Sidekiq which use the new threading system in modern versions of Ruby. They would allow everything to be done in the dyno, but I would say that it would be useful to keep all background processing away from the webserver dynos, so Delayed Job (or similar) would be fine.
As for integrating C/C++, I haven't needed to do this as yet. However, I know it is possible to create gems that integrate C or C++ code and compile natively. As long as you're using ruby rather than JRuby, I don't think Heroku should have a problem with them. There are other ways of accomplishing this, look at SO questions specifically about this topic, such as
How can I call C++ functions from within ruby
It seems that you need to create an extension, then create a gem to contain it. These links may or may not help.
http://www.rubyinside.com/how-to-create-a-ruby-extension-in-c-in-under-5-minutes-100.html
http://guides.rubygems.org/gems-with-extensions/
I recommend making a gem as I think it may be difficult to otherwise get libraries or executables on to a Heroku instance. You can store the gem in your vendor directory if you don't want to make it public.
Overall I would have the webserver upload to S3 or wherever you're storing the images (this can be done directly in the browser without using the webserver as a stepping stone with the AWS JS API. Have a look for gems to help.)
Then the webserver can request a background task to process the image.
If you're not storing them, things become a little more interesting... You'll need a database if you're using background tasks, so you could pass the image data over to the worker as a blob in the database perhaps.
I really wouldn't do all the processing just in the webserver dyno, unless you're really only hitting this thing very occasionally. With multiple users you'd hit a bottleneck very quickly.
The background process can set a flag on the image table row so the webserver can let the user know when processing is complete. (You can poll for information using JS on the upload complete screen using AJAX)
Of course, there are many other ways of accomplishing this, depending on a number of factors.
Apologies that the answer is vague, but the question is quite open-ended.
Good luck.

Django nonrel on Google Appengine 3000 file limit

I followed the directions on http://www.allbuttonspressed.com/projects/djangoappengine, but afterwards realized that the resulting project has almost 5,000 files, namely because of the django directory.
Am I not supposed to include django 1.3 and just use django 1.2 builtin with Google App Engine? Or am I missing something? I've heard that zipimport is not a good idea with django.

There are not a lot of solutions:
You can try to remove every lib unnecessary in the directory of Django
Use zipimport if you don't want use django 1.2 provides with
GAE. Reduce the number of files you use in your project.
But note that: For yours instances, load a lot of files is slower because there are a lot of reads in the file system. django.zip is reads only one time and stocked in the memory to unzip it. there is just one read on the file system not 3000 and more...

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js