Keep track of visited URL - itsy crawler - clojure

I'm making an application in Clojure and I'm using itsy crawler to crawl specific site.
Now I'm wondering is it possible to run itsy crawler for some time, and then stop whole application, but when I start application again to skip already visited urls?

From looking at the source, itsy does not provide a built in mechanism for saving the current state of the crawler. But, the current state of the crawler is accessible in the result of the crawl function, under the :state key.
You could serialize the values in :seen-urls atom and :queued-urls queue on exiting your application, and then deserializing them when you start it again. It looks like you would have to add your saved values after running the crawl function, to make sure everything is initialized correctly.

Related

I am wondering if it is normal for AWS EMR to send a lot of list and head requests for S3 model files

I am using the AWS EMR Cluster service.
It is a situation in which machine learning tasks such as spark-build are being performed by referring to the model file with the S3 Bucket between EMR Cluster use.
I request a lot of head and list requests from S3, but I am wondering if it is normal for AWS EMR to send a lot of list and head requests to the S3 model file.
Symptom: AWS EMR is about 2.7 million head and list requests per day to S3.
A lot of list/head requests get sent.
This is related to how directories are emulated on the hadoop/spark/hive S3 clients; every time a progress looks to see if there's a directory on a path it will issue a LIST request, maybe a HEAD request first (to see if its a file).
Then there's the listing of the contents, more LIST requests, and finally reading the files. There'll be one HEAD request on every open() call to verify the file exists and to determine how long it is.
Files are read with GET Requests. Every time there's a seek()/buffer read on the input stream and the data isn't in a buffer the client has to do one of
read to the end of the current ranged get (assuming its a ranged GET), discarding the data, issue a new ranged GET
abort() the HTTPS connection, negotiate a new one. Slow.
Overall then, a lot of IO, especially if the application is inefficient about caching the output of directory listings, whether files exist, doing needless checks before operations (if fs.exists(path) fs.delete(path, false)) and the like.
If this is your code, try not to do that
(disclaimer: this is all guesses based on the experience of tuning the open source hive/spark apps to work through the S3A connector. I'm assuming the same for EMR)

Run next request with new iteration data from json by using setNextRequest in postman

I have a case where i have a folder in a collection which has to run for a specific set of iterations before going to the next folder.
There is no direct feature in postman yet which allows you to set different iterations at each folder level and still run the collection as whole.
I want to run a specific request in a folder with all the testdata i provide in a json and then move to the next request as per the collection folders.
I tried achieving the same using setNextRequest in the same request which has to be run multiple times but with different data.
However, it runs the same data again for the second time as well and through the runner it goes to iteration 2 which was set to 2 at collection level as well.
How can I achieve my usecase.

Migrate ColdFusion scheduled tasks using neo-cron.xml

We currently have two ColdFusion 10 dedicated servers which we are migrating to a single VPS server. We have many scheduled tasks on each. I have taken each of the neo-cron.xml files and copied the var XML elements, from within the struct type='coldfusion.server.ConfigMap' XML element, and pasted them within that element in the neo-cron.xml file on the new server. Afterward I restarted the ColdFusion service, log into cf admin, and the tasks all show as expected.
My problem is, when I try to update any of the tasks I get the following error when saving:
An error occured scheduling the task. Unable to store Job :
'SERVERSCHEDULETASK#$%^DEFAULT.job_MAKE CATALOGS (SITE CONTROL)',
because one already exists with this identification
Also, when I try to delete a task it tells me a task with that name does not exist. So it seems to me that the task information must also be stored elsewhere. So there when I try to update a task, the record doesn't exist in the secondary location so it tries to add it new to the neo-cron.xml file, which causes an error because it already exists. And when trying to delete, it doesn't exist in the secondary location so it says a task with that name does not exist. That is just a guess though.
Any ideas how I can get this to work without manually re-creating dozens of tasks? From what I've read this should work, but I need to be able to edit the tasks.
Thank you.
After a lot of hair-pulling I was able to figure out the problem. It all boiled down to having parentheses in the scheduled task names. This was causing both the "Unable to store Job : 'SERVERSCHEDULETASK#$%^DEFAULT.job_MAKE CATALOGS (SITE CONTROL)', because one already exists with this identification" error and also causing me to be unable to delete jobs. I believe it has something to do with encoding the parentheses because the actual neo-cron.xml name attribute of the var element encodes the name like so:
serverscheduletask#$%^default#$%^MAKE CATALOGS (SITE CONTROL)
Note that this anomaly did not exist on ColdFusion 10, Update 10, but does exist on Update 13. I'm not sure which update broke it, but there you go.
You will have to copy the neo-cron.xml from C:\ColdFusion10\\lib of one server to another. After that restart the server to make the changes effective. Login to the CF Admin and check the functionality.
This should work.
Note:- Please take a backup of the existing neo-cron.xml, before making the changes.

Updating a hit counter when an image is accessed in Django

I am working on doing some simple analytics on a Django webstite (v1.4.1). Seeing as this data will be gathered on pretty much every server request, I figured the right way to do this would be with a piece of custom middleware.
One important metric for the site is how often given images are accessed. Since each image is its own object, I thought about using django-hitcount, but figured that was unnecessary for what I was trying to do. If it proves easier, I may use it though.
The current conundrum I face is that I don't want to query the database and look for a given object for every HttpRequest that occurs. Instead, I would like to wait until a successful response (indicated by an HttpResponse.status of 200 or whatever), and then query the server and update a hit field for the corresponding image. The reason the only way to access the path of the image is in process_request, while the only way to access the status code is in process_response.
So, what do I do? Is it as simple as creating a class variable that can hold the path and then lookup the file once the response code of 200 is returned, or should I just use django-hitcount?
Thanks for your help
Set up a cron task to parse your Apache/Nginx/whatever access logs on a regular basis, perhaps with something like pylogsparser.
You could use memcache to store the counters and then periodically persist them to the database. There are risks that memcache will evict the value before it's been persisted but this could be acceptable to you.
This article provides more information and highlights a risk arising when using hosted memcache with keys distributed over multiple servers. http://bjk5.com/post/36567537399/dangers-of-using-memcache-counters-for-a-b-tests

Best way to work with temp images in Django?

I'm developing a Django project where I need to serve temporary images, which are generated online. The sessions should be anonymous; anyone should be able to use the service. The images should be destroyed when the session expires or closes.
I don't know, however, what's the best approach. For instance, I could use file-based sessions and just set the images to be generated at the session folder, and they would (or at least should) be destroyed with the session. I suppose I could do something similar with database sessions, maybe saving the images in the database or just removing them when the sessions ends, however, the file-based solution sounds more reliable to me.
Is it a good solution, or are there more solid alternatives?
I'd name the temporary images based on a hash of the session key and then create a management command that:
makes a list containing potential temp filename hashes for all the current sessions.
grabs a list of all the current filenames in your temporary directory
deletes filenames which don't have a matching entry in the hash list
Since there's no failsafe way to know if a session has "closed", you should use the cleanup management command first - either before this one, or you could make it run implicitly as part of this new command by using call_command() function.