Scrapy how to save a State between spider runs (via scrapinghub)?

Scrapy how to save a State between spider runs (via scrapinghub)? - python-2.7

I have a spider that will run on schedule. Spider input is based on Date. From date of last scrape to todays date. So the question is how to save the date of last scrape within the Scrapy project? There is an option to get data from scrapy settings using pkjutil module, but i did not find any reference in the docs on how to write data in that file. Any idea? Maybe an alternative?
P.S. My other option is to use some free remote MySql DB just for this. But looks like more work if simple solution is available.
import pkgutil
class CodeSpider(scrapy.Spider):
name = "code"
allowed_domains = ["google.com.au"]
def start_requests(self):
f = pkgutil.get_data("au_go", "res/state.json")
ids = json.loads(f)
id = ids[0]['state']
yield {'state':id}
ids[0]['state'] = 'New State'
with open('./au_go/res/state.json', 'w') as f:
json.dump(ids, f)
The above solution works fine when ran locally. But I am getting no such file or directory when running the code at Scrapinghub.
File "/tmp/unpacked-eggs/__main__.egg/au_go/spiders/test_state.py", line 33, in parse
with open(savePath, 'w') as f:
IOError: [Errno 2] No such file or directory: './au_go/res/state.json'

The problem is fixed with use of Scrapinghub Colections
And scrapinghub API. Works nice now.
Here is an example code in case somebody will find it usefull.
from scrapinghub import ScrapinghubClient
client = ScrapinghubClient(Your API KEY)
project = client.get_project(Your Project ID)
collections = project.collections
last_accessed = collections.get_store('last_accessed')
last_accessed.set({'_key': 'Date', 'value': '12-54-1235'})
print last_accessed.get('Date')['value']

Related

Python request download a file and save to a specific directory

Hello sorry if this question has been asked before.
But I have tried a lot of methods that provided.
Basically, I want to download the file from a website, which is I will show my coding below. The code works perfectly, but the problem is the file was auto download in our download folder path directory.
My concern is to download the file and save it to a specific folder.
I'm aware we can change our browser setting since this was a server that will remote by different users. So, it will automatically download to their temporarily /users/adam_01/download/ folder.
I want it to save in server disk which is, C://ExcelFile/
Below are my script and some of the data have been changing because it is confidential.
import pandas as pd
import html5lib
import time from bs4
import BeautifulSoup
import requests
import csv
from datetime
import datetime
import urllib.request
import os
with requests.Session() as c:
proxies = {"http": "http://:911"}
url = 'https://......./login.jsp'
USERNAME = 'mwirzonw'
PASSWORD = 'Fiqr123'
c.get(url,verify= False)
csrftoken = ''
login_data = dict(proxies,atl_token = csrftoken, os_username=USERNAME, os_password=PASSWORD, next='/')
c.post(url, data=login_data, headers={"referer" : "https://.....com"})
page = c.get('https://........s...../SearchRequest-96010.csv')
location = 'C:/Users/..../Downloads/'
with open('asdsad906010.csv', 'wb') as output:
output.write(page.content )
print("Done!")
Thank you, be pleased to ask if any confusing information was given.
Regards,
Fiqri

It seems that from your script you are writing the file to asdsad906010.csv. You should be able to change the output directory as follows.
# Set the output directory to your desired location
output_directory = 'C:/ExcelFile/'
# Create a file path by joining the directory name with the desired file name
file_path = os.path.join(output_directory, 'asdsad906010.csv')
# Write the file
with open(file_path, 'wb') as output:
output.write(page.content)

Python Scrapy - Run Spider

Running Python27 on a Windows machine ... Attempting to use Scrapy
following the basic Scrapy tutorial # http://doc.scrapy.org/en/latest/intro/overview.html
I've created the following spider and saved it as Test2 # C:\Python27\Scrapy
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract_first(),
'votes': response.css('.question .vote-count-post::text').extract_first(),
'body': response.css('.question .post-text').extract_first(),
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
The next step tells me to run the spider using
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
But I have no idea where to run that line of code.
I am used to running a print or a store to csv command at the end of my python file in order to retrieve results.
Sure this is an easy resolve but I'm not getting it .. Thanks in advance.

You will need to execute the runspider command in whatever command line utility you are using, e.g. Cygwin, cmd etc.
That command will crate a file called top-stackoverflow-questions.json in the directory in which you run the command.

How can I fetch the latest log file to scrape every time I run my python script?

There is a list of json logs in my Downloads folder.
Every time I run this script I want the latest one to be fetched.
import json
json_file = '/home/Downloads/log12-41-12.json'
json_data = open(json_file)
logs = json.load(json_data)
for index in logs:
print(index)
file.close()
json_data.close()

Code Snippet
import os
logdir='/home/Downloads/' # path to your log directory
logfiles = sorted([ f for f in os.listdir(logdir) if.startswith('log')])
print "Most recent file = %s" % (logfiles[-1],)

Django: No such file or directory

I have a process that scans a tape library and looks for media that has expired, so they can be removed and reused before sending the tapes to an offsite vault. (We have some 7 day policies that never make it offsite.) This process takes around 20 minutes to run, so I didn't want it to run on-demand when loading/refreshing the page. Rather, I set up a django-cron job (I know I could have done this in Linux cron, but wanted the project to be as self-contained as possible) to run the scan, and creates a file in /tmp. I've verified that this works -- the file exists in /tmp from this morning's execution. The problem I'm having is that now I want to display a list of those expired (scratch) media on my web page, but the script is saying that it can't find the file. When the file was created, I use the absolute filename "/tmp/scratch.2015-11-13.out" (for example), but here's the error I get in the browser:
IOError at /
[Errno 2] No such file or directory: '/tmp/corpscratch.2015-11-13.out'
My assumption is that this is a "web root" issue, but I just can't figure it out. I tried copying the file to the /static/ and /media/ directories configured in django, and even in the django root directory, and the project root directory, but nothing seems to work. When it says it cant' find /tmp/file, where is it really looking?
def sample():
""" Just testing """
today = datetime.date.today() #format 2015-11-31
inputfile = "/tmp/corpscratch.%s.out" % str(today)
with open(inputfile) as fh: # This is the line reporting the error
lines = [line.strip('\n') for line in fh]
print(lines)
The print statement was used for testing in the shell (which works, I might add), but the browser gives an error.
And the file does exist:
$ ls /tmp/corpscratch.2015-11-13.out
/tmp/corpscratch.2015-11-13.out
Thanks.
Edit: was mistaken, doesn't work in python shell either. Was thinking of a previous issue.

Use this instead:
today = datetime.datetime.today().date()
inputfile = "/tmp/corpscratch.%s.out" % str(today)
Or:
today = datetime.datetime.today().strftime('%Y-%m-%d')
inputfile = "/tmp/corpscratch.%s.out" % today # No need to use str()
See the difference:
>>> str(datetime.datetime.today().date())
'2015-11-13'
>>> str(datetime.datetime.today())
'2015-11-13 15:56:19.578569'

I ended up finding this elsewhere:
today = datetime.date.today() #format 2015-11-31
inputfilename = "tmp/corpscratch.%s.out" % str(today)
inputfile = os.path.join(settings.PROJECT_ROOT, inputfilename)
With settings.py containing the following:
PROJECT_ROOT = os.path.abspath(os.path.dirname(__file__))
Completely resolved my issues.

Encoding is not kept

I am using Python 2.7 and using google plus public API to get activity data in a file. I am encountering issues to maintain the json encoding in my file. Double quotes are coming as u'' in my file. Below is my code:
from apiclient import discovery
API_KEY = 'MY API KEY'
service = discovery.build("plus", "v1", developerKey=API_KEY)
activities_resource = service.activities()
request = activities_resource.search(query='India versus South Africa', maxResults=1, orderBy='best',)
while request!= None:
activities_document = request.execute()
if 'items' in activities_document:
with open("output.json", mode='a') as file:
data = str(activities_document['items'])
file.write(data +"\n\n")
request = service.activities().list_next(request, activities_document)
Output:
[{u'kind': u'plus#activity', u'provider': {u'title': u'Google+'}, u'titl.......
I am expecting [{"kind": "plus#activity", .....
I am running my code on windows and I have tried both on DOS and pycharm IDE. I have also run the code on ubuntu machine but same output. Please let me know what I am doing wrong.

The json module is used for generating JSON. Use it.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Scrapy how to save a State between spider runs (via scrapinghub)? - python-2.7

Related

Python request download a file and save to a specific directory

Python Scrapy - Run Spider

How can I fetch the latest log file to scrape every time I run my python script?

Django: No such file or directory

Encoding is not kept

Categories

Resources