How to solve 403 error in scrapy - python-2.7

I'm new to scrapy and I made the scrapy project to scrap data.
I'm trying to scrapy the data from the website but I'm getting following error logs
2016-08-29 14:07:57 [scrapy] INFO: Enabled item pipelines:
[]
2016-08-29 13:55:03 [scrapy] INFO: Spider opened
2016-08-29 13:55:03 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/robots.txt> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Crawled (403) <GET http://www.justdial.com/Mumbai/small-business> (referer: None)
2016-08-29 13:55:04 [scrapy] DEBUG: Ignoring response <403 http://www.justdial.com/Mumbai/small-business>: HTTP status code is not handled or not allowed
2016-08-29 13:55:04 [scrapy] INFO: Closing spider (finished)
I'm trying following command then on website console then I got the response but when I'm using same path inside python script then I got the error which I have described above.
Commands on web console:
$x('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/h4/span/a/text()')
$x('//div[#class="col-sm-5 col-xs-8 store-details sp-detail paddingR0"]/p[#class="contact-info"]/span/a/text()')
Please help me.
Thanks

Like Avihoo Mamka mentioned in the comment you need to provide some extra request headers to not get rejected by this website.
In this case it seems to just be the User-Agent header. By default scrapy identifies itself with user agent "Scrapy/{version}(+http://scrapy.org)". Some websites might reject this for one reason or another.
To avoid this just set headers parameter of your Request with a common user agent string:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
yield Request(url, headers=headers)
You can find a huge list of user-agents here, though you should stick with popular web-browser ones like Firefox, Chrome etc. for the best results
You can implement it to work with your spiders start_urls too:
class MySpider(scrapy.Spider):
name = "myspider"
start_urls = (
'http://scrapy.org',
)
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}
for url in self.start_urls:
yield Request(url, headers=headers)

Add the following script on your settings.py file. This works well if you are combining selenium with scrapy
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'}

I just needed to get my shell to work and run some quick tests so Granitosaurus's solution was a bit overkill for me.
I literally just went to the settings.py where you'll find mostly everything is commented out. In like line 16-17 or something you'll find something like this...
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'exercise01part01 (+http://www.yourdomain.com)'
You just need uncomment it and replace it with any user agent like 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'
You can find a list of them here https://www.useragentstring.com/pages/useragentstring.php[][1]
So it'll look something like this...
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64; rv:48.0) Gecko/20100101 Firefox/48.0'
You'll definitely want to rotate user agents if you want to make a make a large-scale crawler. But I just needed to get my scrapy shell to work and make some quick tests without getting that pesky 403 error so this one-liner sufficed. It was nice because I did not need to make a fancy function or anything.
Happy scrapy-ing
Note: PLEASE make sure you are in the same directory as settings.py when you run scrapy shell in order to utilize the changes you just made. It does not work if you are in a parent directory.

How could the whole process of error solution look like:
You can find a huge list of user-agents at https://www.useragentstring.com/pages/useragentstring.php, though you should stick with popular web-browser ones like Firefox, Chrome etc. for the best results (find more at How to solve 403 error in scrapy).
An example of steps working for me for Windows 10 in scrapy shell follows:
https://www.useragentstring.com/pages/useragentstring.php -> choose 1 link from BROWSERS (but you also can try a link from links from CRAWLERS, ...) ->
e.g. Chrome = https://www.useragentstring.com/pages/Chrome/ -> choose 1 of lines, e.g.:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36 -> choose 1 part (text that belongs together) from that line, e.g.: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) ->
Command Prompt -> go into the project folder -> scrapy shell
from scrapy import Request
req = Request('https://www.whiskyshop.com/scotch-whisky?item_availability=In+Stock', headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6)'})
fetch(req)
Now, the result should be 200.
You see that it works even though I am on Windows 10 and there is Macintosh in Request().
You can use previous steps also to add a chosen header to file "settings.py".
Notes 1: Also comments in following stackoverflow web pages are more or less related (and I use them for this example):
https://stackoverflow.com/questions/52196040/scrapy-shell-and-scrapyrt-got-403-but-scrapy-crawl-works,
https://stackoverflow.com/questions/16627227/problem-http-error-403-in-python-3-web-scraping,
https://stackoverflow.com/questions/37010524/set-headers-for-scrapy-shell-request
Notes 2: I also recommend to read e.g.:
https://scrapeops.io/web-scraping-playbook/403-forbidden-error-web-scraping/
https://scrapeops.io/python-scrapy-playbook/scrapy-managing-user-agents/
https://www.simplified.guide/scrapy/change-user-agent

Related

Including additional measurements with Telegraf plugin inputs.logparser using "grok" patterns (Or regex)

I am using telegraf plugin[[inputs.logparser]] to grab the access_log data from Apache based on a local web page I have got running.
Using ["%{COMBINED_LOG_FORMAT}"] patterns, I am able to retrieve the default measurements provided by the access_logs, including http_version, request, resp_bytes etc.
I have appended the "Log Format" within httpd.conf file to include the additional "Response time" to each request access_log records with %D at the end, this has been successful when i look at the access_log after implementing.
However I am so far unable to successfully tell Telegraf to acknowledge this new measurement with the inputs.logparser - I am using a grafana dashboard with InfluxDB to monitor this data and it has not yet appeared as an additional measurement.
So far I have attempted the following:
First [[inputs.logparser]] section remains the same throughout my attempts and is always present/active, this seems right in order to be able to obtain the default measurements?
######## default logparser using COMBINED to obtain default access_log measurements ######
# Stream and parse log file(s).
[[inputs.logparser]]
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT}"
measurement = "apache_access_log"
custom_patterns = '''
'''
Attempt 1 at matching the response time appended to access_log:
############# Grok/RegEx for matching response time ######################
# Stream and parse log file(s).
[[inputs.logparser]]
## Log files to parse.
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{METRICS_INCLUDE_RESPONSE}"]
measurement = "apache_access_log"
custom_patterns = '''
METRICS_INCLUDE_RESPONSE [%{NUMBER:resp}]
'''
And my 2nd attempt I thought to try normal regular expressions
############# Grok/RegEx for matching response time ######################
# Stream and parse log file(s).
[[inputs.logparser]]
## Log files to parse.
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{METRICS_INCLUDE_RESPONSE}"]
measurement = "apache_access_log"
custom_patterns = '''
METRICS_INCLUDE_RESPONSE [%([0-9]{1,3})]
'''
After both of these attempts, the default measurements are still recorded and grabbed fine by Telegraf, but the response time does not appear as an additional measurement.
I believe the issue to be syntax within my custom grok pattern, and that it is not matching as I have intended it to because I am not telling it to pull the correct information? But I am unsure.
I have provided an example of the access_log output below, ALL details are pulled from Telegraf without issue under COMBINED_LOG_FORMAT, except for the number at the end, which is representative of the response time.
10.30.20.32 - - [09/Jan/2020:11:08:14 +0000] "POST /404.php HTTP/1.1" 200 252 "http://10.30.10.77/" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 600
10.30.20.32 - - [09/Jan/2020:11:08:15 +0000] "POST /boop.html HTTP/1.1" 200 76 "http://10.30.10.77/404.php" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36" 472
You are essentially extending a pre-defined pattern. So, the pattern should be written like so (assuming your response time value is within square brackets in the log) :
######## default logparser using COMBINED to obtain default access_log measurements ######
# Stream and parse log file(s).
[[inputs.logparser]]
files = ["/var/log/httpd/access_log"]
from_beginning = true
## Parse logstash-style "grok" patterns:
[inputs.logparser.grok]
patterns = ["%{COMBINED_LOG_FORMAT} \\[%{NUMBER:responseTime:float}\\]"]
measurement = "apache_access_log"
custom_patterns = '''
'''
You will get the response time value in a metric named 'responseTime' in float data type.

Internal Server Error when I try to use HTTPS protocol for traefik backend

My setup is ELB --https--> traefik --https--> service
I get back a 500 Internal Server Error from traefik on every request. It doesn't appear the request ever makes it to the service. The service is running Apache with access logging and I see no incoming requests logged. I am able to curl the service directly and receive an expected response. Both traefik and the service are running in Docker containers. I am also able to use port 80 all the way through with success, and I can use https to traefik and port 80 to the service. I get an error from apache, but it does go all the way through.
traefik.toml
logLevel = "DEBUG"
RootCAs = [ "/etc/certs/ca.pem" ]
#InsecureSkipVerify = true
defaultEntryPoints = ["https"]
[entryPoints]
[entryPoints.https]
address = ":443"
[entryPoints.https.tls]
[[entryPoints.https.tls.certificates]]
certFile = "/etc/certs/cert.pem"
keyFile = "/etc/certs/key.pem"
[entryPoints.http]
address = ":80"
[web]
address = ":8080"
[traefikLog]
[accessLog]
[consulCatalog]
endpoint = "127.0.0.1:8500"
domain = "consul.localhost"
exposedByDefault = false
prefix = "traefik"
The tags used for the consul service:
"traefik.enable=true",
"traefik.protocol=https",
"traefik.frontend.passHostHeader=true",
"traefik.frontend.redirect.entryPoint=https",
"traefik.frontend.entryPoints=https",
"traefik.frontend.rule=Host:hostname"
The debug output from traefik for each request:
time="2018-04-08T02:46:36Z"
level=debug
msg="vulcand/oxy/roundrobin/rr: begin ServeHttp on request"
Request="{"Method":"GET","URL":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"Proto":"HTTP/1.1","ProtoMajor":1,"ProtoMinor":1,"Header":{"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"],"Accept-Encoding":["gzip, deflate, br"],"Accept-Language":["en-US,en;q=0.9"],"Cache-Control":["max-age=0"],"Cookie":["__utmc=80117009; PHPSESSID=64c928bgf265fgqdqqbgdbuqso; _ga=GA1.2.573328135.1514428072; messagesUtk=d353002175524322ac26ff221d1e80a6; __hstc=27968611.cbdd9ce39324304b461d515d0a8f4cb0.1523037648547.1523037648547.1523037648547.1; __hssrc=1; hubspotutk=cbdd9ce39324304b461d515d0a8f4cb0; __utmz=80117009.1523037658.5.2.utmcsr=|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=80117009.573328135.1514428072.1523037658.1523128344.6"],"Upgrade-Insecure-Requests":["1"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.81 Safari/537.36"],"X-Amzn-Trace-Id":["Root=1-5ac982a8-b9615451a35258e3fd2a825d"],"X-Forwarded-For":["76.105.255.147"],"X-Forwarded-Port":["443"],"X-Forwarded-Proto":["https"]},"ContentLength":0,"TransferEncoding":null,"Host”:”hostname”,”Form":null,"PostForm":null,"MultipartForm":null,"Trailer":null,"RemoteAddr":"10.200.20.130:4880","RequestURI":"/","TLS":null}"
time="2018-04-08T02:46:36Z" level=debug
msg="vulcand/oxy/roundrobin/rr: Forwarding this request to URL"
Request="{"Method":"GET","URL":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/","RawPath":"","ForceQuery":false,"RawQuery":"","Fragment":""},"Proto":"HTTP/1.1","ProtoMajor":1,"ProtoMinor":1,"Header":{"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8"],"Accept-Encoding":["gzip, deflate, br"],"Accept-Language":["en-US,en;q=0.9"],"Cache-Control":["max-age=0"],"Cookie":["__utmc=80117009; PHPSESSID=64c928bgf265fgqdqqbgdbuqso; _ga=GA1.2.573328135.1514428072; messagesUtk=d353002175524322ac26ff221d1e80a6; __hstc=27968611.cbdd9ce39324304b461d515d0a8f4cb0.1523037648547.1523037648547.1523037648547.1; __hssrc=1; hubspotutk=cbdd9ce39324304b461d515d0a8f4cb0; __utmz=80117009.1523037658.5.2.utmcsr=|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=80117009.573328135.1514428072.1523037658.1523128344.6"],"Upgrade-Insecure-Requests":["1"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.81 Safari/537.36"],"X-Amzn-Trace-Id":["Root=1-5ac982a8-b9615451a35258e3fd2a825d"],"X-Forwarded-For":["76.105.255.147"],"X-Forwarded-Port":["443"],"X-Forwarded-Proto":["https"]},"ContentLength":0,"TransferEncoding":null,"Host”:”hostname”,”Form":null,"PostForm":null,"MultipartForm":null,"Trailer":null,"RemoteAddr":"10.200.20.130:4880","RequestURI":"/","TLS":null}" ForwardURL="https://10.200.115.53:443"
assume "hostname" is the correct host name. Any assistance is appreciated.
I think your problem come from "traefik.protocol=https", remove this tag.
Also you can remove traefik.frontend.entryPoints=https because it's useless: this tag create a redirection to https entrypoint but your frontend is already on the https entry point ("traefik.frontend.entryPoints=https")

Forbidden (CSRF cookie not set.) when csrf is in header [duplicate]

This question already has an answer here:
Django bug on CRSF token
(1 answer)
Closed 5 years ago.
The request header is as below.
Accept:application/json, text/plain, */*
Accept-Encoding:gzip, deflate, br
Accept-Language:en-US,en;q=0.8
Connection:keep-alive
Content-Length:129
Content-Type:text/plain
Host:localhost:9000
Origin:http://localhost:8000
Referer:http://localhost:8000/
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36
X-CSRFTOKEN:t5Nx0SW9haZTeOcErcBDtaq6psqBfeyuX4LRQ1WOOXq5g93tQkvcUZDGoWz8wSeD
The X-CSRFTOKEN is there but Django still complain about CSRF cookie not set. What happen to Django?
In settings.py, the naming are perfectly correct.
CSRF_HEADER_NAME = "HTTP_X_CSRFTOKEN"
Check if CSRF_COOKIE_SECURE is set to true.
You would get such an error message if CSRF_COOKIE_SECURE is true and you access a site through http instead of https.
Or you need to use (for testing only) csrf_exempt.
For example, curtisp mentions in the comments:
I had conditional dev vs prod settings and accidentally put dev settings to CSRF_COOKIE_SECURE = True and SESSION_COOKIE_SECURE = True.
My dev site is localhost on laptop, and is does not have SSL.
So changing dev settings to False fixed it for me.

Web crawler script will not run when accessing Internet from another source - Python

I have run into an issue where my web crawler will only run correctly when I am connected to my home Internet.
Using Python 2.7 with the Mechanize module on Windows 7.
Here are a few details about the code (snippet below)- This web crawler logs into a website, navigates through a series of links, locates a link to download a file, downloads the file, saves the file to a preset folder, then repeats the process several thousand times.
I am able to run the code successfully at home on both my wired and wireless internet. When I connect to the Internet via a different source (e.g. work, starbucks, neighbor's house, mobile hotspot) the script runs but returns an error when trying to access the link to download a file:
httperror_seek_wrapper: HTTP ERROR 404: Not Found
This is what the prints in the IDE when I access this site:
send: 'GET /download/8635/CLPOINT.E00.GZ HTTP/1.1\r\nHost: dl1.geocomm.com\r\nUser-Agent: Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1\r\nCookie: MSGPOPUP=1391465678; TBMSESSION=5dee7266e3dcfa0193972102c73a2543\r\nConnection: close\r\nAccept-Encoding: gzip\r\n\r\n'
reply: 'HTTP/1.1 404 Not Found\r\n'
header: Content-Type: text/html
header: Content-Length: 345
header: Connection: close
header: Date: Mon, 03 Feb 2014 22:14:44 GMT
header: Server: lighttpd/1.4.32
Simply changing back to my home internet What confuses me is I am not changing anything but the source of the internet - I simply disconnect from router, connect to another, and rerun the code.
I have tried to change the browser headers using these three options:
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.97 Safari/537.11')]
br.addheaders = [('User-agent', 'Firefox')]
I am using the Mechanize module to access the Internet and create a browser session. Here is the login code snippet and download file code snippet (where I am getting the 404 error).
def websiteLogin():
## Logs into GeoComm website using predefined credential (username/password hardcoded in definition)
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(),max_time=1)
br.set_debug_http(True)
br.set_debug_redirects(True)
br.set_debug_responses(False)
br.addheaders = [('User-agent', 'Mozilla/5.0 (x11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
br.select_form(nr=0)
br.form['username']='**********' ## stars replace my actual un and pw
br.form['password']='**********'
br.submit()
return br
def downloadData (br, url, outws):
br.open(url)
for l in br.links(url_regex = 'download/[0-9]{4}'):
fname = l.text
outfile = os.path.join(outws, fname)
if not os.path.exists(outfile):
f = br.retrieve(l.absolute_url)[0]
time.sleep(7.5)
shutil.copy2(f, outfile)
This code does run as expected (i.e. downloads files without 404 error) on my home internet, but that is a satellite internet service and my daily download and monthly data allotments are limited - that is why I need to run this using another source of internet. I am looking for some help better understanding why the code runs one place but not another. Let me know if you require more information to help troubleshoot this.
As you can see from your get-request your mechanize browser object is trying to get the resource /download/8635/CLPOINT.E00.GZ from host dl1.geocomm.com.
When you try to recheck this you will get the 404 because the resource is simply not available.
dl1.geocomm.com is redirected to another target
What I'd recommend you to do is to start debugging your application in an appropriate way.
You could start with adding at least some debugging print statements.
def downloadData (br, url, outws):
br.open(url)
for l in br.links(url_regex = 'download/[0-9]{4}'):
print(l.url)
After that you'll see how the output differs. Ensure to pass the url in the same way every time.

Is it possible to exclude specified GET parameters in apache access logs?

I need to exclude some sensitive details in my apache log, but I want to keep the log and the uri's in it. Is it possible to achieve following in my access log:
127.0.0.1 - - [27/Feb/2012:13:18:12 +0100] "GET /api.php?param=secret HTTP/1.1" 200 7600 "http://localhost/api.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
I want to replace "secret" with "[FILTERED]" like this:
127.0.0.1 - - [27/Feb/2012:13:18:12 +0100] "GET /api.php?param=[FILTERED] HTTP/1.1" 200 7600 "http://localhost/api.php" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11"
I know I probably should have used POST to send this variable, but the damage is already done. I've looked at http://httpd.apache.org/docs/2.4/logs.html and LogFormat, but could not find any possibilities to use regular expression or similar. Any suggestions?
[edit]
Do NOT send sensitive variables as GET parameters if you have the possibility to choose.
I've found one way to solve the problem. If I pipe the log output to sed, I can perform a regex replace on the output before I append it to the log file.
Example 1
CustomLog "|/bin/sed -E s/'param=[^& \t\n]*'/'param=\[FILTERED\]'/g >> /your/path/access.log" combined
Example 2
It's also possible to exclude several parameters:
exclude.sh
#!/bin/bash
while read x ; do
result=$x
for ARG in "$#"
do
cleanArg=`echo $ARG | sed -E 's|([^0-9a-zA-Z_])|\\\\\1|g'`
result=`echo $result | sed -E s/$cleanArg'=[^& \t\n]*'/$cleanArg'=\[FILTERED\]'/g`
done
echo $result
done
Move the script above to the folder /opt/scripts/ or somewhere else, give the script execute rights (chmod +x exclude.sh) and modify your apache config like this:
CustomLog "|/opt/scripts/exclude.sh param param1 param2 >> /your/path/access.log" combined
Documentation
http://httpd.apache.org/docs/2.4/logs.html#piped
http://www.gnu.org/software/sed/manual/sed.html
If you want to exclude several parameters, but don't want to use a script, you can use groups like that :
CustomLog "|$/bin/sed -E s/'(email|password)=[^& \t\n]*'/'\\\\\1=\[FILTERED\]'/g >> /var/log/apache2/${APACHE_LOG_FILENAME}.access.log" combined