I want to use Curl to download and parse data from this website:
http://xetra.com/xetra/dispatch/en/xetraCSV/navigation/xetra/100_market_structure_instruments/100_instruments/100_all_tradable_instruments/
I have used my Curl code on different websites before and it works without issue but this site is different in that it returns a redirect response with an actual link containing the data.
I enabled this setting:
curl_easy_setopt(m_pCurl, CURLOPT_FOLLOWLOCATION, TRUE));
but I get caught in an infinite loop of redirects filling the log file.
To avoid this, I then parsed the initial HTTP response to get the redirect location, and attempt to download using that link. However, Curl tells me the headers and body are empty (CURLE_GOT_NOTHING) and throws. When I visit using a browser I can see the data loading so I know that there is something there, Curl just doesn't seem to be able to see it.
Any help on this issue would be greatly appreciated.
Many thanks,
pma07pg
Many thanks to Captain Giraffe for this answer!
If you have have a redirect link and need to store the cookies then add these options:
curl_easy_setopt(m_pCurl, CURLOPT_MAXREDIRS , 5); // Stop redirecting ad infinitum
curl_easy_setopt(m_pCurl, CURLOPT_COOKIEFILE, "");
You need the JSESSIONID cookie to not get redirected.
Add the cookie you receive on the first request (302 Found) to your headers, repeat the request et viola.
Sample dealing with libcurl Cookies here
Related
I'm trying to perform a file upload operation(which is done using multiple HTTP POST requests). Hence I need to save the cookies from the response of first HTTP POST and set those cookies in the request of the second HTTP POST. I save cookies using CURLINFO_COOKIELIST and set them manually using CURLOPT_COOKIELIST.
CURLcode result = curl_easy_setopt(curlHandle, CURLOPT_COOKIELIST, my_cookies)
This works only if I set the cookies on the same curlHandle. If I close the handles and create new ones after each request, it fails.
Is it not possible to use CURLOPT_COOKIELIST option on different curl handles to execute multiple HTTP requests in the same session ?
Any help is much appreciated.
Update:
I'm trying to save and set the cookies like this. Is there anything wrong I might be doing ?
std::string my_cookies;
// Setting other options using curl_easy_setopt
// To start the cookie engine
curl_easy_setopt(curlHandle,CURLOPT_COOKIEFILE,"");
if (!my_cookies.empty())
{
curl_easy_setopt(curlHandle, CURLOPT_COOKIELIST, my_cookies);
}
curl_easy_perform(curlHandle);
// Save cookies from response of first HTTP POST
struct curl_slist* cookies;
curl_easy_getinfo(curlHandle,CURLINFO_COOKIELIST,&cookies);
// Code to copy cookies to my_cookies.
There's nothing in an extracted cookie list that binds it to that particular easy handle so yes, it can be moved over and inserted into another handle.
How to get secured cookie from curl after authentication?
curl_easy_getinfo(curl_handler, CURLINFO_COOKIELIST, &cookies);
fetched only one cookie, the other secured cookie wasnt fetched.
Same with
curl_easy_setopt(curl_handler, CURLOPT_COOKIEJAR, "cookie.txt");
However in java we could use cookie manager for login and after all the operations if we iterated the cookie manager there were two of them "Cookie" and "_WL_AUTHCOOKIE_JSESSIONID".
In curl i am not able to fetch "_WL_AUTHCOOKIE_JSESSIONID" .
Any help would be appreciated.
First, curl should get the same set of cookies that any other HTTP client gets.
Unfortunately, that is a should as servers sometimes act different depending on which client it thinks it speaks to and thus it may respond differently. Also, since you're comparing with another client it is possible that the java version you see did some more HTTP requests that made it get the second cookie your curl request doesn't.
To minimize the risk for all this, make sure the requests are as similar as possible so that the server cannot spot a difference between your clients and then it should repond identically and you will get the same set of cookies in both cases.
When the curl based client gets both cookies, you can extract them fine with CURLINFO_COOKIELIST just as you want.
I'm trying to implement an answer from another question on this site:
Detect when browser receives file download
I've followed all of the steps and everything is working up to the point where I try to retrieve the cookie. When I use Firebug I can see the cookie that I created in the header response, along with a cookie that was created earlier in the app by javascript.
The info in firebug for the two cookies is:
name:earlierCookie,value:1234,Domain:localhost,Path:/,Expires:Session,HttpOnly:false
name:cookiefromServer,value:5678,Domain:localhost,Path:/resource/upload/file,Expires:Session,HttpOnly:false
So, you can see that the cookies are in the same domain (they have different paths). When looking at document.cookie, only earlierCookie is present.
Why can I see cookieFromServer in Firebug and not in document.cookie?
Also, please tell me if I need to post more info.
I figured this out on my own. The problem is the path. Setting path to / from the server allows the cookie to show up in document.cookie I have no idea why this is and can't find good resources explaining it.
I'm programming with WinInet functions in C++ but I came across a problem.
My program opens an URL with HttpOpenRequest(), HttpSendRequest(), InternetReadFile()... functions and saves output data. I need to save URL with the output data, but in some cases server gives me 301 Moved and InternetReadFile() reads file from the new address.
This is Ok, but I need to find out what address it is. I tried to use HttpQueryInfo with HTTP_QUERY_RAW_HEADERS_CRLF but I didn't obtain this info, only Content-Type, Cache-Control, Cookies, etc. When I use HTTP_QUERY_CONTENT_LOCATION or something similar I get ERROR_HTTP_HEADER_NOT_FOUND.
Can you help me?
After WinInet receives a Redirect response, by default it sends a new HTTP request to the new URL automatically. By the time WinInet is ready for you to start reading file data with InternetReadFile(), the headers that are available at that time belong to the last URL requested, which may not be the same URL that you originally requested. That is why you are not seeing see a Location header. To process the headers for a Redirect response, you have to specify the INTERNET_FLAG_NO_AUTO_REDIRECT flag when calling HttpOpenRequest(), then you can use HttpQueryInfo() to detect a redirect status code and read its Location header before then calling HttpSendRequest() to request the new URL being redirected to.
When a redirect happens automatically in wininet, you can get the redirect url by using a InternetStatusCallback function. Code INTERNET_STATUS_REDIRECT (110) will supply a buffer with the new URL to the callback function. You can use InternetSetStatusCallback() on the HINTERNET handle to set a callback function for the request.
I'm stuck in a cookie related question. I want to write a program that can automate download the attachments of this forum. So I should maintain the cookies this site send to me. When I send a GET request in my program to the login page, I got the cookie such as Set-Cookie: sso_sid=0589a967; domain=.it168.com in my program. Now if I use a cookie viewer such as cookie monster and send the same GET request, my program get the same result, but the cookie viewer shows that the site also send me two cookies which are:
testcookie http://get2know.it/myimages/2009-12-27_072438.jpg and token http://get2know.it/myimages/2009-12-27_072442.jpg
My question is: Where did the two cookie came from? Why they did not show in my program?
Thanks.
Your best bet to figure out screen-scraping problems like this one is to use Fiddler. Using Fiddler, you can compare exactly what is going over the wire in your app vs. when accessing the site from a browser. I suspect you'll see some difference between headers sent by your app vs. headers sent by the browser-- this will likley account for the difference you're seeing.
Next, you can do one of two things:
change your app to send exactly the headers that the browser does (and, if you do this, you should get exactly the response that a real browser gets).
using Fiddler's "request builder" feature, start removing headers one by one and re-issuing the request. At some point, you'll remove a header which makes the response not match the response you're looking for. That means that header is required. Continue for all other headers until you have a list of headers that are required by the site to yield the response you want.
Personally, I like option #2 since it requires a minimum amount of header-setting code, although it's harder initially to figure out which headers the site requires.
On your actual question of why you're seeing 2 cookies, only the diagnosis above will tell you for sure, but I suspect it may have to do with the mechanism that some sites use to detect clients who don't accept cookies. On the first request in a session, many sites will "probe" a client to see if the client accepts cookies. Typically they'll do this:
if the request doesn't have a cookie on it, the site will redirect the client to a special "cookie setting" URL.
The redirect response, in addition to having a Location: header which does the redirect, will also return a Set-Cookie header to set the cookie. The redirect will typically contain the original URL as a query string parameter.
The server-side handler for the "cookie setter" page will then look at the incoming cookie. If it's blank, this means that the user's browser is set to not accept cookies, and the site will typically redirect the user to a "sorry, you must use cookies to use this site" page.
If, however, there is a cookie header send to the "cookie setter" URL, then the client does in fact accept cookies, and the handler will simply redirect the client back to the original URL.
The original URL, once you move on to the next page, may add an additional cookie (e.g. for a login token).
Anyway, that's one way you could end up with two cookies. Only diagnosis with Fiddler (or a similar tool) will tell you for sure, though.