For a project I'm trying to get data from a website only acessible when you're logged in from the site Goodreads.com. I'm new to Jsoup, since I'm using it only for this particular project. Getting the relevant data from the website is not a problem, but I can't seem to get to the particular page I need. The page I'm trying to acces is viewable only when logged in, when not logged in it rederects to the log-in page.
I've looked through the answers here, but the answers given so far have not helped.
What I have now:
String url = "http://www.goodreads.com/friend/user/7493379-judith";
Connection.Response res = Jsoup.connect("http://www.goodreads.com/user/sign_in")
.data("email", "MYEMAIL", "user_password", "MYPASSWORD")
.method(Connection.Method.POST)
.execute();
Document doc2 = res.parse();
String sessionId = res.cookie("_session_id");
Document doc = Jsoup.connect(url)
.cookie("_session_id", sessionId)
.get();
I got this far with help of the answers here, but it doesn't work, I'm still only getting the data from the log-in page it rederects to.
I have several questions:
Most importantly of course; How can I make it work?
The given answers here heve used method.(Method.POST) instead of method.(Connection.Method.POST) . When I use the first one however, I get an error that Method cannot be resolved. Anyone know why this is?
The examples I've seen have used "username" and "password" in .data() . What exactly do these refer to? I've now used the name of the input box. Is it the name, the type, the id, what exactly? Since Goodreads does not refer to the log in as the username, but as the e-mail, I assume I have to change them. (username & password doesn't work either)
Examples also use http://example.com/login.php as example url. Goodreads doesn't have a /login.php page though. Am I correct to assume I have to use the url with the log-in screen?
_session_id is the name of the relevant cookie on Goodreads.
I'd be very grateful if anyone can point me in the right direction!
See carefully what data is posted on login:
user[email]:email#email
remember_me:on
user[password]:plain_pasword
n:667387
So your post must execute exact same keys.
2.Make sure, you make right import: import org.jsoup.Connection.Method;
but Connection.Method.POST is still good.
3.See p1
4.Yes, you are correct
5.what is the question?
Goodreads requires two things when logging in: first, that you have a session ID stored in a cookie, and second, that you have a random generated number. You can get these when first visiting the login page without logging in: it will set a cookie with a session ID, and the form will contain a hidden input form (i.e. ) with the name "n" and value a number. Save these and pass them along as respectively a cookie and a form value when logging in.
Some remarks about the way I found this out:
The first thing you need to realise is that you're trying to recreate the exact same requests your browser does with Jsoup. So, in order to check whether what you have right now will work, you can try to recreate the exact same situation with your browser.
To recreate your code, I went to the login page, then I deleted all my Goodreads cookies (as you don't send along any cookies when you send the login request as well), and attempted to sign in with only passing the username and password form values. It gave an error that my session had timd out. When I first loaded the login page and then deleted all cookies except the session ID and did not remove the "n" form value, I could log in successfully. Therefore, you want to make a general GET request to the sign in page first, retrieve the session ID cookie you get there and the hidden form value, and pass it along with the POST request.
It could be that the API changed or that there just are several ways. Using Connection.Method.POST will do fine, in any case.
Yes, they refer to the names of the input boxes. This should be id, however, since name was used in the past and not all versions of all browsers supported passing the ids as data, most websites are just adding both. Either should be fine.
If you look at the source code of the sign in form, you can see that the "method" attribute of the form element is indeed the sign in page itself, so that's where it sends the request to.
PS. As a general tip, you can use the Firefox extension "Tamper Data" to remove form data or even cookies (though there are easier extensions for that).
You can log in with this code:
public static void main(String[] args) throws Exception {
Connection.Response execute = Jsoup
.connect("https://www.goodreads.com/")
.method(Connection.Method.GET).execute();
Element sign_in = execute.parse().getElementById("sign_in");
String authenticityToken = sign_in.select("input[name=authenticity_token]").first().val();
String n = sign_in.select("input[name=n]").first().val();
Document document = Jsoup.connect("https://www.goodreads.com/user/sign_in")
.data("cookieexists", "✓")
.data("authenticity_token", authenticityToken)
.data("user[email]", "user#email.com")
.data("user[password]", "password")
.data("remember_me", "on")
.data("n", n)
.cookies(execute.cookies())
.post();
}
Related
Hello i am trying to get page level insights and post level insights in the same request but cant seem to get the syntax correct.
page id /published_posts?fields=permalink_url,created_time,message,shares,reactions.limit(0).summary(1),comments.limit(0).summary(1),insights.metric(post_reactions_by_type_total,post_impressions_unique,page_posts_impressions_organic)&since=yesterday
This is my request for now but i wanna add page insights like page_fans and page_fans_city.
How can i do that?
You are using the published_posts endpoint there already, you can not go back “up” to the page object from there. You need to rewrite the whole thing so that you use the page id itself as the basic endpoint, and then request everything else via the fields parameter. The trick is to get the syntax and nesting right …
/page-id?fields=insights.metric(page_fans,page_fans_city),published_posts{…}
should work, inside the {…} you then put all the original fields you requested from the published_posts endpoint before, so
/page-id?fields=insights.metric(page_fans,page_fans_city),published_posts{permalink_url,
created_time,…,insights.metric(post_reactions_by_type_total,post_impressions_unique,
page_posts_impressions_organic)}
And &since=yesterday then just goes at the end again, after all that.
To have the since limitation still apply on the post level, it apparently needs to be added on that “field” again, syntax similar to .metric():
?fields=…,published_posts.since(yesterday){…}
I need to fetch all URLs from this page -
http://www.questdiagnostics.com/testcenter/BUSearch.action?submitValue=BUSearch&keyword=Toxoplasma+Abs+IgG+%2F+IgM
whenever I am selecting a value from a drop down and click on go button.
I selected a value from dropdown option by using xpath. But i can't able to click on go button.
My code is:
import requests
from lxml.html import fromstring
req = requests.get('http://www.questdiagnostics.com/testcenter/BUSearch.action?submitValue=BUSearch&keyword=Toxoplasma+Abs+IgG+%2F+IgM')
hdoc = lxml.html.fromstring(req.content)
hdoc.xpath('//select[#id="labs"]/option/text()')
How to get all links without using selenium?
Normal Use Case
lxml is a great library, and it has decent support for filling out and submitting forms, as documented here. The real challenge for this particular use case is rooted in the way the form works.
The regional laboratory select box is not part of the form; its value is submitted with a cookie instead. This makes things a little more difficult.
If this wasn't the case, you could just issue your GET, pull the form out of it, change the values you're interested in, submit it, and examine the links that come back. That script might look something like this:
req = requests.get('http://www.questdiagnostics.com/testcenter/BUSearch.action?submitValue=BUSearch&keyword=Toxoplasma+Abs+IgG+%2F+IgM')
hdoc = lxml.html.fromstring(req.content)
form = hdoc.forms[1]
# Set form inputs using `form.fields = dict(...)`
form.action = "http://www.questdiagnostics.com" + form.action
submitResult = lxml.html.parse(lxml.html.submit_form(form)).getroot()
links = submitResult.xpath('//*[#id="maincolumn"]/ol/li/a[#class="title"]/#href')
While you can add arbitrary request parameters when calling lxml.html.submit_form(), I don't see a way to add arbitrary cookies.
This Use Case
That said, since this form essentially works by redirecting back to itself (with an additional cookie to identify the lab), you could simulate this behavior by just adding the cookie to your initial GET. You might not need to mess around with a form submission at all. This script will show the first ten links for the SKB lab:
cookies = dict(TC11SelectedLabCode='SKB')
req = requests.get('http://www.questdiagnostics.com/testcenter/BUSearch.action?submitValue=BUSearch&keyword=Toxoplasma+Abs+IgG+%2F+IgM', cookies=cookies)
hdoc = lxml.html.fromstring(req.content)
links = hdoc.xpath('//*[#id="maincolumn"]/ol/li/a[#class="title"]/#href')
print(links)
You could take this a step further, and issue a GET with no cookies to obtain the list of labs, and then iterate over that list, calling requests.get() on each one, sending the appropriate TC11SelectedLabCode cookie to simulate the form submission.
Notes
Note that while lxml has decent form submission support, you're not actually clicking anything. There is nothing "breathing life into" the DOM.
None of the javascript on the page is running.
To illustrate why this is important, consider this example. If you wanted to verify the links on page 2 of the results, I can't say how you'd accomplish that. If your tests need to exercise javascript on the page, I think you'll need more than requests and lxml.
I am working on a Django setup where I can receive a url containining a query string as part of a GET. I would like to be able to process the data provided in the query string and return a page that is adjusted for that data but does not contain the query string in the URL.
Ordinarily I would just use reverse(), but I am not sure how to apply it in this case. Here are the details of the situation:
Example URL: .../test/123/?list_options=1&list_options=2&list_options=3
urls.py
urlpatterns = patterns('',
url(r'test/(P<testrun_id>\d+)/'), views.testrun, name='testrun')
)
views.py
def testrun(request, testrun_id):
if 'list_options' in request.GET.keys():
lopt = request.GET.getlist('list_options')
:
:
[process lopt list]
:
:
:
:
[other processing]
:
:
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
When the example URL is processed, Django will return the page I want but with the URL still containing the query string on the end. The standard way of stripping off the unwanted query string would be to return the testrun function with return HttpResponseRedirect(reverse('testrun', args=(testrun_id,))). However, if I do that here then I'm going to get an infinite loop through the testrun function. Furthermore, I am unsure if the list_options data that was on the original request will still be available after the redirect given that it has been removed from the URL.
How should I work around this? I can see that it might make sense to move the parsing of the list_options variable out into a separate function to avoid the infinite recursion, but I'm afraid that it will lose me the list_options data from the request if I do it that way. Is there a neat way of simultaneously lopping the query string off the end of the URL and returning the page I want in one place so I can avoid having separate things out into multiple functions?
EDIT: A little bit of extra background, since there have been a couple of "Why would you want to do this?" queries.
The website I'm designing is to report on the results of various tests of the software I'm working on. This particular page is for reporting on the results of a single test, and often I will link to it from a bigger list of tests.
The list_options array is a way of specifying the other tests in the list I have just come from. This allows me to populate a drop-down menu with other relevant tests to allow me to easily switch between them.
As such, I could easily end up passing in 15-20 different values and creating huge URLs, which I'd like to avoid. The page is designed to have a default set of other tests to fill in the menu in question if I don't suggest any others in the URL, so it's not a big deal if I remove the list_options. If the user wishes to come back to the page directly he won't care about the other tests in the list, so it's not a problem if that information is not available.
First a word of caution. This is probably not a good idea to do for various reasons:
Bookmarking. Imagine that .../link?q=bar&order=foo will filter some search results and also sort the results in particular order. If you will automatically strip out the querystring, then you will effectively disallow users to bookmark specific search queries.
Tests. Any time you add any automation, things can and will probably go wrong in ways you never imagined. It is always better to stick with simple yet effective approaches since they are widely used thus are less error-prone. Ill give an example for this below.
Maintenance. This is not a standard behavior model therefore this will make maintenance harder for future developers since first they will have to understand first what is going on.
If you still want to achieve this, one of the simplest methods is to use sessions. The idea is that when there is a querystring, you save its contents into a session and then you retrieve it later on when there is no querystring. For example:
def testrun(request, testrun_id):
# save the get data
if request.META['QUERY_STRING']:
request.session['testrun_get'] = request.GET
# the following will not have querystring hence no infinite loop
return HttpResponseRedirect(reverse('testrun', args=(testrun_id,)))
# there is no querystring so retreive it from session
# however someone could visit the url without the querystring
# without visiting the querystring version first hence
# you have to test for it
get_data = request.session.get('testrun_get', None)
if get_data:
if 'list_options' in get_data.keys():
...
else:
# do some default option
...
context = { ...stuff... }
return render(request, 'test_tracker/testview.html', context)
That should work however it can break rather easily and there is no way to easily fix it. This should illustrate the second bullet from above. For example, imagine a user wants to compare two search queries side-by-side. So he will try to visit .../link?q=bar&order=foo and `.../link?q=cat&order=dog in different tabs of the same browser. So far so good because each page will open correct results however as soon as the user will try to refresh the first opened tab, he will get results from the second tab since that is what is currently stored in the session and because browser will have a single session token for both tabs.
Even if you will find some other method to achieve what you want without using sessions, I imagine that you will encounter similar issues because HTTP is stateless hence you will have to store state on the server.
There is actually a way to do this without breaking much of the functionality - store state on client instead of server-side. So you will have a url without a querystring and then let javascript query some API for whatever you will need to display on that page. That however will force you to make some sort of API and use some javascript which does not exactly fall into the scope of your question. So it is possible to do cleanly however that will involve more than just using Django.
I'm developing an Android App that uses the Places API to retrieve information and displays it on a map. The initial request to retrieve to places fails with a ACCESS_DENIED status message from the HTTP request. Below is the code that I used to generate the request:
try {
HttpRequestFactory httpRequestFactory = createRequestFactory(HTTP_TRANSPORT);
HttpRequest request = httpRequestFactory
.buildGetRequest(new GenericUrl(PLACES_SEARCH_URL));
request.getUrl().put("key", API_KEY);
request.getUrl().put("location", _latitude + "," + _longitude);
request.getUrl().put("radius", _radius); // in meters
request.getUrl().put("sensor", "false");
if(types != null)
request.getUrl().put("types", types);
PlacesList list = request.execute().parseAs(PlacesList.class);
// Check log cat for places response status
Log.d("Places Status", "" + list.status);
return list;
In another Stackoverflow posting someone had suggested that the poster try the following to test their key:
Go to the api console here, then to SERVICES. Click Active services
tab and verify 'Places API' is turned ON. Click on the ? "try" link
next to it. It should create a proper URL with your key which should
work. Compare the link that you are trying against this URL for
differences.
I followed these instructions. Based on the fact that I received the following results when I clicked on the ? to "try" the link I suspect something is fundamentally wrong with the API Key independent of the code...otherwise I would think I would get a SUCCESS rather than REQUEST_DENIED:
{
"html_attributions" : [],
"results" : [],
"status" : "REQUEST_DENIED"
}
I obtained my key by entering the SHA1 of my debug certificate (which i obtained using Keytool with all the appropriate parameters...e.g, androiddebugkey....debug.keystore) followed by a ";" and the Package Name of the app.
Not sure what the problem is...I'm sure it's something simple but I'm not seeing it and I'm stuck. Thoughts?
I never received a response to this posting so ultimately I've resolved the problem by creating a brand new key under a new project name and I was at least able to retrieve Places from Google..I'm still having issues with populating maps but that could be a code issue.
I noticed that the key that I was using that gave me the ACCESS DENIED results had a title of: "Key for Android apps (with certificates)" and it had a label "Android apps:" listed just under the actual key. The key value is the SHA1 value ";" followed by the Package Name. Whereas the key I created under a new Project Name (Places API) that ultimately worked had a title of: "Key for browser apps (with referers)" and it had a label of "Referers:" and value of "Any referer allowed".
So there is definitely something different about these two keys. I'm not sure what I did differently when I generated the keys. I'd like to understand what I did to generate these two "different" types of keys so that I and perhaps others won't repeat my "mistake(s)".
There are many references to creating keys in the Google documentation. The fact that there are so many postings regarding problems with the keys tells me that the Google documentation is not very clear otherwise so many issues wouldn't exist on this topic.
I have 500 or so spambots and about 5 actual registered users on my wiki. I have used nuke to delete their pages but they just keep reposting. I have spambot registration under control using reCaptcha. Now, I just need a way to delete/block/merge about 500 users at once.
You could just delete the accounts from the user table manually, or at least disable their authentication info with a query such as:
UPDATE /*_*/user SET
user_password = '',
user_newpassword = '',
user_email = '',
user_token = ''
WHERE
/* condition to select the users you want to nuke */
(Replace /*_*/ with your $wgDBprefix, if any. Oh, and do make a backup first.)
Wiping out the user_password and user_newpassword fields prevents the user from logging in. Also wiping out user_email prevents them from requesting a new password via email, and wiping out user_token drops any active sessions they may have.
Update: Since I first posted this, I've had further experience of cleaning up large numbers of spam users and content from a MediaWiki installation. I've documented the method I used (which basically involves first deleting the users from the database, then wiping out up all the now-orphaned revisions, and finally running rebuildall.php to fix the link tables) in this answer on Webmasters Stack Exchange.
Alternatively, you might also find Extension:RegexBlock useful:
"RegexBlock is an extension that adds special page with the interface for blocking, viewing and unblocking user names and IP addresses using regular expressions."
There are risks involved in applying the solution in the accepted answer. The approach may damage your database! It incompletely removes users, doing nothing to preserve referential integrity, and will almost certainly cause display errors.
Here a much better solution is presented (a prerequisite is that you have installed the User merge extension):
I have a little awkward way to accomplish the bulk merge through a
work-around. Hope someone would find it useful! (Must have a little
string concatenation skills in spreadsheets; or one may use a python
or similar script; or use a text editor with bulk replacement
features)
Prepare a list of all SPAMuserIDs, store them in a spreadsheet or textfile. The list may be
prepared from the user creation logs. If you do have the
dB access, the Wiki_user table can be imported into a local list.
The post method used for submitting the Merge & Delete User form (by clicking the button) should be converted to a get method. This
will get us a long URL. See the second comment (by Matthew Simoneau)
dated 13/Jan/2009) at
http://www.mathworks.com/matlabcentral/newsreader/view_thread/242300
for the method.
The resulting URL string should be something like below:
http: //(Your Wiki domain)/Special:UserMerge?olduser=(OldUserNameHere)&newuser=(NewUserNameHere)&deleteuser=1&token=0d30d8b4033a9a523b9574ccf73abad8%2B\
Now, divide this URL into four sections:
A: http: //(Your Wiki domain)/Special:UserMerge?olduser=
B: (OldUserNameHere)
C: &newuser=(NewUserNameHere)&deleteuser=1
D: &token=0d30d8b4033a9a523b9574ccf73abad8%2B\
Now using a text editor or spreadsheet, prefix each spam userIDs with part A and Suffix each with Part C and D. Part C will include the
NewUser(which is a specially created single dummy userID). The Part D,
the Token string is a session-dependent token that will be changed per
user per session. So you will need to get a new token every time a new
session/batch of work is required.
With the above step, you should get a long list of URLs, each good to do a Merge&Delete operation for one user. We can now create a
simple HTML file, view it and use a batch downloader like DownThemAll
in Firefox.
Add two more pieces " Linktext" to each line at
beginning and end. Also add at top and at
bottom and save the file as (for eg:) userlist.html
Open the file in Firefox, use DownThemAll add-on and download all the files! Effectively, you are visiting the Merge&Delete page for
each user and clicking the button!
Although this might look a lengthy and tricky job at first, once you
follow this method, you can remove tens of thousands of users without
much manual efforts.
You can verify if the operation is going well by opening some of the
downloaded html files (or by looking through the recent changes in
another window).
One advantage is that it does not directly edit the
MySQL pages. Nor does it require direct database access.
I did a bit of rewriting to the quoted text, since the original text contains some flaws.