I have a rails application as back end and a BackboneJS application as front end. BackboneJS app starts with rails server. Now, I have a problem of search engine indexing pages of my application, which I tried to solve with phantomjs [I user this gem]. I wrote a before_filter in application controller like this:
def phantom_response
if params["_escaped_fragment_"].present?
url = "#{request.base_url}/#!#{params["_escaped_fragment_"]}"
output = Phantomjs.run('/lib/assets/get_page.js', url)
render :text => output
end
end
And my get_page.js looks like this:
var page = require('webpage').create();
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.settings.localToRemoteUrlAccessEnabled = true;
page.open(phantom.args[0], function (d) {
var body = page.evaluate(function(s) {
return document.querySelector(s).innerText;
}, '#provider_details_tpl');
console.log(body);
phantom.exit();
});
The element with "provider_details_tpl" contains a handlebars template.
But phantomjs returns just my template without precompilation. Can you help me?
Related
I need to search in Kibana Logs for fields with a specific content. The field is "message", that looks like this:
11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
This field contains URIs, for example here "https://website.com/questionnaire/uuid/result". How can I search for specific URIs in that field?
I need to get all Logs, where the field "message" contains "https://website.com/questionnaire/someUUID*/result"
or where the URI is exactly "https://website.com/"
I've tried with Lucene:
message:/https://.+/result/
nothing found
message:https.*\result
URIs with "https: at the beginning found, but also returns URIs without "result" at the end
message : "https://website.com/questionnaire" AND message : "result"
This works, but this would also work, if "result" would not be related to the URI, but would just stay alone at the end of the "message"-field. And I need something, that would really query those URIs between " ".
I need to visualise the amount of requests for each URI with Kibana later. So I think I need to use Lucene or Query DSL.
Any ideas?
This is a good use case for the new wildcard field type (introduced in 7.9), which allows you to better search within potentially long strings.
If you declare your message field as wildcard like this:
PUT test
{
"mappings": {
"properties": {
"message": {
"type": "wildcard"
}
}
}
}
And then index your documents
PUT test/_doc/1
{
"message": """11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
"""
}
You can then run wildcard searches (even with leading wildcards which are discouraged to run on normal keyword fields) and find your document easily.
GET test/_search
{
"query": {
"wildcard": {
"message": {
"value": "*https*uuid*"
}
}
}
}
I'm struggling with Apache HUE Rest API and django csrf.
The problem is that I can kind-a login, but the rest doesn't work. I always get redirected to login page. Seems like server doesn't like my csrftoken or sessionid cookie.
I have absolutely no idea why.
Here is my login code:
val accessToken = getAccessToken(Http(s"$baseUrl/accounts/login/?next=/").asString)
val response =
Http(s"$baseUrl/accounts/login/")
.postForm(Seq(
"username" -> username,
"password" -> password,
"csrfmiddlewaretoken" -> accessToken.csrftoken.getValue,
"next" -> "/"
))
.cookie(accessToken.csrftoken)
.asString
getAccessToken(response) // wrapper for cookies and headers from response
Now I try just to get page from HUE protected with csrf
def getDir(hdfsPathDirParent: String): Unit = {
val accessToken = login()
val response = Http(s"$baseUrl/filebrowser/view=$hdfsPathDirParent")
.cookie(accessToken.csrftoken) // retrieved after login
.cookie(accessToken.sessionid) // retrieved after login
.header("X-CSRFToken", accessToken.csrftoken.getValue)
.header("Host", "localhost:8888")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")
.header("Connection", "keep-alive")
.header("Sec-Fetch-Dest", "empty")
.header("Sec-Fetch-Mode", "cors")
.header("Sec-Fetch-Site", "same-origin")
//.header("Sec-Fetch-User", "?1")
.header("Upgrade-Insecure-Requests", "1")
.header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Accept-Language", "en,en-US;q=0.9,ru;q=0.8")
.header("Cache-Control", "max-age=0")
.header("X-Requested-With", "XMLHttpRequest")
.asString
I literally copy-pasted all tokens from Google Chrome debug panel. It doesn't work
[30/May/2020 05:19:29 -0700] access WARNING 172.17.0.1 test_user - "POST /accounts/login/ HTTP/1.1" -- Successful login for user: test_user
[30/May/2020 05:19:29 -0700] middleware INFO Redirecting to login page: /filebrowser/view=/user/test_user
[30/May/2020 05:19:29 -0700] access INFO 172.17.0.1 -anon- - "GET /filebrowser/view=/user/test_user HTTP/1.1" -- login redirection
So I do pass login form, but the rest doesn't work. Can't find what I miss...
Their code example doesn't work
http://cloudera.github.io/hue/latest/developer/api/
Aren't you getting back a HTTP 302 redirect instead of a 200? (so you would need to follow it with your code).
Also above doc site is stale, https://docs.gethue.com/developer/api/#python is the new one.
It doesn't seem to be intuitive
but
def getDir(accessToken: CookiesAndHeaders, hdfsPathDirParent: String): (String, CookiesAndHeaders) = {
val req = Http(s"$baseUrl/filebrowser/view=$hdfsPathDirParent")
.cookie(accessToken.sessionid)
val response = req.asString
just DO NOT* pass csrftoken as cookie, use ONLY sessionid sessionid Cookie.
I have no idea why, but it helped...
I am trying to grab the links from a google search with bs4 but my code is returning an empty set.
import requests
from bs4 import BeautifulSoup
website = "https://www.google.co.uk/?gws_rd=ssl#q=science"
response=requests.get(website)
soup = BeautifulSoup(response.content)
link_info = soup.find_all("h3", {class": "r"})
print link_info
The <h3 class="r"> is where the links for all the results are not just the link for the first result.
In response I get [] and this is for any other class I try to request including <div class="rc">.
Here is a prt sc of what I am after,
Try to use following code
url = 'http://www.google.com/search?'
params = {'q': 'science'}
response = requests.get(url, params=params).content
soup = BeautifulSoup(response)
link_info = soup.find_all("h3", {"class": "r"})
print link_info
You're looking for this:
# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):
# grabs each <a> tag from the container and then grabs an href attribute
link = result.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code from Andersson will throw an error because there's no such r CSS selector. It's changed.
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected and then maintain it over time. Instead, you only need to iterate over structured JSON and get the date you want fast.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['link'])
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
I'm pretty new to phantomjs. Just started out with headless automation of the application that I work on. Somehow, the following code seems to work just fine for websites like hotmail,facebook etc but it doesn't work for my application under test. Following is the code that I'm using :-
var page = require("webpage").create();
page.settings.userAgent="Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"
phantom.clearCookies();
phantom.cookiesEnabled = true;
var homePage = "https://www.somewebsite.com";
page.open(homePage, function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
page.evaluate(function(){
document.getElementById('myUsername').value='username;
document.getElementById('myPassword').value='password';
});
page.render("before.png");
page.evaluate(function(){
document.getElementById('myLoginButton').click();
});
setTimeout(function() {
page.render("after.png");
phantom.exit();
}, 10000);
});
The error message that I get is "Your browser has been set to block all cookies. Please enable them to log into the website."
Although I have written the statement "phantom.cookiesEnabled = true;" it doesn't seem to enable it. I already tried changing the user agent but with no luck. Am I missing on something ?
Thanks in Advance,
Harshit Kohli
For anyone that might face this issue. Setting a user agent for the page should work
page.settings.userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:44.0) Gecko/20100101 Firefox/44.0"
I've been looking for real time intraday quotes data for major FOREX pairs, backfilled to several days. I want to use it programmatically for Android or web application, therefore a CSV or XML formated webservice would be ideal.
So far I found a few websites by googling such as finam.ru. But it is in Russian and I was not able to find any clearly documented source on exact URL format how to obtain it.
Could someone guide me how to do it?
On the url: http://www.finam.ru/analysis/profile041CA00007/ you can choose the data to export in csv.
You should choose "Мировые валюты" (International Currencies) on the left combobox and you can choose the currency on the right.
Then you click on the button and download a csv file. Here is an example with minute data:
http://195.128.78.52/AUDCAD_140501_140526.txt?market=5&em=181410&code=AUDCAD&df=1&mf=4&yf=2014&dt=26&mt=4&yt=2014&p=2&f=AUDCAD_140501_140526&e=.txt&cn=AUDCAD&dtf=1&tmf=1&MSOR=0&mstime=on&mstimever=1&sep=1&sep2=1&datf=2&at=1
I have never seen a documentation for it. Just experimented with the date and got the needed period.
You can add the header:Referer:http://www.finam.ru/analysis/profile2C4A000007/ to get the tick data. The code in C# that allows to bypass the restrictions is at the end of the answer.
In the form the periods are the following:
<select id="issuer-profile-export-period" name="p" style="width: 135px; display: none;">
<option value="1">тики</option> ticks
<option value="2">1 мин.</option> 1 minute
<option value="3">5 мин.</option> 5 minutes
<option value="4">10 мин.</option> 10 minutes
<option value="5">15 мин.</option> 15 minutes
<option value="6">30 мин.</option> 30 minutes
<option value="7" selected="selected">1 час</option> 1 hour
<option value="8">1 день</option> 1 day
<option value="9">1 неделя</option> 1 week
<option value="10">1 месяц</option> 1 month
</select>
mstimer = 1 - moscow time
mstimer = 0 - not moscow time
f - mane of file
It is most likely that to change the currency you need to change it in three places: em=(these are the codes from the code below), code=, cn=
Currencies:
<ul>
<li>Aud/Cad</li>
<li>Aud/Chf</li>
<li>Aud/Dkk</li>
<li>Aud/Jpy</li>
<li>Aud/Nok</li>
<li>Aud/Nzd</li>
<li>Aud/Sek</li>
<li>Aud/Sgd</li>
<li>Aud/Usd</li>
<li>Cad/Chf</li>
<li>Cad/Jpy</li>
<li>Cad/Usd</li>
<li>Chf/Dkk</li>
<li>Chf/Jpy</li>
<li>Chf/Sgd</li>
<li>Chf/Usd</li>
<li>Dkk/Usd</li>
<li>Eur/Aud</li>
<li>Eur/Byr</li>
<li>Eur/Cad</li>
<li>Eur/Chf</li>
<li>Eur/Cny</li>
<li>Eur/Gbp</li>
<li>Eur/Hkd</li>
<li>Eur/Huf</li>
<li>Eur/Jpy</li>
<li>Eur/Kzt</li>
<li>Eur/Lvl</li>
<li>Eur/Mdl</li>
<li>Eur/Nok</li>
<li>Eur/Nzd</li>
<li>Eur/Rub</li>
<li>Eur/Sek</li>
<li>Eur/Sgd</li>
<li>Eur/Tjs</li>
<li>Eur/Uah</li>
<li>Eur/Usd</li>
<li>Eur/Uzs</li>
<li>Gbp/Aud</li>
<li>Gbp/Cad</li>
<li>Gbp/Chf</li>
<li>Gbp/Jpy</li>
<li>Gbp/Nok</li>
<li>Gbp/Sek</li>
<li>Gbp/Sgd</li>
<li>Gbp/Usd</li>
<li>Hkd/Usd</li>
<li>Huf/Usd</li>
<li>Jpy/Usd</li>
<li>Mxn/Usd</li>
<li>Nok/Usd</li>
<li>Nzd/Cad</li>
<li>Nzd/Jpy</li>
<li>Nzd/Sgd</li>
<li>Nzd/Usd</li>
<li>Pln/Usd</li>
<li>Rub/Eur</li>
<li>Rub/Lvl</li>
<li>Rub/Usd</li>
<li>Sek/Usd</li>
<li>Sgd/Jpy</li>
<li>Sgd/Usd</li>
<li>Usd/Byr</li>
<li>Usd/Cad</li>
<li>Usd/Chf</li>
<li>Usd/Cny</li>
<li>Usd/Dem</li>
<li>Usd/Idr</li>
<li>Usd/Inr</li>
<li>Usd/Jpy</li>
<li>Usd/Kzt</li>
<li>Usd/Lvl</li>
<li>Usd/Mdl</li>
<li>Usd/Rub</li>
<li>Usd/Tjs</li>
<li>Usd/Uah</li>
<li>Usd/Uzs</li>
<li>Xag</li>
<li>Xau</li>
<li>Zar/Usd</li></ul>
The only difference in requests that bypass the restrictions and that don't is the header: "Referer:http://www.finam.ru/analysis/profile2C4A000007/"
Here is the link to tick data
http://195.128.78.52/AUDJPY_140526_140526.txt?market=5&em=181408&code=AUDJPY&df=26&mf=4&yf=2014&dt=26&mt=4&yt=2014&p=1&f=AUDJPY_140526_140526&e=.txt&cn=AUDJPY&dtf=1&tmf=1&MSOR=0&mstime=on&mstimever=1&sep=1&sep2=1&datf=6&at=1
Here are the headers of the original request that worked:
Remote Address:195.128.78.52:80
Request URL:http://195.128.78.52/AUDJPY_140526_140526.txt?market=5&em=181408&code=AUDJPY&df=26&mf=4&yf=2014&dt=26&mt=4&yt=2014&p=1&f=AUDJPY_140526_140526&e=.txt&cn=AUDJPY&dtf=1&tmf=1&MSOR=0&mstime=on&mstimever=1&sep=1&sep2=1&datf=6&at=1
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,it;q=0.2
Connection:keep-alive
Host:195.128.78.52
Referer:http://www.finam.ru/analysis/profile2C4A000007/
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
Here are the headers of the request that did not work:
Remote Address:195.128.78.52:80
Request URL:http://195.128.78.52/AUDJPY_140526_140526.txt?market=5&em=181408&code=AUDJPY&df=26&mf=4&yf=2014&dt=26&mt=4&yt=2014&p=1&f=AUDJPY_140526_140526&e=.txt&cn=AUDJPY&dtf=1&tmf=1&MSOR=0&mstime=on&mstimever=1&sep=1&sep2=1&datf=6&at=1
Request Method:GET
Status Code:200 OK
Request Headersview source
Accept:text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding:gzip,deflate,sdch
Accept-Language:ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,it;q=0.2
Connection:keep-alive
Host:195.128.78.52
User-Agent:Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.114 Safari/537.36
The code to bypass restrictions and get the tick data:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
namespace ConsoleApplication6
{
class Program
{
static void Main(string[] args)
{
var client = new WebClient();
client.Headers.Add("user-agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36");
client.Headers.Add("Accept-Language", "ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,it;q=0.2");
client.Headers.Add("Referer", "http://www.finam.ru/analysis/profile2C4A000007/");
string url = "http://195.128.78.52/AUDJPY_140526_140526.txt?market=5&em=181408&code=AUDJPY&df=26&mf=4&yf=2014&dt=26&mt=4&yt=2014&p=1&f=AUDJPY_140526_140526&e=.txt&cn=AUDJPY&dtf=1&tmf=1&MSOR=0&mstime=on&mstimever=1&sep=1&sep2=1&datf=6&at=1";
string htmlString = client.DownloadString(url);
Console.WriteLine(htmlString);
Console.ReadLine();
}
}
}