I need to search in Kibana Logs for fields with a specific content. The field is "message", that looks like this:
11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
This field contains URIs, for example here "https://website.com/questionnaire/uuid/result". How can I search for specific URIs in that field?
I need to get all Logs, where the field "message" contains "https://website.com/questionnaire/someUUID*/result"
or where the URI is exactly "https://website.com/"
I've tried with Lucene:
message:/https://.+/result/
nothing found
message:https.*\result
URIs with "https: at the beginning found, but also returns URIs without "result" at the end
message : "https://website.com/questionnaire" AND message : "result"
This works, but this would also work, if "result" would not be related to the URI, but would just stay alone at the end of the "message"-field. And I need something, that would really query those URIs between " ".
I need to visualise the amount of requests for each URI with Kibana later. So I think I need to use Lucene or Query DSL.
Any ideas?
This is a good use case for the new wildcard field type (introduced in 7.9), which allows you to better search within potentially long strings.
If you declare your message field as wildcard like this:
PUT test
{
"mappings": {
"properties": {
"message": {
"type": "wildcard"
}
}
}
}
And then index your documents
PUT test/_doc/1
{
"message": """11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
"""
}
You can then run wildcard searches (even with leading wildcards which are discouraged to run on normal keyword fields) and find your document easily.
GET test/_search
{
"query": {
"wildcard": {
"message": {
"value": "*https*uuid*"
}
}
}
}
Related
I'm struggling with Apache HUE Rest API and django csrf.
The problem is that I can kind-a login, but the rest doesn't work. I always get redirected to login page. Seems like server doesn't like my csrftoken or sessionid cookie.
I have absolutely no idea why.
Here is my login code:
val accessToken = getAccessToken(Http(s"$baseUrl/accounts/login/?next=/").asString)
val response =
Http(s"$baseUrl/accounts/login/")
.postForm(Seq(
"username" -> username,
"password" -> password,
"csrfmiddlewaretoken" -> accessToken.csrftoken.getValue,
"next" -> "/"
))
.cookie(accessToken.csrftoken)
.asString
getAccessToken(response) // wrapper for cookies and headers from response
Now I try just to get page from HUE protected with csrf
def getDir(hdfsPathDirParent: String): Unit = {
val accessToken = login()
val response = Http(s"$baseUrl/filebrowser/view=$hdfsPathDirParent")
.cookie(accessToken.csrftoken) // retrieved after login
.cookie(accessToken.sessionid) // retrieved after login
.header("X-CSRFToken", accessToken.csrftoken.getValue)
.header("Host", "localhost:8888")
.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9")
.header("Connection", "keep-alive")
.header("Sec-Fetch-Dest", "empty")
.header("Sec-Fetch-Mode", "cors")
.header("Sec-Fetch-Site", "same-origin")
//.header("Sec-Fetch-User", "?1")
.header("Upgrade-Insecure-Requests", "1")
.header("User-Agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36")
.header("Accept-Encoding", "gzip, deflate, br")
.header("Accept-Language", "en,en-US;q=0.9,ru;q=0.8")
.header("Cache-Control", "max-age=0")
.header("X-Requested-With", "XMLHttpRequest")
.asString
I literally copy-pasted all tokens from Google Chrome debug panel. It doesn't work
[30/May/2020 05:19:29 -0700] access WARNING 172.17.0.1 test_user - "POST /accounts/login/ HTTP/1.1" -- Successful login for user: test_user
[30/May/2020 05:19:29 -0700] middleware INFO Redirecting to login page: /filebrowser/view=/user/test_user
[30/May/2020 05:19:29 -0700] access INFO 172.17.0.1 -anon- - "GET /filebrowser/view=/user/test_user HTTP/1.1" -- login redirection
So I do pass login form, but the rest doesn't work. Can't find what I miss...
Their code example doesn't work
http://cloudera.github.io/hue/latest/developer/api/
Aren't you getting back a HTTP 302 redirect instead of a 200? (so you would need to follow it with your code).
Also above doc site is stale, https://docs.gethue.com/developer/api/#python is the new one.
It doesn't seem to be intuitive
but
def getDir(accessToken: CookiesAndHeaders, hdfsPathDirParent: String): (String, CookiesAndHeaders) = {
val req = Http(s"$baseUrl/filebrowser/view=$hdfsPathDirParent")
.cookie(accessToken.sessionid)
val response = req.asString
just DO NOT* pass csrftoken as cookie, use ONLY sessionid sessionid Cookie.
I have no idea why, but it helped...
I have Nginx 1.17 with config what successfully excludes POST request method from logging
map $request_method $loggable {
default 1;
POST 0;
}
access_log /var/log/nginx/nginx_access.log main if=$loggable;
I try config below to exclude from logging any URL where is a part of word "rsey" with case-insensitive matching , for example JerSey2018
map_hash_bucket_size 128;
map $request_uri $loggable {
(.*?)rsey(.*?) 0;
default 1;
}
map $request_method $loggable {
default 1;
POST 0;
}
access_log /var/log/nginx/nginx_access.log main if=$loggable;
But with this config NGINX still excludes POST requests from logging, but still write to log where URL like
http://example.com/ID-16409696108601-JerSey2018-report.html
so my map/regex rule does not catch what necessary - LOG EXAMPLE ->
example.com 88.256.54.27 - - [01/Aug/2019:06:52:00 -0500] "GET /ID-16409696108601-JerSey2018-report.html HTTP/1.1" 200 3366 "http://example.com/sitemap.xml" "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393"
Thx for any hints and ideas to try...may be somebody could point me to an error in the config above ?
***A little later - so tired to fight with Nginx map it seems better to catch location by part of a word using something like that (although does not work too)
location ~ /(.*)rsey(.*)/ {
access_log off;
}
So now thx for any ideas what try here to catch REAL wildcard in Nginx in this line
location ~ /(.*)rsey(.*)/ {
You cannot control the same variable from two map blocks. But you can use the value of the first map as the default value of the second.
For example:
map $request_uri $loguri {
default 1;
~*rsey 0;
}
map $request_method $loggable {
default $loguri;
POST 0;
}
access_log /var/log/nginx/nginx_access.log main if=$loggable;
See this document for details.
I am attempting to parse and structured a raw proxy data using GROK filter in the ELK stack and I can;t get the timestamp and user agent string in the correct format. Do refer to sample log as follows:
"1488852784.440 1 10.11.62.19 TCP_DENIED/403 0 GET http://xxx.xxx.com/xxx - NONE/- - BLOCK_WEBCAT_12-XXX-XXX-NONE-NONE-NONE-NONE <IW_aud,0.0,-,""-"",-,-,-,-,""-"",-,-,-,""-"",-,-,""-"",""-"",-,-,IW_aud,-,""-"",""-"",""Unknown"",""Unknown"",""-"",""-"",0.00,0,-,""-"",""-"",-,""-"",-,-,""-"",""-""> - L ""http://xxx.xxx.xxx"" 10.11.11.2 - 403 TCP_DENIED ""Streaming Audio"" - - - GET ""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"" http://xxx.xxx.xxx"
I am using the following filter:
%{NUMBER:timestamp}%{SPACE}%{NUMBER:request_msec:float} %{IPORHOST:src_ip} %{WORD}/%{NUMBER:response_status:int} %{NUMBER:response_size} %{WORD:http_method} (%{URIPROTO:http_proto}://)?%{IPORHOST:dst_host}(?::%{POSINT:port})?(?:%{NOTSPACE:uri_param})? %{USERNAME:user} %{WORD}/(%{IPORHOST:dst_ip}|-)%{GREEDYDATA:content_type}
based on http://grokconstructor.appspot.com, I am able to parse out some of the fields except the timestamp (1488852784.440) and User Agent String. I have tried different Drok default filters on the timestamp but it still shows as numbers.
That's because Grok can't convert to a date datatype. For that, you need to use the date filter which does this exact conversion for you.
filter {
date {
match => [ "timestamp", UNIX_MS ]
}
}
Which will set the #timestamp field of the event to the parsed timestamp in the timestamp field.
I am trying to grab the links from a google search with bs4 but my code is returning an empty set.
import requests
from bs4 import BeautifulSoup
website = "https://www.google.co.uk/?gws_rd=ssl#q=science"
response=requests.get(website)
soup = BeautifulSoup(response.content)
link_info = soup.find_all("h3", {class": "r"})
print link_info
The <h3 class="r"> is where the links for all the results are not just the link for the first result.
In response I get [] and this is for any other class I try to request including <div class="rc">.
Here is a prt sc of what I am after,
Try to use following code
url = 'http://www.google.com/search?'
params = {'q': 'science'}
response = requests.get(url, params=params).content
soup = BeautifulSoup(response)
link_info = soup.find_all("h3", {"class": "r"})
print link_info
You're looking for this:
# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):
# grabs each <a> tag from the container and then grabs an href attribute
link = result.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code from Andersson will throw an error because there's no such r CSS selector. It's changed.
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected and then maintain it over time. Instead, you only need to iterate over structured JSON and get the date you want fast.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['link'])
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
I'm pretty new to phantomjs. Just started out with headless automation of the application that I work on. Somehow, the following code seems to work just fine for websites like hotmail,facebook etc but it doesn't work for my application under test. Following is the code that I'm using :-
var page = require("webpage").create();
page.settings.userAgent="Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"
phantom.clearCookies();
phantom.cookiesEnabled = true;
var homePage = "https://www.somewebsite.com";
page.open(homePage, function(status) {
var url = page.url;
console.log("Status: " + status);
console.log("Loaded: " + url);
page.evaluate(function(){
document.getElementById('myUsername').value='username;
document.getElementById('myPassword').value='password';
});
page.render("before.png");
page.evaluate(function(){
document.getElementById('myLoginButton').click();
});
setTimeout(function() {
page.render("after.png");
phantom.exit();
}, 10000);
});
The error message that I get is "Your browser has been set to block all cookies. Please enable them to log into the website."
Although I have written the statement "phantom.cookiesEnabled = true;" it doesn't seem to enable it. I already tried changing the user agent but with no luck. Am I missing on something ?
Thanks in Advance,
Harshit Kohli
For anyone that might face this issue. Setting a user agent for the page should work
page.settings.userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:44.0) Gecko/20100101 Firefox/44.0"