I am attempting to parse and structured a raw proxy data using GROK filter in the ELK stack and I can;t get the timestamp and user agent string in the correct format. Do refer to sample log as follows:
"1488852784.440 1 10.11.62.19 TCP_DENIED/403 0 GET http://xxx.xxx.com/xxx - NONE/- - BLOCK_WEBCAT_12-XXX-XXX-NONE-NONE-NONE-NONE <IW_aud,0.0,-,""-"",-,-,-,-,""-"",-,-,-,""-"",-,-,""-"",""-"",-,-,IW_aud,-,""-"",""-"",""Unknown"",""Unknown"",""-"",""-"",0.00,0,-,""-"",""-"",-,""-"",-,-,""-"",""-""> - L ""http://xxx.xxx.xxx"" 10.11.11.2 - 403 TCP_DENIED ""Streaming Audio"" - - - GET ""Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"" http://xxx.xxx.xxx"
I am using the following filter:
%{NUMBER:timestamp}%{SPACE}%{NUMBER:request_msec:float} %{IPORHOST:src_ip} %{WORD}/%{NUMBER:response_status:int} %{NUMBER:response_size} %{WORD:http_method} (%{URIPROTO:http_proto}://)?%{IPORHOST:dst_host}(?::%{POSINT:port})?(?:%{NOTSPACE:uri_param})? %{USERNAME:user} %{WORD}/(%{IPORHOST:dst_ip}|-)%{GREEDYDATA:content_type}
based on http://grokconstructor.appspot.com, I am able to parse out some of the fields except the timestamp (1488852784.440) and User Agent String. I have tried different Drok default filters on the timestamp but it still shows as numbers.
That's because Grok can't convert to a date datatype. For that, you need to use the date filter which does this exact conversion for you.
filter {
date {
match => [ "timestamp", UNIX_MS ]
}
}
Which will set the #timestamp field of the event to the parsed timestamp in the timestamp field.
Related
Been getting weird results from my FB CAPI in Facebook's event test tool.
Is it Facebook's bug or something is wrong with my payload?
Here's what I'm doing and I've been able to replicate this on different machines on different IPs.
Here's how I can replicate the problem a lot of the times:
I open the Event test tool for my pixel in Business
Manager. I open Graph API Explorer to send test
events to the above mentioned Event Test tool.
In Graph API Explorer I enter my Access Token. I use
the following JSON code to send a test payload to Event Test tool:
{
"data": [
{
"event_name": "ViewContent",
"event_time": 1661938013,
"event_id": "1661886269650_16619383723281",
"event_source_url": "https://example.com/?gtm_debug=1661936451103",
"action_source": "website",
"user_data": {
"client_ip_address": "111.111.111.111",
"client_user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
"em": null,
"ph": null
},
"custom_data": {
"contents": null
}
}
],
"partner_agent": "gtmss-1.0.0-0.0.5",
"test_event_code": "TEST83629"
}
I then check Event Test tool and see the following message received:
Like you can see from the above screenshot, the event name is CUSTOM EVENT (blank), event though it was sent as a standard ViewContent. Also, the source is marked as WEBSITE, when obviously it was sent through Graph API and should be marked as SERVER.
I then go back to Graph API Explorer and change ONE NUMBER of client_ip_address to something like "112.111.111.111" and send the same payload again.
I check Event Test tool and this time I see the following message received:
WHY the same payload reacts so differently and is even marked as received from a WEBSITE, even though was sent through SERVER? And why does fiddling with IP sometimes fix the problem?
I've been able to replicate this issue with already three different users. Three different Business accounts and Three different pixels. What am I doing wrong?
I had encountered same issue. Also when using the same server-side GTM template: Conversions API Tag
The issue is that in your payload you are sending "em" and "ph" parameters under "user_data" as null. This somehow confuses the API. These values must either be a hashed string or not be defined at all.
See - https://developers.facebook.com/docs/marketing-api/conversions-api/parameters/customer-information-parameters/
Edit the Conversions API Tag template code.
Find these lines:
// Commmon Event Schema Parameters
event.user_data.em = eventModel['x-fb-ud-em'] ||
(eventModel.user_data != null ? hashFunction(eventModel.user_data.email_address) : null);
event.user_data.ph = eventModel['x-fb-ud-ph'] ||
(eventModel.user_data != null ? hashFunction(eventModel.user_data.phone_number) : null);
and replace them with:
let emData = eventModel['x-fb-ud-em'] || (eventModel.user_data != null ? hashFunction(eventModel.user_data.email_address) : null);
if(emData != null) {
event.user_data.em = emData;
}
let phData = eventModel['x-fb-ud-ph'] || (eventModel.user_data != null ? hashFunction(eventModel.user_data.phone_number) : null);
if(phData != null) {
event.user_data.ph = phData;
}
This will make it so the data is not added at all if its not defined, instead of adding a null.
I need to search in Kibana Logs for fields with a specific content. The field is "message", that looks like this:
11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
This field contains URIs, for example here "https://website.com/questionnaire/uuid/result". How can I search for specific URIs in that field?
I need to get all Logs, where the field "message" contains "https://website.com/questionnaire/someUUID*/result"
or where the URI is exactly "https://website.com/"
I've tried with Lucene:
message:/https://.+/result/
nothing found
message:https.*\result
URIs with "https: at the beginning found, but also returns URIs without "result" at the end
message : "https://website.com/questionnaire" AND message : "result"
This works, but this would also work, if "result" would not be related to the URI, but would just stay alone at the end of the "message"-field. And I need something, that would really query those URIs between " ".
I need to visualise the amount of requests for each URI with Kibana later. So I think I need to use Lucene or Query DSL.
Any ideas?
This is a good use case for the new wildcard field type (introduced in 7.9), which allows you to better search within potentially long strings.
If you declare your message field as wildcard like this:
PUT test
{
"mappings": {
"properties": {
"message": {
"type": "wildcard"
}
}
}
}
And then index your documents
PUT test/_doc/1
{
"message": """11.111.72.58 - - [26/Nov/2020:08:44:23 +0000] "GET /images/image.jpg HTTP/1.1" 200 123456 "https://website.com/questionnaire/uuid/result" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.14 (KHTML, like Gecko) Version/14.0.1 Safari/605.1.14" "5.158.163.231"
"""
}
You can then run wildcard searches (even with leading wildcards which are discouraged to run on normal keyword fields) and find your document easily.
GET test/_search
{
"query": {
"wildcard": {
"message": {
"value": "*https*uuid*"
}
}
}
}
We have a field for "url" in our logs, and I'd like to be able to filter down to just requests hitting the homepage. This would be requests for / and for /?*, i.e. with any query string.
Just getting homepage requests is | filter url = "/" but how do you include those requests that have a query string also?
You can use regex: https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html
Something like this should work:
filter url = "/" or url like /^\/\?.*/
I am trying to grab the links from a google search with bs4 but my code is returning an empty set.
import requests
from bs4 import BeautifulSoup
website = "https://www.google.co.uk/?gws_rd=ssl#q=science"
response=requests.get(website)
soup = BeautifulSoup(response.content)
link_info = soup.find_all("h3", {class": "r"})
print link_info
The <h3 class="r"> is where the links for all the results are not just the link for the first result.
In response I get [] and this is for any other class I try to request including <div class="rc">.
Here is a prt sc of what I am after,
Try to use following code
url = 'http://www.google.com/search?'
params = {'q': 'science'}
response = requests.get(url, params=params).content
soup = BeautifulSoup(response)
link_info = soup.find_all("h3", {"class": "r"})
print link_info
You're looking for this:
# select container with needed elements and grab each element in a loop
for result in soup.select('.tF2Cxc'):
# grabs each <a> tag from the container and then grabs an href attribute
link = result.select_one('.yuRUbf a')['href']
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Code from Andersson will throw an error because there's no such r CSS selector. It's changed.
Make sure you're using user-agent because default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you'll receive a different HTML with some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
I wrote a dedicated blog about how to reduce the chance of being blocked while web scraping search engines that cover multiple solutions.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "us",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with selecting correct selectors or figuring out why certain things don't work as expected and then maintain it over time. Instead, you only need to iterate over structured JSON and get the date you want fast.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result['link'])
--------
'''
https://www.youtube.com/watch?v=paTW3wOyIYw
https://www.quotes.net/mquote/1060647
https://www.reddit.com/r/NewTubers/comments/47hw1g/what_does_katana_mean_it_means_japanese_sword_2/
https://www.imdb.com/title/tt0130236/characters/nm0360481
http://www.subzin.com/quotes/Samurai+Cop/What+does+Katana+mean%3F+-+It+means+Japanese+sword
'''
Disclaimer, I work for SerpApi.
I am trying to configure a filter in the Google Analytics API, but I don't manage to extract urls containing numbers.
ex: /comptable-opcvm-debutant-3
My configuration:
ga:medium==organic;ga:PagePath==~.[0-9]+
Here the report status:
=> Cannot read property "0" from undefined.
filtering by Page Path with a digit at the end works for me without any problems after a few changes in the original expression. I was testing it in Query Explorer:
metrics: ga:pageviews
dimensions: ga:pagePath
filters: ga:pagePath=~^[^?]+\d+$
Example of results:
/071620716207162 1
/JZepeda13277JZepeda13277JZepeda13277 1
/help/how-do-i-send-photo-or-file-expert-0 48
/help/topics/141 47
...