How can I exclude search pattern within double quotes in Notepad++ - regex

I have the following line from which I want to replace space with whitespace (tab) but want to keep the spaces within the double quotes as it is. I am on Notepad++.
[11/May/2020:10:10:20 -0400] "GET / HTTP/1.1" 302 523 52197 url.com - - TLSv1.2 19922 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" https://somelinkhere - -
Desired output:
[11/May/2020:10:10:20 -0400] "GET / HTTP/1.1" 302 523 52197 url.com - - TLSv1.2 19922 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36" https://somelinkhere - -
Through the following regex I was able to select the string under the double quotes, but it's of no use for me.
"([^"]*)"
Can you please help me how this can be achieved?

You can use
("[^"]*")|[ ]
Replace with (?1$1:\t).
Details:
("[^"]*") - Capturing group 1: a ", then zero or more chars other than " and then a "
| - or
[ ] - matches a space (you may remove [ and ] here , they are used to make the space pattern visible in the answer).
See the demo screenshot:

Related

Count IP on URLs begins with "domain/product" in Apache access_logs

I try to count the access on a specific URL which begins every time with "shop/product?traffic=ads" with AWK, but I failed.
The following code gives me a counter how often an IP address has accessed these URL:
awk -F'[ "]+' '$7 == "/shop/product?traffic=ads" { ipcount[$1]++ }
END { for (i in ipcount) {
printf "%15s - %d\n", i, ipcount[i] } }' /var/www/vhosts/domain.com/logs/access_ssl_log
An example for the access_log (input-file) is here:
66.249.68.xx- - [19/Dec/2022:09:14:15 +0100] "GET /shop/other-product/1.0" 404 16996 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.xxx Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
109.42.242.xxx - - [19/Dec/2022:09:14:55 +0100] "GET /shop/product?traffic=ads&gclid=Cj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB HTTP/1.0" 200 30589 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 11; SM-A515F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36"
80.187.75.xx - - [20/Dec/2022:06:40:12 +0100] "GET /shop/product HTTP/1.0" 200 10821 "https://www.example.com/shop/product?traffic=ads&gclid=EAIaIQobChMIg_Ks5vWF_AIVAgGLCh3k_gBKEAAYASAAEgKBOfD_BwE&dt=1671461107791" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Mobile/15E148 Safari/604.1"
The "gclid" and and the "dt"(session cookie) are dynamic.
I try to play with ^ after ads, before /shop, but there will be no results.
I want for example the following output:
6 Clicks from 109.42.242.xxx to /shop/product?traffic=ads&gclid=Cj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB
1 Clicks from 80.187.75.xx to https://www.example.com/shop/product?traffic=ads&gclid=EAIaIQobChMIg_Ks5vWF_AIVAgGLCh3k_gBKEAAYASAAEgKBOfD_BwE&dt=1671461107791"
You can check if the string occurs in field 7 using index(), and then store the values of field 1 and field 7 with a space in between as the key, to retrieve both values in the END block by splitting on a space again.
awk -F'[ "]+' 'index($7, "/shop/product?traffic=ads") { ipcount[$1 " " $7]++ }
END { for (i in ipcount) {
parts = split(i, a, " ")
printf ipcount[i] " Clicks from " a[1] " to " a[2] "\n"
}
}' file
Test data
66.249.68.xx- - [19/Dec/2022:09:14:15 +0100] "GET /shop/other-product/1.0" 404 16996 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.5304.xxx Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
109.42.242.xxx - - [19/Dec/2022:09:14:55 +0100] "GET /shop/product?traffic=ads&gclid=Cj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB HTTP/1.0" 200 30589 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 11; SM-A515F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36"
109.42.242.xxx - - [19/Dec/2022:09:15:55 +0100] "GET /shop/product?traffic=ads&gclid=Cj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB HTTP/1.0" 200 30589 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 11; SM-A515F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36"
80.187.75.xx - - [20/Dec/2022:06:40:12 +0100] "GET /shop/product HTTP/1.0" 200 10821 "https://www.example.com/shop/product?traffic=ads&gclid=EAIaIQobChMIg_Ks5vWF_AIVAgGLCh3k_gBKEAAYASAAEgKBOfD_BwE&dt=1671461107791" "Mozilla/5.0 (iPhone; CPU iPhone OS 16_0_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.0 Mobile/15E148 Safari/604.1"
109.42.242.xxx - - [19/Dec/2022:09:15:55 +0100] "GET /shop/product?traffic=ads&gclid=Aj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB HTTP/1.0" 200 30589 "https://www.google.com/" "Mozilla/5.0 (Linux; Android 11; SM-A515F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Mobile Safari/537.36"
Output
1 Clicks from 109.42.242.xxx to /shop/product?traffic=ads&gclid=Aj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB
2 Clicks from 109.42.242.xxx to /shop/product?traffic=ads&gclid=Cj0KCQiAtICdBhCLARIsALUBFcFMmvFbA_1EyTTMRDp9IWhDXFA_HCeuEsIBXl886PoaAapen2KdussaAniSEALw_wcB
With your shown samples please try following awk code. Using match function to match regex \/shop\/product\?traffic=ads\S+(where escaped / to match literal /) and if match is found then creating an array value with index of $1 FS and matched value. In the END block of this program printing the values as per requirement.
awk '
match($7,/\/shop\/product\?traffic=ads\S+/){
value[$1 FS substr($7,RSTART,RLENGTH)]++
}
END{
for(i in value){
split(i,arr)
print value[i] " Clicks from " arr[1] " to " arr[2]
}
}
' Input_file

Is it possible to write multiple regex for the same input in Fluent Bit?

My logs look like this:
200 59903 0.056 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com [xxxx:4900:xxxx:b798:xxxx:c8ba:xxxx:6a23] - - xxx.xxx.xxx.xxx - - - "http://xxxxx/xxxxx/xxxxx" 164551836 1 HIT "-" "-" "Mozilla/5.0 (Linux; Android 9; Mi A1 Build/PKQ1.180917.001; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/77.0.3865.92 Mobile Safari/537.36" "-" "-" "dhDebug=-" "-" - -
200 11485 0.000 - [24/Jun/2020:00:06:56 +0530] "GET /xxxxx/xxxxx/xxxxx/xxxxx HTTP/1.1" xxxxx.com xxx.xxx.xxx.xxx - - xxx.xxx.xxx.xxx - - - "-" 164551710 7 HIT "-" "-" "Dalvik/2.1.0 (Linux; U; Android 9; vivo 1915 Build/PPR1.180610.011)" "-" "-" "dhDebug=appVersion=13.0.8&osVersion=9&clientId=1271210612&conn_type=4G&conn_quality=NO_CONNECTION&sessionSource=organic&featureMask=1879044085&featureMaskV1=635" "-" 40 -
The two logs are almost same except the fact that the last one contains a detailed output of dhDebug.
This is how my parsers.conf looks like:
[PARSER]
Name head
Format regex
Regex (?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
Time_Keep On
Types responseTime:float
Please suggest any idea on how to implement the information of dhDebug in a separate key-value pair in the same regex that works on both the types of logs.
EDITED!!
You can use (?:case1|case2) for case1: is null and case2: is not null
So Regex will be:
(?<responseCode>\d{3})\s(?<responseSize>\d+)\s(?<responseTime>\d+.\d+)\s.*?\s\[(?<time>.*?)\]\s"(?<method>.*?)\s(?<url1>.*?)\s(?<protocol>.*?)"\s(?<servedBy>.*?)\s(?<Akamai_ip1>.*?)\s(?<ClientId_ip2>.*?)\s(?<ip3>.*?)\s(?<lb_ip4>.*?)\s(?<ip5>.*?)\s(?<ip6>.*?)\s(?<ip7>.*?)\s+"(?<url2>.*?)".*?".*?"\s".*?"\s"(?<agentInfo>.*?)"\s"-"\s"-"\s"dhDebug=(?:-|appVersion=(?<appVersion>.*?)&osVersion=(?<osVersion>.*?)&clientId=(?<clientId>.*?)&conn_type=(?<conn_type>.*?)&conn_quality=(?<conn_quality>.*?)&sessionSource=(?<sessionSource>.*?)&featureMask=(?<featureMask>.*?)&featureMaskV1=(?<featureMaskV1>.*?))"
With this you get null for each field name of dhDebug for the first log line and field names with values for the second one.
You can test it at http://grokdebug.herokuapp.com/

Regex to parse blue coat log file

I have this log file that I'm currently trying to parse.
Jan 12 2019, 14:51:23, 117, 10.0.0.1, neil.armstrong, standard-users, -, TCP_Connect, "sports betting", -, 201, accept, GET, text, https, www.best-site.com, 443, /pages/home.php, ?user=narmstrong&team=wizards, -, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome Safari/537.36", 192.168.1.1, 1400, 1463, -, -, -
Jan 12 2019, 14:52:14, 86, 10.0.0.1, neil.armstrong, standard-users, -, TCP_Connect, "sports betting", -, 200, accept, POST, text, https, www.upload.best-site.com, 443, /, -, -, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537", 192.168.1.1, 230056, 600, -, -, -
Jan 12 2019, 14:52:54, 118, 10.0.0.1, neil.armstrong, standard-users, -, TCP_Connect, "sports betting", -, 200, accept, GET, text/javascript, http, google.fr, 80, /search, ?q=wizards, -, "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537", 192.168.1.1, 1717, 17930, -, -, -
this is the regex that I'm currently using https://regex101.com/r/Asbpkx/3 it parses the log file fine until it reaches "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537" then it splits at (KHTML, =like Gecko)
How can I complete the regex so that this does not happen?
I looked into this closer and the log file is not CSV format which is why the CSV parsing regex didn't work in my previous answer. (I also tried parsing it with excel and python csv, and both split at the comma after 'KHTML'.
Using a negative lookbehind makes the example you gave parse correctly.
(.+?)(?<!KHTML),
It looks like you are trying to parse csv using regex.
Use the regex described in this post:
https://stackoverflow.com/a/18147076/9397882
Regex: (?:^|,)(?=[^"]|(")?)"?((?(1)[^"]*|[^,"]*))"?(?=,|$)
Don't use regex for a CSV. Try these props.conf settings.
[mysourcetype]
INDEXED_EXTRACTIONS = CSV
FIELD_DELIMITED = ,
FIELD_QUOTE = "
FIELD_NAMES = Date, Time, Field3, IP_Addr, Field4, Field5, Field6
TIMESTAMP_FIELDS = Date, Time

Regex number range prasing

I am trying to parse out a specific number range, and can't seem to get it right. I am looking to extract specific browser versions from user agent strings. For example, I want to parse Chrome 1-42 and Firefox 1-40, but I can't figure out the syntax.
What I have so far is this, which kind of works, but it grabs the first number it sees and doesn't respect the 2 digit range:
Gecko..Chrome/([1-9].|[1-4][1-2].)
Sample:
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.137 Safari/537.36
Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.1847.137 Safari/537.36
Firefox 29: Mozilla/5.0 (Android; Mobile; rv:29.0) Gecko/29.0 Firefox/23.0
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:29.0) Gecko/20100101 Firefox/29.0
Any ideas? TIA.
((?:(?:Mozilla\/(?:[1-9]|[1-3][0-9]|40))|(?:Chrome\/(?:[1-9]|[1-3][0-9]|4[0-3])))\.[^ ]+)
Is this what you would like? /Edited/
Demo:
https://regex101.com/r/gH1nU9/2
Because regex is text matching only and number are treated as text, to do something like 1 to 41 you would have to something like this:
\b[1-9]\b|\b[1-3][0-9]\b|4[0-2]\b
This is matching 1 to 9 or 10 to 39 or 40 to 42. I have added the boundries \b so that nothing except thes numbers are matched.

Regular Expression in Apache-Alias not working

Why does
AliasMatch .*\.(png|ico|gif|jpg|jpeg|js|css|woff|ttf|svg)$ /my-location/
+
GET /pages/index/index.js HTTP/1.1
=
[30/Jul/2014:12:55:28 -0700] "GET /pages/index/index.js HTTP/1.1" 404 433 "http://localhost/" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36"
?
The solution was
AliasMatch (.*\.(png|ico|gif|jpg|jpeg|js|css|woff|ttf|svg))$ /my-location/$1
The reason for that is:"[...] Alias will automatically copy any additional part of the URI, past the part that matched, onto the end of the file path on the right side, while AliasMatch will not. This means that in almost all cases, you will want the regular expression to match the entire request URI from beginning to end, and to use substitution on the right side." (http://httpd.apache.org/docs/2.2/mod/mod_alias.html#aliasmatch)