Regex to parse certain fields of a log file - regex

I have this log line:
blabla#gmail.com, Portal, qtp724408050-38, com.blabla.search.lib.SearchServiceImpl .logRequest, [Input request is lookupRequestDTO]
I need to find a regex that grabs that email, then matches lookupRequestDTO ignoring everything in between.
Currently my regex grabs the whole line:
([\w-\.]+)#gmail.com,(.+)lookupRequestDTO
How do I not match anything in between the email and lookupRequestDTO ?

What about this?
([^,]+).*?lookupRequestDTO
[^,]+ matches everything up until the first comma so it should get you the email
It assumes lookupRequestDTO is a criteria for your search. If it is a variable you want to retrieve, you could use this :
([^,]+).*?\[Input request is ([^\]]+)

Assuming you're using PCRE (php, perl, etc., and this should work in javascript):
([\w-\.]+?#gmail\.com),(?:.+)(lookupRequestDTO)
Out of capture groups 1 and 2, you'll get:
MATCH 1
blabla#gmail.com
lookupRequestDTO
Working example: http://regex101.com/r/yW9eU3

Related

regex extract username from 2 types of url

I'm currently using this regex (?<=\/movie\/)[^\/]+, but it only matches the username from the second url, i know i could make a if (contains /movie/): use this regex, else: use another regex on my code, but i'm trying to do this directly on regex.
http://example.com:80/username/token/30000
http://example.com:80/movie/username/token/30000.mp4
To complete the Tensibai's answer, if you have not a port in url, you can use the last dot in url to start your regex :
\.[^\/\.]+\/(?:movie\/)?([^\/]+)
(demo)
You can use something like this to make the movie/ optional and have the username in a named capture group (Live exemple):
\d[/](?:movie\/)?(?<username>[^/]+)[/]
using \d/ to anchor the start of match at after the url.

Consolidated RegEx to parse syslog data

Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.
Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.
Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$
One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)

How to write Regex expression to extract the content in brackets, after string and the first match?

I would like to use Regular expression to extract content between brackets, after some specific string and the 1st match.
Example text:
**-n --command PING being applied--:
Wed May 34 7:23:18 2010
[ZZZ_6323] Command [ping] failed with error [[TEZZZGH_IUE] [[EIJERTMMMMIJE_EIEJ] gdyugedyue Service [ABC] is not available in domain [DEF]. Check the content and review diejidjei. Service [ABC] Domain [DEF] ] did not ping back. It might be due to one of the following reasons:
=> Reason1
=> Reason3
=> Reason 4: deijdije djkeoidjeio.
info=4343 day=Mon year=2010*
I would like to extract the string between [] but after string Service and 1st match as Service could appear again later. In this case ABC
Could someone help me?
I am not able to combine these three conditionals.
Thanks
Assuming that you don't care about capturing square brackets inside the [ ] pair, by far the easiest way to do this is to use the following simple regex:
Service (\[[^\]]*\])
and extract only the 1st capturing group from the result using whatever regex functionality you're using. For example, using JS, you would write
string.match(/Service (\[[^\]]*\])/)[1]
to extract the first capturing group.
If you instead want a regex that will only capture the first occurrence, you can exploit the greedy nature of the * quantifier and change the regex to this:
Service (\[[^\]]*\]).*
Service \[([^\]]+)\]
will match Service [anything besides brackets] and capture anything besides brackets in group number 1. Since regex engines work left-to-right, the first match will be the leftmost match.
Test it live on regex101.com.
In PHP, you could do this (code snippet generated by RegexBuddy):
if (preg_match('/Service \[([^\]]+)\]/', $subject, $groups)) {
$result = $groups[1];
} else {
$result = "";
}
The definition of the group name How should I write it? I know that it can be like this: (?) but I dont know how to combine it with this part Service [([^]]+)] in a single way

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.
You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver
Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.
This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.
Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files
Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

RegEx to filter E-Mail Adresses from URLs in Google Analytics

I want to use a Google Analytics filter to remove email addresses from incoming URIs. I am using the custom advanced filter, filtering field A on a RegEx for the Request URI and replacing the respective part later. However, my RegEx does not seem to work correctly. It should find email addresses, not only if an '#' is used, but also if '(at)', '%40', or '$0040' are used to represent the '#'.
My latest RegEx version (see below) still allows '$0040' to go through undetected. Can someone advise me what to change?
^(.*)=([A-Z0-9._%+-]+[#|[\(at\)]|[\$0040]|[\%40]][A-Z0-9.-]+\.[A-Z]{2,4})(.*)$
I suggest using
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌​-Za-z]{2,4})
See the regex demo.
If you need to match the whole string, you may keep that pattern enclosed with your ^(.*) and (.*)$.
Details
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌​-Za-z]{2,4}) - Group 1 capturing
[A-Za-z0-9._%+-]+ - 1 or more ASCII letters/digits, ., _, %, +, or -
(#|\(at\)|[$]0040|\%40) - one of the alternatives: #, (at), $0040 or %40
[A-Za-z0-9.-]+ - 1 or more ASCII letters/digits, . or -
\. - a dot
[A‌​-Za-z]{2,4} - 2 to 4 ASCII letters.