Regex to parse certain fields of a log file

Regex to parse certain fields of a log file - regex

I have this log line:
blabla#gmail.com, Portal, qtp724408050-38, com.blabla.search.lib.SearchServiceImpl .logRequest, [Input request is lookupRequestDTO]
I need to find a regex that grabs that email, then matches lookupRequestDTO ignoring everything in between.
Currently my regex grabs the whole line:
([\w-\.]+)#gmail.com,(.+)lookupRequestDTO
How do I not match anything in between the email and lookupRequestDTO ?

What about this?
([^,]+).*?lookupRequestDTO
[^,]+ matches everything up until the first comma so it should get you the email
It assumes lookupRequestDTO is a criteria for your search. If it is a variable you want to retrieve, you could use this :
([^,]+).*?\[Input request is ([^\]]+)

Assuming you're using PCRE (php, perl, etc., and this should work in javascript):
([\w-\.]+?#gmail\.com),(?:.+)(lookupRequestDTO)
Out of capture groups 1 and 2, you'll get:
MATCH 1
blabla#gmail.com
lookupRequestDTO
Working example: http://regex101.com/r/yW9eU3

Related

regex extract username from 2 types of url

I'm currently using this regex (?<=\/movie\/)[^\/]+, but it only matches the username from the second url, i know i could make a if (contains /movie/): use this regex, else: use another regex on my code, but i'm trying to do this directly on regex.
http://example.com:80/username/token/30000
http://example.com:80/movie/username/token/30000.mp4

To complete the Tensibai's answer, if you have not a port in url, you can use the last dot in url to start your regex :
\.[^\/\.]+\/(?:movie\/)?([^\/]+)
(demo)

You can use something like this to make the movie/ optional and have the username in a named capture group (Live exemple):
\d[/](?:movie\/)?(?<username>[^/]+)[/]
using \d/ to anchor the start of match at after the url.

Consolidated RegEx to parse syslog data

Goal
I am trying to craft a RegEx that will parse out specific data from various syslog entries that contain subtle differences in logged content. While I am able to accomplish my goal using multiple RegEx statements, if possible, I would like to combine these statements into a single consolidated RegEx.
Log entries
The main issue I'm having is that some log entries have a URL that needs to be parsed to a named group and other log entries do not have any URL. Examples of these two different log entries are provided below.
Entry with URL
Nov 3 11:33:04 host1 postfix/smtpd[12812]: NOQUEUE: reject: RCPT from 178.red-83-59-180.dynamicip.rima-tde.net[83.59.180.178]: 554 5.7.1 Service unavailable; Client host [83.59.180.178] blocked using b.barracudacentral.org; http://www.barracudanetworks.com/reputation/?pr=1&ip=83.59.180.178; from=<lmclapp68#newmail.spamcop.net> to=<user1#example.com> proto=ESMTP helo=<178.red-83-59-180.dynamicip.rima-tde.net>
Entry without URL
Nov 2 16:01:25 host1 postfix/smtpd[31667]: NOQUEUE: reject_warning: RCPT from mail1.sendersrv.com[185.3.229.125]: 554 5.7.1 Service unavailable; Client host [185.3.229.125] blocked using bl.spamcop.net; from=<bounces+rL59wUXq98_inBrG#sendersrv.com> to=<user1#example.com> proto=ESMTP helo=<mail1.sendersrv.com>
RegEx statements
In the RegEx statements that follow, the first two are what I currently use for each of the previous log messages. The third RegEx is my attempt at consolidating these both into a single RegEx that will parse data from either log message. My attempt was to use a conditional statement that would basically check for the existence of http(s) and if found, then to parse the URL to a named group. If http(s) was not found, then it would parse out everything until the next RegEx token.
The issue is that when I test the RegEx against a log entry that has a URL, the RegEx does not seem to find http(s) despite this token being set as optional (i.e. using the ? quantifier). However, if I remove the ? quantifier, it does find http(s) and then parses the URL as desired. However, without the quantifier, the RegEx does not work with log entries that do not have a URL.
Parse entries with URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);.+https?:\/{2}(?P<entryUrl>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Parse entries without URL
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+);\s.+\sto=\<(?P<destEm>.+)>.+$
Attempt at consolidating RegEx
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+)(?<=[a-z]);.+(https?:\/{2})?(?(5)(?P<entryUrl>.+)|.+)to=\<(?P<destEm>.+)>.+$
I'm sure the issue is my misunderstanding as to how the conditional statements and the ? quantifier works.

Looking at your patterns, the email address for to: is between tags < and > but due to the formatting in the question they are not shown.
The parts in your pattern like .+ first match until the end of the string, and will then backtrack and try to match the rest of the pattern.
You can make the pattern a bit more performant making the parts that you want and know more specific.
For the datetime, you can make the pattern match the specified format instead of .+ using ^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2})
For (?P<blkList>[^;]+) and (?P<entryUrl>[^;]+) you can use a negated character class matching any char except ;
For group (?P<destEm>[^<>\s]+) you can exclude matching tags.
To make match the url, instead of using a condition you can make the group optional using ?
For example
^(?P<datetime>[A-Z][a-z]{2}\s+\d{1,2}\s* \d{1,2}:\d{1,2}:\d{1,2}) host1 postfix\b.*? RCPT from (?P<srcDns>.*?)\[(?P<srcIp>[0-9\.]+)\]:.*? blocked using (?P<blkList>[^;]+);(?:.+?https?:\/\/(?P<entryUrl>[^;]+);)?\s.*? to=[^<]*<(?P<destEm>[^<>\s]+)>
See a regex demo.

Have you tried to test your regex on page like regex101?
to=\<(?P<destEm>.+)> doesn't seem to match your examples. You should either remove <> or replace to with helo. Be careful to make your quantifier lazy after blkList otherwise you might catch too much text.
You can then make your url optional with ? and it should work in both cases:
^(?P<datetime>.+) host1 postfix.+RCPT from (?P<srcDns>.+)\[(?P<srcIp>[0-9\.]+)\]:.+blocked using (?P<blkList>.+?);(.+https?:\/{2}(?P<entryUrl>.+);\s)?.+\sto=(?P<destEm>.+?)\s.*$

One approach would be to replace in the first regex .+https?:\/{2}(?P<entryUrl>.+); with (?:.+https?:\/{2}(?P<entryUrl>.+);)? where ?: indicates that it is a non-capturing group and the ? at the end means that it is optional.
However, it still does not work because .+ is greedy, so use lazy .+? instead.
Final regex:
^(?P<datetime>.+?) host1 postfix.+?RCPT from (?P<srcDns>.+?)\[(?P<srcIp>[0-9\.]+)\]:.+?blocked using (?P<blkList>.+?);(?:.+?https?:\/{2}(?P<entryUrl>.+?);)?\s.+?\sto=\<(?P<destEm>.+?)>.+?$
https://regex101.com/r/QkmXWz (to see it in action)

How to write Regex expression to extract the content in brackets, after string and the first match?

I would like to use Regular expression to extract content between brackets, after some specific string and the 1st match.
Example text:
**-n --command PING being applied--:
Wed May 34 7:23:18 2010
[ZZZ_6323] Command [ping] failed with error [[TEZZZGH_IUE] [[EIJERTMMMMIJE_EIEJ] gdyugedyue Service [ABC] is not available in domain [DEF]. Check the content and review diejidjei. Service [ABC] Domain [DEF] ] did not ping back. It might be due to one of the following reasons:
=> Reason1
=> Reason3
=> Reason 4: deijdije djkeoidjeio.
info=4343 day=Mon year=2010*
I would like to extract the string between [] but after string Service and 1st match as Service could appear again later. In this case ABC
Could someone help me?
I am not able to combine these three conditionals.
Thanks

Assuming that you don't care about capturing square brackets inside the [ ] pair, by far the easiest way to do this is to use the following simple regex:
Service (\[[^\]]*\])
and extract only the 1st capturing group from the result using whatever regex functionality you're using. For example, using JS, you would write
string.match(/Service (\[[^\]]*\])/)[1]
to extract the first capturing group.
If you instead want a regex that will only capture the first occurrence, you can exploit the greedy nature of the * quantifier and change the regex to this:
Service (\[[^\]]*\]).*

Service \[([^\]]+)\]
will match Service [anything besides brackets] and capture anything besides brackets in group number 1. Since regex engines work left-to-right, the first match will be the leftmost match.
Test it live on regex101.com.
In PHP, you could do this (code snippet generated by RegexBuddy):
if (preg_match('/Service \[([^\]]+)\]/', $subject, $groups)) {
$result = $groups[1];
} else {
$result = "";
}

The definition of the group name How should I write it? I know that it can be like this: (?) but I dont know how to combine it with this part Service [([^]]+)] in a single way

How to extract file name from URL?

I have file names in a URL and want to strip out the preceding URL and filepath as well as the version that appears after the ?
Sample URL
Trying to use RegEx to pull, CaptialForecasting_Datasheet.pdf
The REGEXP_EXTRACT in Google Data Studio seems unique. Tried the suggestion but kept getting "could not parse" error. I was able to strip out the first part of the url with the following. Event Label is where I store URL of downloaded PDF.
The URL:
https://www.dudesolutions.com/Portals/0/Documents/HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
REGEXP_EXTRACT( Event Label , 'Documents/([^&]+)' )
The result:
HC_Brochure_Digital.pdf?ver=2018-03-18-110927-033
Now trying to determine how do I pull out everything after the? where the version data is, so as to extract just the Filename.pdf.

You could try:
[^\/]+(?=\?[^\/]*$)
This will match CaptialForecasting_Datasheet.pdf even if there is a question mark in the path. For example, the regex will succeed in both of these cases:
https://www.dudesolutions.com/somepath/CaptialForecasting_Datasheet.pdf?ver
https://www.dudesolutions.com/somepath?/CaptialForecasting_Datasheet.pdf?ver

Assuming that the name appears right after the last / and ends with the ?, the regular expression below will leave the name in group 1 where you can get it with \1 or whatever the tool that you are using supports.
.*\/(.*)\?
It basically says: get everything in between the last / and the first ? after, and put it in group 1.
Another regular expression that only matches the file name that you want but is more complex is:
(?<=\/)[^\/]*(?=\?)
It matches all non-/ characters, [^\/], immediately preceded by /, (?<=\/) and immediately followed by ?, (?=\?). The first parentheses is a positive lookbehind, and the second expression in parentheses is a positive lookahead.

This REGEXP_EXTRACT formula captures the characters a-zA-Z0-9_. between / and ?
REGEXP_EXTRACT(Event Label, "/([\\w\\.]+)\\?")
Google Data Studio Report to demonstrate.

Please try the following regex
[A-Za-z\_]*.pdf
I have tried it online at https://regexr.com/. Attaching the screenshot for reference
Please note that this only works for .pdf files

Following regex will extract file name with .pdf extension
(?:[^\/][\d\w\.]+)(?<=(?:.pdf))
You can add more extensions like this,
(?:[^\/][\d\w\.]+)(?<=(?:.pdf)|(?:.jpg))
Demo

RegEx to filter E-Mail Adresses from URLs in Google Analytics

I want to use a Google Analytics filter to remove email addresses from incoming URIs. I am using the custom advanced filter, filtering field A on a RegEx for the Request URI and replacing the respective part later. However, my RegEx does not seem to work correctly. It should find email addresses, not only if an '#' is used, but also if '(at)', '%40', or '$0040' are used to represent the '#'.
My latest RegEx version (see below) still allows '$0040' to go through undetected. Can someone advise me what to change?
^(.*)=([A-Z0-9._%+-]+[#|[\(at\)]|[\$0040]|[\%40]][A-Z0-9.-]+\.[A-Z]{2,4})(.*)$

I suggest using
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌-Za-z]{2,4})
See the regex demo.
If you need to match the whole string, you may keep that pattern enclosed with your ^(.*) and (.*)$.
Details
([A-Za-z0-9._%+-]+(#|\(at\)|[$]0040|\%40)[A-Za-z0-9.-]+\.[A‌-Za-z]{2,4}) - Group 1 capturing
[A-Za-z0-9._%+-]+ - 1 or more ASCII letters/digits, ., _, %, +, or -
(#|\(at\)|[$]0040|\%40) - one of the alternatives: #, (at), $0040 or %40
[A-Za-z0-9.-]+ - 1 or more ASCII letters/digits, . or -
\. - a dot
[A‌-Za-z]{2,4} - 2 to 4 ASCII letters.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to parse certain fields of a log file - regex

What about this? ([^,]+).?lookupRequestDTO [^,]+ matches everything up until the first comma so it should get you the email It assumes lookupRequestDTO is a criteria for your search. If it is a variable you want to retrieve, you could use this : ([^,]+).?\[Input request is ([^\]]+)

Assuming you're using PCRE (php, perl, etc., and this should work in javascript): ([\w-\.]+?#gmail\.com),(?:.+)(lookupRequestDTO) Out of capture groups 1 and 2, you'll get: MATCH 1 blabla#gmail.com lookupRequestDTO Working example: http://regex101.com/r/yW9eU3

Related

regex extract username from 2 types of url

Consolidated RegEx to parse syslog data

How to write Regex expression to extract the content in brackets, after string and the first match?

How to extract file name from URL?

RegEx to filter E-Mail Adresses from URLs in Google Analytics

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Regex to parse certain fields of a log file - regex

What about this? ([^,]+).*?lookupRequestDTO [^,]+ matches everything up until the first comma so it should get you the email It assumes lookupRequestDTO is a criteria for your search. If it is a variable you want to retrieve, you could use this : ([^,]+).*?\[Input request is ([^\]]+)

Assuming you're using PCRE (php, perl, etc., and this should work in javascript): ([\w-\.]+?#gmail\.com),(?:.+)(lookupRequestDTO) Out of capture groups 1 and 2, you'll get: MATCH 1 blabla#gmail.com lookupRequestDTO Working example: http://regex101.com/r/yW9eU3

Related

regex extract username from 2 types of url

Consolidated RegEx to parse syslog data

How to write Regex expression to extract the content in brackets, after string and the first match?

How to extract file name from URL?

RegEx to filter E-Mail Adresses from URLs in Google Analytics

Categories

Resources

What about this? ([^,]+).?lookupRequestDTO [^,]+ matches everything up until the first comma so it should get you the email It assumes lookupRequestDTO is a criteria for your search. If it is a variable you want to retrieve, you could use this : ([^,]+).?\[Input request is ([^\]]+)