Splunk - regex extract fields from source - regex

I am trying to extract the job name , region from Splunk source using regex .
Below is the format of my sample source :
/home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414_USA_log
With the below , I am able to extract job name :
(?<logdir>\/[\W\w]+\/[\W\w]+\/)(?<date>[^\/]+)\/job_(?<jobname>.+)_\d+
Here is the match so far :
Full match 0-53 /home/app/abc/logs/20200817/job_DAILY_HR_REPORT_44414
Group `logdir` 0-19 /home/app/abc/logs/
Group `date` 19-27 20200817
Group `jobname` 32-47 DAILY_HR_REPORT
I also need USA (region) from the source . Can you please help suggest.
Region will always appear after number field (44414) , which can vary in number of digits.
Ex: 123, 1234, 56789
Thank you in advance.

You could make the pattern a bit more specific about what you would allow to match as [\W\w]+ and .+ will cause more backtracking to fit the rest of the pattern.
Then for the region you can add a named group at the end (?<region>[^\W_]+) matching one or more times any word character except an underscore.
In parts
(?<logdir>\/(?:[^\/]+\/)*)(?<date>(?:19|20)\d{2}(?:0?[1-9]|1[012])(?:0[1-9]|[12]\d|3[01]))\/job_(?<jobname>\w+)_\d+_(?<region>[^\W_]+)_log
(?<logdir> Group logdir
\/(?:[^\/]+\/)* match / and optionally repeat any char except / followed by matching the / again
) Close group
(?<date> Group date
(?:19|20)\d{2} Match a year starting with 19 or 20
(?:0?[1-9]|1[012]) Match a month
(?:0[1-9]|[12]\d|3[01]) Match a day
) Close group
\/job_ Match /job_
(?<jobname>\w+) Group jobname, match 1+ word chars
_\d+_ Match 1+ digits between underscores
(?<region>[^\W_]+) Group region Match 1+ occurrences of a word char except _
_log Match literally
Regex demo

Related

Match specific letter from group N Regex

I have the following log message:
Aug 25 03:07:19 localhost.localdomainASM:unit_hostname="bigip1",management_ip_address="192.168.41.200",management_ip_address_2="N/A",http_class_name="/Common/log_to_elk_policy",web_application_name="/Common/log_to_elk_policy",policy_name="/Common/log_to_elk_policy",policy_apply_date="2020-08-10 06:50:39",violations="HTTP protocol compliance failed",support_id="5666478231990524056",request_status="blocked",response_code="0",ip_client="10.43.0.86",route_domain="0",method="GET",protocol="HTTP",query_string="name='",x_forwarded_for_header_value="N/A",sig_ids="N/A",sig_names="N/A",date_time="2020-08-25 03:07:19",severity="Eror",attack_type="Non-browser Client,HTTP Parser Attack",geo_location="N/A",ip_address_intelligence="N/A",username="N/A",session_id="0",src_port="39348",dest_port="80",dest_ip="10.43.0.201",sub_violations="HTTP protocol compliance failed:Bad HTTP version",virus_name="N/A",violation_rating="5",websocket_direction="N/A",websocket_message_type="N/A",device_id="N/A",staged_sig_ids="",staged_sig_names="",threat_campaign_names="N/A",staged_threat_campaign_names="N/A",blocking_exception_reason="N/A",captcha_result="not_received",microservice="N/A",tap_event_id="N/A",tap_vid="N/A",vs_name="/Common/adv_waf_vs",sig_cves="N/A",staged_sig_cves="N/A",uri="/random",fragment="",request="GET /random?name=' or 1 = 1' HTTP/1.1\r\n",response="Response logging disabled"
And I have the following RegEx:
request="(?<Flag1>.*?)"
I trying now to match some text again from the previous group under name "Flag1", the new match that I'm trying to flag it is /random?name=' or 1 = 1' as Flag2.
How can I match the needed text from other matched group number or flag name without insert the new flag inside the targeted group like:
request="(?<Flag1>\w+\s+(?<Flag2>.*?)\s+HTTP.*?)"
https://regex101.com/r/EcBv7p/1
Thanks.
You can use
request="(?<Flag1>[A-Z]+\s+(?<Flag2>\/\S+='[^']*')[^"]*)"
See the regex demo.
Details:
(?<Flag1> - Flag1 group:
[A-Z]+ - one or more uppercase ASCII letters
\s+ - one or more whitespaces
(?<Flag2>\/\S+='[^']*') - Group Flag2: /, one or more non-whitespace chars, =', zero or more chars other than ', and then a ' char
[^"]* - zero or more chars other than "
) - end of Flag1 group.
If I understand you correctly, you want to match whatever string a previous group has matches, right?
In that case you can use \n or in this case \1 to match the same thing that your first capture group matched

Problem with a regex to extract informations of a filename

I have to extract informations of filenames.
There is a possible name which can contains a-z , 0-9 , - , _
There is a possible separator which can be _ , . , / , \ , * , # or a space
There is always a number before the extension
There is always an extension
For now, here is where I am
^((?<name>[a-z0-9-_]+)(?<separator>[_\.#\/\\*\- ])){0,1}(?<number>\d+)\.(?<extension>[a-z]{3})$
I have to match all of theses:
tot0_tutu_00001.tif
tot0.0001.tif
tot0#00001.tif
tot0/0001.tif
tot0\00001.tif
tot0*0001.tif
00001.tif
tot0-tutu_0001.tif
tot0-tutu-00001.tif
tot0-tutu 000001.tif
tot0-tutu000001.tif
That regex is working for all cases axcept the last one
tot0-tutu000001.tif
I cannot figure how to solve this
Here is a sandbox
https://regex101.com/r/wZP6RI/1
The filenames do not start with a separator, so you could make the whole name part optional. If it is present, make sure it starts with a-z0-9 and optionally match the separator.
For the digits part, you can use a negative lookbehind to start matching digits where there is no digit directly before it.
^(?<name>(?:[a-z0-9]+(?:[-_][a-z0-9]+)*(?<separator>[_\.#\/\\*\- ]?))?)(?<!\d)(?<number>\d+)\.(?<extension>[a-z]{3})$
^ Start of string
(?<name> Named group name
(?: Non capture group
[a-z0-9]+ Match 1+ occurrences of a range a-z or 0-9
(?:[-_][a-z0-9]+)* Optionally repeat the previous with either - or _ prepended
(?<separator> Named group separator
[_\.#\/\\*\- ]? Optionally match any of the listed chars
) Close group separator
)? Close non capture group and make it optional
) Close group name
(?<!\d)(?<number>\d+) Named group number match 1+ digits asserting no digit directly to the right
\.(?<extension>[a-z]{3}) Match . and named group extension matching 3 times a char in range a-z
$ End of string
Regex demo
According to the criteria you stated, it feels like it could be simplified:
^[a-z0-9_-]*[_.\/\\*# ]?\d+\.[a-z]{3}$
See https://regex101.com/r/z5eRwM/1.
Feel free to regroup as you need.
Side note, careful with - in char classes, it needs to be either at the very beginning or at the very end to be considered as the - char (and not the range separator as in [a-z]):
[-az]: matches -, a and z
[a-z]: matches chars between a and z (- excluded)
[az-]: matches a, z and -

Comma separated prefix list with commas inside

I'm trying to match a comma separated list with prefixed values which contains also a comma.
I finally made it to match all occurrence which doesn't have a ,.
Sample String (With NL for visualization - original string doesn't have NL):
field01=Value 1,
field02=Value 2,
field03=<xml value>,
field04=127.0.0.1,
field05=User-Agent: curl/7.28.0\r\nHost: example.org\r\nAccept: */*,
field06=Location, Resource,
field07={Item 1},{Item 2}
My actual RegEx looks like this not optimized piece ....
(?'fields'(field[0-9]{2,3})=?([\s\w\d_<>.:="*?\-\/\\(){}<>'#]+))([^,](?&fields))*
Any one has a clue how to solve this?
EDIT:
The first pattern is near to my expected result.
This is a anonymized full example of the string:
asm01=Predictable Resource Location,Information Leakage,asm02=N/A,asm04=Uncategorized,asm08=2021-02-15 09:18:16,asm09=127.0.0.1,asm10=443,asm11=N/A,asm15=,asm16=DE,asm17=User-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n,asm18=/Common/_www.example.com_live_v1,asm20=127.0.0.1,asm22=,asm27=HEAD,asm34=/Common/_www.example.com_live_v1,asm35=HTTPS,asm39=blocked,asm41=0,asm42=3,asm43=0,asm44=Error,asm46=200000028,200100015,asm47=Unix hidden (dot-file) access,.htaccess access,asm48={Unix/Linux Signatures},{Apache/NCSA HTTP Server Signatures},asm50=40622,asm52=200000028,asm53=Unix hidden (dot-file) access,asm54={Unix/Linux Signatures},asm55=,asm61=,asm62=,asm63=8985143867830069446,asm64=example-waf.example.com,asm65=/.htaccess,asm67=Attack signature detected,asm68=<?xml version='1.0' encoding='UTF-8'?><BAD_MSG><violation_masks><block>13020008202d8a-f803000000000000</block><alarm>417020008202f8a-f803000000000000</alarm><learn>13000008202f8a-f800000000000000</learn><staging>200000-0</staging></violation_masks><request-violations><violation><viol_index>42</viol_index><viol_name>VIOL_ATTACK_SIGNATURE</viol_name><context>request</context><sig_data><sig_id>200000028</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>2</length></kw_data></sig_data><sig_data><sig_id>200000028</sig_id><blocking_mask>4</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>0</offset><length>3</length></kw_data></sig_data><sig_data><sig_id>200100015</sig_id><blocking_mask>7</blocking_mask><kw_data><buffer>Ly5odGFjY2Vzcw==</buffer><offset>1</offset><length>9</length></kw_data></sig_data></violation></request-violations></BAD_MSG>,asm69=5,asm71=/Common/_dev.example.com_SSL,asm75=127.0.0.1,asm100=,asm101=HEAD /.htaccess HTTP/1.1\r\nUser-Agent: curl/7.29.0\r\nHost: dev.example.com\r\nAccept: */*\r\nX-Forwarded-For: 127.0.0.1\r\n\r\n#015
The pattern does not work as the fields group matches the string field
You are trying to repeat the named group fields but the example strings do not have the string field.
Note that [^,] matches any char except a comma, you can omit the capture group inside the named group field as it already is a group and \w also matches \d
With 2 capture groups:
\b(asm[0-9]+)=(.*?)(?=,asm[0-9]+=|$)
\b A word boundary
(asm[0-9]+) Capture group 1, match asm and 1+ digits
= Match literally
(.*?) Capture group 2, match any char as least as possible
(?= Positive lookahead, assert what is at the right is
,asm[0-9]+= Match ,asm followed by 1+ digits and =
| Or
$ Assert the end of the string
) Close lookahead
Regex demo
A simple solution would be (see regexr.com/5mg1b):
/((asm\d{2,3})=(.*?))(?=,asm|$)/g
Match groupings will be:
group #1 - asm01=Predictable Resource Location,Information Leakage
group #2 - asm01
group #3 - Predictable Resource Location,Information Leakage
Conditions:
This will match everything including empty values
The key here is to make sure that each match is delimited by either a comma and your field descriptor, or an end of string. A look ahead will be handy here: (?=,asm|$).

RegEx - Return pattern to the right of a text string for URL

I'm looking to return the URL string to the right of a specific set of text using RegEx:
URL:
www.websitename/countrycode/websitename/contact/thank-you/whitepaper/countrycode/whitepapername.pdf
What I would like to just return:
/whitepapername.pdf
I've tried using ^\w+"countrycode"(\w.*) but the match won't recognize countrycode.
In Google Data Studio, I want to create a new field to remove the beginning of the URL using the REGEX_REPLACE function.
Ideally using:
REGEX_REPLACE(Page,......)
The REGEXP_REPLACE function below does the trick, capturing all (.*) the characters after the last countrycode, where Page represents the respective field:
REGEXP_REPLACE(Page, ".*(countrycode)(.*)$", "\\2")
Alternatively - Adapting the RegEx by The fourth bird to Google Data Studio:
REGEXP_REPLACE(Page, "^.*/countrycode(/[^/]+\\.\\w+)$", "\\1")
Google Data Studio Report as well as a GIF to elaborate:
You could use a capturing group and replace with group 1. You could match /countrycode literally or use the pattern to match 2 times chars a-z with an underscore in between like /[a-z]{2}_[a-z]{2}
In the replacement use group 1 \\1
^.*/countrycode(/[^/]+\.\w+)$
Regex demo
Or using a country code pattern from the comments:
^.*/[a-z]{2}_[a-z]{2}(/[^/]+\.\w+)$
Regex demo
The second pattern in parts
^ Start of string
.*/ Match until the last occurrence of a forward slash
[a-z]{2}_[a-z]{2} Match the country code part, an underscore between 2 times 2 chars a-z
( Capture group 1
/[^/]+ Match a forward slash, then match 1+ occurrences of any char except / using a negated character class
\.\w+ Match a dot and 1+ word chars
) Close group
$ End of string

Regex pattern in vbscript to match Text with multiple line

I have a long string with Slno. in it. I want to split the sentence from the string with Slno.
Sample text:
1. Able to click new button and proceed to ONB-002 dialogue.
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
I tried using vbscript regex to match a pattern. but it is matches only the first line of the string (1. text) not the second one.
^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]
And while splitting the string, for the Slno. 2 i want o get the below sentence as well. which am finding difficulty in getting.
Please assist me.
Set regex = CreateObject("VBScript.RegExp")
With regex
.Pattern = "^\d+\.\s(-?).*[\r\n].[\r\n\*+]*.*|^\d+\.\s(-?).*[\r\n]"
.Global = True
End With
Set matches = regex.Execute(txt)
My Expectation is am looking for a regex pattern that match
1. Able to click new button and proceed to ONB-002 dialogue.
&
2. - Partner connection name **(text field empty)(MANDATORY)**
- GS1 company prefix **(text field empty)(MANDATORY)**
as separate sentence or group.
If I am not mistaken, to get the 2 separate parts including the line after you could use:
^\d+\..*(?:\r?\n(?!\d+\.).*)*
Explanation
^ Start of string
\d+\. Match 1+ digits followed by a dot
.* Match any character except a newline 0+ times
(?: Non capturing group
\r?\n(?!\d+\.).* Match a newline and use a negative lookahead to asset what is on the right is not 1+ digits followed by a dot
)* Close non capturing group and repeat 0+ times
Regex demo