I am using a data analysis package that exposes a Regex function for string parsing. I am trying to parse a response from a website that is in the format...
key1=val1&key2=val2&key3=val3 ...
[There is the possibility that the keys and values may be percent encoded, but the current return values are not, the current return values are tokens and other info that are alphanumeric].
I understand this data to be www-form-urlencoded, or alternatively it might be known as query string format.
The object is to extract the value for a given key, if the order of the keys cannot be relied upon. For example, I might know that one of the keys I should receive is "token", so what regex pattern can I use to extract the value for the key "token"? I have searched for this but cannot find anything that does what I need, but if there is a duplicate question, apologies in advance.
In Alteryx, you may use Tokenize with a regex containing a capturing group around the part you need to extract:
The Tokenize Method allows you to specify a regular expression to match on and that part of the string is parsed into separate columns (or rows). When using the Tokenize method, you want to match to the whole token, and if you have a marked group, only that part is returned.
I bolded the part of the method description that proves that if there is a capturing group, only this part will be returned rather than the whole match.
Thus, you may use
(?:^|[?&])token=([^&]*)
where instead of token you may use any of the keys the value for which you want to extract.
See the regex demo.
Details
(?:^|[?&]) - the start of a string, ? or & (if the string is just a plain key-value pair string, you may omit ? and use (?:^|&) or (?<![^&]))
token - the key
= - an equal sign
([^&]*) - Group 1 (this will get extracted): 0 or more chars other than & (if you do not want to extract empty values, replace * with + quantifier).
I'm trying to parse the output of the "display interface brief" Comware switch command to convert it to a CSV file using RegEx. This command is printed using the following format:
Interface Link Speed Duplex Type PVID Description
BAGG51 UP 4G(a) F(a) T 1
FGE1/0/42 DOWN auto A T 1 ### LIVRE ###
GE6/0/20 UP 100M(a) F(a) A 1 LIVRE (MGMT - [WAN8-P8]
It's seems quite challenging for me because doesn't matter which RegEx I try, it doesn't properly handle "DOWN auto" and "100M(a) F(a)" output that has only one space between them. I also couldn't find a way to properly handle the last field, that can contain one or more spaces, but into most RegEx that I tried it create a separate capture group for each space instead of handling it's text content properly.
I'd also tried countless ways to try to parse it, and I couldn't find much content about parsing non-uniform columns into the Internet and StackOverflow community.
I need to parse it into the following format, with 7 capture groups per line, respecting the end of line:
BAGG51;UP;4G(a);F(a);T;1
FGE1/0/42;DOWN;auto;A;T;1;### LIVRE ###
GE6/0/20;UP;100M(a);F(a);A;1;LIVRE (MGMT - [WAN8-P8]
The most successfully RegEx that I found so far was: ^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+) replacing it to $1;$2;$3;$4;$5;$6;$7 using Notepad++ but it doesn't properly handle the "Description" field, that can be empty.
The following pattern seems to be working here:
^(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)(?:[ ]+(.*))?
This follows your pattern with six mandatory capture groups, followed by an optional seventh capture group. The (?:[ ]+(\S+))? at the end of the pattern matches one or more spaces followed by the content. Note that this pattern should be used in multiline mode.
Here is a working demo
I need to extract date_from and date_to from the following log field value.
date_from=11-04-2020&date_to=01-04-2021&page_size=1000&page=1 in AWS cloudwatch
I have so far tried parse keyword with the following regex \d{2}-\d{2}-\d{4} and it does not work.
What I ultimately want to do is extract these two dates and gets the time difference between them in days.
Here's the query I tried,
filter #logStream like /<log-stream>/ and process like /rest-call/ | parse parameters '\d{2}-\d{2}-\d{4}' as #date | display #date
You can capture both date_from and date_to into two named capturing groups:
parse parameters /date_from=(?<date_from>\d{2}-\d{2}-\d{4}).*?date_to=(?<date_to>\d{2}-\d{2}-\d{4})/ | display date_from, date_to
See the regex demo.
If the date format can be any, you may replace the \d{2}-\d{2}-\d{4} specific pattern with a more generic [^&]+:
/date_from=(?<date_from>[^&]+).*?date_to=(?<date_to>[^&]+)/
See the regex demo.
Note that .*? matches any zero or more chars other than line break chars, as few as possible (it is necessary to make sure the regex engine can "travel" all the way from the first capture to the second one as the regex engine parses the string from left to right and can never "skip" parts of a string, it should consume them).
For anyone looking AWS does not currently have any Date time functions to convert a date (i.e - mm/dd/yyyy) to a timestamp. Therefore, I exported the results of the above query to a CSV and did the timestamp calculations in Google Sheets.
From the following example pattern, I want to select the first 3 entries in the line.
Say:
timestamp
hostname
the first word after the hostname
Example pattern:
2017-04-24T09:20:01.687387+00:00 aabvabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
2017-04-24T09:20:01.687387+00:00 aacdefabcw74.def.co.uk hostd-probe: lacp: DEBUG]:147, Recv signal 15, LACP service is about to stop
I have used following regex and it worked fine.
REGEX 1 - ^(?:[^\s]\s){1}([^\s]) - to select the timestamp and hostname.
REGEX 2 - ^(?:[^\s]*\s){2}([^\s]\w+) - to select the word after the hostname.
2017-04-24T09:20:01.687387+00:00 hostd probing is done Fdm: sslThumbprint>95:43:64:71:A3:60:D8:17:C8:6F:68:83:92:CE:E4:3B:53:4E:1D:AD10.199.6.5a2:0e:09:01:0a:00a2:0e:09:01:0b:01/vmfs/volumes/b01f388c-aaa4889f/vmfs/volumes/6ad2d8d7-86746df14435.5.03568722host-619286aabvabcs16.def.co.uk
But the above log has created the problem, as it is not in a standard syslog format it has picked "hostd" as the hostname.
I would like to have regex which need to select the logs which has timestamp as the first entry, hostname as second entry (it always ends with.def.co.uk) and if it satisfies both then select the 3rd entry.
How can I achieve this?
^(\S+[^\s])\s(\w+\.def.co.uk)\s(.+?)\s Demo
Break down :
(\S+[^\s])\s capture out date and timestamp, and leave out the space after it
(\w+\.def.co.uk)\s capture only if it contains something.def.co.uk, and leave the space out again
(.+)? non greedily capture the first word (assuming word means no space in between
EDIT :
Unless you also want the date and time to be in their own capture groups, then it should be like this:
^(\S+)(T\S+)\s(\w+\.def.co.uk)\s(.+?)\s
Hope this helps!
I have these sample data. (Current Balance is numeric field and has some bad records which need to be replaced)
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,NULL
0008653447693589299,687746622,M,18-08-1981,#20
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,xx
I need to validate data in Pentaho and want the output like below :
Accno,Cust_id,gender,DOB,Current_balance
0008647447654709299,87128110,M,29/02/1960,184126.23
0008650447626799299,143500723,F,4/18/1967,165198.85
0008651447674209299,479941323,M,5/5/1979,
0008653447693589299,687746622,M,18-08-1981,
0008654447606469299,890134223,M,18-08-1983,0
0008655447659179299,684451923,F,10/9/1982,142.25
0008658447686789299,57470921,F,25-02-1978,458518.25
0008669447629759299,57470925,M,23-01-1981,
That means the validator pass the good row(s) and replace those bad data into null value.
Can anyone suggest how can I do this??
I'm not sure about Pentaho, but to point you in the right direction, you can use the following regex:
,(?=[^,]+$)(?!\d+(\.\d{2})).*$
In Multi-line mode
If you replace all matches with ',' you should have the desired output.
Working on RegexPal
RegexPlanet translates this into the following Java regex (looks like you just need to escape the backslashes):
,(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$
So in Java I guess you'd use something like:
str.replaceAll("(?m),(?=[^,]+$)(?!\\d+(\\.\\d{2})).*$", ",");
The (?m) at the start is the multi-line flag mentioned above.