How to parse a path with regex - optional fields

How to parse a path with regex - optional fields - regex

I am using the following regex: (example here: https://regex101.com/r/dVTUrM/1)
\/(?<field1>.{4})\/(?<field2>.*?)\/(?<field3>.*?)\/(?<field4>.*?)\/(?<field5>.*?)\/(?<field6>.*)
to parse the following text:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}/9/abc.tif
giving the following result:
Group `field1` 9-13 `A49E`
Group `field2` 14-46 `18DA-6FAB-4921-8AEB-45A07B162DA5`
Group `field3` 47-85 `{E3646FA1-4652-45E9-885A-3756FC574057}`
Group `field4` 86-124 `{F1864679-1D9D-4084-B38D-231D793AA15D}`
Group `field5` 125-126 `9`
Group `field6` 127-134 `abc.tif`
But if field5 and field 6 are missing:
pyramid:/A49E/18DA-6FAB-4921-8AEB-45A07B162DA5/{E3646FA1-4652-45E9-885A-3756FC574057}/{F1864679-1D9D-4084-B38D-231D793AA15D}
I would like this to work and for field5 and field6 to be blank.
Is this possible by modifying the regex statement?
Note: only field6 may be missing as well.

Here you go:
(?x)^pyramid:
/(?P<field1>[^/]{4})
/(?P<field2>[^/]+)
/(?P<field3>[^/]+)
/(?P<field4>[^/]+)
(?:
/(?P<field5>[^/]+)
/(?P<field6>[^/]+)
)?
See a demo on regex101.com.
Or, in short (without the verbose flag):
^pyramid:/(?P<field1>[^/]{4})/(?P<field2>[^/]+)/(?P<field3>[^/]+)/(?P<field4>[^/]+)(?:/(?P<field5>[^/]+)/(?P<field6>[^/]+))?
Depending on the programming language / flavour used, you might use other delimiters like ~ so that you don't need to escape the forward slashes anymore. The (?: ... ) construct is a non capturing group which is made optional with ? to allow 4 or 6 (but not five!) fields.

Related

Regex a Block of Yaml Data

Im currently using regex101 to try and work out the following, id like to be able to capture a full items data for example name_template_2 and its associated description, define and write data
Here's my data model
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
I can capture the lines I need with the following
^[[:space:]][[:space:]][[:space:]][[:space:]].*
and
^[[:space:]][[:space:]]name_template_2:
but I am unable to join both patterns together to filter just the key and data related to name_template_2. The more I read online the more I understand it less. Has anyone achieved this before or is there a much more efficient way of doing this?

Using 2 capture groups:
^[^\S\n]{2}(name_template_2:)((?:\n[^\S\n]{4}\S.*)+)
Explanation
^ Start of string
[^\S\n]{2} Match 2 spaces without newlines
(name_template_2:) Match the string and capture in group 1
( Capture group 2
(?: Non capture group
\n Match a newline
[^\S\n]{4} Match 4 spaces without newlines
\S.* Match a non whitespace char and the rest of the line
)+ Repeat the non capture group 1 or more times
) Close group 2
Regex demo

I suggest using a structure-aware tool like yq to manipulate YAML and not using regular expressions.
#!/bin/bash
INPUT='
templates:
name_template:
description: test_description
define: yes
write: true
name_template_2:
description: test_description2
define: false
write: true
'
echo "$INPUT" | yq '{"name_template_2": .templates.name_template_2}'
Output
name_template_2:
description: test_description2
define: false
write: true

Regex pattern for mm/dd/yyyy and mmddyyyy in Scala

I have date in my .txt file which comes like either of the below:
mmddyyyy
OR
mm/dd/yyyy
Below is the regex which works fine for mm/dd/yyyy.
^02\/(?:[01]\d|2\d)\/(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)\/(?:[0-2]\d|3[01])\/(?:19|20)\d{2}|(?:0[469]|11)\/(?:[0-2]\d|30)\/(?:19|20)\d{2}|02\/(?:[0-1]\d|2[0-8])\/(?:19|20)\d{2}$
However, unable to build the regex for mmddyyyy. I just want to understand is there any generic regex that would work for both cases?

Why use regex for this? Seems like a case of "Now you have two problems"
It would be more effective (and easier to understand) to use a DateTimeFormatter (assuming you are on the JVM and not using scala-js)
The format patterns support using [] to surround optional sections, such as the /, and the formatters inherently perform input validation so if you plug in a month or day that can't exist, it'll throw an exception.
import java.time.format.DateTimeFormatter
import java.time.LocalDate
val mdy = DateTimeFormatter.ofPattern("MM[/]dd[/]yyyy")
def parse(rawDate: String) = LocalDate.parse(rawDate, mdy)
scala> parse("12252022")
res7: java.time.LocalDate = 2022-12-25
scala> parse("12/25/2022")
res8: java.time.LocalDate = 2022-12-25
scala> parse("25/12/2022")
java.time.format.DateTimeParseException: Text '25/12/2022' could not be parsed: Invalid value for MonthOfYear (valid values 1 - 12): 25
scala> parse("abc123")
java.time.format.DateTimeParseException: Text 'abc123' could not be parsed at index 0

If you want to match all those variations with either 2 forward slashes or only digits, you can use a positive lookahead to assert either only digits or 2 forward slashes surrounded by digits.
Then in the pattern itself you can make matching the / optional.
Note that you don't have to escape the \/
^(?=\d+(?:/\d+/\d+)?$)(?:02/?(?:[01]\d|2\d)/?(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)/?(?:[0-2]\d|3[01])/?(?:19|20)\d{2}|(?:0[469]|11)/?(?:[0-2]\d|30)/?(?:19|20)\d{2}|02/?(?:[0-1]\d|2[0-8])\?(?:19|20)\d{2})$
Regex demo
Another option is to write an alternation | matching the same pattern without the / in it.

First of all, there is a tiny shortcoming in your regex: the ^ anchor only applies to the first part of your regex, not to the other alternatives that are separated by |. Similarly the final $ applies only to the final alternative. You should put all alternatives in a non-capturing group, like ^(?: | | | )$
Then for the question itself, you could make the forward slash that follows the month optional and put it in a capture group. Then what comes between the day and the year could be a backreference to that capture group. So (\/?) and \1.
^(?:02(\/?)(?:[01]\d|2\d)\1(?:19|20)(?:0[048]|[13579][26]|[2468][048])|(?:0[13578]|10|12)(\/?)(?:[0-2]\d|3[01])\2(?:19|20)\d{2}|(?:0[469]|11)(\/?)(?:[0-2]\d|30)\3(?:19|20)\d{2}|02(\/?)(?:[0-1]\d|2[0-8])\4(?:19|20)\d{2})$

Grok pattern/Regex to parse string with nested parenthesis

I am trying to parse out several dynamic strings via Grok/Regex that exist in log messages between (). For example (SenderPartyName below):
2021/05/23 16:01:26.094 High Messaging.Message.Delivered Id(ci1653336085475.12327434#test_te) MessageId(EPIUM#1130754#84601671) SenderPartyName(Mcdonalds (CFH) Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN) SenderRoutingId(08Mdsfkm853)
I would want to parse each key-value out from the string that follow the () format. Here is my grok pattern so far. I've been testing with https://grokdebug.herokuapp.com/
%{DATESTAMP:ts} %{WORD:loglevel} %{DATA:reason}\s ?(Id\(%{DATA:id}\))? ?(MessageId\(%{DATA:originalmessageid}\))? ?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))? ?(ReceiverPartyName\(%{DATA:receiverpartyname}\))? ?(SenderRoutingId\(%{DATA:senderroutingid}\))?
This works when there are () within the nested string like this:
Mcdonalds (CFH) Restaurant Glen
...but it is dynamic and could appear without () like such: Mcdonalds Restaurant Glen
Trying to build regex to account for both scenarios with this portion of the grok pattern:
?(SenderPartyName\((?<senderpartyname>.+?\).+?)\))?
Currently this parses the non-parenthesis case like this though:
"senderpartyname": "Mcdonalds Restaurant Glen) ReceiverPartyName(TEST_HERE_AGAIN"
..where desired state is one of the following depending on the string:
"senderpartyname": "Mcdonalds Restaurant Glen"
or
"senderpartyname": "Mcdonalds (CFH) Restaurant Glen"

You can use
%{DATESTAMP:ts}\s+%{WORD:loglevel}\s+%{DATA:reason}\s+Id\(%{DATA:id}\)(?:\s+MessageId\(%{DATA:originalmessageid}\))?(?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?(?:\s+ReceiverPartyName\(%{DATA:receiverpartyname}\))?(?:\s+SenderRoutingId\(%{DATA:senderroutingid}\))?
Note I revamped it so that all optional fields match one or more whitespaces and the fields as obligatory patterns, but they are made optional as a sequence, which makes matching more efficient.
The main thing changed is (?:\s+SenderPartyName(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)))?, it matches
(?: - start of a non-capturing group:
\s+ - one or more whitespaces
SenderPartyName - a fixed word
(?<senderpartyname>\((?:[^()]++|\g<senderpartyname>)*\)) - Group "senderpartyname": ( (matched with \(), then zero or more repetitions of any char other than ( and ) or the Group "senderpartyname" pattern recursed ( see (?:[^()]++|\g<senderpartyname>)*) and then a ) char (matched with \))
)? - end of the group, one or zero repetitions (optional)

Hello together I have the following problem:
I have a long list of SQL queries which I would like to adapt to one of my changes. Finally, I have a renaming problem and I'm afraid I want to solve it more complicated than expected.
The query looks like this:
INSERT member (member, prename, name, street, postalcode, town, tel1, tel2, fax, bem, anrede, salutation, email, name2, name3, association, project) VALUES (2005, N'John', N'Doe', N'Street 4711', N'1234', N'Town', N'1234-5678', N'1234-5678', N'1234-5678', N'Leader', NULL, N'Dear Mr. Doe', N'a#b.com', N'This is the text i want to delete', N'Name2', N'Name3', NULL, NULL);
In the "Insert" there was another column which I removed (which I did simply via Notepad++ by typing the search term - "example, " - and replaced it with an empty field. Only the following entry in Values I can't get out using this method, because the text varies here. So far I have only worked with the text file in which I adjusted the list of queries.
So as you can see there is one more entry in Values than in the insertions (there was another column here, but it was removed by my change).
It is the entry after the email address. I would like to remove this including the comma (N'This is the text i want to delete',).
My idea was to form a group and say that the 14th digit after the comma should be removed. However, even after research I do not know how to realize this.
I thought it could look like this (tried in https://regex101.com/)
VALUES\s?\((,) something here
Is this even the right approach or is there another method? I only knew Regex to solve this problem, because of course the values look different here.
And how can I finally use the regex to get the queries adapted (because the queries are local to my computer and not yet included in the code).
Short summary:
Change the query from
VALUES (... test5, test6, test7 ...)
To
VALUES (... test5, test7 ...)

As per my comment, you could use find/replace, where you search for:
(\bVALUES +\((?:[^,]+,){13})[^,]+,
And replace with $1
See the online demo
( - Open 1st capture group.
\bValues +\( - Match a word-boundary, literally 'VALUES', followed by at least a single space and a literal open paranthesis.
(?: - Open non-capturing group.
[^,]+, - Match anything but a comma at least once followed by a comma.
){13} - Close non-capture group and repeat it 13 times.
) - Close 1st capture group.
[^,]+, - Match anything but a comma at least once followed by a comma.

You may use the following to remove / replace the value you need:
Find What: \bVALUES\s*\((\s*(?:N'[^']*'|\w+))(?:,(?1)){12}\K,(?1)
Replace With: (empty string, or whatever value you need)
See the regex demo
Details
\bVALUES - whole word VALUES
\s* - 0+ whitespaces
\( - a (
(\s*(?:N'[^']*'|\w+)) - Group 1: 0+ whitespaces and then either N' followed with any 0 or more chars other than ' and then a ', or 1+ word chars
(?:,(?1)){12} - twelve repetitions of , followed with the Group 1 pattern
\K - match reset operator that discards the text matched so far from the match memory buffer
, - a comma
(?1) - Group 1 pattern.
Settings screen:

Regular expression enforcing at least one in two groups

I have to parse a string using regular expressions in which at least one group in a set of two is required. I cannot figure out how to write this case.
To illustrate the problem we can think parsing this case:
String: aredhouse theball bluegreencar the
Match: ✓ ✓ ✓ ✗
Items are separated by spaces
Each item is composed by an article, a colour and an object defined by groups in the following expression (?P<article>the|a)?(?P<colour>(red|green|blue|yellow)*)(?P<object>car|ball|house)?\s*
An item can have an 'article' but must have a 'colour' or/and an 'object'.
Is there a way of making 'article' optional but require at least one 'colour' or 'object' using regular expressions?
Here is the coded Go version of this example, however I guess this is generic regexp question that applies to any language.

This is working with your testcases.
/
(?P<article>the|a)? # optional article
(?: # non-capture group, mandatory
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
(?P<object>car|ball|house) # followed by 1 object
| # OR
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
| # OR
(?P<object>car|ball|house) # 1 object
) # end group
/x
It can be reduced to:
/
(?P<article>the|a)? # optional article
(?: # non-capture group, mandatory
(?P<colour>(?:red|green|blue|yellow)+) # 1 or more colors
(?P<object>car|ball|house)? # followed by optional object
| # OR
(?P<object>car|ball|house) # 1 object
) # end group
/x

In regex, there's a few special signs that indicate the expected number of matches for a character or a group:
* - zero or more
+ - one or more
? - zero or one
These applied, your regex looks like this:
(?P<article>(the|a)?)(?P<colour>(red|green|blue|yellow)+)(?P<object>(car|ball|house)+)\s*
None or one article.
One or more colors.
Finally one or more objects.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js