String mask in perl for know format variations [closed] - regex

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I am trying to pull information from a list of folders that are organised in a logical manner but have optional parts.
Below is my folder structure with optional fields noted inside <> :
artist - album_nr. album_title <(type)> <(issue_info)> (year) [quality]
So some examples of directories would be named like this
Emperor - 03. Reverence (EP) (1997) [flac]
Emperor - 05b. IX Equilibrium (reissue 2007) (1999) [cue-flac]
Exodus - 01a. Bonded By Blood (1985) [cue-flac]
Exodus - 01b. Bonded By Blood (remaster 2008) (1985) [cue-flac]
Exodus - 03.Tempo of the Damned (EP) (remaster 2008) (1985) [cue-flac]
I need a regex that will correctly pull the relevant parts into an array for further processing but am struggling , mostly because of the optional fields.
At most, the array will contain 7 pieces of information and 5 pieces of information at the very least.
If anyone can help me I will be extremely grateful and it will save me a lot of manual effort.

Using extended notation for legibility:
my $re = qr/
([^-]+?) # artist
\h* #
- # literal '-'
\h* #
([0-9]+[a-z]?) # album number
\. # literal '.'
\h* #
([^(]+?) # album title
\h* #
(?:\(([^)]+)\))? # type (optional)
\h* #
(?:\(([^)]+)\))? # issue info (optional)
\h* #
\(([^)]+)\) # year
\h* #
\[(.+)\] # quality
/x;
Note that this regex always returns seven values (on match) because there are seven captures.
The "trick" to the optional parts you said you were having trouble with is to
navigate among capturing, non-capturing, and literal parentheses. Those portions of the regex break down as follows:
(?: # begin non-capturing grouping (for '?' quantifier at the end)
\( # literal '('
( # begin capture
[^)]+ # any character other than ')', one or more times
) # end capture
\) # literal ')'
) # end non-capturing grouping
? # zero or one quantifier (make everything in group optional)
Edit: In the comments, Jerry correctly points out that there's potential ambiguity about what matched when only one of the optional fields (type or issue info) is present in the data. This can be fixed by making the regex less permissive (at the risk of failing to match some data -- always check whether or not a match was successful). This works for the sample data you provided:
(?:\((\w+\h+[0-9]{4}+)\))? # issue info (optional)
If we do that, it also seems prudent to make the year more restrictive as well.
\(([0-9]{4})\) # year

Related

Customised regex in ruby [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I want to check if a string states Above 60. Example:
'>60', '> 60', 'above60', 'above 60', 'Above60', 'Above 60', OR more than 1 space in between (> and 60), (above and 60), (Above and 60).
How can I write a regex to validate a string that starts with either (>, above or Above) then any number of spaces and ends with 60?
How can I write a regex to validate a string that starts with either (>, above or Above) then any number of spaces and ends with 60?
It's very straight forward:
a string that starts with either >, above or Above
/^(>|above|Above)/
then any number of spaces
/^(>|above|Above)\s*/
and ends with 60
/^(>|above|Above)\s*60$/
Note that in Ruby, ^ and $ match beginning and end of a line, not string. You might want to change that to \A and \z respectively. And instead of specifying both cases explicitly (above / Above), you could append an i to make the regexp case-insensitive, i.e. /^(>|above)\s*60$/i.
As always, there's more than one way to get the desired result. I can recommend Rubular for fiddling with regular expressions: https://rubular.com/r/EEHBSOB3PK2Djk
r = /\A(?:>|[aA]bove) *60\z/
['>60', '> 60', 'above60', 'above 60', 'Above60', 'Above 60'].all? do |s|
s.match?(r)
end
#=> true
[' >60', '> 600', ' above60', 'above 600', 'Above60 '].any? do |s|
s.match?(r)
end
#=> false
We can write the regular expression in free-spacing mode to make it self-documenting.
/
\A # match beginning of string
(?: # begin a non-capture group
> # match '>'
| # or
[aA] # match 'a' or 'A'
bove # match 'bove'
) # end non-capture group
[ ]* # match 0+ spaces
60 # match '60'
\z # match end of string
/x # invoke free-spacing regex definition mode
Notice that in the above I placed the space in a character class ([ ]). Alternatively I could have escaped the space, used [[:space:]] or one of a few other options. Without protecting the space in one of these ways it would be stripped out (when using free-spacing mode) before the regex is parsed.
When spaces are to be reflected in a regex I use space characters rather than whitespace characters (\s), mainly because the latter also match end-of-line terminators (\n and possibly \r) which can result in problems.
This should work :
/(>|above|more than){0,1} *60\+*/i
The i at the end of the regex is for case insensitivity.
If you need additional prefixes, just add it after more than separated by a pipe |

Filter lines based on range of value, using regex [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
What regex will work to match only certain rows which have a value range (e.g. 20-25 days) in the text raw data (sample below):
[product-1][arbitrary-text][expiry-17days]
[product-2][arbitrary-text][expiry-22days]
[product-3][arbitrary-text][expiry-29days]
[product-4][arbitrary-text][expiry-25days]
[product-5][arbitrary-text][expiry-10days]
[product-6][arbitrary-text][expiry-12days]
[product-7][arbitrary-text][expiry-20days]
[product-8][arbitrary-text][expiry-26days]
'product' and 'expiry' text is static (doesn't change), while their corresponding values change.
'arbitrary-text' is also different for each line/product. So in the sample above, the regex should only match/return lines which have the expiry between 20-25 days.
Expected regex matches:
[product-2][arbitrary-text][expiry-22days]
[product-4][arbitrary-text][expiry-25days]
[product-7][arbitrary-text][expiry-20days]
Thanks.
Please check the following regex:
/(.*-2[0-5]days\]$)/gm
( # start capturing group
.* # matches any character (except newline)
- # matches hyphen character literally
2 # matches digit 2 literally
[0-5] # matches any digit between 0 to 5
days # matches the character days literally
\] # matches the character ] literally
$ # assert position at end of a line
) # end of the capturing group
Do note the use of -2[0-5]days to make sure that it doesn't match:
[product-7][arbitrary-text][expiry-222days] # won't match this
tested this one and it works as expected:
/[2-2]+[0-5]/g
[2-2] will match a number between 2 and 2 .. to restrict going pass the 20es range.
[0-5] second number needs to be between 0 and 5 "the second digit"
{2} limit to 2 digits.
Edit : to match the entire line char for char , this shoudl do it for you.
\[\w*\-\d*\]\s*\[\w*\-[2-2]+[0-5]\w*\]
Edit2: updated to account for Arbitrary text ...
\[(\w*-\d*)\]+\s*\[(\w*\-\w*)\]\s*\[(\w*\-[2-2]+[0-5]\w*)\]
edit3: Updated to match any character for the arbitrary-text.
\[(\w*-\d*)\]\s*\[(.*)\]\s*\[(\w*\-[2-2][0-5]\w*)\]
.*\D2[0-5]d.*
.* matches everything.
\D prevents numbers like 123 and 222 from being valid matches.
2[0-5] covers the range.
d so it doesn't match the product number.
I pasted your sample text into http://regexr.com
It's a useful tool for building regular expressions.
You can try this one :
/(.*-2[0-5]days\]$)/gm
try it HERE

Regular expression model

Hey guys am new to regular expression i have found a regular expression like this ..
preg_match("/^(1[-\s.])?(\()?\d{3}(?(2)\))[-\s.]?\d{3}[-\s.]?\d{4}$/",$number)
preg_match("/^
(1[-\s.])? # optional '1-', '1.' or '1'
( \( )? # optional opening parenthesis
\d{3} # the area code
(?(2) \) ) # if there was opening parenthesis, close it
[-\s.]? # followed by '-' or '.' or space
\d{3} # first 3 digits
[-\s.]? # followed by '-' or '.' or space
\d{4} # last 4 digits
$/x",$number);
I found these explanation from a tutorial website ..I just need to know why (?(2)) is assigned here..why questionmark(optional symbol) is applied at the beginning and what is the use of (2) there in that code ..
Am sorry if this question is of low standard since am a newbie .Any help would be appreciated .ThANKS .:)
The (?(2)\)) is an if clause that checks to see if the 2nd match group was captured.
You should be able to see a break down of your regex at Regex101. It's pretty useful to see what the regex is doing at all points and it's easy to tweak a regex from there.

Extract data from a table of content with regex [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 8 years ago.
This question appears to be off-topic because it lacks sufficient information to diagnose the problem. Describe your problem in more detail or include a minimal example in the question itself.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Improve this question
Consider the following String, which is a table of content
Table of Content
Name abc ......... 20
Name fghkjkj kjkj . 31
Name.with.dot ..... 45
I want to extract the section's name 'Name abc' 'Name fghkjkj kjkj' and 'Name.with.dot'
I didn't found yet the right regex to achieve that goal, any insights?
I think the following should work:
^.*?(?= \.+ \d+$)
assuming you're working line by line or have MULTILINE mode enabled. The positive lookahead assertion makes sure that we end the match as soon as only dots and a number follow on the line.
Explanation:
^ # Start of line
.*? # Match any number of characters, as few as possible
(?= # Look ahead to assert that the following matches from here:
[ ] # a space
\.+ # one or more dots
[ ] # a space
\d+ # a number
$ # End of line
) # End of lookahead
This positive lookahead based regex should work:
^.+?(?= +\.+ +\d+$)
Live Demo: http://www.rubular.com/r/B5EdXF3SIz
This will do the trick:
^Name[ .]\w+(?:[. ]\w+)?
Explanation:
^ # Start of string
Name # Literal string 'Name'
[ .] # Space or period
\w+ # One or more word characters
(?: # Start non-capturing group
[ .] # Space or period
\w+ # One or more word characters
) # Close noo-capturing group
? # Make previous group optional
Live demo here.

complex regular expression question on stop set [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
What regular expression to perform search for header that starts with a number such as 1. Humility?
Here's the sample data screen shot, http://www.knowledgenotebook.com/issue/sampleData.html
Thanks.
Don't know what regex your using so I asume its Perl compatible.
You should always post some example data incase your perceptions of regex are unclear.
Breaking down what your 'Stop signs' are:
## left out of regex, this could be anything up here
##
(?: # Start of non-capture group START sign
\d+\. # 1 or more digits followed by '.'
| # or
\(\d+\) # '(' folowed by 1 or more digits followed by ')'
# note that \( could be start of capture group1 in bizzaro world
) # End group
\s? # 0 or 1 whitespace (includes \n)
[^\n<]+ # 1 or more of not \n AND not '<' STOP sign's
It seems you want all chars after the group up to but not to include the
very next \n OR the very next '<'. In that case you should get rid of the \s?
because \s includes newline, if it matches a newline here, it will continue to match
until [^\n<]+ is satisfied.
(?:\d+\.|\(\d+\))[^\n<]+
Edit - After viewing your sample, it appears that you are searching unrendered html
pasted in html content. In that case the header appears to be:
'1. Self-Knowledge<br>' which when the entities are converted, would be
1. Self-Knowledge<br>
Self-Knowledge
Superior leadership ...
You can add the entity to the mix so that all your bases are covered (ie: entity, \n, <):
((?:\d+\.|\(\d+\)))[^\S\n]+((?:(?!<|[\n<]).)+)
Where;
Capture group1 = '1.'
Capture group2 = 'Self-Knowledge'
Other than that, I don't know what it could be.