Complex regex get closest - regex

I would like some help to finish my complex regex.
I spent some times on it and still can't figure out how I can achieve what I want
This is the text I want to parse :
Do [|83]([]?([]?([]?([]?([]?([]?([]?:))))):)([]?([]?:([]?:)):)([]?[]? :):)([]?([]?[]:):)([]?([]?[]:):)
Bo [|18] pz ([]?:)\n la :\n[pl]
Co [|76] pp ([]?:)
For readability, I put every text in one line only but please consider that they are not on a new line.
This is my regex so far :
(\[\|(\d*)])+(?!\\\n).*([%\sa-zA-Z]*)(\((\[[^\[\]()?:]*])+\s*\?([^()]*):([^()]*)\))
I'm reading every combinations of [|NUMBER] () one by one. The process I apply on "()" depends of the NUMBER related
When I'm parsing the first time, I'm getting this which is fine :
Then, I replace the whole value after my process :
Now, I do have :
Do [|83] blabla done Bo [|18] pz ([]?:)\n la :\n[pl] Co [|76] pp ([]?:)
When I parse them once more, I got :
The number I got is not the good one. My question is : how can I get the closest one from the string I'm parsing after?
Thanks you for any tips

You might shorten the pattern a bit and exclude matching both the square brackets and the parenthesis in the character class after matching the digit and ]
\[\|\d+][^][()]*\([^()]*\)
The pattern matches:
\[\|\d+] Match [| 1+ digits and ]
[^][()]* Match 0+ times any char other than [ ] ( )
\([^()]*\) Match (, than 0+ times any char other than ( ) then match )
Regex demo

Related

Regex expression to bold using asterisks

I have a question regarding using a regular expression to bold text within a string using asterisks.
The other questions on this topic work well for simple scenarios however we have encountered some issues.
Our particular scenario is for asterisks to be replaced with <bold></bold> tags.
It must also be able to handle multiple asterisks as well as an uneven number of asterisks.
Our example input text is as follows;
string exampleText1 = "**** PLEASE NOTE *** Testing, *nuts*, **please note..., test";
string exampleText2 = "**Test text (10)";
Our current regex is as follows;
Regex _boldRegex = new Regex(#"(\*)+([^*?$]+)+(\*)");
string value = _boldRegex.Replace(exampleText1, #"<bold>$2</bold>");
Example 1 should show "<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test" where the groups of asterisks are treated as single asterisks and an unfinished tag is ignored.
Example 2 crashes the program because it expects a 'closing' asterisk. It should show "*Text text (10)"
Can anyone help by suggesting a new regex, bearing in mind the ability to handle groups of asterisks and also an uneven number of asterisks?
Thanks in advance.
For you examle data, you might use an optional part with a capture group to capture the repeated character class without newlines between 1 or more *
In the callback of replace, you can test for the existence of group 1, and do the replacements based on that.
\*+(?:([^*?$\n\r]+)\*+)?
The pattern matches:
\*+ Match 1+ times *
(?: Non capture group
( Capture group 1
[^*?$\n\r]+ Match 1+ times any char other than the listed in the character class
) Close group 1
\*+ Match 1+ times *
)? Close on capture group
See a regex demo.
For example
Regex _boldRegex = new Regex(#"\*+(?:([^*?$\n\r]+)\*+)?");
string exampleText1 = #"**** PLEASE NOTE *** Testing, *nuts*, **please note..., test
**Test text (10)";
string value = _boldRegex.Replace(exampleText1, m =>
m.Groups[1].Success ? String.Format("<bold>{0}</bold>", m.Groups[1].Value) : "*"
);
Console.WriteLine(value);
Output
<bold> PLEASE NOTE </bold> Testing, <bold>nuts</bold>, *please note..., test
*Test text (10)

Regex - math with cycle

How to count the amount of a match inside itself to skip some characters?
Example:
I have:
(a(b(c)))
If I run this regex: \(.+?\)
It will be return: (a(b(c)
But what I want is the ) that closes the loop, that is, the third.
I could just remove the ? From the regex but there is a problem:
Ex: \(.+\) to (a)(a(b(c))) return (a)(a(b(c)))
And what I want is for the group to return to me with the closed loop of (), that is, it should return 2 matchs to me:
match 1: (a)
match 2: (a(b(c)))
What is the question of counting in the match? Well, what I thought was if there is any way to count how many ( passed to know how many ) one should skip, that is:
1 2 3 1 2 3
( a ( b ( c ) ) )
Does anyone have any idea how to do this just using regex?
If you need to use regex, please try the recursive regex (?R).
The implementation depends on the language so let me explain it with python.
#!/usr/bin/python
import regex
str ='(a)(a(b(c)))'
m = regex.findall(r'\((?:[^()]|(?R))+\)', str)
print(m)
Output:
['(a)', '(a(b(c)))']
Explanation of the regex pattern \((?:[^()]|(?R))+\):
The inner part (?:[^()]|(?R))+ matches:
one or more [^()] or (?R) where
[^()] matches any character other than parentheses.
(?R) represents the entire regex \((?:[^()]|(?R))+\) recursively.

Regex with global modifier to capture words within lines

The Input:
Let's consider this string below
* key : foo bar *
* big key : bar*bar
* healthy : cereal bar *
sadly : without star *
The Output:
I would like to retrieve the key:value pairs for each match.
'key', 'foo bar'
'big key', 'bar*bar'
'healthy', 'cereal bar'
'sadly', 'without star'
The Regex:
My first success was achieved with this Regex (PCRE/Perl):
/(\n?)([^\* ].*[^ *])\s+:\s+([^\* ].*[^ *])[\s\*]+(?|\n)/g
Here the DEMO.
My question
I really find my regex pretty ugly. The main reason is because I can't use /^ and $/ in a global regex and I had to play with /(\n?)...(?|\n)/g.
Is there any possibility to shorten the above regex ?
The optional challenge
Actually this was the easy part. My string is supposed to be embedded in a C comment and I have to make sure I am not trying to match something outside a comment block.
(I not really need an answer to this second tricky question because if I write a script I can first match all the comments blocks, then find all the key:values patterns).
/********************************
* key : foo bar *
* big key : bar*bar
* healthy : /*cereal bar *
sadly : without star *
********************************/
not a key : this key
You can add the m -flag to the regexp to make anchors ^ and $ match beginnings and ends of each line within the string, i.e:
/^\s*\*?\s*([^:]+?)\s*:\s*(.*?)\s*\*?\s*$/gm
Note the use of non-greedy quantifiers (+? and *?) to not eat up characters that can be matched after the quantifier, i.e. the first capture group will not include the optional trailing whitespace before the colon, and the second capture group will not include trailing whitespace and an optional asterisk at the end of a line.
http://regex101.com/r/oJ8uW4/1
the regex I used is: /^\s*[*]*\s+(.*)\s+:\s+(.*?)\s+[*]*\s*$/gm
It works for your exemple as the not a key : this key has no space after it, so it would miss comments which do not close whith * and get values with trailing spaces too.
The point you're looking for is the modifiers after the last /
m to says it's multiline so ^ and $ are usable and g to rematch on each line.
The drawback is you can't rely on having /* and */ on lines around when using ^ and $
But Avinash will prove me wrong I bet :) (he's far better than me with regexes)

Find the match extract next n chars but exclude a match itself

I am not a regex savvy so my question may seem simple. How do you extract hours and minutes from a string like this:
2013-12-03T10:45:33-07:00
So I just want to get 10:45 from the above string and ignore the rest.
I tried /[0-2][0-9]:[0-5][0-9]/ but that gives me: 10:45 as well as 07:00
Also tried /[T][0-2][0-9]:[0-5][0-9]/ , but this gives me T10:45
I tried excluding 'T' by using a ^ anchor [^T][ ][ ]:[ ][ ] but this gave me -07:00 !
I thought about searching for the first occurrence of ':' but I don't know how to extract 2 digits before and after ':' and include the ':' itself.
Any help with a comment would be greatly appreciated.
You can use a positive lookbehind for this:
/(?<=T)\d{2}:\d{2}/
What this essentially means it that we're matching two digits followed by a colon followed by 2 digits, but they MUST have a "T" in front. Do not, however, add this to the match as lookaheads/behinds are not matched.
DEMO
[^T] means "any character that isn't T", which is why it didn't work.
JS regex does not support lookaheads/behinds (see?), but you can simply create a matching group using /T(\d{2}:\d{2})/ and then match [1]:
var timeString = '2013-12-03T10:45:33-07:00';
var time = timeString.match(/T(\d{2}:\d{2})/)[1];
console.log(time); //10:45
A simple way to extract without heavy regex knowledge would be to do something like
foo = "2013-12-03T10:45:33-07:00"
(hours,minutes,junk) = foo.split ":"
hours =~ s/*(\d\d)$/$1/
so now you have
hours and minutes available for use

Regex with lookahead

I can't seem to make this regex work.
The input is as follows. Its really on one row but I have inserted line breaks after each \r\n so that it's easier to see, so no check for space characters are needed.
01-03\r\n
01-04\r\n
TEXTONE\r\n
STOCKHOLM\r\n
350,00\r\n ---- 350,00 should be the last value in the first match
12-29\r\n
01-03\r\n
TEXTTWO\r\n
COPENHAGEN\r\n
10,80\r\n
This could go on with another 01-31 and 02-01, marking another new match (these are dates).
I would like to have a total of 2 matches for this input.
My problem is that I cant figure out how to look ahead and match the starting of a new match (two following dates) but not to include those dates within the first match. They should belong to the second match.
It's hard to explain, but I hope someone will get me.
This is what I got so far but its not even close:
(.*?)((?<=\\d{2}-\\d{2}))
The matches I want are:
1: 01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n
2: 12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n
After that I can easily separate the columns with \r\n.
Can this more explicit pattern work to you?
(\d{2}-\d{2})\r\n(\d{2}-\d{2})\r\n(.*)\r\n(.*)\r\n(\d+(?:,?\d+))
Here's another option for you to try:
(.+?)(?=\d{2}-\d{2}\\r\\n\d{2}-\d{2}|$)
Rubular
/
\G
(
(?:
[0-9]{2}-[0-9]{2}\r\n
){2}
(?:
(?! [0-9]{2}-[0-9]{2}\r\n ) [^\n]*\n
)*
)
/xg
Why do so much work?
$string = q(01-03\r\n01-04\r\nTEXTONE\r\nSTOCKHOLM\r\n350,00\r\n12-29\r\n01-03\r\nTEXTTWO\r\nCOPENHAGEN\r\n10,80\r\n);
for (split /(?=(?:\d{2}-\d{2}\\r\\n){2})/, $string) {
print join( "\t", split /\\r\\n/), "\n"
}
Output:
01-03 01-04 TEXTONE STOCKHOLM 350,00
12-29 01-03 TEXTTWO COPENHAGEN 10,80`