regex findall to retrieve a substring based on start and end character - regex

I have the following string:
6[Sup. 1e+02]
I'm trying to retrieve a substring of just 1e+02. The variable first refers to the above specified string. Below is what I have tried.
re.findall(' \d*]', first)

You need to use the following regex:
\b\d+e\+\d+\b
Explanation:
\b - Word boundary
\d+ - Digits, 1 or more
e - Literal e
\+ - Literal +
\d+ - Digits, 1 or more
\b - Word boundary
See demo
Sample code:
import re
p = re.compile(ur'\b\d+e\+\d+\b')
test_str = u"6[Sup. 1e+02]"
re.findall(p, test_str)
See IDEONE demo

import re
first = "6[Sup. 1e+02]"
result = re.findall(r"\s+(.*?)\]", first)
print result
Output:
['1e+02']
Demo
http://ideone.com/Kevtje
regex Explanation:
\s+(.*?)\]
Match a single character that is a “whitespace character” (ASCII space, tab, line feed, carriage return, vertical tab, form feed) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “]” literally «\]»

Related

RegEx after certain string

I have a manifest file
Bundle-ManifestVersion: 2
Bundle-Name: BundleSample
Bundle-Version: 4
I want to change the value of Bundle-Name using -replace in Powershell.
I used this pattern Bundle-Name:(.*)
But it returns including the Bundle-Name. What would be the pattern if I want to change only the value of the Bundle-Name?
You could capture both the Bundle-Name: and its value in two separate capture groups.
Then replace like this:
$manifest = #"
Bundle-ManifestVersion: 2
Bundle-Name: BundleSample
Bundle-Version: 4
"#
$newBundleName = 'BundleTest'
$manifest -replace '(Bundle-Name:\s*)(.*)', ('$1{0}' -f $newBundleName)
# or
# $manifest -replace '(Bundle-Name:\s*)(.*)', "`$1$newBundleName"
The above will result in
Bundle-ManifestVersion: 2
Bundle-Name: BundleTest
Bundle-Version: 4
Regex details:
( Match the regex below and capture its match into backreference number 1
Bundle-Name: Match the character string “Bundle-Name:” literally (case sensitive)
\s Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, vertical tab, form feed, next line)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
( Match the regex below and capture its match into backreference number 2
. Match any single character that is NOT a line break character (line feed)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
Thanks to LotPings, there is even an easier regex that can be used:
$manifest -replace '(?<=Bundle-Name:\s*).*', $newBundleName
This uses a positive lookbehind.
The regex details for that are:
(?<= Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
Bundle-Name: Match the characters “Bundle-Name:” literally
\s Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
)
. Match any single character that is not a line break character
* Between zero and unlimited times, as many times as possible, giving back as needed (greedy)

regex: extract text blocks, defined beginning, undefined end

i have text like this:
Date: 01.02.2015 //<-stable format
something
something more
some random more
Date: 02.02.2015
something random
i dont know
so i have many such blocks. Starts with Date... ends with next Date... start.
The text in the lines in the block could be anything, but not Date... format
I need an array at the end, with such blocks:
array[0] = "Date: 01.02.2015
something
something more
some random more"
array[1] = "Date: 02.02.2015
something random
i dont know"
for now i add some unique splitter before Date... than split by the splitter.
Question: is it possible to get such blocks only by regex?
(i use VBA to parse the text, RegExp object)
Instead of split just match using
\bDate:\s\d{1,2}\.\d{1,2}\.\d{4}[\s\S]*?(?=\nDate:|$)
See demo.
https://regex101.com/r/uF4oY4/77
Syntax explanation (from the linked site):
\b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
Date: matches the characters Date: literally (case sensitive)
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{1,2} matches a digit (equal to [0-9]) between 1 and 2 times, as many times as possible, giving back as needed (greedy)
. matches the character . literally (case sensitive)
\d{4} matches a digit (equal to [0-9]) exactly 4 times
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\S matches any non-whitespace character (equal to [^\r\n\t\f\v ])
*? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy) , what specified in previous brackets
?= Positive Lookahead - Assert that the following Regex matches
\nDate Option 1
\n matches a line-feed (newline) character (ASCII 10)
Date matches the characters Date: literally (case sensitive)
$: Option 2 - $ asserts position at the end of the string, or before the line terminator right at the end of the string (if any)

Regex Pattern where group may not exist

I have a RegEx pattern that needs to match on any of the following lines:
10-10-15 15:16:41.1 Some Text here
10-10-15 15:16:41.12 Some Text here
10-10-15 15:16:41.123 Some Text here
10-10-15 15:16:41 Some Text here
I can match the first 3 with the pattern below:
(?<date>(?<day>\d{1,2})-(?<month>\d{1,2})-(?<year>(?:\d{4}|\d{2}))\s(?<time>(?<hour>\d{2}):(?<minutes>\d{2}):(?<seconds>\d{2})\.(?<milli>\d{0,3})))\s(?<Line>.*)
How do i Match this line (10-10-15 15:16:41 Some Text here) which has no milliseconds but still get the group back in my result either wit a blank value or with 0 as the value?
Thanks
As i said each of the lines below will match:
10-10-15 15:16:41.123 Some text Here
10-10-15 15:16:41.12 Some Text here
10-10-15 15:16:41.1 Some Text here
10-10-15 15:16:41. Some Text here
The groups look like so:
date [0-18] `10-10-15 15:16:41.`
day [0-2] `10`
month [3-5] `10`
year [6-8] `15`
time [9-18] `15:16:41.`
hour [9-11] `15`
minutes [12-14] `16`
seconds [15-17] `41`
milli [18-18] ``
Line [19-34] `Some Text here `
You can use the following (slightly modified version of your regex):
(?<date>(?<day>\d{1,2})-(?<month>\d{1,2})-(?<year>(?:\d{4}|\d{2}))\s(?<time>(?<hour>\d{2}):(?<minutes>\d{2}):(?<seconds>\d{2})(?<milli>\.\d{0,3})?))\s(?<logEntry>.*)
See DEMO
Explanation:
Make the <milli> part optional.. and not the . since it matches strings like 10-10-15 15:16:41123 Some Text here also..
Worked it out. I needed the following pattern:
(?<date>(?<day>\d{1,2})-(?<month>\d{1,2})-(?<year>(?:\d{4}|\d{2}))\s(?<time>(?<hour>\d{2}):(?<minutes>\d{2}):(?<seconds>\d{2})(?<milli>\.?\d{0,3})))\s(?<logEntry>.*)
^(\d+)-(\d+)-(\d+)\s(\d+):(\d+):(\d+)\.?(\d*)([a-zA-Z\s]+)
Note the (\d*) which will return the group even if empty.
Demo
Make the milliseconds optional ?
/^([\d]{2})-([\d]{2})-([\d]{2}|[\d]{4})\s+([\d]{2}):([\d]{2}):([\d]{2})\.?(\d+)?\s+(.*?)$/
Example:
<?php
$strings = <<< LOL
10-10-15 15:16:41.1 Some Text here
10-10-15 15:16:41.12 Some Text here
10-10-15 15:16:41.123 Some Text here
10-10-15 15:16:41 Some Text here
LOL;
preg_match_all('/^([\d]{2})-([\d]{2})-([\d]{2}|[\d]{4})\s+([\d]{2}):([\d]{2}):([\d]{2})\.?(\d+)?\s+(.*?)$/m', $strings , $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[0]); $i++) {
$day = $matches[1][$i];
$month = $matches[2][$i];
$year = $matches[3][$i];
$hours = $matches[4][$i];
$minutes = $matches[5][$i];
$seconds = $matches[6][$i];
$ms = $matches[7][$i];
$text = $matches[8][$i];
echo "$day $month $year $hours $minutes $seconds $ms $text \n";
}
Regex Demo:
https://regex101.com/r/aF9wN6/1
PHP Demo:
http://ideone.com/1aEt2E
Regex Explanation:
^([\d]{2})-([\d]{2})-([\d]{2}|[\d]{4})\s+([\d]{2}):([\d]{2}):([\d]{2})\.?(\d+)?\s+(.*?)$
Assert position at the beginning of a line (at beginning of the string or after a line break character) (line feed) «^»
Match the regex below and capture its match into backreference number 1 «([\d]{2})»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Match the character “-” literally «-»
Match the regex below and capture its match into backreference number 2 «([\d]{2})»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Match the character “-” literally «-»
Match the regex below and capture its match into backreference number 3 «([\d]{2}|[\d]{4})»
Match this alternative (attempting the next alternative only if this one fails) «[\d]{2}»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Or match this alternative (the entire group fails if this one fails to match) «[\d]{4}»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{4}»
Exactly 4 times «{4}»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, form feed) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 4 «([\d]{2})»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Match the character “:” literally «:»
Match the regex below and capture its match into backreference number 5 «([\d]{2})»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Match the character “:” literally «:»
Match the regex below and capture its match into backreference number 6 «([\d]{2})»
Match a single character that is a “digit” (any decimal number in any Unicode script) «[\d]{2}»
Exactly 2 times «{2}»
Match the character “.” literally «\.?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match the regex below and capture its match into backreference number 7 «(\d+)?»
Between zero and one times, as many times as possible, giving back as needed (greedy) «?»
Match a single character that is a “digit” (any decimal number in any Unicode script) «\d+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, form feed) «\s+»
Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match the regex below and capture its match into backreference number 8 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Assert position at the end of a line (at the end of the string or before a line break character) (line feed) «$»

How to match a string, for which a specific method is invoked, via regex?

Regex: \"(.+?)\"\.Localize\(\)
Text: ModelState.AddModelError("Property", "Invalid property.".Localize());
Example:
http://regex101.com/r/aY5jK2
Currently the text Property", "Invalid property. gets matched. How do i match just the Invalid property string?
Try this:
\"([^"]+?)\"\.Localize\(\)
Demo Here
You can use this regex:
(")([^"]+)\1(?=\.Localize\(\))
Group 2 will contain what you want.
#Dante, Keep your regex that's working a simply add a space (\s) before the quote:
\s\"(.*?)\"\.Localize\(\)
DEMO
REGEX EXPLANATION:
\s\"(.*?)\"\.Localize\(\)
Match a single character that is a “whitespace character” (any Unicode separator, tab, line feed, carriage return, form feed) «\s»
Match the character “"” literally «\"»
Match the regex below and capture its match into backreference number 1 «(.*?)»
Match any single character that is NOT a line break character (line feed) «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «\"»
Match the character “.” literally «\.»
Match the character string “Localize” literally (case insensitive) «Localize»
Match the character “(” literally «\(»
Match the character “)” literally «\)»

Perl regexp specific letters in string

input strings consists of letters I N P U Y X
-I have to verify that it only contains these letters and nothing else in PERL regexp
-verify that input also contains at least 2 occurrences of "NP" (without quotes)
example string:
INPYUXNPININNPXX
strings are all in uppercase
You can use this lookahead based regex in PCRE:
^(?=(?:.*?NP){2})[INPUYX]+$
Online Demo: http://regex101.com/r/zH3jQ3
Explanation:
^ assert position at start of a line
(?=(?:.*?NP){2}) Positive Lookahead - Assert that the regex below can be matched
(?:.*?NP){2} Non-capturing group
Quantifier: Exactly 2 times
.*? matches any character (except newline)
Quantifier: Between zero and unlimited times, as few times as possible, expanding as needed [lazy]
NP matches the characters NP literally (case sensitive)
[INPUYX]+ match a single character present in the list below
Quantifier: Between one and unlimited times, as many times as possible, giving back as needed [greedy]
INPUYX a single character in the list INPUYX literally (case sensitive)
$ assert position at end of a line
Use this:
^[INPUYX]*NP[INPUYX]*?NP[INPUYX]*$
See it in action: http://regex101.com/r/vI2xQ6
Effectively what we're doing here is allowing 0 or more of your character class, capturing the first (required) occurrence of NP, then ensuring that it occurs at least once again before the end of the string.
Hypothetically if you wanted to capture out the middle, you could do:
^(?=(?:(.*?)NP){2})[INPUYX]+$
Or as #ikegami points out (matching ONLY the single line) \A(?=(?:(.*?)NP){2})[INPUYX]+\z.
The cleanest solution is:
/^[INPUXY]*\z/ && /NP.*NP/s
The following is the most efficient as it avoids matching the string twice and it prevents backtracking on failure:
/
^
(?: (?:[IPUXY]|N[IUXY])* NP ){2}
[INPUXY]*
\z
/x
See in action
To capture what's between the two NP, you can use
/
^
(?:[IPUXY]|N[IUXY])* NP
( (?:[IPUXY]|N[IUXY])* ) NP
[INPUXY]*
\z
/x
See in action