Find all substrings with at least one group - regex

I try to find in a string all substring that meet the condition.
Let's say we've got string:
s = 'some text 1a 2a 3 xx sometext 1b yyy some text 2b.'
I need to apply search pattern {(one (group of words), two (another group of words), three (another group of words)), word}. First three positions are optional, but there should be at least one of them. If so, I need a word after them.
Output should be:
2a 1a 3 xx
1b yyy
2b
I wrote this expression:
find_it = re.compile(r"((?P<one>\b1a\s|\b1b\s)|" +
r"(?P<two>\b2a\s|\b2b\s)|" +
r"(?P<three>\b3\s|\b3b\s))+" +
r"(?P<word>\w+)?")
Every group contain set or different words (not 1a, 1b). And I can't mix them into one group. It should be None if group is empty. Obviously the result is wrong.
find_it.findall(s)
> 2a 1a 2a 3 xx
> 1b 1b yyy
I am grateful for your help!

You can use following regex :
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s?)+(?:\w+|\.))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b.']
Here I just concise your regex by using character class and modifier ?.The following regex is contain 2 part :
[12][ab]|3b?
[12][ab] will match 1a,1b,2a,2b and 3b? will match 3b and 3.
And if you don't want the dot at the end of 2b you can use following regex using a positive look ahead that is more general than preceding regex (because making \s optional is not a good idea in first group):
>>> reg=re.compile('((?:(?:[12][ab]|3b?)\s)+\w+|(?:(?:[12][ab]|3b?))+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '2b']
Also if your numbers and example substrings are just instances you can use [0-9][a-z] as a general regex :
>>> reg=re.compile('((?:[0-9][a-z]?\s)+\w+|(?:[0-9][a-z]?)+(?=\.|$))')
>>> reg.findall(s)
['1a 2a 3 xx', '1b yyy', '5h 9 7y examole', '2b']

Related

Extracting multi values with regex ( Only values, Not Fieldname )

Can someone help me with this regex?
I would like to extract either 1. or 2.
1.
(2624594000) 303 days, 18:32:20.00 <-- Timeticks
.1.3.6.1.4.1.14179.2.6.3.39. <-- OID
Hex-STRING: 54 4A 00 C8 73 70 <-- Hex-STRING (need "Hex-STRING" ifself too)
0 <--INTEGER
"NJTHAP027" <- STRING
OR
2.
Timeticks: (2624594000) 303 days, 18:32:20.00
OID: .1.3.6.1.4.1.14179.2.6.3.39
Hex-STRING: 54 4A 00 C8 73 70
INTEGER: 0
STRING: "NJTHAP027"
This filedname and value will return different data each time. (The data will be variable.)
I don't need to get the field names and only want to get the values in order from the top (multi value)
(?s)[^=]+\s=\s(?<value_v2c>([^=]+)-)
https://regex101.com/r/lsKeEM/2
-> I can't extract the last STRING: "NJTHAP027" at all!
The named group value_v2c is already a group, so you can omit the inner capture group.
Currently the - char should always be matched in the pattern, but you can either match it or assert the end of the string.
As you are using negated character classes and [^=]+ and \s, you can omit the inline modifier (?s) as both already match newlines.
To match the 2. variation, you can update the pattern to:
[^=]+\s=\s(?<value_v2c>[^=]+)(?:-|$)
Regex demo
To get the 1. version, you can match all before the colon as long as it is not Hex-String.
Then in the group optionally match it.
[^=]+\s=\s(?:(?!Hex-STRING:)[^:])*:?\s*(?<value_v2c>(?:Hex-STRING: )?[^=]+?)(?: -|$)
Regex demo

Use RegEx to search pattern, exclude another

I'm using Regex to include all patterns except one.
My code so far:
for root, dirs, files, in os.walk('z:/rod/folder'):
for name in files:
currentfile=os.path.join(root,name)
with open(currentfile) as d:
text = d.read()
regex = re.compile('2\sFF\s{28}\d\sLOANS')
a = regex.findall(text)
if a:
with open('z:/rod/results.txt', 'a') as f:
f.write(os.path.join(root,name))
f.write('\n')
This code will include all files where '2 FF (any number) LOANS' which is OK, but I do not want any files that has a zero in the string - for example:
'2 FF 0 LOANS'
If the files has any other number in the string, such as, '2 FF 75 LOANS' - this is OK. But I do not want '2 FF 0 LOANS'.
Does this make sense? please help me finish the code.
You can try the following regex:
2\sFF\s(?!0 )\d+\sLOANS
\s(?!0 ) will only match the space if its not followed by a zero and a space. With this regex strings with multiple zeros will be matched. For example 2 FF 0000 LOANS will be matched.
If you do not want this you can use this:
2\sFF\s(?!0+ )\d+\sLOANS

Regex pattern to match "AA BB CC DD"

I have a hexadecimal string with space separator for each byte.
eg., A1 B2 C3 D4 E5 FF 00 11 22 33 44 ...
I would like to use a regex validator to verify the user input is correct or not?
How could I write the regular expression to achieve this goal?
Something like this:
^[A-F0-9]{2}( [A-F0-9]{2})*$
Explanation:
^ - anchor: string start
[A-F0-9]{2} - two symbols in either 0..9 or A..F range
( [A-F0-9]{2})* - followed by space and two 0..9 or A..F symbols zero or more times
$ - anchor: string end
If you allow a..f as valid hexadecimal symbols
^[A-Fa-f0-9]{2}( [A-Fa-f0-9]{2})*$
I would like to propose a solution based on DRY principle
(Don't Repeat Yourself).
Instead of writing the same pattern (as Dmitry proposed), you can:
Write the pattern for 2 hex digits as a capturing group - ([A-F0-9]{2}).
"Call" it again using (?1).
So the whole pattern can be ^([A-F0-9]{2})( (?1))*$.
There are also other variants of "calling" a capturing group, e.g.
(?-1) - call the preceding group or
(?&name) - call a named group.
For details see https://www.regular-expressions.info/subroutine.html

Matching repetition with capturing groups returns only the final match value

I am trying to capture a recursive sequence using matching repetition with capturing groups.
Here is a sample text:
some_string_here 00 12.34 34 56.78 78.90
The regexp used to capture all the floating point values:
\S+(?:\s+(\d+(?:\.\d+)?)){5}
The regexp matches all the floating point values as expected but the capturing group returns only the final match result.
Group #1: 78.90
The required result is:
Group #1: 00
Group #2: 12.34
Group #3: 34
Group #4: 56.78
Group #5: 78.90
If I use the following as the regexp, the result is as expected, but with too many recursive sequence, the regexp is too long.
\S+(?:\s+(\d+(?:\.\d+)?))(?:\s+(\d+(?:\.\d+)?))(?:\s+(\d+(?:\.\d+)?))(?:\s+(\d+(?:\.\d+)?))(?:\s+(\d+(?:\.\d+)?))
Is there a way to capture all of the floating point values in a matching repetition with capturing groups?
Try this
$s = "some_string_here 00 12.34 34 56.78 78.90";
#ar = $s =~m/(\d+\.?(?:\d+)?)/g;
$, = "\n";
print #ar;
g flag returns all possible matches in a list. And the list was stored into the array. So it will give the all possible matches in an array.
Without using g global modifier it returns the only one element that is 00. Because search will satisfy at a first match.
output
00
12.34
34
56.78
78.90
Else you want to store particular number of elements, create the list and the give the variables
For example, you want to store the only three matches,
($first,$second,$thrid) = $s =~m/(\d+\.?(?:\d+)?)/g;
Here $first holds the 00, $second holds the 12.34 and the $third holds the 34.
As I mentioned in my comment, you probably just want split, like this
my $s = 'some_string_here 00 12.34 34 56.78 78.90';
my #groups = split ' ', $s;
shift #groups;
for my $i ( 0 .. $#groups ) {
printf "Group #%d: %-s\n", $i+1, $groups[$i];
}
output
Group #1: 00
Group #2: 12.34
Group #3: 34
Group #4: 56.78
Group #5: 78.90
Try this
(\d+\.?\d*)
Demo
Input
some_string_here 00 12.34 34 56.78 78.90
Output
MATCH 1
1. [18-20] `00`
MATCH 2
1. [22-27] `12.34`
MATCH 3
1. [29-31] `34`
MATCH 4
1. [33-38] `56.78`
MATCH 5
1. [40-45] `78.90`

Haskell Regex non capture group

I'm using Text.Regex.TDFA on Lazy ByteString for extract some infomation from a file.
I have to extract each byte from this string:
27 FB D9 59 50 56 6C 8A
Here is what i've tried (my string begins with space):
(\\ ([0-9A-Fa-f]{2}))+
but i have 2 problems:
Only last match is returned [[" 27 FB D9 59 50 56 6C 8A"," 8A","8A"]]
I want to make the outer group non caputing one (like ?: in other engines)
Here is my minimal code:
import System.IO ()
import Data.ByteString.Lazy.Char8 as L
import Text.Regex.TDFA
main::IO()
main = do
let input = L.pack " 27 FB D9 59 50 56 6C 8A"
let entries = input =~ "(\\ ([0-9A-Fa-f]{2}))+" :: [[L.ByteString]]
print entries
When you attach a multiplier to a capture group, the engine returns only the last match. See rexegg.com/regex-capture.html#groupnumbers for a good explanation.
On the first pass, use this regex, similar to what you were already using (using a case-insensitive option):
^([\dA-F]+) +([\dA-F]+) +(\d+) +([\dA-F]+)(( [\dA-F]{2})+)
You'll get the following matching groups:
Use the 5th one as the target of a second pass, to extract each individual byte (using a "global" option):
([0-9A-Fa-f]{2})
Then each match will be returned separately.
Note: you don't need to escape the spaces, as you had in your original regex.