Python regex matching enumerated lists - regex

I have a python string of the following format
string = 'Some text.\n1. first item\n2. second item\n3. third item\nSome more text.'
What I want to match is the substring \n1. first item\n2. second item\n3. third item, effectively, the enumerated list within the string. For my purposes, I do not necessarily need to match the first \n.
What I've tried so far:
re.findall('\n.*\d\..*', req, re.DOTALL)
re.findall('\n.*\d\..*?', req, re.DOTALL)
The first case finds the last line of the text which I don't want, and the second case doesn't find the rest of line 3. The key difficulty I'm facing is that I don't know how to make the first .* greedy (and match over newlines) but make the second .* simply match up to a newline.
Note: The number of items in the enumerated string is unknown so I can't just match three numbered lines. It could be any number of lines. The string provided is simply an example which happens to have three enumerated items.

How about using line-wise matching and a filter?
string = 'Some text.\n1. first item\n2. second item\n3. third item\nSome more text.'
is_enumerated = re.compile(r"^\d+\.\s")
matches = list(filter(lambda line: is_enumerated.match(line), string.splitlines()))
# ['1. first item', '2. second item', '3. third item']
You can join the matches with \n, if you want.

Related

Using Gsub to get matched strings in R - regular expression

I am trying to extract words after the first space using
species<-gsub(".* ([A-Za-z]+)", "\1", x=genus)
This works fine for the other rows that have two words, however row [9] "Eulamprus tympanum marnieae" has 3 words and my code is only returning the last word in the string "marnieae". How can I extract the words after the first space so I can retrieve "tympanum marnieae" instead of "marnieae" but have the answers stored in one variable called >species.
genus
[9] "Eulamprus tympanum marnieae"
Your original pattern didn't work because the subpattern [A-Za-z]+ doesn't match spaces, and therefore will only match a single word.
You can use the following pattern to match any number of words (other than 0) after the first, within double quotes:
"[A-Za-z]+ ([A-Za-z ]+)" https://regex101.com/r/p6ET3I/1
https://regex101.com/r/p6ET3I/2
This is a relatively simple, but imperfect, solution. It will also match trailing spaces, or just one or more spaces after the first word even if a second word doesn't exist. "Eulamprus " for example will successfully match the pattern, and return 5 spaces. You should only use this pattern if you trust your data to be properly formatted.
A more reliable approach would be the following:
"[A-Za-z]+ ([A-Za-z]+(?: [A-Za-z]+)*)"
https://regex101.com/r/p6ET3I/3
This pattern will capture one word (following the first), followed by any number of addition words (including 0), separated by spaces.
However, from what I remember from biology class, species are only ever comprised of one or two names, and never capitalized. The following pattern will reflect this format:
"[A-Za-z]+ ([a-z]+(?: [a-z]+)?)"
https://regex101.com/r/p6ET3I/4

Why one word breaks all right output in regex (perl)?

I want to understand the situation with regular expression in Perl.
$str = "123-abc 23-rr";
Need to show both words beside minus.
Regular expression is:
#mas=$str=~/(?:([\d\w]+)\-([\d\w]+))/gx;
And it show right output: 123, abc, 23, rr.
But if I change string a little and put one word in start:
$str = "word 123-abc 23-rr";
And I want to take account this first word, so I change my regexp:
#mas=$str=~/\w+\s(?:\s*([\d\w]+)\-([\d\w]+))*/gx;
My output must be same, but there are: 23, rr. If I remove \s* or * the output is 123, abc. But it's still not right. Anyone knows why?
Rather than making an ever more specific regex for an ever more specific string, consider taking advantage of the overall pattern.
Each piece is separated by whitespace.
The first piece is a word.
The rest are pairs separated by dashes.
First split the pieces on whitespace.
my #pieces = split /\s+/, $str;
Then remove the first piece, it doesn't have to be split.
my $word = shift #pieces;
Then split each piece on - into pairs.
my %pairs = map { split /-/, $_ } #words;
For each match, each capture is returned.
In the first snippet, the pattern matches twice.
123-abc 23-rr
\_____/ \___/
There are two captures, so four (2*2=4) values are returned.
In the second snippet, the pattern matches once.
word 123-abc 23-rr
\________________/
There are two captures, so two (2*1=2) values are returned.

Regex to grab formulas

I am trying to parse a file that contains parameter attributes. The attributes are setup like this:
w=(nf*40e-9)*ng
but also like this:
par_nf=(1) * (ng)
The issue is, all of these parameter definitions are on a single line in the source file, and they are separated by spaces. So you might have a situation like this:
pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0
The current algorithm just splits the line on spaces and then for each token, the name is extracted from the LHS of the = and the value from the RHS. My thought is if I can create a Regex match based on spaces within parameter declarations, I can then remove just those spaces before feeding the line to the splitter/parser. I am having a tough time coming up with the appropriate Regex, however. Is it possible to create a regex that matches only spaces within parameter declarations, but ignores the spaces between parameter declarations?
Try this RegEx:
(?<=^|\s) # Start of each formula (start of line OR [space])
(?:.*?) # Attribute Name
= # =
(?: # Formula
(?!\s\w+=) # DO NOT Match [space] Word Characters = (Attr. Name)
[^=] # Any Character except =
)* # Formula Characters repeated any number of times
When checking formula characters, it uses a negative lookahead to check for a Space, followed by Word Characters (Attribute Name) and an =. If this is found, it will stop the match. The fact that the negative lookahead checks for a space means that it will stop without a trailing space at the end of the formula.
Live Demo on Regex101
Thanks to #Andy for the tip:
In this case I'll probably just match on the parameter name and equals, but replace the preceding whitespace with some other "parse-able" character to split on, like so:
(\s*)\w+[a-zA-Z_]=
Now my first capturing group can be used to insert something like a colon, semicolon, or line-break.
You need to add Perl tag. :-( Maybe this will help:
I ended up using this in C#. The idea was to break it into name value pairs, using a negative lookahead specified as the key to stop a match and start a new one. If this helps
var data = #"pd=2.0*(84e-9+(1.0*nf)*40e-9) nf=ng m=1 par=(1) par_nf=(1) * (ng) plorient=0";
var pattern = #"
(?<Key>[a-zA-Z_\s\d]+) # Key is any alpha, digit and _
= # = is a hard anchor
(?<Value>[.*+\-\\\/()\w\s]+) # Value is any combinations of text with space(s)
(\s|$) # Soft anchor of either a \s or EOB
((?!\s[a-zA-Z_\d\s]+\=)|$) # Negative lookahead to stop matching if a space then key then equal found or EOB
";
Regex.Matches(data, pattern, RegexOptions.IgnorePatternWhitespace | RegexOptions.ExplicitCapture)
.OfType<Match>()
.Select(mt => new
{
LHS = mt.Groups["Key"].Value,
RHS = mt.Groups["Value"].Value
});
Results:

Why does a regex match return more than normal in nodejs?

If i perform this instruction in the node repl
"hello".match(/(\w+)(.*)/)
It returns this
[ 'hello',
'hello',
'',
index: 0,
input: 'hello' ]
I expected it to return the first three items, where did the other values come from?
The first item in the array is the entire regex match ("group 0"). That's hello, of course.
The second item is the content of the first capturing group's match (\w+). That's hello, again.
The third item is the content of the second capturing group's match (.*). That's the empty string after hello.
index is the position of the start of the match - which is the first character of the string.
input shows you the string that the regex was performed on - which is hello.
It's surprisingly difficult to find docs on this (at least for me), but here's something from MSDN that describes the object returned by a regex match: http://msdn.microsoft.com/en-us/library/ie/7df7sf4x(v=vs.94).aspx:
If the global flag is not set, the array returned by the match method has two properties, input and index. The input property contains the entire searched string. The index property contains the position of the matched substring within the complete searched string.

Regular Expression: Extract the lines

I try to extract the name1 (first-row), name2 (second-row), name3 (third-row) and the street-name (last-row) with regex:
Company Inc.
JohnDoe
Foobar
Industrieterrein 13
The very last row is the street name and this part is already working (the text is stored in the variable "S2").
REGEXREPLACE(S2, "(.*\n)+(?!(.*\n))", "")
This expression will return me the very last line. I am also able the extract the first row:
REGEXREPLACE(S2, "(\n.*)", "")
My problem is, that I do not know how to extract the second and third row....
Also how do I test if the text contains one, two, three or more rows?
Update:
The regex is used in the context of Scribe (a ETL tool). The problem is I can not execute sourcecode, I only have the following functions:
REGEXMATCH(input, pattern)
REGEXREPLACE(input, pattern, replacement)
If the regex language provides support for lookaheads you may count rows backwards and thus get (assuming . does not match newline)
(.*)$ # matching the last line
(.*)(?=(\n.*){1}$) # matching the second last line (excl. newline)
(.*)(?=(\n.*){2}$) # matching the third last line (excl. newline)
just use this regex:
(.+)+
explain:
.
Wildcard: Matches any single character except \n.
+
Matches the previous element one or more times.
As for a regular expression that will match each of four rows, how about this:
(.*?)\n(.*?)\n(.*?)\n(.*)
The parentheses will match, and the \n will match a new line. Note: you may have to use \r\n instead of just \n depending; try both.
You can try the following:
((.*?)\n){3}