find a function with regex in python - regex

I've got some problem to find the "options" in a jquery ui widget file with python regex. I read file with "os" class and put it in a var.
Problem is (I think) tabulation, space and endline caracter.
I try something like:
resp = re.findall( r'options\s?:\s?\{.×\n\t×},', myfile, flags=re.MULTILINE|re.DOTALL )
(the × symbol for multiplicator symbol)
to find the options{
kA: vA,
kB : vB,
...etc....
}
object in the widget.
But it doesn't work. It always put the rest of file at the end of result or find nothing (if i try to change the regex). If I put the last word of the object, it work!
But any other test fail.
Someone have an idea?!
thanks and, have a good new year!

This works:
/^\s*options{[^}]*}/mg
# explanation:
^ assert position at start of a line
\s* match any white space character [\r\n\t\f ]
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
options{ matches the characters options{ literally (case sensitive)
[^}]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
} the literal character }
} matches the character } literally
Demo:
import re
txt='''\
nothing(blah)
options{ kA: vA, kB : vB, ...etc.... }
options{ kA: vA, kB : vB, ...etc.... }
blah blah
options{ kA: vA, kB : vB, ...etc.... } # tab'''
print(re.findall(r'^\s*(options{[^}]*})',txt, re.S | re.M))
# ['options{ kA: vA, kB : vB, ...etc.... }', 'options{ kA: vA, kB : vB, ...etc.... }', 'options{ kA: vA, kB : vB, ...etc.... }']
The more robust solution is to actually parse the file. The regex can be combined with something like pyparsing for a better solution:
import re
import pyparsing as pp
txt='''\
nothing(blah)
options{ kA: vA, kB : vB}
options{ kA: vA, kB : vB}
blah blah
options{ kA: vA, kB : vB } # tab
options{
kA: vA,
kB: vB,
kC : vC
}
'''
ident = pp.Word(pp.alphas+"_", pp.alphanums+"_")
comma = pp.Suppress(',')
pair=ident+pp.Suppress(':')+ident
pair_list=pp.OneOrMore(pair)+pp.ZeroOrMore(comma+pair)
options=(pp.Suppress('{')+
pair_list+
pp.Suppress('}'))
for om in (m.group(1) for m in re.finditer(r'^\s*options({[^}]*})',txt, re.S | re.M)):
res=options.parseString(om)
data=dict(res[i:i+2] for i in range(0,len(res),2))
print('"{}"=>{}==>{}'.format(om,res,data))
Prints:
"{ kA: vA, kB : vB}"=>['kA', 'vA', 'kB', 'vB']==>{'kB': 'vB', 'kA': 'vA'}
"{ kA: vA, kB : vB}"=>['kA', 'vA', 'kB', 'vB']==>{'kB': 'vB', 'kA': 'vA'}
"{ kA: vA, kB : vB }"=>['kA', 'vA', 'kB', 'vB']==>{'kB': 'vB', 'kA': 'vA'}
"{
kA: vA,
kB: vB,
kC : vC
}"=>['kA', 'vA', 'kB', 'vB', 'kC', 'vC']==>{'kC': 'vC', 'kB': 'vB', 'kA': 'vA'}
Properly parsing takes care of all the whitespace for you and validates in one step.

Based on your sample "options{ kA: vA, kB : vB, ...etc.... }", a rough regex is
# r'options\s*\{(?:[^:]*:[^,]*(?:,[^:]*:[^,]*)*[,\s]*)?\}'
options \s*
\{
(?:
[^:]* : [^,]*
(?: , [^:]* : [^,]* )*
[,\s]*
)?
\}

Related

Switching multi-language substring positions using regex

Raw input with lithuanian letters:
Ą.BČ
Ą.BČ D Ę
Ą. BČ
Ą. BČ D Ę
Ą BČ
Ą BČ D Ę
Examples below should not be affected.
ĄB ČD DĘ
Expected result:
BČ Ą.
BČ Ą. D Ę
BČ Ą.
BČ Ą. D Ę
BČ Ą
BČ Ą D Ę
ĄB ČD DĘ
What I've tried:
^(.\.? *)([\p{L}\p{N}\p{M}]*)$
With ReplaceAllString substitution like so
$2 $1
I have tried various patterns but this is the best I could come up for now.
It manages to capture 1st, 3rd and 5th line and successfully substitute like so:
(Except for some extra spaces at the end of lines)
BČ Ą.
Ą.BČ D Ę
BČ Ą.
Ą. BČ D Ę
BČ Ą
Ą BČ D Ę
ĄB ČD DĘ
Explanation:
There is a set of data with varying entries of the underlying basic
structure [FIRST NAME FIRST LETTER][LASTNAME] which I want to ideally
bring to [LASTNAME][SPACE][FIRST NAME FIRST LETTER][DOT]?
Link to regex101:
regex101
Final solution:
^([\p{L}\p{N}\p{M}](?:\. *| +))([\p{L}\p{N}\p{M}]+)
With ReplaceAllString substitution like so
$2 $1
For your example data, you can omit the anchor $ and match either a dot followed by optional spaces, or 1 or more spaces.
To prevent an empty match for the character class, you can repeat it 1 or more times using + instead of *
^(.(?:\. *| +))([\p{L}\p{N}\p{M}]+)
See a regex demo
Note that the . can match any char including a space. You might also change the dot to a single [\p{L}\p{N}\p{M}]

Regex to get total price with space as separator

I need to build a regex that would catch the total price, here some exemple:
Total: 145.01 $
Total: 1 145.01 $
Total: 00.01 $
Total: 12 345.01 $
It's need to get any price that follow 'Total: ', without the '$'.
That what I got so far : (?<=\bTotal:\s*)(\d+.\d+)
RegExr
I assume:
each string must begin 'Total: ' (three spaces), the prefix;
the last digit in the string must be followed by ' $' (one space), the suffix, which is at the end of the string;
the substring between the prefix and suffix must end '.dd', where 'd' presents any digit, the cents;
the substring between the prefix and cents must match one of the following patterns, where 'd' represents any digit: 'd', 'dd', 'ddd', 'd ddd', 'dd ddd', 'ddd ddd', 'd ddd ddd', 'dd ddd ddd', 'ddd ddd ddd', 'd ddd ddd ddd' and so on;
the return value is the substring between the prefix and suffix that meets the above requirements; and
spaces will be removed from the substring returned as a separate step at the end.
We can use the following regular expression.
r = /\ATotal: {3}(\d{1,3}(?: \d{3})*\.\d{2}) \$\z/
In Ruby (but if you don't know Ruby you'll get the idea):
arr = <<~_.split(/\n/)
Total: 145.01 $
Total: 1 145.01 $
Total: 00.01 $
Total: 12 345.01 $
Total: 1 241 345.01 $
Total: 1.00 $
Total: 1.00$
Total: 1.00 $x
My Total: 1.00 $
Total: 12 34.01 $
_
The following matches each string in the array arr and extracts the contents of capture group 1, which is shown on the right side of each line.
arr.each do |s|
puts "\"#{(s + '"[r,1]').ljust(30)}: #{s[r,1] || 'no match'}"
end
"Total: 145.01 $"[r,1] : 145.01
"Total: 1 145.01 $"[r,1] : 1 145.01
"Total: 00.01 $"[r,1] : 00.01
"Total: 12 345.01 $"[r,1] : 12 345.01
"Total: 1 241 345.01 $"[r,1] : 1 241 345.01
"Total: 1.00 $"[r,1] : no match
"Total: 1.00$"[r,1] : no match
"Total: 1.00 $x"[r,1] : no match
"My Total: 1.00 $"[r,1] : no match
"Total: 12 34.01 $"[r,1] : no match
The regular expression can be written in free-spacing mode to make it self-documenting.
r = /
\A # match the beginning of the string
Total:\ {3} # match 'Total:' followed by 3 digits
( # begin capture group 1
\d{1,3} # match 1, 2 or 3 digits
(?:\ \d{3}) # match a space followed by 3 digits
* # perform the previous match zero or more times
\.\d{2} # match a period followed by 2 digits
) # end capture group 1
\ \$ # match a space followed by a dollar sign
\z # match end of string
/x # free-spacing regex definition mode
The regex can be seen in action here.

Python Regex - extracting the sentence that contains asterisk

test_string: '**Amount** : $25k **Name** : James **Excess** : None Returned \n **In Suit?** Y **Venue** : SF **Insurance** : N/A \n **FTSA** : None listed'
import re
regex = r"(?:^|[^.?*,!-]*(?<=[.?\s*,!-]))(n/a)(?=[\s.?*!,-])[^.?*,!-]*[.?*,!-]"
subst = ""
result = re.sub(regex, subst, test_str, 0, re.IGNORECASE | re.MULTILINE)
I tried to extract '**Insurance** : N/A' from the string. But my above code doesn't work. How can I make it?
Thanks in advance!
I would treat the content like a (semi-structured) key-value file format.
You can match the key-value pairs with a regex like this:
(\*\*[a-zA-Y ?]+\*\*) : ((?:(?!\*\*).)*)(?= |$)
Demo
Explanation:
(\*\*[a-zA-Y ?]+\*\*) the key: you may have to adjust the character range
: the kv separator with surrounded by spaces
((?:(?!\*\*).)*) the value is captured with a tempered greedy token: everything but literal ** followed by (?= |$) the end of string $ or a separating space.
(?= |$)
Sample Code:
import re
regex = r"(\*\*[a-zA-Z ?]+\*\*) : ((?:(?!\*\*).)*)(?= |$)"
test_str = "**Amount** : $25k **Name** : James **Excess** : None Returned \\n **In Suit?** : Y **Venue** : SF **Insurance** : N/A \\n **FTSA** : None listed"
matches = re.finditer(regex, test_str, re.MULTILINE)
for match in matches:
if match.group(1) == "**Insurance**":
print (match.group(2))

Parse arbitrary delimiter character using Antlr4

I try to create a grammar in Antlr4 that accepts regular expressions delimited by an arbitrary character (similar as in Perl). How can I achieve this?
To be clear: My problem is not the regular expression itself (which I actually do not handle in Antlr, but in the visitor), but the delimiter characters. I can easily define the following rules to the lexer:
REGEXP: '/' (ESC_SEQ | ~('\\' | '/'))+ '/' ;
fragment ESC_SEQ: '\\' . ;
This will use the forward slash as the delimiter (like it is commonly used in Perl). However, I also want to be able to write a regular expression as m~regexp~ (which is also possible in Perl).
If I had to solve this using a regular expression itself, I would use a backreference like this:
m(.)(.+?)\1
(which is an "m", followed by an arbitrary character, followed by the expression, followed by the same arbitrary character). But backreferences seem not to be available in Antlr4.
It would be even better when I could use pairs of brackets, i.e. m(regexp) or m{regexp}. But since the number of possible bracket types is quite small, this could be solved by simply enumerating all different variants.
Can this be solved with Antlr4?
You could do something like this:
lexer grammar TLexer;
REGEX
: REGEX_DELIMITER ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ {getText().charAt(0) == _input.LA(1)}? .
| '{' REGEX_ATOM+ '}'
| '(' REGEX_ATOM+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [/~##]
;
fragment REGEX_ATOM
: '\\' .
| ~[\\]
;
If you run the following class:
public class Main {
public static void main(String[] args) throws Exception {
TLexer lexer = new TLexer(new ANTLRInputStream("/foo/ /bar\\ ~\\~~ {mu} (bla("));
for (Token t : lexer.getAllTokens()) {
System.out.printf("%-20s %s\n", TLexer.VOCABULARY.getSymbolicName(t.getType()), t.getText().replace("\n", "\\n"));
}
}
}
you will see the following output:
REGEX /foo/
ANY
ANY /
ANY b
ANY a
ANY r
ANY \
ANY
REGEX ~\~~
ANY
REGEX {mu}
ANY
ANY (
ANY b
ANY l
ANY a
ANY (
The {...}? is called a predicate:
Syntax of semantic predicates in Antlr4
Semantic predicates in ANTLR4?
The ( {getText().charAt(0) != _input.LA(1)}? REGEX_ATOM )+ part tells the lexer to continue matching characters as long as the character matched by REGEX_DELIMITER is not ahead in the character stream. And {getText().charAt(0) == _input.LA(1)}? . makes sure there actually is a closing delimiter matched by the first chararcter (which is a REGEX_DELIMITER, of course).
Tested with ANTLR 4.5.3
EDIT
And to get a delimiter preceded by m + some optional spaces to work, you could try something like this (untested!):
lexer grammar TLexer;
#lexer::members {
boolean delimiterAhead(String start) {
return start.replaceAll("^m[ \t]*", "").charAt(0) == _input.LA(1);
}
}
REGEX
: '/' ( '\\' . | ~[/\\] )+ '/'
| 'm' SPACES? REGEX_DELIMITER ( {!delimiterAhead(getText())}? ( '\\' . | ~[\\] ) )+ {delimiterAhead(getText())}? .
| 'm' SPACES? '{' ( '\\' . | ~'}' )+ '}'
| 'm' SPACES? '(' ( '\\' . | ~')' )+ ')'
;
ANY
: .
;
fragment REGEX_DELIMITER
: [~##]
;
fragment SPACES
: [ \t]+
;

Weird regular pattern behaviour in Python

I have this following program in Python.
import re
data = '''component FA_8 is
port( a : in bit_vector(7 downto 0);
b: in bit_vector(7 downto 0);
s: out bit_vector(7 downto 0);
c: out bit);
end component;'''
m = re.search(r'''component\ +(\w+)\ +is[\ \n]+
port\ *[(]\ +''', data, re.I | re.VERBOSE)
if m:
print m.group()
else:
print "Cant find pattern"
I can't figure out why it is not working. If I change ending of regular pattern with port\ *[(]\ * then it matches.
If the quantifier is the only difference, then it means that there is no space in the text, could it be that it is a tab in the original string?
I would replace the escaped space by a whitespace \s. \s is matching a whitespace character, this is a space, a tab, \r and \n (and other whitespace characters)
m = re.search(r'''component\s+(\w+)\s+is\s+
port\s*[(]\s+''', data, re.I | re.VERBOSE)