Regex currency Python 3.5 - regex

I am trying to reformat the Euro currency in text data. The original format is like this: EUR 3.000.00 or also EUR 33.540.000.- .
I want to standardise the format to €3000.00 or €33540000.00.
I have reformatted EUR 2.500.- successfully using this code:
import re
format1 = "a piece of text with currency EUR 2.500.- and some other information"
regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.]-)')
text1 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format1)
Out: "a piece of text with currency €2500.00 and some other information"
This gives me €2500.00 which is correct. I've tried applying the same logic to the other formats to no avail.
format2 = "another piece of text EUR 3.000.00 and EUR 5.000.00. New sentence"
regexObj = re.compile('\d{1,3}[.](\d{3}[.])(\d{2})?')
text2 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(\.\d+)?').search(m.group().replace('.','')).group())),format2)
Out: "another piece of text EUR €300000.00 and EUR €500000.00. New sentence"
and
format3 = "another piece of text EUR 33.540.000.- and more text"
regexObj = regexObj = re.compile(r'EUR\s*\d{1,3}[.](\d{3}[.])(\d{3}[.])(\d{3}[.]-)')
text3 = regexObj.sub(lambda m:"\u20ac"+"{:0.2f}".format(float(re.compile('\d+(.\d+)?(.\d+)?').search(m.group().replace('.','')).group())),format3)
Out: "another piece of text EUR 33.540.000.- and more text"
I think the problem might be with the regexObj.sub(), as the .format() part of it is confusing me. I've tried to change re.compile('\d+(.\d+)?(.\d+)?') within that, but I can't seem to generate the result I want. Any ideas much appreciated. Thanks!

Let's start with the regex. My propositions is:
EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))
Details:
EUR\s* - The starting part.
(?: - Start of a non-capturing group - a container for alternatives.
( - Start of a capturing group #1 (Integer part with ".-" instead of
the decimal part).
\d{1,3} - Up to 3 digits.
(?:\.\d{3})* - ".ddd" part, 0 or more times.
) - End of group #1.
\.- - ".-" ending.
| - Alternative separator.
( - Start of a capturing group #2 (Integer part)
\d{1,3}(?:\.\d{3})* - Like in alternative 1.
) - End of group #2.
(\.\d{2}) - Capturing group #3 (dot and decimal part).
) - End of the non-capturing group.
Instead of a lambda function I used "ordinary" replicating function,
I called it repl. It contains 2 parts, for group 1 and group 2 + 3.
In both variants dots from the integer part are deleted, but the "final"
dot (after the integer part) is a part of group 3, so it is not deleted.
So the whole script can look like below:
import re
def repl(m):
g1 = m.group(1)
if g1: # Alternative 1: ".-" instead of decimal part
res = g1.replace('.','') + '.00'
else: # Alternative 2: integet part (group 2) + decimal part (group 3)
res = m.group(2).replace('.','') + m.group(3)
return "\u20ac" + res
# Source string
src = 'xxx EUR 33.540.000.- yyyy EUR 3.000.00 zzzz EUR 5.210.15 vvvv'
# Regex
pat = re.compile(r'EUR\s*(?:(\d{1,3}(?:\.\d{3})*)\.-|(\d{1,3}(?:\.\d{3})*)(\.\d{2}))')
# Replace
result = pat.sub(repl, src)
The result is:
xxx €33540000.00 yyyy €3000.00 zzzz €5210.15 vvvv
As you can see, no need to use float or format.

Related

VBA RegEx: How to find the first instance of a number after a specific string and ignore all other characters?

Having trouble with writing code that will pick up the pattern I want. I want to be able to grab the first number that comes up after the words 5 Months in the .txt file that I have. If there are any other characters A-Z, parentheses, $, % etc. I want to ignore them. I keep getting an error code with VBA such as the INVALID PROCEDURE CALL OR ARGUMENT.
Currently, I have code that looks like this:
Dim reg4 As Object: Set reg4 = CreateObject("vbscript.regexp")
reg4.Pattern = "5 Months\s*([\d+]\.[\d+])\s*"
Dim MCS As Object
Set MCS = reg4.Execute(myText)
**Dim Months5 As String: Months5 = MCS(0).submatches(0)** *the error stems from this line*
where mytext is a string that consists of content from a text file. My main problem is that this text file is not always in a standardized format, so when I want to extract the first number after "5 Months" it gives me that error.
The text file could look like:
EXAMPLE 1
5 Months
($) (%) (Months) (%) (%) (%) ($) (Months)
0.00 0.0000 0.000
OR
EXAMPLE 2
5 Months
0.00
0.000
0.000
In both cases, I would ideally be able to extract that first number "0.00" in its entire form, while ignoring any other characters such as (%) or ($) as shown in example 1.
I would like to ask if anyone has any suggestions on how to rewrite the pattern statement so it will be able to pick up the first numeric instance along with the numbers after its decimal point?
Many thanks in advance!
Your regex does not match the strings you showed. You can use
\b5 Months[\s\S]*?(\d+(?:\.\d+)?)
See the regex demo. Details:
\b - a word boundary
5 Months - a literal text
[\s\S]*? - any 0 or more chars, as few as possible
(\d+(?:\.\d+)?) - Capturing group 1: one or more digits followed with an optional sequence of a . and one or more digits.
Test run in VBA:
Sub TestFn()
Dim reg4 As Object: Set reg4 = CreateObject("vbscript.regexp")
reg4.Pattern = "\b5 Months[\s\S]*?(\d+(?:\.\d+)?)"
Dim myText As String
myText = "5 Months" & vbCrLf & vbCrLf & "0.00"
Dim MCS As Object
Set MCS = reg4.Execute(myText)
Dim Months5 As String: Months5 = MCS(0).SubMatches(0)
Debug.Print (Months5)
End Sub

How to find all currency related digits REGEX?

For a string that has free text:
"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"
How to find a regex pattern that will extract currency related numbers?
In this case: 89.99, 95, and 100?
So far, I've tried these patterns:
[0-9]*[€.]([0-9]*)
\[0-9]{1,3}(?:\.\[0-9]{3})*,\[0-9]\[0-9]
[0-9]+\€\.[0-9]+
But these don't seem to be producing exactly what is needed
Simpler solution would be [.\d]*€[.\d]*.
One option is to match all 3 variations and afterwards remove the euro sign from the match.
(?:\d+€\d*|€\d+(?:\.\d+)?)
Explanation
(?: Non capture group
\d+€\d* Match 1+ digit and € followed by optional digits
| Or
€\d+(?:\.\d+)? Match € followed by digits and an optional decimal part
) Close non capture group
Regex demo
For example
import re
regex = r"(?:\d+€\d*|€\d+(?:\.\d+)?)"
test_str = ("\"The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5 \n"
"from last monday. If they do not level up again to 100€ by the end of this week there might \n"
"be serious consequences to the company\"")
print([x.replace("€", "") for x in re.findall(regex, test_str)])
Output
['89.99', '95', '100']
A bit more precise pattern for the number with optional comma followed by 3 digits and 2 digit decimal part could be:
(?:\d+€\d*|€\d{1,3}(?:,\d{3})*\.\d{2})
Regex demo
This need further testing but I would simply grab everything around € which is not whitespace, that is:
import re
text = """The shares of the stock at the XKI Market fell by €89.99 today, which saw a drop of a 9€5
from last monday. If they do not level up again to 100€ by the end of this week there might
be serious consequences to the company"""
values = re.findall(r"\S*€\S*", text)
print(values)
Output:
['€89.99', '9€5', '100€']

Regex to search and replace whats inside parenthesis

We're trying to find text inside parenthesis and replace it with a words. In this case all text inside parenthesis, like (R:2379; L:28) etc are to be replaced with (Receipt No.:2379; Ledger No.:28)
There's that very same text on the next line that should not be touched (Don't know why it there. This is from an old DOS accounting application).
I came upto /\([R.]]+\)/g, 'Receipt No.' but this is harder than I imagined. How can this be done?
#Ch. No. 209488 #Rt. Date 12-09-1997 #Bank: Citibank (R:2379;L:28)
R:2379;L:28
#Ch. No. 884273 #Dr. Date 10-09-1997 #Ch. Dep. 14-09-1997 #Bank: Citibank (R:2432; L:28)
R:2432; L:28
#Ch. No. 884274 #Dr. Date 10-09-1997 #Ch. Dep. 19-09-1997 #Bank: Citibank (R:2475; L:28)
R:2475; L:28
#Ch. No. 884275 #Dr. Date 10-09-1997 #Ch. Dep. 24-09-1997 #Bank: Citibank (R:2480; L:28)
R:2480; L:28
You can use
\(R:(\d+);\s*L:(\d+)\)
Replace with (Receipt No.:$1; Ledger No.:$2).
See the regex demo. Details:
\(R: - (R: text
(\d+) - Group 1: one or more digits
; - a ; char
\s* - 0 or more whitespaces
L: - a literal L: text
(\d+) - Group 2: one or more digits
\) - a ) char.
The $1 is the backreference to Group 1 value and the $2 is the backreference to Group 2 value.

Regex and capturing groups

I have my regex working to the point where I now have two groups of text, group 1 and group 2 - I'm only really interested in the group2 text. In the end how do I get just group 2 to match/display?
("token":")([^,]+)
Something like below should work:
>>> p = re.compile('("token":")([^,]+)')
>>> m = p.match('...')
>>> m.group(2)
This will get the content of the second group. (Taken from here)

String Split AND Replace

I am trying to replace a string based on the split portion. This string is a date, where the year should be formatted as a superscript.
Eg. Jan 24, 2014 needs to be split at 2014 then replaced with Jan 24, ^2014^ where 2014 is the superscript.
Example pseudo:
mydate.Split(" ", 2).Replace("^2014^")
But, instead of replacing the new split string, it should be the original (or copy of original). I can't just edit based on index because the formatting may not always be the same, at times the date may be expanded to January 24th, 2014 which would then break the traditional replace by index.
You can try
(?<=[A-Z][a-z]{2} \d{2}, )(\d{4})
Replaced with ^$1^ or ^\1^
Here is online demo and tested it on regexstorm
If you want to match January 24th, 2014 as well then try
([A-Z][a-z]{2,9} \d{2}[a-z]{0,2}, )(\d{4})
Replaced with $1^$2^
Here is demo
You can use a combination of lookarounds to achieve your result.
Regex.Replace(input, "(?<=\d{4})|(?=\d{4})", "^")
Explanation:
(?<= # look behind to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-behind
| # OR
(?= # look ahead to see if there is:
\d{4} # digits (0-9) (4 times)
) # end of look-ahead
Live Demo
Normalize you date string by assigning it to a Date variable, then do the formatting from there.
Dim dt As Date = "Jan 24, 2014"
Dim s As String = dt.ToShortDateString.Replace("2014", "^2014^")
MsgBox(s)
' or '
s = dt.Month.ToString & "/" & dt.Day.ToString & "/^" & dt.Year.ToString & "^"
MsgBox(s)
IMO RegEx is write once code and is difficult to debug/maintain.