I'm currently parsing data from PDFs and I'd like to get the name and amount in a simple format: [NAME] [AMOUNT]
NAME LAST
7 494 25 7 494 25 199 44
NAME LAST
4 488 00 4 488 00 109 07
NAME MIDDLE LAST
7 854 00 7 854 00 298 25
NAME LAST
494 23 494 23 12 01
NAME MIDDLE LAST
4 301 56 4 301 56 112 61
NAME M LAST
13 359 25 13 359 25 130 54
This data means the following:
[NAME] [M?] [LAST]
[TOTAL WAGES] [PIT WAGES] [PIT WITHHELD]
NAME LAST $7,494.25 $7,494.25 $199.44
NAME LAST $4,488.00 $4,488.00 $109.07
NAME MIDDLE LAST $7,854.00 $7,854.00 $298.25
NAME LAST $494.23 $494.23 $12.01
NAME MIDDLE LAST $4,301.56 $4,301.56 $112.61
NAME M LAST $13,359.25 $13,359.25 $130.54
I'd like a regex to detect the duplicate group of numbers so that it parses to this:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25
Hopefully, that makes sense. Thanks
Assuming that no-one in your organisation is making more than $1M or less than $1, this regex will do what you want:
*([a-z][a-z ]+)\R+((\d+)(?: (\d+))? (\d+)) (?=\2).*
It looks for
some number of spaces
names (simplistically) with [a-z][a-z ]+ (captured in group 1)
newline characters (\R+)
2 or 3 sets of digits separated by spaces ((\d+)(?: (\d+))? (\d+)) (captured overall in group 2, with individual groups of digits captured in groups 3, 4 and 5)
a space, followed by an assertion that group 2 is repeated (?=\2)
characters to match the rest of the string to end of line (may not be required, dependent on your application) (.*)
You can replace that with
$1 \$$3$4.$5
to get the following output for your sample data:
NAME LAST $7494.25
NAME LAST $4488.00
NAME MIDDLE LAST $7854.00
NAME LAST $494.23
NAME MIDDLE LAST $4301.56
NAME M LAST $13359.25
Demo on regex101
If you're using JavaScript, you need a couple of minor changes. In the regex, replace \R with [\r\n] as JavaScript doesn't recognise \R. In the substitution, replace \$ with $$.
Demo on regex 101
If your regex flavour supports conditional replacements, you can add a , between the thousands and hundreds by checking if group 4 was part of the match:
$1 \$$3${4:+,}$4.$5
In this case the output is:
NAME LAST $7,494.25
NAME LAST $4,488.00
NAME MIDDLE LAST $7,854.00
NAME LAST $494.23
NAME MIDDLE LAST $4,301.56
NAME M LAST $13,359.25
Demo on regex101
Related
Can someone help me with this regex?
I would like to extract either 1. or 2.
1.
(2624594000) 303 days, 18:32:20.00 <-- Timeticks
.1.3.6.1.4.1.14179.2.6.3.39. <-- OID
Hex-STRING: 54 4A 00 C8 73 70 <-- Hex-STRING (need "Hex-STRING" ifself too)
0 <--INTEGER
"NJTHAP027" <- STRING
OR
2.
Timeticks: (2624594000) 303 days, 18:32:20.00
OID: .1.3.6.1.4.1.14179.2.6.3.39
Hex-STRING: 54 4A 00 C8 73 70
INTEGER: 0
STRING: "NJTHAP027"
This filedname and value will return different data each time. (The data will be variable.)
I don't need to get the field names and only want to get the values in order from the top (multi value)
(?s)[^=]+\s=\s(?<value_v2c>([^=]+)-)
https://regex101.com/r/lsKeEM/2
-> I can't extract the last STRING: "NJTHAP027" at all!
The named group value_v2c is already a group, so you can omit the inner capture group.
Currently the - char should always be matched in the pattern, but you can either match it or assert the end of the string.
As you are using negated character classes and [^=]+ and \s, you can omit the inline modifier (?s) as both already match newlines.
To match the 2. variation, you can update the pattern to:
[^=]+\s=\s(?<value_v2c>[^=]+)(?:-|$)
Regex demo
To get the 1. version, you can match all before the colon as long as it is not Hex-String.
Then in the group optionally match it.
[^=]+\s=\s(?:(?!Hex-STRING:)[^:])*:?\s*(?<value_v2c>(?:Hex-STRING: )?[^=]+?)(?: -|$)
Regex demo
I'm trying to increase by a fixed amount every page number in a file with the following content (it is an index for a book):
Adam und Eva 42–44 (Abb. 14, 15)
Biblioteca Apostolica Vaticana Cod. gr. 1613 31 31 (Abb. 8)
Hamburg, Staatsbibliothek Ms. 151 in scrin. 35 (Abb. 11)
Transverberation der Hl. Theresa von Ávila 10, 18 (Abb. 2, Detail S. 8)
The file contains numbers for years, figures etc. However, page numbers
are never preceded by "Abb. " or "Ms. "
have 3 digits or less
I'd like to add the number 4 to the page numbers, ideally leading to
Adam und Eva 46–48 (Abb. 14, 15)
Biblioteca Apostolica Vaticana Cod. gr. 1613 35 35 (Abb. 8)
Hamburg, Staatsbibliothek Ms. 151 in scrin. 39 (Abb. 11)
Transverberation der Hl. Theresa von Ávila 14, 22 (Abb. 2, Detail S. 12)
A verbal rule could be: Add 4 to every number if it has 3 digits or less and if it is not preceded by "Abb. |Ms. " or another number which is separated by ", " and, in turn, preceded by "Abb. |Ms. "
The following line
perl -pe 's/(?<!Abb. )(\b\d{1,3}\b)/$1+4/eg' original.md
produces
Adam und Eva 46--48 (Abb. 14, 19)
Biblioteca Apostolica Vaticana Cod. gr. 1613 35 35 (Abb. 8)
Hamburg, Staatsbibliothek Ms. 155 in scrin. 39 (Abb. 11)
Transverberation der Hl. Theresa von Ávila 14, 22 (Abb. 2, Detail S. 12)
Two problems remain, the first one of which is most pressing:
The second figure number on line 1 has of course increased by 4. But I don't know how to fix this. I'm aware that I could expand the middle part to something like (\b\d{1,3}\b),?\s?(\b\d{1,3}\b)? and reference the second number with $2, but I don't know how to deal with the separating comma (if it exists) in the replacement.
The number after "Ms. " has increased by 4. But if I change the negative lookbehind to (?<!(Abb. |Ms. )), I receive the error Variable length lookbehind not implemented in regex m/(?<!(Abb. |Ms. ))(\b\d{1,3}\b)/ at -e line 1. I don't know an alternative to such an implementation.
Any help on these two problems would be much appreciated!
You can use the following regex:
See regex in use here
(?:Abb|Ms)\.\s+\d{1,3}(?:,\s+\d{1,3}\b)*(*SKIP)(*FAIL)|\b\d{1,3}\b
This regex works in the following way:
(?:Abb|Ms) Match either Abb or Ms literally
\.\s+ Match the literal . character, followed by one or more whitespace characters.
\d{1,3} Match between 1 and 3 digits
(?:,\s+\d{1,3}\b)* Match the following non-capture group any number of times:
,\s+\d{1,3}\b Match ,, followed by a whitespace character one or more times, then by 1 to 3 digits and assert the end of the digit using a word boundary
(*SKIP) control verb that causes the regex to give up on the current match if it tries to backtrack past its position (which means that it did match this string and will prevent the second option from matching)
(*FAIL) control verb that forces this match to fail causing the current match to be excluded from the results
The second option is what actually matches: \b\d{1,3}\b - match between 1 and 3 digits asserting each side as a word boundary.
If \b doesn't properly match every location, you may want to replace \b with (?:(?<=\D)|^) for preceding and (?=\D|$) for proceeding word boundaries respectively:
See regex in use here
(?:Abb|Ms)\.\s+\d{1,3}(?:,\s+\d{1,3}(?=\D|$))*(*SKIP)(*FAIL)|(?:(?<=\D)|^)\d{1,3}(?=\D|$)
These lookbehind/lookaheads work by asserting either a non-digit character or anchor to start/end of string exists in the previous/next position.
I have some detail fields I need to filter out house number and zip but leave all other numbers in.
example 1:
Van: ION VORM.-VR.TIJD-SOC.TOER HOOGSTRAAT NR. 42 1000 BRUSSEL België IBAN: BE80877459990177 Mededeling: VK 60
example 2:
Van: SYND ABVV-REGIO ANTWERPEN OMMEGANCKSTRAAT 35 2018 ANTWERPEN België IBAN: BE15877800950130 Mededeling: VK38
in the first one I need to filter out 42 and 1000, in the second 35 and 2018
So basically I need a regex that will filter out the numbers from (any)straat(some chars that may include spaces)number(space)number
Thx
This regex works for your two examples:
.+ ([0-9]+) ([0-9]+) .+
Live example: https://regex101.com/r/zA3fM9/1
I am trying to add characters at the end of every line. Those characters are a comma and a name (same for all the columns) as well as a number (incrementing from 1 to end number). My columns are not regular and I have many lines so I need to find the expression to use in the Find and Replace.
My document looks like this:
1,-16 37 25.3,65 32 36.1
2,-16 18 5.9,66 6 37.9
3,-16 17 54.3,66 6 58.7
4,-15 59 23.3,66 40 9.2
5,-15 59 8.2,66 40 36.3
I need it to look like that:
1,-16 37 25.3,65 32 36.1,ECS1
2,-16 18 5.9,66 6 37.9,ECS2
3,-16 17 54.3,66 6 58.7,ECS3
4,-15 59 23.3,66 40 9.2,ECS4
5,-15 59 8.2,66 40 36.3,ECS5
Does anyone know the appropriate expression?
If you select "Regular expression" in the Replace dialog then you can match the leading character and the remainder using ^(\d)(.*)$ in the "Find what" field and replace it using the captured parts in the "Replace with" field: \1\2,ECS\1 where backslash-digit gets substituted with the captured match from one of the parenthetical match expressions.
675185538end432 204 9/9 4709 908 2
343269172end430 3 43 9335 975 7
590144128end89 7 29 3-5-4 420 2
337460105end8Y5 7A 78 2 23
292484648end70 A53 03 9235 93
These are the strings that I am working with. I want to find a regex to replace the above strings as follows
675185538
432 204 9/9 4709 908 2
343269172
430 3 43 9335 975 7
590144128
89 7 29 3-5-4 420 2
337460105
8Y5 7A 78 2 23
292484648
70 A53 03 9235 93
Wherever end comes, \r\n should be introduced.
The string before end is numeric and after end is alphanumeric with whiteline characters.
I am using notepad++.
To make the match strict, try this:
Find: ^(\d+)end(\w)
Replace: \1\r\n\2
This captures, then puts back via back references, the preceding number between start of line and "end" and the following digit/letter. This won't match "end" elsewhere.
Kludgery:
Find (\d\d\d\d\d\d\d\d\d)end(\d)
Replace \1\r\n\2
Find creates two capture groups:
each group is bounded by an ( and a )
one capture group matches exactly nine numerals
the other capture group matches exactly one numeral.
In the replace:
the first capture group is referenced with \1
and the second group with \2.