Extracting Multiple Blocks of Similar Text - regex

I am trying to parse a report. The following is a sample of the text that I need to parse:
7605625112 DELIVERED N 1 GORDON CONTRACTORS I SIPLAST INC Freight Priority 2000037933 $216.67 1,131 ROOFING MATERIALS
04/23/2021 02:57 PM K WRISHT N 4 CAPITOL HEIGHTS, MD ARKADELPHIA, AR Prepaid 2000037933 -$124.23 170160-00
04/27/2021 12:41 PM 2 40 20743-3706 71923 $.00 055 $.00
2 WBA HOT $62.00 0
$12.92 $92.44
$167.36
7605625123 DELIVERED N 1 SECHRIST HALL CO SIPLAST INC Freight Priority 2000037919 $476.75 871 PAIL,UN1263,PAINT,3,
04/23/2021 02:57 PM S CHAVEZ N 39 HARLINGEN, TX ARKADELPHIA, AR Prepaid 2000037919 -$378.54
04/27/2021 01:09 PM 2 479 78550 71923 $.00 085 $95.35
2 HRL HOT $62.00 21
$13.55 $98.21
$173.76
This comprised of two or more blocks that start with "[0-9]{10}\sDELIVERED" and the last currency string prior to the next block.
If I test with "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$167.36\n)" I successfully get the first Block, but If I use "(?s)([0-9]{10}\sDELIVERED)(.*)(?<=\$\d\d\d.\d\d\n)" it grabs everything.
If someone can show me the changes that I need to make to return two or more blocks I would greatly appreciate it.

* is a greedy operator, so it will try to match as much characters as possible. See also Repetition with Star and Plus.
For fixing it, you can use this regex:
(?s)(\d{10}\sDELIVERED)((.(?!\d{10}\sDELIVERED))*)(?<=\$\d\d\d.\d\d)
in which I basically replaced .* with (.(?!\d{10}\sDELIVERED))* so that for every character it checks if it is followed or not by \d{10}\sDELIVERED.
See a demo here

Related

Codeeval Challenge 230: Football, Answer Only Partially Correct

I am working on a relatively new challenge in CodeEval called 'Football.' The description is listed in the following link:
https://www.codeeval.com/open_challenges/230/
Inputs are lines of a file read by Python, and within each line there are lists separated by '|', with each list representing a country: the first being country "1", second being country "2", and so on.
1 2 3 4 | 3 1 | 4 1
19 11 | 19 21 23 | 31 39 29
Outputs are also lines in response to each line read from the file.
1:1,2,3; 2:1; 3:1,2; 4:1,3;
11:1; 19:1,2; 21:2; 23:2; 29:3; 31:3; 39:3;
so country 1 supports team 1, 2, and 3 as shown in the first line of output: 1:1,2,3.
Below is my solution, and since I have no clue why the solution only works for the two sample cases lited in the description link, I'd like to ask anyone for comments and hints on how to correct my code. Thank you very much for your time and assistance ahead of time.
import sys
def football(string):
countries = map(str.split, string.split('|'))
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
results = []
for i in range(len(teams)):
results.append([teams[i]+':'])
for j in range(len(countries)):
if teams[i] in countries[j]:
results[i].append(str(j+1))
for i in range(len(results)):
results[i] = results[i][0]+','.join(results[i][1:])
return '; '.join(results) + '; '
if __name__ == '__main__':
lines = [line.rstrip() for line in open(sys.argv[1])]
for line in lines:
print football(line)
After deliberately failing an attempt to checkout the complete test input and my output, I found the problem. The line:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])))
will make the output problematic in terms of sorting. For example here's a sample input:
10 20 | 43 23 | 27 | 25 | 11 1 12 43 | 33 18 3 43 41 | 31 3 45 4 36 | 25 29 | 1 19 39 | 39 12 16 28 30 37 | 32 | 11 10 7
and it produces the output:
1:5,9; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 3:6,7; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 4:7; 41:6; 43:2,5,6; 45:7; 7:12;
But the challenge expects the output teams to be sorted by numbers in ascending order, which is not achieved by the above-mentioned code as the numbers are in string format, not integer format. Therefore the solution is simply adding a key to sort the teams list by ascending order for integer:
teams = sorted(list(set([i[j] for i in countries for j in range(len(i))])), key=lambda x:int(x))
With a small change in this line, the code passes through the tests. A sample output looks like:
1:5,9; 3:6,7; 4:7; 7:12; 10:1,12; 11:5,12; 12:5,10; 16:10; 18:6; 19:9; 20:1; 23:2; 25:4,8; 27:3; 28:10; 29:8; 30:10; 31:7; 32:11; 33:6; 36:7; 37:10; 39:9,10; 41:6; 43:2,5,6; 45:7;
Please let me know if you have a better and more efficient solution to the challenge. I'd love to read better codes or great suggestions on improving my programming skills.
Here's how I solved it:
import sys
with open(sys.argv[1]) as test_cases:
for test in test_cases:
if test:
team_supporters = {}
for nation, nation_teams in enumerate(test.strip().split("|"), start=1):
for team in map(int, nation_teams.split()):
team_supporters.setdefault(team, []).append(nation)
print(*("{}:{};".format(team, ",".join(map(str, sorted(nations))))
for team, nations in sorted(team_supporters.items())))
The problem is not very complicated. We're given a mapping from nation (implicitly numbered by their order in the input) to a list of teams. We need to reverse that to create an output that maps from a team to a list of nations.
It seems natural to use a dictionary that maps in the same way as the desired output. We can use enumerate to give numbers to the nations as we iterate over them. The setdefault method of the dict adds empty lists to the dictionary as they are needed (using a collections.defaultdict instead of a regular dictionary would be another way to deal with this). We don't need to care about the order of the input, nor the order things are stored in the dictionary's inner lists.
The output we build using str.format calls and the default space separator of the print function. If the final semicolon wasn't desired, I'd have used print("; ".join("{}:{}.format(...))) instead. Since the output needs to be sorted by team at the top level, and by nation in the inner lists, we make some sorted calls where necessary.
Sorting the inner lists is probably not even be necessary, since the nations were processed in order, with their numbers derived from the order they had in the input line. Fortunately, Python's Timsort algorithm is very fast on already-sorted input, so even with a bit of unnecessary sorting, our code is still fast enough.

Regular expression to debatch MT940 message

I got a message with below structure, where message starts from tag :20: and ends at :86:. I want to write a regular expression to extract the all messages.
I would write a C# utility to extract each message and put it in ArrayList.
:20:160212-2359
:21:600******444
:28C:00001/00001
.
.
.
:86:DAILY SETTLEMENT /ENTRY-13 MAR
:62F:D160212GBP1229387,45
:64:D160212GBP1229387,45
:65:D120314GBP1229387,45
:65:D120315GBP1229387,45
:65:D120316GBP1229387,45
:65:D120317GBP1229387,45
:65:D120318GBP1229387,45
:86:FORWARD AVAILABLE FUNDS SHOW ITEMS KNOWN BUT NOT YET POSTED
some more comments in 86_2 segment
this is line2
:20:160212-2359
:21:B***22
:25:60*****88
.
.
.
:86:/ENTRY-13 MAR TRF/REF 6*******64 /ORD/ some line here
*********************** /BNF/ JO 88
:62F:C160212EUR13868931,00
:64:C160212EUR13868931,00
:65:C120314EUR13868931,00
:65:C120315EUR13791849,00
:65:C120316EUR13791849,00
:65:C120317EUR13791849,00
:65:C120318EUR13791849,00
:86:FORWARD AVAILABLE FUNDS SHOW ITEMS KNOWN BUT NOT YET POSTED
some more comments in 86_2 segment.
:20:160212-2359
:21:B****X
:25:6*************1
:28C:00001/00001
:86:STORE1 EUROPE B.V. /ENTRY-15 MAR RTS/REF 6*****6 RTS
SWEPT FROM 9999 1**** XX***********BILLING CHARGES -
28FEB12 TRF/REF 6641XXX43799053 /ITEMCNT/004 /BNF/ /ITEMCNT/004
BILLING CHARGES
:61:1203130313DR10000000,00****288//6*****6
:86:STORE1 CNRTY SRL /ENTRY-13 MAR CLG/REF 66**********6
:61:1*****000,00NT*****9846//6******74
:86:NAME /ENTRY-13 MAR CLG/REF 6******4 LA C****R
**** CASH DEPOSIT STORE1
:61:1203150315DR48531,00NCHGBILLING CHARGES//6641XXX43799053
:86:BILLING CHARGES - 28FEB12 /ENTRY-15 MAR TRF/REF
66******53 /ITEMCNT/004
:62F:C160212EUR0,00
:64:C160212EUR0,00
:65:C120314EUR0,00
:65:C120315EUR0,00
:65:C120316EUR0,00
:65:C120317EUR0,00
:65:C120318EUR0,00
:86:FORWARD AVAILABLE FUNDS SHOW ITEMS KNOWN BUT NOT YET POSTED
{newline}
Actual values are replaced with '*' character.
Thanks
Dhiraj Bhavsar
Try this
:20:(.*?):86:
in code
/:20:(.*?):86:/gs
https://regex101.com/r/dW4zS3/1
.*? matches any character between zero and unlimited times, as few times as possible, expanding as needed

Replace zeros with missing values in certain cases

I was wondering if anyone knew an easier way of doing the following:
I have a dataset of health facility caseload by year, where each observation is one health facility. Facilities were 'brought online' in different years, so some have zeros before they have values for caseload. Also, some 'discontinue', as in they did provide services, but don't any more. I would like to replace the zeros with missing values for the years in which a facility discontinued. In the following example, the 3rd and 4th facilities discontinued, so I'd like missing for y2014 for the 3rd and y2013 & y2014 for the 4th.
y2011 y2012 y2013 y2014
0 0 76 82
0 0 29 13
0 0 25 0
5 10 0 0
0 0 17 24
I tried the following, which worked, but I'm going to have many years worth of data to work on (2000-2014), so was wondering if there was a more efficient way.
replace y2014=. if y2014==0 & (y2013>0 | y2012>0 | y2011>0)
replace y2013=. if y2013==0 & ( y2012>0 | y2011>0)
replace y2012=. if y2012==0 & ( y2011>0)
I messed around with egen rowlast to identify the facilities with a zero in the last year (meaning they discontinued), but then wasn't sure where to go with it.
Your problem would benefit from a loop over the variables.
We'll initialise started to 0, change our mind about started when we see a positive value, and change any subsequent 0s to missings if started is 1.
gen started = 0
forval y = 2000/2014 {
replace started = 1 if y`y' > 0
replace y`y' = . if started == 1 & y`y' == 0
}
Note that this scheme allows re-starts.
A more general comment is that this is not the better data structure for such panel or longitudinal data. This particular problem is not too challenging, but most problems with such data will be easier after reshape long.
See here for a survey of "rowwise" technique in Stata.

Regex Pattern for String including newline characters

I am looking for a regex pattern that will return a match from %PDF-1.2 to and including %%EOF in the string below.
So far my patterns don't seem to work.
DOCUMENTS ACCEPTED
001//201//0E9136614////ACME 107 PTY LTD//8
**E10 End of validation report**
BDAT 4367 LAST
XSVBOUT
001XSVSEPRXXXOUT_TP.19
ZHDASCRA55 0700 8
ZCO*** TEST DATABASE ***ACME 107 PTY LTD 551824563 APTY LMSH PDF NSW 20111217 PNPC
ZIL 77000030149 Australian Securities and Investments Commission 86768265615 ZUMESOFT SOLUTIONS PTY LTD 61 buxton st north adelaide SA 5006
ZIAProprietary Company 42600 0E9136614 201 TAX INVOICE EXE 0 0E9136614201C PA 20111217 Not Subject to GST - Treasurer's Determination (Exempt Taxes, Fees and Charges)
ZTRENDRA55 5
%PDF-1.2
%????
3495
%%EOF
BDAT 11 LAST
/(?s)(%PDF-1\.2.+%%EOF)/ should solve your problem
If you are using an older flavor of regex the (?s) could be moved to the end of regex modifier like //s so.

Perl RegEx for Matching 11 column File

I'm trying to write a perl regex to match the 5th column of files that contain 11 columns. There's also a preamble and footer which are not data. Any good thoughts on how to do this? Here's what I have so far:
if($line =~ m/\A.*\s(\b\w{9}\b)\s+(\b[\d,.]+\b)\s+(\b[\d,.sh]+\b)\s+.*/i) {
And this is what the forms look like:
No. Form 13F File Number Name
____ 28-________________ None
[Repeat as necessary.]
FORM 13F INFORMATION TABLE
TITLE OF VALUE SHRS OR SH /PUT/ INVESTMENT OTHER VOTING AUTHORITY
NAME OF INSURER CLASS CUSSIP (X$1000) PRN AMT PRNCALL DISCRETION MANAGERS SOLE SHARED NONE
Abbott Laboratories com 2824100 4,570 97,705 SH sole 97,705 0 0
Allstate Corp com 20002101 12,882 448,398 SH sole 448,398 0 0
American Express Co com 25816109 11,669 293,909 SH sole 293,909 0 0
Apollo Group Inc com 37604105 8,286 195,106 SH sole 195,106 0 0
Bank of America com 60505104 174 12,100 SH sole 12,100 0 0
Baxter Internat'l Inc com 71813109 2,122 52,210 SH sole 52,210 0 0
Becton Dickinson & Co com 75887109 8,216 121,506 SH sole 121,506 0 0
Citigroup Inc com 172967101 13,514 3,594,141 SH sole 3,594,141 0 0
Coca-Cola Co. com 191216100 318 6,345 SH sole 6,345 0 0
Colgate Palmolive Co com 194162103 523 6,644 SH sole 6,644 0 0
If you ever do write a regex this long, you should at least use the x flag to ignore whitespace, and importantly allow whitespace and comments:
m/
whatever
something else # actually trying to do this
blah # for fringe case X
/xi
If you find it hard to read your own regex, others will find it Impossible.
I think a regular expression is overkill for this.
What I'd do is clean up the input and use Text::CSV_XS on the file, specifying the record separator (sep_char).
Like Ether said, another tool would be appropriate for this job.
#fields = split /\t/, $line;
if (#fields == 11) { # less than 11 fields is probably header/footer
$the_5th_column = $fields[4];
...
}
My first thought is that the sample data is horribly mangled in your example. It'd be great to see it embedded inside some <pre>...</pre> tags so columns will be preserved.
If you are dealing with columnar data, you can go after it using substr() or unpack() easier than you can using regex. You can use regex to parse out the data, but most of us who've been programming Perl a while also learned that regex is not the first tool to grab a lot of times. That's why you got the other comments. Regex is a powerful weapon, but it's also easy to shoot yourself in the foot.
http://perldoc.perl.org/functions/substr.html
http://perldoc.perl.org/functions/unpack.html
Update:
After a bit of nosing around on the SEC edgar site, I've found that the 13F files are nicely formatted. And, you should have no problem figuring out how to process them using substr and/or unpack.
FORM 13F INFORMATION TABLE
VALUE SHARES/ SH/ PUT/ INVSTMT OTHER VOTING AUTHORITY
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMT PRN CALL DSCRETN MANAGERS SOLE SHARED NONE
- ------------------------------ ---------------- --------- -------- -------- --- ---- ------- ------------ -------- -------- --------
3M CO COM 88579Y101 478 6051 SH SOLE 6051 0 0
ABBOTT LABS COM 002824100 402 8596 SH SOLE 8596 0 0
AFLAC INC COM 001055102 291 6815 SH SOLE 6815 0 0
ALCATEL-LUCENT SPONSORED ADR 013904305 172 67524 SH SOLE 67524 0 0
If you are seeing the 13F files unformatted, as in your example, then you are not viewing correctly because there are tabs between columns in some of the files.
I looked through 68 files to get an idea of what's out there, then wrote a quick unpack-based routine and got this:
3M CO, COM, 88579Y101, 478, 6051, SH, , SOLE, , 6051, 0, 0
ABBOTT LABS, COM, 002824100, 402, 8596, SH, , SOLE, , 8596, 0, 0
AFLAC INC, COM, 001055102, 291, 6815, SH, , SOLE, , 6815, 0, 0
ALCATEL-LUCENT, SPONSORED ADR, 013904305, 172, 67524, SH, , SOLE, , 67524, 0, 0
Based on some of the other files here's some thoughts on how to process them:
Some of the files use tabs to separate the columns. Those are trivial to parse and you do not need regex to split the columns. 0001031972-10-000004.txt appears to be that way and looks very similar to your example.
Some of the files use tabs to align the columns, not separate them. You'll need to figure out how to compress multiple tab runs into a single tab, then probably split on tabs to get your columns.
Others use a blank line to separate the rows vertically so you'll need to skip blank lines.
Others allow wrap columns to the next line (like a spreadsheet would in a column that is not wide enough. It's not too hard to figure out how to deal with that, but how to do it is being left as an exercise for you.
Some use centered column alignment, resulting in leading and trailing whitespace in your data. s/^\s+//; and s/\s+$//; will become your friends.
The most interesting one I saw appeared to have been created correctly, then word-wrapped at column 78, leading me to think some moron loaded their spreadsheet or report into their word processor then saved it. Reading that is a two step process of getting rid of the wrapping carriage-returns, then re-processing the data to parse out the columns. As an added task they also have column headings in the data for page breaks.
You should be able to get 100% of the files parsed, however you'll probably want to do it with a couple different parsing methods because of the use of tabs and blank lines and embedded column headers.
Ah, the fun of processing data from the wilderness.