How do I make regex non-greedy to extract specific element - regex

I have the following text from which I need to extract certain phrases:
Restricted Cash 951 37505 Accounts Receivable - Affiliate 31613 27539 Accounts
Receivable - Third Party 23091 2641 Crude Oil Inventory 2200 0 Other Current
Assets 2724 389
Total Current Assets 71319 86100 Property Plant and Equipment Total Property
Plant and Equipment Gross 1500609 706039 Less Accumulated
Depreciation and Amortization (79357) (44271) Total Property Plant and Equipment
Net 1421252 661768 Intangible Assets Net 310202 0 Goodwill 109734 0 Investments
82317 80461 Other Noncurrent Assets 3093 1429 Total Assets 1997917 829758
LIABILITIES Current Liabilities Accounts Payable - Affiliate 2778 1616 Accounts
Payable - Trade 92756 109893 Other Current Liabilities 9217 2876 Total Current
Liabilities 104751 114385 Long-Term Liabilities Long-Term Debt 559021 85000
Asset Retirement Obligations 17330 10416 Other Long-Term Liabilities 582 3727
Total Liabilities 681684 213528 EQUITY Partners' Equity Limited Partner
Common Units (23759 and 23712 units outstanding respectively) 699866 642616
Subordinated Units (15903 units outstanding) (130207) (168136) General Partner 2421 520
Total Partners' Equity 572080 475000 Noncontrolling Interests 744153 141230 Total
Equity 1316233 616230 Total Liabilities and Equity 1997917 829758
I need to remove all phrases that would be in parenthesis, i.e. (), and also would contain number with word outstanding or units.
Based on these conditions, I have two phrases that needs to be removed:
(23759 and 23712 units outstanding respectively)
(15903 units outstanding)
I have tried the following Regex in Python:
\(\d+.+?(outstanding)+?\)
The idea was that .+? after \d+ will make Regex non-greedy (lazy). However, regex selects huge segment starting from (79357) (44271) Total Property Plant and Equipment till outstanding) which is greedy.
The unique marker here is word outstanding, may be there is better approach to extracting those phrases?

You may use
\(\d[^()]*outstanding[^()]*\)
See the regex demo and the regex graph:
Details
\( - ( char
\d - a digit
[^()]* - 0+ chars other than ( and )
outstanding - a substring
[^()]* - 0+ chars other than ( and )
\) - a ) char.
Python:
re.findall(r'\(\d[^()]*outstanding[^()]*\)', s)

Related

Need a regex that captures date & amount fields & ignore blank lines & other miscellaneous data

Goal:
Capture only the purchase date, amount, and purchased item name(s).
Ignore all blank lines
Ignore Reference # & SHIPPING AND TAX string
Then, repeat this on the next grouping of purchases.
I am using Google Sheets for this project.
Sample data showing 3 purchases (ie blocks of data)
Note: spacing & SHIPPING AND TAX string varies inbetween
01/12 P934200QXEHMBPNAD Acme MARKETPLACE SEATTLE WA $34.96 Date & Amount
Ignore (blank line)
435852496957 Ignore
BOSCH CO2135 1/8 In. x 2-3/4 I Purchased item name
BOSCH CO2131 1/16 In. x 1-7/8 Purchased item name
IZOD Men's Memory Foam Slipper Purchased item name
SHIPPING AND TAX Ignore
Ignore (blank line)
01/12 P934200QXEHMB6MQ0 Acme MARKETPLACE SEATTLE WA $48.91
492577232349
LxTek Remanufactured Ink Cartr
SHEENGO Remanufactured Ink Car
02/02 P934200AEHMB7E12 Acme MARKETPLACE SEATTLE WA $21.60
659473773469
LHXEQR Tubing Adapter for ResM
SHIPPING AND TAX
My updated attempt
=index(if(len(C26:C33),REGEXREPLACE(C26:C33,"(?Ums)(\d{2}/\d{2}) .* (\$\d{1,}\.\d{1,2}).(?:^\s+\d+$)(.*)(?:\s+SHIPPING AND TAX)","$1,$2,$3"),))
Unsuccessful results
01/12 P934200QXEHMBPNAD Acme MARKETPLACE SEATTLE WA $34.96
#VALUE!
BOSCH CO2135 1/8 In. x 2-3/4 I
BOSCH CO2131 1/16 In. x 1-7/8
IZOD Men's Memory Foam Slipper
SHIPPING AND TAX
01/12 P934200QXEHMB6MQ0 Acme MARKETPLACE SEATTLE WA $48.91
#VALUE!
LxTek Remanufactured Ink Cartr
SHEENGO Remanufactured Ink Car
02/02 P934200DJEHMB7E12 AMAZON MARKETPLACE SEATTLE WA $21.60
#VALUE!
LHXEQR Tubing Adapter for ResM
SHIPPING AND TAX
Unsuccessful
- it did not ignore:
data btwn date & amount
blank lines
SHIPPING AND TAX
- Value issue - not handling Reference # well
Function REGEXREPLACE parameter 1 expects text values. But '435848996957' is a number and cannot be coerced to a text.
This matches one "block" (note: turn DOTALL flag on):
(\d\d\/\d\d).*(\$[\d.]+).(?:^\s+\d+$)(.*)SHIPPING AND TAX\n\n
Capturing:
the date as group 1
the amount as group 2
the product lines as group 3
See live demo.

Named Entity Recognition For Product Names Of Clothes

I am trying to extract product names from a plain text, the problem with product names is that they don't have a specific pattern and I don't want to give the algorithm a set of data that has fixed names I want it to be generic.
I'm looking for a way to make it detect the product names as an Entity.
Any help please?
Here's an example of the text
Order dispatched Your new clothes are on their way. Track your
delivery with Royal Mail: VB 9593 7366 0GB
Order Details
Men's Dark Navy Jersey Cotton Lounge Shorts Size: XL
£45.00
Men's Navy Cotton Jersey Lounge Pants Size: XL
£60.00
Delivery £0.00
Total £95.00
I want to extract
Men's Navy Cotton Jersey Lounge
and
Men's Dark Navy Jersey Cotton Lounge Shorts
Another example
Your order summary Delivery between 18/11/2019 and 19/11/2019 Shipping
from O' adidas Lxcon sneakers £80.96 Delivery between 18/11/2019 and
19/11/2019 Shipping from BOUTIQUE ANTONIA MARCELO BURLON COUNTY OF
MILAN Confidencial striped swimsuit £97.58 Shipping Total Payment
method £20.00 £153.90 VISA
I want to extract
adidas Lxcon sneakers
And
MARCELO BURLON COUNTY OF MILAN
For your information this text is an email of orders and I have a lot of different patterns of emails.

How to filter distinct counts of text with a greater than indicator in Power BI?

I am working on a report that counts stores with different types of beverages. I am trying to get a distinct count of stores that are selling 4 or more Powerade flavors and two or more Coca-Cola flavors while maintaining a count of stores that are purchashing other products (Sprite, Dr. Pepper, etc.).
My data table is BEVSALES and the data looks like:
CustomerNo Brand Flavor
43 PWD Fruit Punch
37 Coca-Cola Vanilla
43 PWD Mixed Bry
37 Coca-Cola Cherry
44 Sprite Tropical Mix
43 PWD Strawberry
43 PWD Grape
44 Coca-Cola Cherry
17 Dr. Pepper Cherry
I am trying to make the data give me a distinct count of customers with filters that have PWD>=4 and Coca-Cola>=2, while keeping the customer count of Dr. Pepper and Sprite at 1 each. (1 customer purchasing PWD, 1 customer Purchasing Coca-Cola, etc.)
The best measure that I have been able to find is
= SUMX(BEVSALES, 1*(FIND("PWD",BEVSALES[Brand],,0)))
but I don't know how to put it together so the formula counts the stores that have more than 4 PWD and 2 Coca-Cola flavors. Any ideas?
The easiest way would be to do this in a separate query. Go to the query design and click on edit. Then chose your table and group by column Brand and distinctcount the column Flavor. The result should look like this (Maybe as a new table):
GroupedBrand DistinctCountFlavor
PWD 4
Coca-Cola 2
Sprite 1
Dr. Pepper 1
Now you can access the distinct count of the flavors by brands. With an IIF() statement you can check for >=4 at PWD and so on...

Regular expression to match a pattern and print it to next line

I have a textfile with data delimited with '|'. The sample data looks like:
584|Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface|N2TM |01/28/2015|01.00.00|PM |01/28/2015|02.00.00|PM |IN|Y|NULL|N2TM ||https://rti-events3.webex.com/rti-events3/onstage/g.php?MTID=ef4a250ead6cb06bad08d8e9ca3cb07ba|Daily support calls during the first week of GPRO Web Interface submission period.
Topic: Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface
Date: Wednesday, January 28, 2015
Time: 1:00 pm -2:00 pm ET
You may join the event online or by phone. Please use only 1 of the 2 options shown below.
Option 1: To join the online event
1. Click on the link provided above.
2. Click "Join Now."
Option 2: To join the event by telephone only
US TOLL: 1-650-479-3207
Access code: 993 566 829
Event password: gpro128|2015-02-03-13.21.30.421193|2015-02-03-16.55.46.580524
585|Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface|N2TM |01/29/2015|01.00.00|PM |01/29/2015|02.00.00|PM |IN|Y|NULL|N2TM ||https://rti-events3.webex.com/rti-events3/onstage/g.php?MTID=e4c555479a1fc58ba3064b28983cd6595|Daily support calls during the first week of GPRO Web Interface submission period.
Topic: Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface
Date: Thursday, January 29, 2015
Time: 1:00 pm - 2:00 pm ET
You may join the event online or by phone. Please use only 1 of the 2 options shown below.
Option 1: To join the online event
1. Click on the link provided above.
2. Click "Join Now."
Option 2: To join the event by telephone only
US TOLL: 1-650-479-3207
Access code: 991 837 559
Event password: gpro129|2015-02-03-13.26.46.870448|2015-02-03-13.27.03.a
I want the result to be:
584|Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface|N2TM |01/28/2015|01.00.00|PM |01/28/2015|02.00.00|PM |IN|Y|NULL|N2TM ||https://rti-events3.webex.com/rti-events3/onstage/g.php?MTID=ef4a250ead6cb06bad08d8e9ca3cb07ba|Daily support calls during the first week of GPRO Web Interface submission period.|Topic: Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface |Date: Wednesday, January 28, 2015|Time: 1:00 pm -2:00 pm ET|You may join the event online or by phone. Please use only 1 of the 2 options shown below.|Option 1: To join the online event|1. Click on the link provided above.|2. Click "Join Now."|Option 2: To join the event by telephone only|US TOLL: 1-650-479-3207|Access code: 993 566 829|Event password: gpro128|2015-02-03-13.21.30.421193|2015-02-03-16.55.46.580524
585|Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface|N2TM |01/29/2015|01.00.00|PM |01/29/2015|02.00.00|PM |IN|Y|NULL|N2TM ||https://rti-events3.webex.com/rti-events3/onstage/g.php?MTID=e4c555479a1fc58ba3064b28983cd6595|Daily support calls during the first week of GPRO Web Interface submission period.|Topic: Daily Support Calls - PQRS Group Practice and ACO GPRO Web Interface |Date: Thursday, January 29, 2015|Time: 1:00 pm - 2:00 pm ET|You may join the event online or by phone. Please use only 1 of the 2 options shown below.|Option 1: To join the online event|1. Click on the link provided above.|2. Click "Join Now."|Option 2: To join the event by telephone only|US TOLL: 1-650-479-3207|Access code: 991 837 559|Event password: gpro129|2015-02-03-13.26.46.870448|2015-02-03-13.27.03.a
Basically some fields are occuring in new lines. I want them in a single line.
You can use this pattern/replacement:
search: \R(?!\d+\|)
replace: |
details:
\R is an alias for any kind of newline sequences (so including \r, \r\n and \n)
(?!...) is a negative lookahead. A test that means "not followed by"
\d+ one or more digits
\| a literal |

Search for multiple strings in many text files, count hits on combinations

I'm struggling to automate a reporting exercise, and would appreciate some pointers or advice please.
I have several hundred thousand small (<5kb) text files. Each contains a few variables, and I need to count the number of files that match each combination of variables.
Each file contains a device number, such as /001/ /002/.../006/.
Each file also contains a date string, such as 01.10.14 (dd.mm.yy)
Some files contain a 'status' string which is always "Not Settled"
I need a way to trawl through each file on a Linux system (spread across several subdirectories), and produce a report file that counts 'per device' how many files include each date stamp (6 month range) and for each of those dates, how many contain the status string.
The report might look like this:
device, date, total count of files
device, date, total "not settled" count
e.g.
/001/, 01.12.14, 356
/001/, 01.12.14, 12
/001/, 02.12.14, 209
/001/, 02.12.14, 8
/002/, 01.12.14, 209
/002/, 01.12.14, 7
etc etc
In other words:
Foreach /device/
Foreach <date>
count total matching files - write number to file
count toal matching 'not settled' files - write number to file
Each string to match could appear anywhere in the file.
I tried using grep piped to a second (and third) grep commands, but I'd like to automate this and loop through the variables (6 devices, about 180 dates, 2 status strings) . I suspect Perl and Bash is the answer, but I'm out of my depth.
Please can anyone recommend an approach to this?
Edit: Some sample data as mentioned in the comments. The information is basically receipt data from tills - as would be sent to a printer. Here's a sample (identifying bits stripped out).
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37!
c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
Contents = Not Settled
In the case above, I'd be looking for /003/ , 08.01.15, and "Not Settled"
Many thanks.
First, read everything into an SQLite database, then run queries against it to your heart's content. Putting the data in an SQL database is going to save you time if you need to tweak anything. Besides, even simple SQL can tackle this kind of thing if you have the right tables set up.
First of all I agree with #Sinan :-)
The following might work as hack to make a hash out of your file data.
# report.pl
use strict;
use warnings;
use Data::Dumper;
my %report;
my ($date, $device) ;
while (<>) {
next unless m/^ .*
(?<device>\/00[1-3]\/) .*
(?<date>\d{2}\.\d{2}\.\d{2})
.*$/x ;
($date, $device,) = ($+{date}, $+{device});
$_ = <> unless eof;
if (/Contents/) {
$report{$date}{$device}{"u_count"}++ ;
}
else {
$report{$date}{$device}{"count"}++ ;
}
}
print Dumper(\%report)
This seems to work with a collection of data files in the format shown below (since you don't say or show where the Contents = Not Settled appears, I assume it is either part of the last line along with the device ID or in a separate and final line for each file).
Explanation:
The script reads the STDIN of all the files passed as a glob in while(<>){} loop. First, next unless m/ ... skips forward lines of input until it matches the line with device and date information.
Next, the match then uses named capture groups (?<device> ?<date> to hold the values of the patterns it finds and places those values in corresponding variables (($date, $device,) = ($+{date}, $+{device});). These could simply be $1 and $2 but naming keeps me organized here.
Then, in case there is another line to read $_ = <> unless eof; reads it and tries the final set of conditional matches in order to add to $counts and $u_counts.
Data file format:
file1.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/003/132 08.01.15 11:18 A-00
file2.data
c0! SUBTOTAL 11.37
c0! ! T O T A L 11.37! c0! 19 ITEMS
c0! C a s h ? 11.37
vu p022c0!
c0! NET TOTAL VAT A 10.87
c0! VAT 00.0% 0.00
c0! NET TOTAL VAT B 0.42
c0! VAT 20.0% 0.08
c0! *4300 772/080/002/132 08.01.15 11:18 A-00
Contents = Not Settled
(a set of files for testing are listed here: http://pastebin.com/raw.php?i=7ALU80fE).
perl report.pl file*.data
Data::Dumper Output:
$VAR1 = {
'08.01.15' => {
'/002/' => {
'u_count' => 4
},
'/003/' => {
'count' => 1
}
},
'08.12.15' => {
'/003/' => {
'count' => 1
}
}
};
From that you can make a report by iterating through the hash with keys() (the date) and retrieving the inner hash and count values per machine. Really it would be a good idea to have some tests to make sure everything works as expected - that or just do as #sinan_Ünür suggests: use SQLite!
NB: this code was not extensively tested :-)