REGEX all invoice item descriptions - regex

I'm trying to regex all items from an invoice (name, unit price, total, VAT, etc.). Managed to get all the information regarding digits, but biggest problem si to extract the item descriptions as sometimes it's on two separate lines. This is what I need to regex
1 Agrafe metalice Eco, rotunjite, 33 mm, 50 buc/cutie buc. 30.00 0,76 22,80 4,33
(SOBO604)
2 Banda corectoare DONAU Mouse, 5 mm x 8 m, orizontala, buc. 5.00 4,83 24,15 4,59
blister (7635001PL-99)
3 Biblioraft plastifiat OFFICE Products, 5 cm, colturi buc. 75.00 5,08 381,00 72,39
metalice, albastru (21011121-01)
4 Burete magnetic DONAU, 110 x 57 x 25 mm, galben buc. 10.00 5,53 55,30 10,51
(7638001PL-99)
5 Calculator de birou Canon WS-1610T, solar, 16 cifre, buc. 1.00 71,11 71,11 13,51
afisaz inclinat, format mare (WS1610T)
6 Capse zincate OFFICE Products 24/6, 1000 buc/cutie buc. 5.00 1,12 5,60 1,06
(18072419-19)
7 Creion grafic Eco, ascutit, cu radiera, corp verde buc. 20.00 0,40 8,00 1,52
(SOIS432)
8 Creion mecanic BIC Matic, 0.7 mm (601021) buc. 4.00 1,88 7,52 1,43
9 Dosar din plastic cu sina si doua perforatii OFFICE buc. 250.00 0,35 87,50 16,63
Products, albastru (21104211-01)
10 Dosar din plastic cu sina si doua perforatii OFFICE buc. 100.00 0,35 35,00 6,65
Products, roz (21104211-13)
pagina 1 / 3
797638
11 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 5.00 6,53 32,65 6,20
microni, 100 file/set (21141215-90)
12 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 20.00 6,51 130,20 24,74
microni, 100 file/set (21141215-90)
13 Marker whiteboard Eco, varf rotund, albastru (SOIS535A) buc. 104.00 1,33 138,32 26,28
14 Marker whiteboard Eco, varf rotund, negru (SOIS535N) buc. 2.00 1,33 2,66 0,51
15 Marker whiteboard Eco, varf rotund, rosu (SOIS535R) buc. 2.00 1,33 2,66 0,51
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
17 Organizator de birou DONAU Clasic VII, 6 compartimente, buc. 2.00 30,67 61,34 11,65
155 x 105 x 101 mm, transparent (7476001-99)
18 Panou din pluta Bi-Office, 60 x 90 cm, rama lemn buc. 1.00 32,96 32,96 6,26
(GMC070012010)
19 Pioneze color Eco, tinte pentru pluta , 40 buc/cutie buc. 1.00 2,16 2,16 0,41
(SOBO612)
20 Pix fara mecanism Eco, varf de 1 mm, albastru (SOIS405A) buc. 110.00 0,33 36,30 6,90
21 Plic C4 (229 x 324 mm), alb, siliconic, 10/set buc. 2.00 2,15 4,30 0,82
(15223619-14)
22 Tus pentru stampila Pelikan, cu picurator, 28 ml, negru buc. 1.00 6,93 6,93 1,32
(351197)
Notice that the item description sometimes is after the total price. Problem is that the space between items isn't even, it's variable, like for e.g. position 8 and 9 are almost linked, compared to position 20 and 21 which have a lot of space between them.
Somebody helped me and got only the first line using
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s{0,1}){3}
this is where I got stuck because of the uneven syntax.
It only gets the first line. For e.g.:
'''
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
'''
it gest only 16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57 but not 100 file (14047511-06). The complete invoice description is Notite adezive OFFICE Products, 51 x 76 mm, galben pal, 100 file (14047511-06) when transformed from pdf to text this is how I get the files.
Will need to extract also the last part and merge the first one to get the full item description.
Thank you

Try this regex:
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s?){3}([\n ]+[^(\n]*\([^)]+\)(?=\n))?
Test on regex101

Related

Rank a column according to the Filters selected by the user

I have data consisting of route details of the customers and also their store scores.
raw data with overall ranking for all the customers :
Dist_Code|Dist_Name|State|Store_name|Route_code|Store_score|Rank
5371 ABC Chicago CG 1200 5 1
2098 HGT Kansas KK 6500 4.8 2
7680 POE Arizona QW 3300 4.2 3
3476 POE Arizona CV 3300 4 4
6272 KUN Florida ANF 7800 3.9 5
3220 ABC Chicago AF 1200 3.6 6
7266 IOR Califor LU 4500 3.2 7
3789 POE Arizona TR 3300 3 8
9383 KAR Newyork IO 5600 3 9
1583 KUN Florida BOT 7800 2.8 10
8219 ABC Chicago Bb 1200 2.5 11
3734 ABC Chicago AA 1200 2 12
6900 POE Arizona HAL 3300 1.8 13
8454 KUN Florida UYO 7800 1.5 14
Filters
Distname ALL
State ALL
Routecode ALL
This is the overall ranking for all the customers without selecting any filters. So when I select some filter like (Dist name, route code, store score) I want it to show the rank according to the selected filter. Eg :
Dist_Code|Dist_Name|State|Store_name|Route_code|Store_score|Rank
7680 POE Arizona QW 3300 4.2 1
3476 POE Arizona CV 3300 4 2
3789 POE Arizona TR 3300 3 3
6900 POE Arizona HAL 3300 1.8 4
Filter
Distname POE
State Arizona
Routecode 3300
The store score is based on some parameter which I calculated in a model using python. 
Currently it is string column in powerbi. I tried some dax but it was not successful.

SAS Restructure Data

I need help restructuring the data. My Table looks like this
NameHead Department Per_test Per_Delta Per_DB Per_Vul
Nancy Health 55 33.2 33 63
Jim Air 25 22.8 23 11
Shu Water 26 88.3 44 12
Dick Electricity 77 55.9 66 10
Elena General 88 22 67 9
Nancy Internet 66 12 44 79
And I want my table to look like this
NameHead Nancy Jim Shu Dick Elena Nancy
Department Health Air Water Electricity General Internet
Per_test 55 25 26 77 88 66
Per_Delta 33.2 22.8 88.3 55.9 22 12
PerDB 33 23 44 66 67 44
Per_Vul 63 11 12 10 9 79
I tried proc transpose but couldnt get the desired result. Please help!
Thanks!
PROC TRANSPOSE does exactly what you want. You must include a VAR statement if you want to include the character variables.
proc transpose data=have out=want;
var _all_;
run;
Note that you cannot have variables that do not have names. Here is what the dataset looks like.
Obs _NAME_ COL1 COL2 COL3 COL4 COL5 COL6
1 NameHead Nancy Jim Shu Dick Elena Nancy
2 Department Health Air Water Electricity General Internet
3 Percent_test 55 25 26 77 88 66
4 Percent_Delta 33.2 22.8 88.3 55.9 22 12
5 Percent_DB 33 23 44 66 67 44
6 Percent_Vul 63 11 12 10 9 79

SAS ARIMA modelling for 5 different variables

I am trying do a ARIMA model estimation for 5 different variables. The data consists of 16 months of Point of Sales. How do I approach this complicated ARIMA modelling?
Furthermore I would like to do:
A simple moving average of each product group
A Holt-Winters
exponential smoothing model
Data is as follows with date and product groups:
Date Gloves ShoeCovers Socks Warmers HeadWear
apr-14 11015 3827 3465 1264 772
maj-14 11087 2776 4378 1099 1423
jun-14 7645 1432 4490 674 670
jul-14 10083 7975 2577 1558 8501
aug-14 13887 8577 6854 1305 15621
sep-14 9186 5213 5244 1183 6784
okt-14 7611 4279 4150 977 6191
nov-14 6410 4033 2918 507 8276
dec-14 4856 3552 3192 450 4810
jan-15 17506 7274 3137 2216 3979
feb-15 21518 5672 8848 1838 2321
mar-15 17395 5200 5712 1604 2282
apr-15 11405 4531 5185 1479 1888
maj-15 11509 5690 4370 1145 2369
jun-15 9945 2610 4884 882 1709
jul-15 8707 5658 4570 1948 6255
Any skilled forecasters out there willing to help? Much appreciated!

SAS-Create Dynamic Tables Based On ID With Only Top 3 Showing

Hi I would really like to create dynamic tables based on the following sample data, create 4 new data sets based upon PAYEE_ID: 522,622,743,and 888. I want all all of the fields to be in the new 4 data sets, but only have the top 3 AMT_BILLED in the 4 tables for each type of PAYEE_ID
PAYEE_ID PAYEENAME MSG_CODE MSG_DESCRIPTION AMT_BILLED percentbilled claimscounts PercentLines TotalAmount TotNumofClaims
522 MakeBelieve Center 1 AA text field 1 10000 4% 50 16% 275000 305
522 MakeBelieve Center 1 BB text field 2 20000 7% 40 13% 275000 305
522 MakeBelieve Center 1 6N text field 3 30000 11% 30 10% 275000 305
522 MakeBelieve Center 1 5U text field 4 25000 9% 20 7% 275000 305
522 MakeBelieve Center 1 1F text field 5 90000 33% 100 33% 275000 305
522 MakeBelieve Center 1 2E text field 6 100000 36% 65 21% 275000 305
622 Invisible Center 2 A4 text field 1 600 2% 9 7% 34300 134
622 Invisible Center 2 D2 text field 2 700 2% 31 23% 34300 134
622 Invisible Center 2 D4 text field 3 8000 23% 11 8% 34300 134
622 Invisible Center 2 DS text field 4 10000 29% 62 46% 34300 134
622 Invisible Center 2 F8 text field 5 15000 44% 21 16% 34300 134
743 Pretend Center 1 1K text field 1 440 1% 2 1% 41040 246
743 Pretend Center 1 1N text field 2 3000 7% 7 3% 41040 246
743 Pretend Center 1 1V text field 3 400 1% 4 2% 41040 246
743 Pretend Center 1 2W text field 4 15000 37% 63 26% 41040 246
743 Pretend Center 1 3B text field 5 500 1% 2 1% 41040 246
743 Pretend Center 1 3H text field 6 7700 19% 41 17% 41040 246
743 Pretend Center 1 3Z text field 7 14000 34% 127 52% 41040 246
888 It's A MakeBelieve One B7 text field 1 68000 38% 257 29% 178449 886
888 It's A MakeBelieve One B8 text field 2 5000 3% 47 5% 178449 886
888 It's A MakeBelieve One B9 text field 3 200 0% 138 16% 178449 886
888 It's A MakeBelieve One BB text field 4 1562 1% 18 2% 178449 886
888 It's A MakeBelieve One BO text field 5 39999 22% 3 0% 178449 886
888 It's A MakeBelieve One BZ text field 6 40000 22% 2 0% 178449 886
888 It's A MakeBelieve One C2 text field 7 500 0% 5 1% 178449 886
888 It's A MakeBelieve One C5 text field 8 7865 4% 395 45% 178449 886
888 It's A MakeBelieve One C7 text field 9 8649 5% 14 2% 178449 886
888 It's A MakeBelieve One CR text field 10 5674 3% 1 0% 178449 886
888 It's A MakeBelieve One CX text field 11 1000 1% 6 1% 178449 886
to
I'm new to SAS, and this would really help me out. Thank you so much!
proc sort data=sampleData out=sampleData_s;
by payee_id amt_billed;
run;
You can use descending if by 'top' you mean largest e.g. by payee_id descending amt_billed;
Once the data are sorted you are able to read into a data step and use first and last e.g.
data partial_solution(drop=count);
retain count 0;
set sampleData_s;
by payee_id descending amt_billed;
if first.payee_id then count=0;
count+1;
if count le 3 then output;
run;
To output to different dataset names:
proc sort data=sampleData(keep=payee_id) out=all_payee_ids nodupkey;
by payee_id;
run;
data _null_;
length id_list $10000; * needs to be long enough to contain all ids;
* if you do not state this, sas will default;
* length to first value;
retain id_list;
set all_payee_ids end=eof;
id_list = catx('|', id_list, payee_id);
if eof then call symputx('macroVarIdList', id_list);
run;
You've now got a pipe separated list of all your id's. You can loop through these using them to create names for you datasets. You need to do this as SAS needs to know the names of the datasets you want to output to up front e.g.
data ds1 ds2 ds3 ds4;
set some_guff;
if blah then output ds1;
else if blahblah then output ds2;
else output d3;
output d4;
run;
So with the macro var loop:
%let nrVars=%sysfunc(countw(&macroVarIdList));
data
%do i = 1 %to &nrVars;
dataset_%scan(&macroVarIdList,&i,|)
%end;
;
set partial_solution;
count+1;
%do j = 1 %to &nrVars;
%let thisPayeeId=%scan(&macroVarIdList,&j,|);
if payee_id = "&thisPayeeId" then output dataset_&thisPayeeId.;
%end;
run;

Extract measurement data values of column r

I am just learning regex right now in R. I have a sample data frame as shown below:
col1 <- c('1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup', 'Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece',
'2-3/4 in. x 4-1/2 in. Heavy-Duty',
'1/4-20 x 2 in. Forged Steel ',
'1/2-Amp Slo-Blo GMA Fuse',
'3/4 in. x 12 in. x 24 in. White Thermally',
'12.0 oz. of weight',
'1.4 fl. oz. of liquid',
'14 gal. tall',
'Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height',
'1/25 HP Cast Iron ',
'1/2 in., 3/4 in. and 1 in. PVC ',
'24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor',
'8 oz. -200 Pot Of Cream',
'5/8 in. dia. x 25 ft. Water ',
'18.5 / 30.5 in. Brushed Nickel',
'57-1/2 in. x 70-5/16 in. Semi-Framed',
'2-1/4 HP Router',
'12-Volt Lithium-Ion Cordless 3/8 in.',
'12-Gauge 24-5/8 in. Strap ',
'7-3/4 in. Wigan Ceiling',
'1 qt. B-I-N ',
'3/8 in. O.D. x 1/4 in. NPTF',
'2-1/2 in. Long x 5/8 in. Diameter Spring',
'1/4 x 3 in. Heat-Shrink ',
'4-White PVC End',
'41000 Series Non-Vented Range',
'Revival 1-Spray 5-Katalyst Air',
'180-Degree White Outdoor',
'3/8 x 3 Hand Scraped ',
'67-Qt. Jug',
'35-77-7/8 in. White',
'-16 tpi x 4 in. Stainless Steel',
'3-21 degree Full')
df <- data.frame(col1 = col1)
df
col1
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece
3 2-3/4 in. x 4-1/2 in. Heavy-Duty
4 1/4-20 x 2 in. Forged Steel
5 1/2-Amp Slo-Blo GMA Fuse
6 3/4 in. x 12 in. x 24 in. White Thermally
7 12.0 oz. of weight
8 1.4 fl. oz. of liquid
9 14 gal. tall
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height
11 1/25 HP Cast Iron
12 1/2 in., 3/4 in. and 1 in. PVC
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor
14 8 oz. -200 Pot Of Cream
15 5/8 in. dia. x 25 ft. Water
16 18.5 / 30.5 in. Brushed Nickel
17 57-1/2 in. x 70-5/16 in. Semi-Framed
18 2-1/4 HP Router
19 12-Volt Lithium-Ion Cordless 3/8 in.
20 12-Gauge 24-5/8 in. Strap
21 7-3/4 in. Wigan Ceiling
22 1 qt. B-I-N
23 3/8 in. O.D. x 1/4 in. NPTF
24 2-1/2 in. Long x 5/8 in. Diameter Spring
25 1/4 x 3 in. Heat-Shrink
26 4-White PVC End
27 41000 Series Non-Vented Range
28 Revival 1-Spray 5-Katalyst Air
29 180-Degree White Outdoor
30 3/8 x 3 Hand Scraped
31 67-Qt. Jug
32 35-77-7/8 in. White
33 -16 tpi x 4 in. Stainless Steel
34 3-21 degree Full
I would like to extract all the measurement data from col1 and add it to col2.
I tried the following:
df$col2 <- str_extract(df$col1, '([[:digit:]]*\\/?\\.?\\-?[[:digit:]]+[[:space:]]+(in|ft)\\.[[:space:]]*x*)')
And the results are as follows:
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 3/4 in. x
4 1/4-20 x 2 in. Forged Steel 2 in.
5 1/2-Amp Slo-Blo GMA Fuse <NA>
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x
7 12.0 oz. of weight <NA>
8 1.4 fl. oz. of liquid <NA>
9 14 gal. tall <NA>
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in.
11 1/25 HP Cast Iron <NA>
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 3/4 in. x
14 8 oz. -200 Pot Of Cream <NA>
15 5/8 in. dia. x 25 ft. Water 5/8 in.
16 18.5 / 30.5 in. Brushed Nickel 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 1/2 in. x
18 2-1/4 HP Router <NA>
19 12-Volt Lithium-Ion Cordless 3/8 in. 3/8 in.
20 12-Gauge 24-5/8 in. Strap 5/8 in.
21 7-3/4 in. Wigan Ceiling 3/4 in.
22 1 qt. B-I-N <NA>
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 1/2 in.
25 1/4 x 3 in. Heat-Shrink 3 in.
26 4-White PVC End <NA>
27 41000 Series Non-Vented Range <NA>
28 Revival 1-Spray 5-Katalyst Air <NA>
29 180-Degree White Outdoor <NA>
30 3/8 x 3 Hand Scraped <NA>
31 67-Qt. Jug <NA>
32 35-77-7/8 in. White 7/8 in.
33 -16 tpi x 4 in. Stainless Steel 4 in.
34 3-21 degree Full <NA>
However, I would want to see results like:
df
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x 1/2 in. x 3/4 ft.
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x 43-1/2 in. x 54-1/4 in.
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 2-3/4 in. x 4-1/2 in.
4 1/4-20 x 2 in. Forged Steel 1/4-20 x 2 in.
5 1/2-Amp Slo-Blo GMA Fuse 1/2-Amp
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x 12 in. x 24 in.
7 12.0 oz. of weight 12.0 oz.
8 1.4 fl. oz. of liquid 1.4 fl. oz.
9 14 gal. tall 14 gal.
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height
11 1/25 HP Cast Iron 1/25 HP
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in., 3/4 in. and 1 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 24-3/4 in. x 48-3/4 in. x 1-1/4 in.
14 8 oz. -200 Pot Of Cream 8 oz.
15 5/8 in. dia. x 25 ft. Water 5/8 in. dia. x 25 ft.
16 18.5 / 30.5 in. Brushed Nickel 18.5 / 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 57-1/2 in. x 70-5/16 in.
18 2-1/4 HP Router 2-1/4 HP
19 12-Volt Lithium-Ion Cordless 3/8 in. 12-Volt 3/8 in.
20 12-Gauge 24-5/8 in. Strap 12-Gauge 24-5/8 in.
21 7-3/4 in. Wigan Ceiling 7-3/4 in.
22 1 qt. B-I-N 1 qt.
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in. O.D. x 1/4 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 2-1/2 in. Long x 5/8 in. Diameter
25 1/4 x 3 in. Heat-Shrink 1/4 x 3 in.
26 4-White PVC End
27 41000 Series Non-Vented Range
28 Revival 1-Spray 5-Katalyst Air 1-Spray 5-Katalyst
29 180-Degree White Outdoor 180-Degree
30 3/8 x 3 Hand Scraped 3/8 x 3
31 67-Qt. Jug 67-Qt.
32 35-77-7/8 in. White 35-77-7/8 in.
33 -16 tpi x 4 in. Stainless Steel -16 tpi x 4 in.
34 3-21 degree Full 3-21 degree
I am not sure how should I tweak my regex in order to work with all the cases? I have also tried to split my regex into multiple lines, but did not help much either.
Here's how I tried it:
df$col2 = paste(str_extract_all(df$col1, '([[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]+(in|ft|cu\\.[[:space:]]+ft)\\.[[:space:]]*[WHD]*[[:space:]]+x*[[:space:]]*)+'), collapse = ' ')]
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2], '[[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]*\\-*(oz|lb|gal|Gal)\\.'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],'[[:digit:]]*\\.?\\,?[[:digit:]]+\\-?[[:space:]]*(Watt|Pack|Gauge|piece|Piece|Panel|mph|MPH|cc|Ton|ton|Light|Gang|LED|Volt|amp|BTU|Amp|Drawer|Step|Tier|Cycle)[[:space:]]+'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],,'[[:digit:]]*\\.?\\,?[[:digit:]]+[[:space:]]+sq\\.[[:space:]]+ft\\.'), collapse = ' ')
Yet, am not getting the results I would want.
Do you have any inputs?
Thanks!