I'm trying to regex all items from an invoice (name, unit price, total, VAT, etc.). Managed to get all the information regarding digits, but biggest problem si to extract the item descriptions as sometimes it's on two separate lines. This is what I need to regex
1 Agrafe metalice Eco, rotunjite, 33 mm, 50 buc/cutie buc. 30.00 0,76 22,80 4,33
(SOBO604)
2 Banda corectoare DONAU Mouse, 5 mm x 8 m, orizontala, buc. 5.00 4,83 24,15 4,59
blister (7635001PL-99)
3 Biblioraft plastifiat OFFICE Products, 5 cm, colturi buc. 75.00 5,08 381,00 72,39
metalice, albastru (21011121-01)
4 Burete magnetic DONAU, 110 x 57 x 25 mm, galben buc. 10.00 5,53 55,30 10,51
(7638001PL-99)
5 Calculator de birou Canon WS-1610T, solar, 16 cifre, buc. 1.00 71,11 71,11 13,51
afisaz inclinat, format mare (WS1610T)
6 Capse zincate OFFICE Products 24/6, 1000 buc/cutie buc. 5.00 1,12 5,60 1,06
(18072419-19)
7 Creion grafic Eco, ascutit, cu radiera, corp verde buc. 20.00 0,40 8,00 1,52
(SOIS432)
8 Creion mecanic BIC Matic, 0.7 mm (601021) buc. 4.00 1,88 7,52 1,43
9 Dosar din plastic cu sina si doua perforatii OFFICE buc. 250.00 0,35 87,50 16,63
Products, albastru (21104211-01)
10 Dosar din plastic cu sina si doua perforatii OFFICE buc. 100.00 0,35 35,00 6,65
Products, roz (21104211-13)
pagina 1 / 3
797638
11 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 5.00 6,53 32,65 6,20
microni, 100 file/set (21141215-90)
12 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 20.00 6,51 130,20 24,74
microni, 100 file/set (21141215-90)
13 Marker whiteboard Eco, varf rotund, albastru (SOIS535A) buc. 104.00 1,33 138,32 26,28
14 Marker whiteboard Eco, varf rotund, negru (SOIS535N) buc. 2.00 1,33 2,66 0,51
15 Marker whiteboard Eco, varf rotund, rosu (SOIS535R) buc. 2.00 1,33 2,66 0,51
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
17 Organizator de birou DONAU Clasic VII, 6 compartimente, buc. 2.00 30,67 61,34 11,65
155 x 105 x 101 mm, transparent (7476001-99)
18 Panou din pluta Bi-Office, 60 x 90 cm, rama lemn buc. 1.00 32,96 32,96 6,26
(GMC070012010)
19 Pioneze color Eco, tinte pentru pluta , 40 buc/cutie buc. 1.00 2,16 2,16 0,41
(SOBO612)
20 Pix fara mecanism Eco, varf de 1 mm, albastru (SOIS405A) buc. 110.00 0,33 36,30 6,90
21 Plic C4 (229 x 324 mm), alb, siliconic, 10/set buc. 2.00 2,15 4,30 0,82
(15223619-14)
22 Tus pentru stampila Pelikan, cu picurator, 28 ml, negru buc. 1.00 6,93 6,93 1,32
(351197)
Notice that the item description sometimes is after the total price. Problem is that the space between items isn't even, it's variable, like for e.g. position 8 and 9 are almost linked, compared to position 20 and 21 which have a lot of space between them.
Somebody helped me and got only the first line using
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s{0,1}){3}
this is where I got stuck because of the uneven syntax.
It only gets the first line. For e.g.:
'''
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
'''
it gest only 16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57 but not 100 file (14047511-06). The complete invoice description is Notite adezive OFFICE Products, 51 x 76 mm, galben pal, 100 file (14047511-06) when transformed from pdf to text this is how I get the files.
Will need to extract also the last part and merge the first one to get the full item description.
Thank you
Try this regex:
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s?){3}([\n ]+[^(\n]*\([^)]+\)(?=\n))?
Test on regex101
I have a raw data set looks like this:
enter link description here
And I tried to transform the observations having a "T" to 0,
and then read in the data set and print out. Just this.
However, with my code, simply by looking at the first observation in line 5, it is apparently something is off.
For instance, the first observation for "Nov" should not be 0.
I could not figure out what had gone wrong and I wonder is anyone would like to give me some advice on what I can do for the next? Thank you very much! Highly appreciated.
My code is as below:
INFILE "&DIRLSB.Pr1Snowfall1.csv" DSD FIRSTOBS=5;
DROP i;
INPUT Season $#;
INPUT Year 1-4 Season 1-7 Sep Oct Nov Dec Jan Feb Mar Apr May Total;
ARRAY Months (*) Sep -- May;
DO i = 1 TO dim(Months);
IF Months(i)=. Then Months(i)=0;
END;
RUN;
I'm guessing you have a missing T; statement somewhere that is reading T(race) as missing T. ".T does not equal ."
I would use coalesce function. There is really no need to change missing T to 0 is there?
missing t;
data snow;
infile cards firstobs=2;
input Season:$7. Sep Oct Nov Dec Jan Feb Mar Apr May Total;
array mth[*] Sep--May;
do i = 1 to dim(mth);
mth[i] = coalesce(mth[i],0);
end;
t = sum(of mth[*]);
drop i;
cards;
Season Sep Oct Nov Dec Jan Feb Mar Apr May Total
1884-85 0 T 1 27.1 22.2 17 3.5 19.5 T 90.3
1885-86 0 1.7 8.2 8.4 16.9 16 6.5 7 0 64.7
1886-87 0 T 22.2 12.5 12 18.4 6.3 1.2 0 72.6
1893-94 0 0.5 6.1 27.6 20 29.5 5.4 13.3 0 102.4
1894-95 0 T 11.1 22.1 26.5 23.6 9.5 0.6 0 93.4
1895-96 0 1.5 5.9 8.7 22.5 39.1 45.1 1 0 123.8
1896-97 0 T 5.5 13.9 20.1 13.7 8.1 5.2 0 66.5
1897-98 0 0 10.1 18.4 32.1 26.8 1.2 2.4 0 91
1898-99 0 T 10.6 27 16.6 16.3 21.2 4.3 T 96
1899-00 T T 1.3 21.5 24.7 28.5 54 1.3 0 131.3
1906-07 0 5 5.7 18.7 11.7 15.7 3.1 2.5 1.3 63.7
1907-08 0 0 2.2 11.6 16.5 19.8 7.9 6.3 3 67.3
1908-09 0 0.5 4.6 10 22.5 6.1 9.7 9.8 3.3 66.5
1909-10 0 T 1.7 14.6 22 42.7 3.4 0.5 0 84.9
1910-11 0 2.2 15.7 29.8 9.5 30 13.5 4.7 2 107.4
1911-12 0 0 6.5 7.5 21.5 10.8 8.8 6.9 T 62
;;;;
run;
proc print;
run;
I am trying to select the following data using pandas for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm) starting from the year 1991 to 2000. somebody please can help me how I can write the code. Thanks!
datum m_ta m_tax m_taxd m_tan m_tand
------- ----- ----- ---------- ----- ----------
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
1902-01 3.4 7.5 1902-01-25 -2.2 1902-01-15
1902-02 2.8 6.6 1902-02-09 -2.8 1902-02-06
1902-03 5.3 13.3 1902-03-22 -3.5 1902-03-13
1902-04 10.5 15.8 1902-04-21 6.1 1902-04-08
1902-05 12.5 20.6 1902-05-31 8.5 1902-05-10
1902-06 18.5 23.8 1902-06-30 14.4 1902-06-19
....
You can use df.year with boolean indexing for selecting data by column datum:
#convert column datum to period
df['datum'] = pd.to_datetime(df['datum']).dt.to_period('M')
#convert columns to datetime
df['m_taxd'] = pd.to_datetime(df['m_taxd'])
df['m_tand'] = pd.to_datetime(df['m_tand'])
print df.datum.dt.year
0 1901
1 1901
2 1901
3 1901
4 1901
5 1901
6 1901
7 1901
8 1901
9 1901
10 1901
11 1901
12 1902
13 1902
14 1902
15 1902
16 1902
17 1902
Name: datum, dtype: int64
#change 1901 to 2000
print df[df.datum.dt.year <= 1901]
datum m_ta m_tax m_taxd m_tan m_tand
0 1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1 1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
2 1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
3 1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
4 1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
5 1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
6 1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
7 1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
8 1901-09 15.9 19.9 1901-09-01 11.8 1901-09-09
9 1901-10 12.6 17.9 1901-10-04 8.3 1901-10-31
10 1901-11 4.7 11.1 1901-11-14 -0.2 1901-11-26
11 1901-12 4.2 8.4 1901-12-22 -1.4 1901-12-07
I want to write Python code to analyze the percentage of m_tax and m_tan for Python 2.7 from the web page (http://owww.met.hu/eghajlat/eghajlati_adatsorok/bp/Navig/202_EN.htm). I have already the dataframe code, but I couldn't write percentage code. Could somebody please help me how I can write the code. Thanks!
datum m_ta m_tax m_taxd m_tan m_tand
------- ----- ----- ---------- ----- ----------
1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10
1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15
1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01
1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23
1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05
1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17
1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04
1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29
You can call div and pass the sum of the columns to add % columns:
In [66]:
df['m_tax%'],df['m_tan%'] = df['m_tax'].div(df['m_tax'].sum()) * 100, df['m_tan'].div(df['m_tax'].sum()) * 100
df
Out[66]:
datum m_ta m_tax m_taxd m_tan m_tand m_tax% m_tan%
0 1901-01 -4.7 5.0 1901-01-23 -12.2 1901-01-10 3.551136 -8.664773
1 1901-02 -2.1 3.5 1901-02-06 -7.9 1901-02-15 2.485795 -5.610795
2 1901-03 5.8 13.5 1901-03-20 0.6 1901-03-01 9.588068 0.426136
3 1901-04 11.6 18.2 1901-04-10 7.4 1901-04-23 12.926136 5.255682
4 1901-05 16.8 22.5 1901-05-31 12.2 1901-05-05 15.980114 8.664773
5 1901-06 21.0 24.8 1901-06-03 14.6 1901-06-17 17.613636 10.369318
6 1901-07 22.4 27.4 1901-07-30 16.9 1901-07-04 19.460227 12.002841
7 1901-08 20.7 25.9 1901-08-01 14.7 1901-08-29 18.394886 10.440341