Extract measurement data values of column r - regex

I am just learning regex right now in R. I have a sample data frame as shown below:
col1 <- c('1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup', 'Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece',
'2-3/4 in. x 4-1/2 in. Heavy-Duty',
'1/4-20 x 2 in. Forged Steel ',
'1/2-Amp Slo-Blo GMA Fuse',
'3/4 in. x 12 in. x 24 in. White Thermally',
'12.0 oz. of weight',
'1.4 fl. oz. of liquid',
'14 gal. tall',
'Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height',
'1/25 HP Cast Iron ',
'1/2 in., 3/4 in. and 1 in. PVC ',
'24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor',
'8 oz. -200 Pot Of Cream',
'5/8 in. dia. x 25 ft. Water ',
'18.5 / 30.5 in. Brushed Nickel',
'57-1/2 in. x 70-5/16 in. Semi-Framed',
'2-1/4 HP Router',
'12-Volt Lithium-Ion Cordless 3/8 in.',
'12-Gauge 24-5/8 in. Strap ',
'7-3/4 in. Wigan Ceiling',
'1 qt. B-I-N ',
'3/8 in. O.D. x 1/4 in. NPTF',
'2-1/2 in. Long x 5/8 in. Diameter Spring',
'1/4 x 3 in. Heat-Shrink ',
'4-White PVC End',
'41000 Series Non-Vented Range',
'Revival 1-Spray 5-Katalyst Air',
'180-Degree White Outdoor',
'3/8 x 3 Hand Scraped ',
'67-Qt. Jug',
'35-77-7/8 in. White',
'-16 tpi x 4 in. Stainless Steel',
'3-21 degree Full')
df <- data.frame(col1 = col1)
df
col1
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece
3 2-3/4 in. x 4-1/2 in. Heavy-Duty
4 1/4-20 x 2 in. Forged Steel
5 1/2-Amp Slo-Blo GMA Fuse
6 3/4 in. x 12 in. x 24 in. White Thermally
7 12.0 oz. of weight
8 1.4 fl. oz. of liquid
9 14 gal. tall
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height
11 1/25 HP Cast Iron
12 1/2 in., 3/4 in. and 1 in. PVC
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor
14 8 oz. -200 Pot Of Cream
15 5/8 in. dia. x 25 ft. Water
16 18.5 / 30.5 in. Brushed Nickel
17 57-1/2 in. x 70-5/16 in. Semi-Framed
18 2-1/4 HP Router
19 12-Volt Lithium-Ion Cordless 3/8 in.
20 12-Gauge 24-5/8 in. Strap
21 7-3/4 in. Wigan Ceiling
22 1 qt. B-I-N
23 3/8 in. O.D. x 1/4 in. NPTF
24 2-1/2 in. Long x 5/8 in. Diameter Spring
25 1/4 x 3 in. Heat-Shrink
26 4-White PVC End
27 41000 Series Non-Vented Range
28 Revival 1-Spray 5-Katalyst Air
29 180-Degree White Outdoor
30 3/8 x 3 Hand Scraped
31 67-Qt. Jug
32 35-77-7/8 in. White
33 -16 tpi x 4 in. Stainless Steel
34 3-21 degree Full
I would like to extract all the measurement data from col1 and add it to col2.
I tried the following:
df$col2 <- str_extract(df$col1, '([[:digit:]]*\\/?\\.?\\-?[[:digit:]]+[[:space:]]+(in|ft)\\.[[:space:]]*x*)')
And the results are as follows:
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 3/4 in. x
4 1/4-20 x 2 in. Forged Steel 2 in.
5 1/2-Amp Slo-Blo GMA Fuse <NA>
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x
7 12.0 oz. of weight <NA>
8 1.4 fl. oz. of liquid <NA>
9 14 gal. tall <NA>
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in.
11 1/25 HP Cast Iron <NA>
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 3/4 in. x
14 8 oz. -200 Pot Of Cream <NA>
15 5/8 in. dia. x 25 ft. Water 5/8 in.
16 18.5 / 30.5 in. Brushed Nickel 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 1/2 in. x
18 2-1/4 HP Router <NA>
19 12-Volt Lithium-Ion Cordless 3/8 in. 3/8 in.
20 12-Gauge 24-5/8 in. Strap 5/8 in.
21 7-3/4 in. Wigan Ceiling 3/4 in.
22 1 qt. B-I-N <NA>
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 1/2 in.
25 1/4 x 3 in. Heat-Shrink 3 in.
26 4-White PVC End <NA>
27 41000 Series Non-Vented Range <NA>
28 Revival 1-Spray 5-Katalyst Air <NA>
29 180-Degree White Outdoor <NA>
30 3/8 x 3 Hand Scraped <NA>
31 67-Qt. Jug <NA>
32 35-77-7/8 in. White 7/8 in.
33 -16 tpi x 4 in. Stainless Steel 4 in.
34 3-21 degree Full <NA>
However, I would want to see results like:
df
col1 col2
1 1/2 in. x 1/2 in. x 3/4 ft. Copper Pressure Cup 1/2 in. x 1/2 in. x 3/4 ft.
2 Ensemble 60 in. x 43-1/2 in. x 54-1/4 in. 3-piece 60 in. x 43-1/2 in. x 54-1/4 in.
3 2-3/4 in. x 4-1/2 in. Heavy-Duty 2-3/4 in. x 4-1/2 in.
4 1/4-20 x 2 in. Forged Steel 1/4-20 x 2 in.
5 1/2-Amp Slo-Blo GMA Fuse 1/2-Amp
6 3/4 in. x 12 in. x 24 in. White Thermally 3/4 in. x 12 in. x 24 in.
7 12.0 oz. of weight 12.0 oz.
8 1.4 fl. oz. of liquid 1.4 fl. oz.
9 14 gal. tall 14 gal.
10 Sahara Wood 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height 47 in. Long x 12-1/8 in. Deep x 1-11/16 in. Height
11 1/25 HP Cast Iron 1/25 HP
12 1/2 in., 3/4 in. and 1 in. PVC 1/2 in., 3/4 in. and 1 in.
13 24-3/4 in. x 48-3/4 in. x 1-1/4 in. Faux Windsor 24-3/4 in. x 48-3/4 in. x 1-1/4 in.
14 8 oz. -200 Pot Of Cream 8 oz.
15 5/8 in. dia. x 25 ft. Water 5/8 in. dia. x 25 ft.
16 18.5 / 30.5 in. Brushed Nickel 18.5 / 30.5 in.
17 57-1/2 in. x 70-5/16 in. Semi-Framed 57-1/2 in. x 70-5/16 in.
18 2-1/4 HP Router 2-1/4 HP
19 12-Volt Lithium-Ion Cordless 3/8 in. 12-Volt 3/8 in.
20 12-Gauge 24-5/8 in. Strap 12-Gauge 24-5/8 in.
21 7-3/4 in. Wigan Ceiling 7-3/4 in.
22 1 qt. B-I-N 1 qt.
23 3/8 in. O.D. x 1/4 in. NPTF 3/8 in. O.D. x 1/4 in.
24 2-1/2 in. Long x 5/8 in. Diameter Spring 2-1/2 in. Long x 5/8 in. Diameter
25 1/4 x 3 in. Heat-Shrink 1/4 x 3 in.
26 4-White PVC End
27 41000 Series Non-Vented Range
28 Revival 1-Spray 5-Katalyst Air 1-Spray 5-Katalyst
29 180-Degree White Outdoor 180-Degree
30 3/8 x 3 Hand Scraped 3/8 x 3
31 67-Qt. Jug 67-Qt.
32 35-77-7/8 in. White 35-77-7/8 in.
33 -16 tpi x 4 in. Stainless Steel -16 tpi x 4 in.
34 3-21 degree Full 3-21 degree
I am not sure how should I tweak my regex in order to work with all the cases? I have also tried to split my regex into multiple lines, but did not help much either.
Here's how I tried it:
df$col2 = paste(str_extract_all(df$col1, '([[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]+(in|ft|cu\\.[[:space:]]+ft)\\.[[:space:]]*[WHD]*[[:space:]]+x*[[:space:]]*)+'), collapse = ' ')]
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2], '[[:digit:]]*\\.?\\/?[[:digit:]]+[[:space:]]*\\-*(oz|lb|gal|Gal)\\.'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],'[[:digit:]]*\\.?\\,?[[:digit:]]+\\-?[[:space:]]*(Watt|Pack|Gauge|piece|Piece|Panel|mph|MPH|cc|Ton|ton|Light|Gang|LED|Volt|amp|BTU|Amp|Drawer|Step|Tier|Cycle)[[:space:]]+'), collapse = ' ')
df$col2[is.na(df$col2)] <- paste(str_extract_all(df$col1[df$col2],,'[[:digit:]]*\\.?\\,?[[:digit:]]+[[:space:]]+sq\\.[[:space:]]+ft\\.'), collapse = ' ')
Yet, am not getting the results I would want.
Do you have any inputs?
Thanks!

Related

POA "weird" outcome (IMHO)

I have gathered satellite data (every 5 minutes, from "Solcast") for GHI, DNI and DHI and I use pvlib to get the POA value.
The pvlib function I use:
def get_irradiance(site_location, date, tilt, surface_azimuth, ghi, dni, dhi):
times = pd.date_range(date, freq='5min', periods=12*24, tz=site_location.tz)
solar_position = site_location.get_solarposition(times=times)
POA_irradiance = irradiance.get_total_irradiance(
surface_tilt=tilt,
surface_azimuth=surface_azimuth,
ghi=ghi,
dni=dni,
dhi=dhi,
solar_zenith=solar_position['apparent_zenith'],
solar_azimuth=solar_position['azimuth'])
return pd.DataFrame({'GHI': ghi,
'DNI': dni,
'DHI': dhi,
'POA': POA_irradiance['poa_global']})
When I compare GHI and POA values for 12 June 2022 and 13 June 2022 is see the POA value for 12 June is significantly behind the GHI. The location is in The Netherlands, I use a tilt of 12.5 degrees and an azimuth of 180 degrees. Here is the outcome (per hour, from 6:00 - 20:00):
12 Juni 2022
GHI DNI DHI POA
6 86.750000 312.750000 40.500000 40.277034
7 224.583333 543.000000 69.750000 71.130218
8 366.833333 598.833333 113.833333 178.974322
9 406.083333 182.000000 304.000000 348.272844
10 532.166667 266.750000 346.666667 445.422584
11 725.666667 640.416667 226.500000 509.360716
12 688.500000 329.416667 409.583333 561.630762
13 701.333333 299.750000 439.333333 570.415438
14 725.416667 391.666667 387.750000 532.529676
15 753.916667 629.166667 244.333333 407.665794
16 656.750000 599.750000 215.333333 293.832376
17 381.833333 36.416667 359.416667 356.317883
18 411.750000 569.166667 144.750000 144.254438
19 269.750000 495.916667 102.500000 102.084439
20 134.583333 426.416667 51.583333 51.370738
And
13 June 2022
GHI DNI DHI POA
6 5.666667 0.000000 5.666667 5.616296
7 113.500000 7.750000 111.416667 111.948831
8 259.500000 106.833333 208.416667 256.410392
9 509.166667 637.750000 150.583333 514.516389
10 599.333333 518.666667 240.583333 619.050821
11 745.250000 704.500000 195.583333 788.773772
12 757.250000 549.666667 292.000000 798.739403
13 742.000000 464.583333 335.000000 778.857394
14 818.250000 667.750000 243.000000 869.972769
15 800.750000 776.833333 166.916667 852.559043
16 699.000000 733.666667 167.166667 730.484502
17 582.666667 729.166667 131.916667 593.802853
18 449.166667 756.583333 83.500000 434.958210
19 290.083333 652.666667 68.666667 254.048655
20 139.833333 466.916667 48.333333 97.272684
What can be an explanation of the significantly low POA compared to the GHI values on 12 June?
I have this outcome with other days too: some days have a POA much closer to the GHI than other days. Maybe this is "normal behaviour" and I do not reckon with weather influences which maybe important...
I use the POA to do a PR (Performance Ratio) calculation but I do not get "trusted" results..
Hope someone can shine a light on these values.
Kind regards,
Oscar
The Netherlands.
I'm really sorry, although the weather is unpredictable in the Netherlands I made a very big booboo in using dd-mm-yyyy format instead of mm-dd-yyyy. Something I overlooked for a long time...(I never had used mm-dd-yyyy, but that's a lame excuse...)
Really sorry, hope you did not think about it too long..
Thank you anyway for reacting!
I've good values now!
Oscar (shame..)

KDB moving percentile using Swin function

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!
What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

REGEX all invoice item descriptions

I'm trying to regex all items from an invoice (name, unit price, total, VAT, etc.). Managed to get all the information regarding digits, but biggest problem si to extract the item descriptions as sometimes it's on two separate lines. This is what I need to regex
1 Agrafe metalice Eco, rotunjite, 33 mm, 50 buc/cutie buc. 30.00 0,76 22,80 4,33
(SOBO604)
2 Banda corectoare DONAU Mouse, 5 mm x 8 m, orizontala, buc. 5.00 4,83 24,15 4,59
blister (7635001PL-99)
3 Biblioraft plastifiat OFFICE Products, 5 cm, colturi buc. 75.00 5,08 381,00 72,39
metalice, albastru (21011121-01)
4 Burete magnetic DONAU, 110 x 57 x 25 mm, galben buc. 10.00 5,53 55,30 10,51
(7638001PL-99)
5 Calculator de birou Canon WS-1610T, solar, 16 cifre, buc. 1.00 71,11 71,11 13,51
afisaz inclinat, format mare (WS1610T)
6 Capse zincate OFFICE Products 24/6, 1000 buc/cutie buc. 5.00 1,12 5,60 1,06
(18072419-19)
7 Creion grafic Eco, ascutit, cu radiera, corp verde buc. 20.00 0,40 8,00 1,52
(SOIS432)
8 Creion mecanic BIC Matic, 0.7 mm (601021) buc. 4.00 1,88 7,52 1,43
9 Dosar din plastic cu sina si doua perforatii OFFICE buc. 250.00 0,35 87,50 16,63
Products, albastru (21104211-01)
10 Dosar din plastic cu sina si doua perforatii OFFICE buc. 100.00 0,35 35,00 6,65
Products, roz (21104211-13)
pagina 1 / 3
797638
11 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 5.00 6,53 32,65 6,20
microni, 100 file/set (21141215-90)
12 Folie protectie OFFICE Products, A4, coaja portocala, 40 buc. 20.00 6,51 130,20 24,74
microni, 100 file/set (21141215-90)
13 Marker whiteboard Eco, varf rotund, albastru (SOIS535A) buc. 104.00 1,33 138,32 26,28
14 Marker whiteboard Eco, varf rotund, negru (SOIS535N) buc. 2.00 1,33 2,66 0,51
15 Marker whiteboard Eco, varf rotund, rosu (SOIS535R) buc. 2.00 1,33 2,66 0,51
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
17 Organizator de birou DONAU Clasic VII, 6 compartimente, buc. 2.00 30,67 61,34 11,65
155 x 105 x 101 mm, transparent (7476001-99)
18 Panou din pluta Bi-Office, 60 x 90 cm, rama lemn buc. 1.00 32,96 32,96 6,26
(GMC070012010)
19 Pioneze color Eco, tinte pentru pluta , 40 buc/cutie buc. 1.00 2,16 2,16 0,41
(SOBO612)
20 Pix fara mecanism Eco, varf de 1 mm, albastru (SOIS405A) buc. 110.00 0,33 36,30 6,90
21 Plic C4 (229 x 324 mm), alb, siliconic, 10/set buc. 2.00 2,15 4,30 0,82
(15223619-14)
22 Tus pentru stampila Pelikan, cu picurator, 28 ml, negru buc. 1.00 6,93 6,93 1,32
(351197)
Notice that the item description sometimes is after the total price. Problem is that the space between items isn't even, it's variable, like for e.g. position 8 and 9 are almost linked, compared to position 20 and 21 which have a lot of space between them.
Somebody helped me and got only the first line using
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s{0,1}){3}
this is where I got stuck because of the uneven syntax.
It only gets the first line. For e.g.:
'''
16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57
100 file (14047511-06)
'''
it gest only 16 Notite adezive OFFICE Products, 51 x 76 mm, galben pal, buc. 5.00 1,65 8,25 1,57 but not 100 file (14047511-06). The complete invoice description is Notite adezive OFFICE Products, 51 x 76 mm, galben pal, 100 file (14047511-06) when transformed from pdf to text this is how I get the files.
Will need to extract also the last part and merge the first one to get the full item description.
Thank you
Try this regex:
\d{1,2}(.*)(\d+\.\d+\s+)(\d+\,\d+\s?){3}([\n ]+[^(\n]*\([^)]+\)(?=\n))?
Test on regex101

Boxplots lose "box" nature when plotting weighted data

I have the following data in Stata:
input drug halflife hl_weight
3 2.95 0.0066
2 6.00 0.0004
5 13.60 0.0006
1 2.82 0.0331
4 8.80 0.0001
4 1.24 0.0075
2 6.25 0.1123
4 17.20 0.0002
5 14.50 0.0020
4 5.50 0.0016
5 13.30 0.0003
4 8.26 0.0201
4 16.50 0.0103
4 11.40 0.0016
4 5.90 0.0005
4 3.99 0.0100
4 2.80 0.0073
4 3.00 0.0133
4 3.17 0.0061
4 4.95 0.1404
end
I am trying to create boxplots of drug halflives using the command below:
graph box halflife [aweight=hl_weight], over(drug)
When I include the weight option, some of the resulting box plots consist of multiple dots instead of the typical interquartile range and median:
Why does this happen and how can I fix it?
Obviously, this happens because of the weighting. The weights give more emphasis to values that are well outside the interquartile range.
I do not think there is anything to fix here. You could try to use the nooutsides option of the graph box command to hide the dots but i would not recommend it.

Pandas quantile failing with NaN's present

I've encountered an interesting situation while calculating the inter-quartile range. Assuming we have a dataframe such as:
import pandas as pd
index=pd.date_range('2014 01 01',periods=10,freq='D')
data=pd.np.random.randint(0,100,(10,5))
data = pd.DataFrame(index=index,data=data)
data
Out[90]:
0 1 2 3 4
2014-01-01 33 31 82 3 26
2014-01-02 46 59 0 34 48
2014-01-03 71 2 56 67 54
2014-01-04 90 18 71 12 2
2014-01-05 71 53 5 56 65
2014-01-06 42 78 34 54 40
2014-01-07 80 5 76 12 90
2014-01-08 60 90 84 55 78
2014-01-09 33 11 66 90 8
2014-01-10 40 8 35 36 98
# test for q1 values (this works)
data.quantile(0.25)
Out[111]:
0 40.50
1 8.75
2 34.25
3 17.50
4 29.50
# break it by inserting row of nans
data.iloc[-1] = pd.np.NaN
data.quantile(0.25)
Out[115]:
0 42
1 11
2 34
3 12
4 26
The first quartile can be calculated by taking the median of values in the dataframe that fall below the overall median, so we can see what data.quantile(0.25) should have yielded. e.g.
med = data.median()
q1 = data[data<med].median()
q1
Out[119]:
0 37.5
1 8.0
2 19.5
3 12.0
4 17.0
It seems that quantile is failing to provide an appropriate representation of q1 etc. since it is not doing a good job of handling the NaN values (i.e. it works without NaNs, but not with NaNs).
I thought this may not be a "NaN" issue, rather it might be quantile failing to handle even-numbered data sets (i.e. where the median must be calculated as the mean of the two central numbers). However, after testing with dataframes with both even and odd-numbers of rows I saw that quantile handled these situations properly. The problem seems to arise only when NaN values are present in the dataframe.
I would like to use quntile to calculate the rolling q1/q3 values in my dataframe, however, this will not work with NaN's present. Can anyone provide a solution to this issue?
Internally, quantile uses numpy.percentile over the non-null values. When you change the last row of data to NaNs you're essentially left with an array array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) in the first column
Calculating np.percentile(array([ 33., 46., 71., 90., 71., 42., 80., 60., 33.]) gives 42.
From the docstring:
Given a vector V of length N, the qth percentile of V is the qth ranked
value in a sorted copy of V. A weighted average of the two nearest
neighbors is used if the normalized ranking does not match q exactly.
The same as the median if q=50, the same as the minimum if q=0
and the same as the maximum if q=100.