Pandas and reg ex, decompoising text and numbers into several columns with headings - regex

I have a dataframe with a column containing:
1 Tile 1 up Red 2146 (75) Green 1671 (75)
The numbers 1 can be upto 10
up can be also be down
The 2146 and 1671 can be any digit upto 9999
Whats the best way to break out each of these into separate columns without using split. I was looking at regex but not sure how to handle this (especially the white spaces). I liked the idea of putting the new column names in too and started with
Pixel.str.extract(r'(?P<num1>\d)(?P<text>[Tile])(?P<Tile>\d)')
Thanks for any help

To avoid an overly complicated regex pattern, perhaps you can use str.extractall to get all numbers, and then concat to your current df. For up or down, use str.findall:
df = pd.DataFrame({"title":["1 Tile 1 up Red 2146 (75) Green 1671 (75)",
"10 Tile 10 down Red 9999 (75) Green 9999 (75)"]})
df = pd.concat([df, df["title"].str.extractall(r'(\d+)').unstack().loc[:,0]], axis=1)
df["direction"] = df["title"].str.findall(r"\bup\b|\bdown\b").str[0]
print (df)
#
title 0 1 2 3 4 5 direction
0 1 Tile 1 up Red 2146 (75) Green 1671 (75) 1 1 2146 75 1671 75 up
1 10 Tile 10 down Red 9999 (75) Green 9999 (75) 10 10 9999 75 9999 75 down

Related

KDB moving percentile using Swin function

I am trying to create a list of the 99th and 1st percentiles. Rather than a single percentile for today. I wanted percentiles for 500 days each using the prior 500 days. The functions I was using for this are the following
swin:{[f;w;s] f each { 1_x,y }\[w#0;s]}
percentile:{[x;y] y (100 xrank y:asc y) bin x}
swin[percentile[99;];500;List].
The issue I come across is that the 99th percentile calculates perfectly, but the 1st percentile makes the entire list = 0. a bit lost as to why it would do that. suggestions appreciated!
What's causing the zeros is two-fold:
What behaviour do you want for the earliest 500 days when there isn't 500 days of history to work with? On day 1 there's only 1 datapoint, on day 2 only 2 etc. Only on the 500th day is there 500 days of actual data to work with. By default that swin function fills the gaps with some seed value
You're using zero as that seed value, aka w#0
For example a 5 day lookback on each date looks something like:
q)swin[::;5;1 2 3 4 5]
0 0 0 0 1
0 0 0 1 2
0 0 1 2 3
0 1 2 3 4
1 2 3 4 5
You have zeros until you have data, so naturally the 1st percentile will pick up the zeros for the first roughly 500 dates.
So then you can decide to seed with a different value, or else possibly exclude zeros from your percentile function:
q)List:1000?1000
q)percentile:{[x;y] y (100 xrank y:asc y except 0) bin x}
q)swin[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90...
If zeros are a legitimate value in your list and can't be excluded then maybe seed the swin with some other value that you know won't be in the list (negatives? infinity? null?) and then exclude that seed from the percentile function.
EDIT: A final alternative is to use a different sliding window function which doesn't fill gaps with a seed value, e.g.
q)swin2:{[f;w;s] f each(),/:{neg[x]sublist y,z}[w]\[s]}
q)swin2[::;5;1 2 3 4 5]
,1
1 2
1 2 3
1 2 3 4
1 2 3 4 5
q)percentile:{[x;y] y (100 xrank y:asc y) bin x}
q)swin2[percentile[99;];500;List]
908 908 908 908 908 908 908 908 908 908 908 959 959..
q)swin2[percentile[1;];500;List]
908 360 360 257 257 257 90 90 90 90 90 90 90 90 90..

Excel - Drop down list within a formula

I am sure this is a easy formula but 1 am struggling, I have the following:
On tab 1 I want to enter a colour multiple times into column A using a drop down option, for example and I want to pull the how many information from a table on another sheet, so when I do my formula using xlookup (=XLOOKUP(A2,Sheet2!A2:A7,Sheet2!B2:B7)) it works for the top 4 options but not the rest. Can someone help? I ahve also tried the IF formula etc but with no success.
A B
Colours How Many
Black 17
Yellow 765
Purple 65
Orange 43
Red #N/A
Green #N/A
Purple #N/A
Orange #N/A
Sheet 2 table:
Colours How Many
Red 34
Black 17
Green 32
Yellow 765
Purple 65
Orange 43
I hope this make sense.
Thanks in advance
Wayne
I figured it out
=VLOOKUP(A2,Sheet2!$A$1:$B$7, 2, FALSE)

replace multiple column values at the same time

I would like to replace multiple column values at the same time in a dataframe. I would like to change 2 to 1, 1 to 2.
data=data.frmae(store=c(122,323,254,435,654,342,234,344)
,cluster=c(2,2,2,1,1,3,3,3))
The problem in my code is after it changes 2 to 1 , it changes these 1's to 2.
Can I do it in dplyr or sth? Thank you
Desired data set below
store cluster
122 1
323 1
254 1
435 2
654 2
342 3
234 3
344 3

Testing a Condition Before Creating an Observation in SAS using '#' in the end of the input statement

I have read the online document and from it, I think that it only works with the column input method. How can this be used with list input method?
/This Works/
data new;
input height 25-26 #;
if height = 6 ;
input name $ 1-8 colour $ 9-13 place $ 16-24 ;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
/* But This Doesn't*/
data new;
input height #;
if height = 6;
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;
run;
LOG:
NOTE: Invalid data for height in line 79 1-6.
79 Deepak Red Delhi 6
height=. name= colour= place= _ERROR_=1 _N_=1
NOTE: Invalid data for height in line 80 1-5.
80 Aditi Yellow Delhi 5
height=. name= colour= place= _ERROR_=1 _N_=2
The fixed layout of the first data lines make it possible to input a field from a specific location.
The second layout is variable in layout, so it is harder to arbitrarily grab a specific field.
So, what is wrong? In the second DATA step the input will read from the start of the line, so it won't read a number from where a name is.
Don't worry about 'reducing processing' by reading only part of a line. Held input and conditional processing is more often used for processing data lines that have some sort of variant or conditional data items within the content.
For both of those formats I would read all of the variables and then add logic to filter based on values.
If you really need to check if the last "word" on the line matched some criteria before deciding HOW to read the line then you might want to try using the automatic _infile_ variable.
data new;
input # ;
if scan(_infile_,-1,' ') = '6';
input name $ colour $ place $ height;
datalines;
Deepak Red Delhi 6
Aditi Yellow Delhi 5
Anup Blue Delhi 5
Era Green Varanasi 5
Avinash Black Noida 5
Vivek Grey Agra 5
;

Subtract a vector to each row of a dataframe

My dataframe looks like this :
fruits = pd.DataFrame({'orange': [10, 20], 'apple': [30, 40], 'banana': [50, 60]})
apple banana orange
0 30 50 10
1 40 60 20
And I have this vector (its also a dataframe)
sold = pd.DataFrame({'orange': [1], 'apple': [2], 'banana': [3]})
apple banana orange
0 2 3 1
I want to subtract this vector to each row of the initial dataframe to obtain a dataframe which looks like this
apple banana orange
0 28.0 47.0 9.0
1 38.0 57.0 19.0
I tried :
print fruits.subtract(sold, axis = 0)
And the output is
apple banana orange
0 28.0 47.0 9.0
1 NaN NaN NaN
It worked only for the first line. I could create a dataframe filled with the vector for each row. Is there a more efficient way to subtract this vector ? I don't want to use a loop.
convert the df to a series using squeeze and pass axis=1:
In [6]:
fruits.sub(sold.squeeze(), axis=1)
Out[6]:
apple banana orange
0 28 47 9
1 38 57 19
The conversion is necessary as by design arithmetic operations between dfs will align on indices and columns, by passing a Series this allows you to subtract from each row in the df the row from the other df.
Try:
fruits.sub(sold.iloc[0, :])
What you tried before didn't work because sold is a dataframe and the subtraction will try to align both columns and index. sold.iloc[0, :] gets at the first row and is a series thus will work as you intended.