I haven't been able to find an answer to this anywhere, so here it goes:
I have a pandas dataframe df like this:
X Y name
0 3 1 foo
1 5 2 fa
2 1 3 hoo
3 2 4 ha
I can easily find information in df by making conditions (for example df['X' >=3]), which is great. However, I want to make a more generic solution, where I can create a long condition in string form (e.g. '(X >= 3) & (name == foo)'), which could be split into an actual condition, that can be used in a pandas dataframe.
Can anyone suggest a smart solution (if something like this is possible) or redirect me to a similar discussion on the discussion board, where this topic has been debated?
It seems you need query for filtering:
df = df.query("X >= 3 & name == 'foo'")
Related
I have a pyspark data frame and I'd like to have a conditional replacement of a string across multiple columns, not just one.
To be more concrete: I'd like to replace the string 'HIGH' with 1, and everything else in the column with 0. [Or at least replace every 'HIGH' with 1.] In pandas I would do:
df[df == 'HIGH'] = 1
Is there a way to do something similar? Or can I do a loop?
I'm new to pyspark so I don't know how to generate example code.
You can use the replace method for this:
>>> df.replace("HIGH", "1")
Keep in mind that you'll need to replace like for like datatypes, so attemping to replace "HIGH" with 1 will throw an exception.
Edit: You could also use regexp_replace to address both parts of your question, but you'd need to apply it to all columns:
>>> df = df.withColumn("col1", regexp_replace("col1", "^(?!HIGH).*$", "0"))
>>> df = df.withColumn("col1", regexp_replace("col1", "^HIGH$", "1"))
I got stuck with a specific question in R around concatenating columns of a data frame by using a wildcard. Perhaps I am searching wrongly. However I could not find a matching answer yet.
Here is my question:
I have a data frame df where each column represents a user (U1, U2, U3), e.g.:
> df <-data.frame(U1=1:3, U2=4:6, U3=7:9)
> df
> U1 U2 U3
1 1 4 7
2 2 5 8
3 3 6 9
I would like to concatenate the values from all users into a single vector as one would do using the c() function, e.g.:
> c(df$U1, df$U2, df$U3)
[1] 1 2 3 4 5 6 7 8 9
However, my number of users is large and varies over time. So, I look for an elegant dynamic way of concatenating the columns such as
> c(df$U*)
Unfortunately this does not seem to work. I played around with grep and regular expressions but could not get it to work. For sure, I could use a for-loop and program my own cat function but I assume there is a better way. I just don't find it. Maybe I am just blind. Hope you can help.
sub_df <- df[, grep(pattern ='^U.*', names(df))]
stack(df)$values
Hope this works for you. You could first subset some columns according to your need.
Coerce the data frame to a matrix first:
as.vector(as.matrix(df))
Use the bracket [ to select columns whose names match a certain expression:
df[, grep("U.*", colnames(df)), drop = FALSE]
I was trying to run a loop through a variable and was unsure how to code up my thoughts. So, I have variable called newid that goes as
newid
1
1
2
2
3
3
and so on.
foreach x in newid2 {
replace switchers = 1 if doc[_n] != doc[_n+1]
}
I want to modify this code so that this code will run for each two values (in this case run for 1 and 1, 2 and 2). What would be the best way to modify this? Please help me
Something like this can be done with levelsof:
clear
input id str1 doc
1 "A"
1 "B"
2 "A"
3 "C"
3 "A"
end
gen switcher1 = 0
levelsof id
foreach i in `r(levels)' {
quietly tab doc if id==`i'
replace switcher1 = 1 if r(r)>1 & id==`i'
}
However, you there are certainly more efficient ways to accomplish your goal. Here's one example that tags ids that switch doctors:
ssc install egenmore
bysort id: egen num_docs = nvals(doc)
generate switcher2 = cond(num_docs>1,1,0)
The underlying idea is the same. You count the number of distinct values of doc for each id. If that number exceeds one, the id is tagged as a switcher. The second version is arguably more efficient since it does not involve looping over each value of id.
I have a dataset of actions doing over time, an attribute 'Hour' ( contains values from 0 ->23 ). Now I want to create another attribute, say 'PartOfDay', which group 24 hours into 4 parts. For tuples have 'Hour' value of 0 to 5, then the 'PartOfDay' value should be 1; if 'Hour' value in [6,11], then the 'PartOfDay' value should be 2;...How can I do?
The codes would do this:
train['PartOfDay']=1
train.loc[(train.Hour>=6) & (train.hour<=11),'PartOfDay']=2
train.loc[(train.Hour>=12) & (train.hour<=17),'PartOfDay']=3
train.loc[(train.Hour>=18) & (train.hour<=23),'PartOfDay']=4
but it seems not so beautiful, I would like to know a more decent one if possible
Thank you for all your supports!!
While it is not clear what train.loc represents, a general approach to your problem is to use modulus function to set the RHS:
1 + int(train.Hour / 6)
I am a newbie to R and I have problem splitting a very large data frame into a nested list. I tried to look for help on the internet, but I was unsuccessful.
I have a simplified example on how my data are organized:
The headers are:
1 "station" (number)
2. "date.str" (date string)
3. "member"
4. "forecast time"
5. "data"
I am not sure my data example will show up rightly, but if so, it look like this:
1. station date.str member forecast.time data1
2. 6019 20110805 mbr000 06 77
3. 6031 20110805 mbr000 06 28
4. 6071 20110805 mbr000 06 45
5. 6019 20110805 mbr001 12 22
6. 6019 20110806 mbr024 18 66
I want to split the large data frame into a nested list after "station", "member", "date.str" and "forecast.time". So that mylist[[c(s,m,d,t)]] contains a data frame with data for station "s" and member "m" for date.str "d" and for forecast time "t" conserving the values of s, m, d and t.
My code is:
data.st <- list()
data.st.member <- list()
data.st.member.dato <- list()
data.st. <- split(mydata, mydata$station)
data.st.member <- lapply(data.st, FUN = fsplit.member)
(I created a function to split after "member")
#Loop over station number:
for (s in 1:S){
#Loop over members:
for (m in 1:length(members){
tmp <- split( data.st.member[[s]][[m]], data.st.member[[s]][[m]]$dato.str )
#Loop over number of different "date.str"s
for (t in 1:length(no.date.str) ){
data.st.member.dato[[s]][[m]][[t]] <- tmp}
} #end m loop
} #end s loop
I would also like to split according to the forecast time: forec.time, but I didn't get that far.
I have tried a couple of different configurations within the loops, so I don't at the moment have a consistent error message. I can't figure out, what I am doing or thinking wrong.
Any help is much appreciated!
Regards
Sisse
It's easier than you think. You can pass a list into split in order to split on several factors.
Reproducible example
with(airquality, split(airquality, list(Month, Day)))
With your data
data.st <- with(mydata,
split(mydata, list("station", "member", "date.str", "forecast.time"))
)
Note: This doesn't give you a nested list like you asked for, but as Joran commented, you very probably don't want that. A flat list will be nicer to work with.
Speculating wildly: did you just want to calculate statistics on different chunks of data? If so, then see the many questions here on split-apply-combine problems.
I also want to echo the others in that this recursive data structure is going to be difficult to work with and probably there are better ways. Do look at the split-apply-combine approach as Richie suggested. However, the constraints may be external, so here is an answer using the plyr library.
mylist <- dlply(mydata, .(station), dlply, .(memeber), dlply, .(date.str), dlply, .(forecast.time), identity)
Using the snippet of data you gave for mydata,
> mylist[[c("6019","mbr000","20110805","6")]]
station date.str member forecast.time data1
1 6019 20110805 mbr000 6 77