How can one create a single table from multiple datasets? - r-markdown

I'm trying to create a descriptive table by treatment group. For my analysis, I have 3 different partitions of the data (because I'm running 3 separate analyses) from a complete data set, but I only have one statistic from each subset that I am trying to describe, so I think it'd look better in one complete table. At the end, I'd like an output that can convert to latex (as I'm using bookdown).
I've been using the compareGroups package to easily create each table individually. I know that there is an rbind function that allows to create a stacked table, but it won't let me combine them because the n of each separate data frame is different (due to missingness). For instance, I'm trying to study marriage in one of my analyses, and later divorce (which is a separate analysis), and so the n's of these two data frames differ, but the definition of treatment group is the same.
Ideally, I'd have two columns, one for the treatment group and one for the control group. There would be two rows, one that has age of first marriage, and the second row which would have length of that first marriage, and then the respective ns of the cells.
library(compareGroups)
d1 <- compareGroups(treat ~ time1mar,
data = nlsy.mar,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d2 <- compareGroups(treat ~ time1div,
data = nlsy.div,
simplify=TRUE,
na.action=na.omit) %>% createTable(.,
type=1,
show.p.overall = FALSE)
d.tot <- rbind(`First Age at Marriage` = d1, `Length of First Marriage` = d2)
This is the error that I get:
Error in data.frame(..., check.names = FALSE) :
arguments imply differing number of rows: 6626, 5057
Any suggestions?

The problem might be that you're using na.omit which delets the cases/rows with NAs from both of your datasets. Probably a different amount of cases get removed from each data set. But actually different numbers of row should only be a problem with cbind. However you might try to change the na.action option.
I'm just guessing. As said by joshpk without sample data is difficult to reproduce your problem.

Related

Power BI: Power Query slowly loading multiple Excel worksheets

I have a small (1.5 MB) Excel file that contains multiple worksheets. I need to transform each worksheet (two significant transformations are created as separate functions) and then expand the results dynamically (i.e. each worksheet can have a different number of columns, so I needed to extract the list of all distinct column names beforehand).
It's all working fine and the output is meeting my expectations, however, I noticed that when refreshing Power Query is loading over 10 MB of data. After it is done, the Load window resets and >10 MB of data is being loaded again.
Here's the M-code that I am using. I have tested each section and it seems like Expanded step might be the slowest one.
let
Source = Excel.Workbook(File.Contents("xxx.xlsx"), null, true),
Split = Table.SplitColumn(Source, "Name", Splitter.SplitTextByDelimiter(" ", QuoteStyle.Csv), {"Name.1", "Name.2", "Name.3"}),
TidyUp = Table.RenameColumns(
Table.RemoveColumns(Split,{"Item", "Kind", "Hidden", "Name.3"}),
{{"Name.1", "PORTFOLIO"}, {"Name.2", "DATE"}}),
GetCurrency = Table.AddColumn(TidyUp, "CURRENCY", each GetCurrencyFromSpreadsheet([Data])),
GetStresses = Table.AddColumn(GetCurrency, "fx", each GetStresses([Data])),
ColNames = Table.ColumnNames(Table.Combine(GetStresses[fx])),
Expanded = Table.ExpandTableColumn(GetStresses, "fx", ColNames),
RemoveData = Table.RemoveColumns(Expanded,{"Data"})
in
RemoveData
As a result, it takes about 5 minutes to process a single small Excel file. As we expect to receive more similar files in the future, I would like to check with you if you have any ideas what can I do to improve the code? Thanks.
I would rebuild this using the Power Query Editor UI. That should lead to cleaner code with less redundant steps like your use of Table.Combine with a single input table.
The Table.AddColumn steps would probably be rebuilt as separate queries that are combined using Merge Queries. Set-based logic like that will usually outperform row-by-row function calls.

spotfire plot list of elements

I have a data table that has this format :
and I want to plot temperature to time, any idea how to do that ?
This can be done in a TERR data function. I don't know how comfortable you are integrating Spotfire with TERR, there is an intro video here for instance (demo starts from about minute 7):
https://www.youtube.com/watch?v=ZtVltmmKWQs
With that in mind, I wrote the script without loading any library, so it is quite verbose and explicit, but hopefully simpler to follow step by step. I am sure there is a more elegant way, and there are better ways of making it flexible with column names, but this is a start.
Your input will be a data table (dt, the original data) and the output a new data table (dt.out, the transformed data). All column names (and some values) are addressed explicitly in the script (so if you change them it won't work).
#remove the []
dt$Values=gsub('\\[|\\]','',dt$Values)
#separate into two different data frames, one for time and one for temperature
dt.time=dt[dt$Description=='time',]
dt.temperature=dt[dt$Description=='temperature',]
#split the columns we want to separate into a list of vectors
dt2.time=strsplit(as.character(dt.time$Values),',')
dt2.temperature=strsplit(as.character(dt.temperature$Values),',')
#rearrange times
names(dt2.time)=dt.time$object
dt2.time=stack(dt2.time) #stack vectors
dt2.time$id=c(1:nrow(dt2.time)) #assign running id for merging later
colnames(dt2.time)[colnames(dt2.time)=='values']='time'
#rearrange temperatures
names(dt2.temperature)=dt.temperature$object
dt2.temperature=stack(dt2.temperature) #stack vectors
dt2.temperature$id=c(1:nrow(dt2.temperature)) #assign running id for merging later
colnames(dt2.temperature)[colnames(dt2.temperature)=='values']='temperature'
#merge time and temperature
dt.out=merge(dt2.time,dt2.temperature,by=c('id','ind'))
colnames(dt.out)[colnames(dt.out)=='ind']='object'
dt.out$time=as.numeric(dt.out$time)
dt.out$temperature=as.numeric(dt.out$temperature)
Gaia
because all of the example rows you've shown here contain exactly four list items and you haven't specified otherwise, I'll assume that all of the data fits this format.
with this assumption, it becomes pretty trivial, albeit a little messy, to split the values out into columns using the RXReplace() expression function.
you can create four calculated columns, each with an expression like:
Int(RXReplace([values],"\\[([\\d\\-]+),([\\d\\-]+),([\\d\\-]+),([\\d\\-]+)]","\\1",""))
the third argument "\\1" determines which number in the list to extract. backslashes are doubled ("escaped") per the requirements of the RXReplace() function.
note that this example assumes the numbers are all whole numbers. if you have decimals, you'd need to adjust each "phrase" of the regular expression to ([\\d\\-\\.]+), and you'd need to wrap the expression in Real() rather than Int() (if you leave this part out, the result will be a String type which could cause confusion later on when working with the data).
once you have the four columns, you'll be able to unpivot to get the data easily.

Creating an ID based on factor and filling down with Stata

Consider the fictional data to illustrate my problem, which contains in reality thousands of rows.
Figure 1
Each individual is characterized by values attached to A,B,C,D,E. In figure1, I show 3 individuals for which some characteristics are missing. Do you have any idea how can I get the following completed table (figure 2)?
Figure 2
With the ID in figure 1 I could have used the carryforward command to filling in the values. But since each individual has a different number of rows I don't know how to create the ID.
Edit: All individual share the characteristic "A".
Edit: the existing order of observations is informative.
To detect the change of id, the idea is to compare if the precedent value of char is >= in each rows.
This works only if your data are ordered, but it seems mandatory in your data.
gen id= 1 if (char[_n-1] >= char[_n]) | _n ==1
replace id = sum(id) if id==1
replace id = id[_n-1] if missing(id)
fillin id char
drop _fillin
If an individual as only the characteristics A and C and another individual as only the characteristics D and E, this won't work, but it seems impossible to detect with your data.

kettle sample rows for each type

I have a set of rows let's say "rowId","type","value". I need on output set of 10 sample rows for each "type". How can I do it? "type" has aprox. 100 different, and changing values, so switch is not good option.
Well I've figured a walkaround from this situation. I splited transformation in parts. First part collects all data to a temp table, finds unique types, and copies them to the result.
The second one runs for every input row (where we have types), and collects data of a given type from temp table. Then you need no grouping to do stratified sample.

How to map a function to a triple nested list and keep the triple nested list intact?

I've have been building an analysis workflow for my PhD and have been using a triple nested list to represent my data structure because I want it to be able to expand to an arbitrary amount of data in its second and third levels. The first level is the whole dataset, the second level is each subject in the dataset and third level is a row for each measure that each subject.
[dataset]
|
[subject]
|
[measure1, measure2, measure3]
I am trying to map a function to each measure - for instance convert all the points into floats or replace anomalous values with None - and wish to return the whole dataset according to its nesting but my current code:
for subject in dataset:
for measure in subject:
map(float, measure)
...the result is correct and exactly what I want but the problem is that I can't think how to assign the result back to the dataset efficiently or without losing a level of the nest. Ideally, I would like it to change the measure *in place but I can't think how to do it.
Could you suggest an efficient and pythonic way of doing that? Is a triple nested list a silly way to organize my data in the program?
Rather than doing it in place, make a new list
dataset = [[[float(value) for value in measure]
for measure in subject]
for subject in dataset]
return [[map(float, measure) for measure in subject] for subject in dataset]
You can return a list instead of altering it in place -- this is still remarkably efficient and preserves all the information you want. (aside: In fact, it's often faster than assigning to list indexes [citation needed], which is what others have suggested here!)
A straight-forward way to do that in place would be:
for subject in dataset:
for measure in subject:
for i, elem in enumerate(measure):
measure[i] = float(elem)
Alternatively, use the slice operator to upate the list in-place with the results of map
for subject in dataset:
for measure in subject:
measure[:] = map(float, measure)
This should do the job
for subject in dataset:
for measure in subject:
for i, m in enumerate(measure):
measure[i] = float(m)