Column Names by skipping for rowMean output in R - row

I have calculated rowMeans of a numeric matrix containing 14 columns.
tbl3 <- as.data.frame(sapply(seq(1, ncol(tbl2), 2), function(j) rowMeans(tbl2[,j+(-1:1)])))
The final output is of 7 columns, which points out that for every 3 columns used for calculating the average the central column name is the appropriate one for putting names on the new column. Right now the new column names are automatically generated by Rstudio from V1:V7. However, I want to use the names of the existing columns for the new output.
A sample data is pasted below
structure(list(Aug08 = c(111.15, 24.75, 41.2, 23726.05, 14.95,
279.15, 187.15, 13.6, 13.2, 34.2), Aug10 = c(108.35, 25.2, 40.95,
22750.3, 15.4, 278.8, 185.7, 12.9, 12.65, 35.15), Aug11 = c(105.65,
25.7, 40.85, 22726.8, 15.6, 280.1, 183.8, 12.45, 12.15, 34.55
), Aug12 = c(105.65, 25.5, 40.7, 22470.05, 15.9, 280.95, 185.25,
12.5, 12.35, 36.75), Aug16 = c(108.25, 26, 45.8, 22401.85, 15.1,
281.85, 184.35, 12.25, 12.7, 35.2), Aug17 = c(109.65, 26.5, 44.7,
22380.95, 15, 281.5, 183.5, 13.45, 13.05, 35.2), Aug18 = c(105,
27, 43.85, 23257.4, 15.9, 281.65, 191.15, 13.15, 12.95, 35),
Aug19 = c(102.25, 27.5, 45.7, 22856.45, 17.15, 281.5, 189.75,
12.85, 12.8, 35.3), Aug22 = c(107.4, 28.05, 44.15, 22361.5,
18.85, 276, 184.95, 12.55, 12.65, 34.95), Aug23 = c(108.75,
28.6, 44.55, 22374.35, 19.45, 277.3, 181.95, 12.9, 12.6,
35.2), Aug24 = c(115.1, 29.15, 45, 22369.6, 18.85, 282.45,
181.4, 12.7, 12.6, 37.75), Aug25 = c(112.05, 29.7, 44.65,
22855.75, 18.05, 283.3, 182.9, 12.6, 12.3, 37.4), Aug26 = c(106.6,
30.25, 44.45, 23086.5, 17.55, 285.1, 180.8, 12.65, 12.05,
37.95), Aug29 = c(108.25, 29.85, 43.2, 23221.7, 17.35, 287.65,
178.75, 12.45, 12.1, 39.6)), row.names = c("A", "B",
"C", "D", "E", "F", "G", "H",
"I", "J"), class = "data.frame")

Related

Combined Formattable with KableExtra

I am trying to create a table that combines features from Formattable with KableExtra. I have found a number of examples which have helped but doesn't quite do everything I'm trying to achieve.
This is what I've tried so far:
library(KableExtra)
library(Formattable)
df <- structure(list(Income_source = c("A", "B", "C", "C"), Jul = c(1777.01,
0.13, 9587.39, 11364.53), Aug = c(0, 0.09, 9908.78, 9908.87),
Sep = c(5374.6, 0.03, 9859.87, 15234.5)), row.names = c(NA,
-4L), class = c("tbl_df", "tbl", "data.frame"))
Example of the Formattable function I'd like to apply. Note the color_tile is applied specifically to each row
formattable(df, lapply(1:nrow(df), function(row) {
area(row, col = 1:nrow(df)) ~ color_tile("transparent", "pink")
}))
The example I found which lets me combine Formattable with KableExtra looks like this:
df %>%
mutate(Jul = formattable::color_tile("transparent", "pink")(Jul),
Aug = formattable::color_tile("transparent", "pink")(Aug),
Sep = formattable::color_tile("transparent", "pink")(Sep)) %>%
select(Name,everything()) %>%
kable("html", escape = F,format.args = list(big.mark = ",",scientific = FALSE)) %>%
kable_classic(full_width = T, html_font = "Cambria") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T)
The problems with this solution is:
1: The color_tile function is applied to columns rather than rows
2: The numeric values drop the commas
The table I'm planning on generating would be updated monthly so that next month the data for October would be presented, followed by November and so forth. As such I'm hoping for a solution that doesn't require me to edit the script each time i.e. mutate the new data. Hopefully that makes sense.
List item

Reference a column by a variable

I want to reference a table column by a variable while creating another column but I can't get the syntax:
t0 = Table.FromRecords({[a = 1, b = 2]}),
c0 = "a", c1 = "b",
t1 = Table.AddColumn(t0, "c", each([c0] + [c1]))
I get the error the record's field 'c0' was not found. It is understanding c0 as a literal but I want the text value contained in c0. How to do it?
Edit
I used this inspired by the accepted answer:
t0 = Table.FromRecords({[a = 1, b = 2]}),
c0 = "a", c1 = "b",
t1 = Table.AddColumn(t0, "c", each(Record.Field(_, c0) + Record.Field(_, c1)))
Another way:
let
t0 = Table.FromRecords({[a = 1, b = 2]}),
f = {"a","b"},
t1 = Table.AddColumn(t0, "sum", each List.Sum(Record.ToList(Record.SelectFields(_, f))))
in
t1
try using an index as below
let t0 = Table.FromRecords({[a = 1, b = 2]}),
#"Added Index" = Table.AddIndexColumn(t0, "Index", 0, 1),
c0 = "a",
c1 = "b",
t1 = Table.AddColumn(#"Added Index", "c", each Table.Column(#"Added Index",c0){[Index]} + Table.Column(#"Added Index",c1){[Index]} )
in t1
Expression.Evaluate is another possibility:
= Table.AddColumn(t0, "c", each Expression.Evaluate("["&c0&"] + ["&c1&"]", [_=_]) )
Please refer to this article to understand the [_=_] context argument:
Expression.Evaluate() In Power Query/M
This article explains that argument specifically:
Inside a table, the underscore _ represents the current row, when working with line-by-line operations. The error can be fixed, by adding [_=_] to the environment of the Expression.Evaluate() function. This adds the current row of the table, in which this formula is evaluated, to the environment of the statement, which is evaluated inside the Expression.Evaluate() function.

AWS EMR Hive: Not yet supported place for UDAF 'COUNT'

I have a pretty complicated query I am trying to convert over to use with Hive.
Specifically, I am running it as a Hive "step" in an AWS EMR cluster.
I have tried to clean up the query a bit for the post and just leave essence of the thing.
The full error message is:
FAILED: SemanticException [Error 10128]: Line XX:XX Not yet supported place for UDAF 'COUNT'
The line number is pointing to the COUNT at the bottom of the select statement:
INSERT INTO db.new_table (
new_column1,
new_column2,
new_column3,
... ,
new_column20
)
SELECT MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,
TBL1.col2,
TBL1.col3,
TBL1.col3 AS new_column3,
TBL1.col4,
CASE
WHEN TBL1.col5 = …
ELSE “some value”
END AS new_column5,
TBL1.col6,
TBL1.col7,
TBL1.col8,
CASE
WHEN TBL1.col9 = …
ELSE "some value"
END AS new_column9,
CASE
WHEN TBL1.col10 = …
ELSE "value"
END AS new_column10,
TBL1.col11,
"value" AS new_column12,
TBL2.col1,
TBL2.col2,
from_unixtime(…) AS new_column13,
CAST(…) AS new_column14,
CAST(…) AS new_column15,
CAST(…) AS new_column16,
COUNT(DISTINCT TBL1.col17) AS new_column17
FROM db.table1 TBL1
LEFT JOIN
db.table2 TBL2
ON TBL1.col311 = TBL2.col311
WHERE TBL1.col14 BETWEEN "low" AND "high"
AND TBL1.col44 = "Y"
AND TBL1.col55 = "N"
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20;
If I have left out too much, please let me know.
Thanks for your help!
Updates
It turns out, I did in fact leave out way too much info. Sorry for those who have already tried to help...
I made the updates above.
Removing the 20th group by column, eg:
GROUP BY 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced: Expression not in GROUP BY key '' ''
LATEST
Removing the 20th group by column and adding the first one, eg:
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19;
Produced:
Line XX:XX Invalid table alias or column reference 'new_column5':(possible column
names are: TBL1.col1, TBL1.col2, (looks like all columns of TBL1),
TBL2.col1, TBL2.col2, TBL2.col311)
Line # is referring the line with the SELECT statement. Just those three columns from TBL2 are listed in the error output.
The error seems to be pointing to COALESCE(new_column5). Note that I have a CASE statement within the TBL 1 select which I am running with AS new_column5.
You are addressing calculated column name new_column5 at the same subquery level where it is being calculated. This is not possible in Hive. Replace it with calculation itself or use upper level subquery.
This:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(CASE WHEN TBL1.col5 = … ELSE “some value” END," ")||"_"||...) AS new_col1,
Instead of this:
MD5(COALESCE(TBL1.col1," ")||"_"||COALESCE(new_column5," ")||"_"||...) AS
new_col1,

Python printing lists with column headers

So I have a nested list containing these values
#[[Mark, 10, Orange],
#[Fred, 15, Red],
#[Gary, 12, Blue],
#[Ned, 21, Yellow]]
You can see that the file is laid out so you have (name, age, favcolour)
I want to make it so I can display each column with its corresponding header
E.G
Name|Age|Favourite colour
Mark|10 |Orange
Fred|15 |Red
Gary|12 |Blue
Ned |21 |Yellow
Thank You!
Simple solution using str.format() function:
l = [['Mark', 10, 'Orange'],['Fred', 15, 'Red'],['Gary', 12, 'Blue'],['Ned', 21, 'Yellow']]
f = '{:<10}|{:<3}|{:<15}' # format
# header(`Name` column has some gap as there could be long names, like "Cristopher")
print('Name |Age|Favourite colour')
for i in l:
print(f.format(*i))
The output:
Name |Age|Favourite colour
Mark |10 |Orange
Fred |15 |Red
Gary |12 |Blue
Ned |21 |Yellow

Beginner rbind function

I cannot for the life of me understand the rbind function. I've tried using the examples on here, but I can't figure out what I am doing incorrectly. All I would like to do is add the data from my second data frame under the first.
Does rbind require the columns be the same name or...?
ParticipantA=c("A","B","C","D")
Score1A=c("21","20","21","21")
Score2A=c("32","40","32","31")
Score3A=c("47","50","43","46")
BlockA=data.frame(ParticipantA,Score1A,Score2A,Score3A)
BlockA$Major=c("Computer_Science","Computer_Science","Computer_Science","Computer_Science")
BlockA$Gender=c("Female","Female","Male","Male")
ParticipantB=c("E","F","G","H")
Score1B=c("28","28","21","22")
Score2B=c("30","36","37","32")
Score3B=c("41","49","49","46")
BlockB=data.frame(ParticipantB,Score1B,Score2B,Score3B)
BlockB$Major=c("Medical","Medical","Medical","Medical")
BlockB$Gender=c("Female","Female","Male","Male")
rbind requires that all columns be of the same name and class.
The problem is in the column titles. rbind uses column titles to orient how it will bind the rows. The columns can be in different orders, R will just use the first element to determine column order.
Alternatively, adding another column to your data frames, with the value "A" or "B" in it could preserve your information without putting "A"s and "B"s in your column names <-- the reason you can't use rbind. The additional column would also allow you to do more analyses in R, e.g. regression and other linear models.
Here is one way to handle your data:
Create a uniform set of column names that can be used for the data frames "BlockA" and "BlockB"
final_colnames <- c("Block", "Participant", "Score1", "Score2", "Score3")
Create a new list to identify which block the participants belong to.
BlockA = c("A", "A", "A", "A")
Your previous data
ParticipantA = c("A", "B", "C", "D")
Score1A = c("21", "20", "21", "21")
Score2A = c("32", "40", "32", "31")
Score3A = c("47", "50", "43", "46")
The label "BlockA" is recycled here to name the new data frame, but not before adding the "BlockA" column list of "A" "A" "A" "A".
BlockA = data.frame(BlockA, ParticipantA, Score1A, Score2A, Score3A)
The new column names have to be added at this point, so that the number of names and the number of columns are equal.
colnames(BlockA) <- final_colnames
Now you can add the remaining columns
BlockA$Major = c("Computer_Science", "Computer_Science", "Computer_Science", "Computer_Science")
BlockA$Gender = c("Female", "Female", "Male", "Male")
BlockB is the same process
BlockB = c("B", "B", "B", "B") # the extra column
ParticipantB = c("E", "F", "G", "H")
Score1B = c("28", "28", "21", "22")
Score2B = c("30", "36", "37", "32")
Score3B = c("41", "49", "49", "46")
BlockB = data.frame(BlockB, ParticipantB, Score1B, Score2B, Score3B)
colnames(BlockB) <- final_colnames # renaming the columns
BlockB$Major = c("Medical", "Medical", "Medical", "Medical")
BlockB$Gender = c("Female", "Female", "Male", "Male")
Uniform column names mean that rbind will now work.
rbind(BlockA,BlockB)