After running this code:
t1 <-Sys.time()
df.m <- left_join(df.h,daRta3,by=c("year","month","MA","day"))
t2 <- Sys.time()
difftime(t2,t1)
I have this error.
Error: std::bad_alloc
The dimension of the matrix that I have tried to create is 74495*2695 = 180.10^6 rows.
The computer in which I run the code has 20 GB of RAM
I tried the memory.limit() but it did not solve my issue.
Examine cardinality of your join key
Is the c("year","month","MA","day") unique in both df.h and daRta3?
What are the most frequent values?
NA values. left_join can treat NA values as equal or different:
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x')
# A tibble: 9 x 1
x
<lgl>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x', na_matches = 'never')
# A tibble: 3 x 1
x
<lgl>
1 NA
2 NA
3 NA
If order and values in c("year","month","MA","day") can be guaranteed to be the same then simple cbind or bind_cols might be an efficient solution
Related
I'am trying to extract sets of coordinates from strings and change the format.
I have tried some of the stringr package and getting nowhere with the pattern extraction.
It's my first time dealing with regex and still is a little confusing to create a pattern.
There is a data frame with one column with one or more sets of coordinates.
The only pattern (the majority) separating Lat from Long is (-), and to separate one set of coordinates to another there is a (/)
Here is an example of some of the data:
ID Coordinates
1 3438-5150
2 3346-5108/3352-5120 East island, South port
3 West coast (284312 472254)
4 28.39.97-47.05.62/29.09.13-47.44.03
5 2843-4722/3359-5122(1H-2H-3H-4F)
Most of the data is in decimal degree, e.g. (id 1 is Lat 34.38 Lon 51.50), some others is in 00º00'00'', e.g. (id 4 is Lat 28º 39' 97'' Lon 47º 05' 62'')
I will need to make in a few steps
1 - Extract all coordinates sets creating a new row for each set of each record;
2 - Extract the text label of record to a new column, concatenating them;
3- Convert the coordinates from 00º00'00''(28.39.97) to 00.0000º (28.6769 - decimal dregree) so all coordinates are in the same format. I can easily convert if they are as numeric.
4 - Add dot (.) to separate the decimal degree values (from 3438 to 34.38) and add (-) to identify as (-34.38) south west hemisphere. All value must have (-) sign.
I'am trying to get something like this:
Step 1 and 2 - Extract coordinates sets and names
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 28.39.97 47.05.62
4 29.09.13 47.44.03
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 3 - convert coordinates format to decimal degree (ID 4)
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 286769 471005
4 291536 470675
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 4 - change display format
ID x y label
1 -34.38 -51.50
2 -33.46 -51.08 East island, South port
2 -33.52 -51.20 East island, South port
3 -28.43 -47.22 West coast
4 -28.6769 -47.1005
4 -29.1536 -47.0675
5 -28.43 -47.22 1H-2H-3H-4F
5 -33.59 -51.22 1H-2H-3H-4F
I have edit the question to better clarify my problems and change some of my needs. I realized that it was messy to understand.
So, has anyone worked with something similar?
Any other suggestion would be of great help.
Thank you again for the time to help.
Note: the first answers address the original asking of the question and the last answer addresses its current state. The data in data1 should be set appropriately for each solution.
The following should address your first question given the data you provided and the expected output (using dplyr and tidyr).
library(dplyr)
library(tidyr)
### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120",
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA,
-4L))
### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }
### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
## ID x.1 y.1 x.2 y.2
## 1 1 34.38 51.50 NA NA
## 2 2 33.46 51.08 33.52 51.20
## 3 3 28.43 47.22 33.59 51.22
## 4 4 28.39 47.05 29.09 47.44
The call to mutate modifies Coordinates twice to make substitutions easier.
Edit
A variation that uses another regex substitution instead of mutate_at.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates),
Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)
Edit 2: The following solution addresses the updated version of the question
The following solution does a number of transformations to transform the data. These are separate to make it a bit easier to think about (much easier relatively speaking).
library(dplyr)
library(tidyr)
data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port",
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03",
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))
### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
nres <- rep(0, length(llval))
coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval)
nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
nres
}
### Each mutate works to transform the various data formats
### into a single format. The 'separate' commands then split
### the data into the appropriate columns. The action of each
### 'mutate' can be seen by progressively viewing the results
### (i.e. adding one 'mutate' command at a time).
data1 %>%
mutate(Coordinates_new = Coordinates) %>%
mutate(Coordinates_new = gsub('\\([^) ]+\\)', '', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(.*?)\\(((\\d{6})[ ](\\d{6}))\\).*', '\\3-\\4 \\1', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>%
separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>%
separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))
## ID Coordinates x.1 y.1 x.2 y.2 label
## 1 1 3438-5150 -34.38000 -51.50000 NA NA <NA>
## 2 2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3 3 East coast (284312 472254) -28.72000 -47.38167 NA NA East coast
## 4 4 28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417 <NA>
## 5 5 2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000 <NA>
We can use stringi. We create a . between the 4 digit numbers with gsub, use stri_extract_all (from stringi) to extract two digit numbers followed by a dot followed by two digit numbers (\\d{2}\\.\\d{2}) to get a list output. As the list elements have unequal length, we can pad NA at the end for those elements that have shorter length than the maximum length and convert to matrix (using stri_list2matrix). After converting to data.frame, changing the character columns to numeric, and cbind with the 'ID' column of the original dataset.
library(stringi)
d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\\d{2})(\\d{2})",
"\\1.\\2", data1$Coordinates), "\\d{2}\\.\\d{2}"), byrow=TRUE), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
cbind(data1[1], d1)
# ID x.1 y.1 x.2 y.2
#1 1 34.38 51.50 NA NA
#2 2 33.46 51.08 33.52 51.20
#3 3 28.43 47.22 33.59 51.22
#4 4 28.39 47.05 29.09 47.44
But, this can also be done with base R.
#Create the dots for the 4-digit numbers
str1 <- gsub("(\\d{2})(\\d{2})", "\\1.\\2", data1$Coordinates)
#extract the numbers in a list with gregexpr/regmatches
lst <- regmatches(str1, gregexpr("\\d{2}\\.\\d{2}", str1))
#convert to numeric
lst <- lapply(lst, as.numeric)
#pad with NA's at the end and convert to data.frame
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
#change the column names
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
#cbind with the first column of 'data1'
cbind(data1[1], d1)
I have a big Eurostat dataset loaded like this:
install.packages("SmarterPoland")
library(SmarterPoland)
GDP_raw <- getEurostatRCV(kod = "namq_gdp_c")
It has this structure:
s_adj unit indic_na geo time value
1 NSA EUR_HAB B11 AT 2014Q1 NA
2 NSA EUR_HAB B11 BE 2014Q1 200.0
3 NSA EUR_HAB B11 BG 2014Q1 -100.0
I want to use "time" as the first column and the other variables as rows. Doing it the other way around is easy with:
GDP_sorted <- cast(GDP_raw, geo + unit + s_adj + indic_na ~ time)
which returns:
geo unit s_adj indic_na 1955Q1 1955Q2 1955Q3 1955Q4
1 AT EUR_HAB NSA B11 NA NA NA NA
2 AT EUR_HAB NSA B111 NA NA NA NA
3 AT EUR_HAB NSA B112 NA NA NA NA
The problem is, that here the columns are variables so every quarter is its own variable which doesn't make sense from a Time Series perspective. I need some sort of transpose (simple t() doesn't return the same data type). However, if I try cast the other way around, it adds the different categories together into one variable and creates:
time AT_EUR_HAB_NSA_B11 AT_EUR_HAB_NSA_B111 AT_EUR_HAB_NSA_B112
1 1955Q1 NA NA NA
2 1955Q2 NA NA NA
3 1955Q3 NA NA NA
Which means I have 12405 variables. That makes subset infeasible. I'd like something along the lines of:
time
s_adj NSA NSA NSA
geo AT AT AT
unit EUR_HAB EUR_HAB EUR_HAB
indic_na B11 B12 B13
1 1955Q1 NA NA NA
2 1955Q2 NA NA NA
3 1955Q3 NA NA NA
and so forth (this is a fictional example). So then I could use:
Demand <- subset(GDP_sorted, (indic_na == "P3_P5") & (s_adj == "SWDA") & (unit == "MIO_EUR"))
Without having to specify all combinations of variables from 12405 variables.
Until someone provides a better answer, here is a workaround I'm using now:
start from the raw downloaded table:
GDP_raw <- read.table("/media/38A05C6AA05C311C/1_Documents/Dropbox/Masterarbeit/2_R/Data/GDP_raw.RData")
then subset your variable of interest:
Demand <- subset(GDP_raw, (indic_na == "P3_P5") & (s_adj == "SWDA") & (unit == "MIO_EUR"))
and then the only dimensions which remain are time and geo which you can cast simply as:
Demand_cast <- cast(Demand, time ~ geo)
Which gives you one file with the matrix of your variable of the form:
time AT BE BG
1955Q1 NA NA NA
1955Q2 NA NA NA
1955Q3 NA NA NA
I have a dataframe with 2 columns GL and GLDESC and want to add a 3rd column called KIND based on some data that is inside of column GLDESC.
The dataframe is as follows:
GL GLDESC
1 515100 Payroll-Indir Salary Labor
2 515900 Payroll-Indir Compensated Absences
3 532300 Bulk Gas
4 539991 Area Charge In
5 551000 Repairs & Maint-Spare Parts
6 551100 Supplies-Operating
7 551300 Consumables
For each row of the data table:
If GLDESC contains the word Payroll anywhere in the string then I want KIND to be Payroll
If GLDESC contains the word Gas anywhere in the string then I want KIND to be Materials
In all other cases I want KIND to be Other
I looked for similar examples on stackoverflow but could not find any, also looked in R for dummies on switch, grep, apply and regular expressions to try and match only part of the GLDESC column and then fill the KIND column with the kind of account but was unable to make it work.
Since you have only two conditions, you can use a nested ifelse:
#random data; it wasn't easy to copy-paste yours
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c("gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"), sample(letters, 10), sep = " "))
DF$KIND <- ifelse(grepl("gas", DF$GLDESC, ignore.case = T), "Materials",
ifelse(grepl("payroll", DF$GLDESC, ignore.case = T), "Payroll", "Other"))
DF
# GL GLDESC KIND
#1 8 e gas l Materials
#2 1 c payroll12 y Payroll
#3 10 m GaSer v Materials
#4 6 t asdf n Other
#5 2 w qweaa t Other
#6 4 r PayROll-12 q Payroll
#7 9 n asdfg a Other
#8 5 d GAS--2 w Materials
#9 7 s fghfgh e Other
#10 3 g qweee k Other
EDIT 10/3/2016 (..after receiving more attention than expected)
A possible solution to deal with more patterns could be to iterate over all patterns and, whenever there is match, progressively reduce the amount of comparisons:
ff = function(x, patterns, replacements = patterns, fill = NA, ...)
{
stopifnot(length(patterns) == length(replacements))
ans = rep_len(as.character(fill), length(x))
empty = seq_along(x)
for(i in seq_along(patterns)) {
greps = grepl(patterns[[i]], x[empty], ...)
ans[empty[greps]] = replacements[[i]]
empty = empty[!greps]
}
return(ans)
}
ff(DF$GLDESC, c("gas", "payroll"), c("Materials", "Payroll"), "Other", ignore.case = TRUE)
# [1] "Materials" "Payroll" "Materials" "Other" "Other" "Payroll" "Other" "Materials" "Other" "Other"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat1a|pat1b", "pat2", "pat3"),
c("1", "2", "3"), fill = "empty")
#[1] "1" "1" "3" "empty"
ff(c("pat1a pat2", "pat1a pat1b", "pat3", "pat4"),
c("pat2", "pat1a|pat1b", "pat3"),
c("2", "1", "3"), fill = "empty")
#[1] "2" "1" "3" "empty"
I personally like matching by index. You can loop grep over your new labels, in order to get the indices of your partial matches, then use this with a lookup table to simply reassign the values.
If you wanna create new labels, use a named vector.
DF <- data.frame(GL = sample(10), GLDESC = paste(sample(letters, 10),
c(
"gas", "payroll12", "GaSer", "asdf", "qweaa", "PayROll-12",
"asdfg", "GAS--2", "fghfgh", "qweee"
), sample(letters, 10),
sep = " "
))
lu <- stack(sapply(c(Material = "gas", Payroll = "payroll"), grep, x = DF$GLDESC, ignore.case = TRUE))
DF$KIND <- DF$GLDESC
DF$KIND[lu$values] <- as.character(lu$ind)
DF$KIND[-lu$values] <- "Other"
DF
#> GL GLDESC KIND
#> 1 6 x gas f Material
#> 2 3 t payroll12 q Payroll
#> 3 5 a GaSer h Material
#> 4 4 s asdf x Other
#> 5 1 m qweaa y Other
#> 6 10 y PayROll-12 r Payroll
#> 7 7 g asdfg a Other
#> 8 2 k GAS--2 i Material
#> 9 9 e fghfgh j Other
#> 10 8 l qweee p Other
Created on 2021-11-13 by the reprex package (v2.0.1)
Say i have a list mn like this
i<-c(w=5,n="oes")
p<-c(w=9,n="ty",j="ooe")
mn<-list(i,p,i,p,i,p,i)
Now I´d like to select the list elements with the shortest length (the i´s) and append "unknown" to the list before creating a dataframe. How can I do this?
EDIT: In the end I´d like the list to have every i element in mn as w=5,n="oes", and j="unknown" before mn including p is changed into a dataframe:
To find the lenght of each element in your list, use length wrapped in sapply:
len <- sapply(mn, length)
len
[1] 2 3 2 3 2 3 2
Now, to identify only those elements that have lengths equal to the shortest length:
which(len==min(len))
[1] 1 3 5 7
Use subsetting and as.data.frame to create your data.frame. But this data.frame will have somewhat random column names, so I rename the column names:
df <- as.data.frame(mn[which(len==min(len))])
names(df) <- seq_len(ncol(df))
df
1 2 3 4
w 5 5 5 5
n oes oes oes oes
You will have to clarify what you mean with "append unknown" to this data.frame.
Another possibility is:
all.names = unique( unlist( lapply( mn, names ) ) )
do.call( 'rbind', lapply( mn, function( r ) {
data.frame( sapply( all.names, function( v ) r[ v ], simplify=F ) )
} ) )
which gives:
w n j
w 5 oes <NA>
w1 9 ty ooe
w2 5 oes <NA>
w3 9 ty ooe
w4 5 oes <NA>
w5 9 ty ooe
w6 5 oes <NA>
But I get the feeling there's a much neater route to this solution...
edit
If you want unknown rather than <NA>, you can change the inner sapply to:
sapply( all.names, function( v ) if( is.na( r[v] ) ) 'unknown' else r[v], simplify=F )
There is not very elegant, but it might do the trick:
maxlength <- max(sapply(mn,length))
## make a new list, with the "missing" entries replaced with "unknown"
mn2 <- lapply(mn,function(x)c(x,rep('unknown',maxlength - length(x))))
## convert to a data.frame
mn3 <- data.frame(matrix(unlist(mn2),nrow = 3))
Which gives the following
> mn3
X1 X2 X3 X4 X5 X6 X7
1 5 9 5 9 5 9 5
2 oes ty oes ty oes ty oes
3 unknown ooe unknown ooe unknown ooe unknown
However it is better practice to use NA, rather than "unknown"
I have a data.frame:
df<-data.frame(a=c("x","x","y","y"),b=c(1,2,3,4))
> df
a b
1 x 1
2 x 2
3 y 3
4 y 4
What's the easiest way to print out each pair of values as a list of strings like this:
"x1", "x2", "y1", "y2"
apply(df, 1, paste, collapse="")
with(df, paste(a, b, sep=""))
And this should be faster than apply.
About timing
For 10000 rows we get:
df <- data.frame(
a = sample(c("x","y"), 10000, replace=TRUE),
b = sample(1L:4L, 10000, replace=TRUE)
)
N = 100
mean(replicate(N, system.time( with(df, paste(a, b, sep="")) )["elapsed"]), trim=0.05)
# 0.005778
mean(replicate(N, system.time( apply(df, 1, paste, collapse="") )["elapsed"]), trim=0.05)
# 0.09611
So increase in speed is visible for few thousands.
It's because Shane's solution call paste for each row separately. So there is nrow(df) calls of paste, in my solution is one call.
Also, you can use sqldf library:
library("sqldf")
df<-data.frame(a=c("x","x","y","y"),b=c(1,2,3,4))
result <- sqldf("SELECT a || cast(cast(b as integer) as text) as concat FROM df")
You will get the following result:
concat
1 x1
2 x2
3 y3
4 y4