Cast Function with multiple rows and a time column - casting

I have a big Eurostat dataset loaded like this:
install.packages("SmarterPoland")
library(SmarterPoland)
GDP_raw <- getEurostatRCV(kod = "namq_gdp_c")
It has this structure:
s_adj unit indic_na geo time value
1 NSA EUR_HAB B11 AT 2014Q1 NA
2 NSA EUR_HAB B11 BE 2014Q1 200.0
3 NSA EUR_HAB B11 BG 2014Q1 -100.0
I want to use "time" as the first column and the other variables as rows. Doing it the other way around is easy with:
GDP_sorted <- cast(GDP_raw, geo + unit + s_adj + indic_na ~ time)
which returns:
geo unit s_adj indic_na 1955Q1 1955Q2 1955Q3 1955Q4
1 AT EUR_HAB NSA B11 NA NA NA NA
2 AT EUR_HAB NSA B111 NA NA NA NA
3 AT EUR_HAB NSA B112 NA NA NA NA
The problem is, that here the columns are variables so every quarter is its own variable which doesn't make sense from a Time Series perspective. I need some sort of transpose (simple t() doesn't return the same data type). However, if I try cast the other way around, it adds the different categories together into one variable and creates:
time AT_EUR_HAB_NSA_B11 AT_EUR_HAB_NSA_B111 AT_EUR_HAB_NSA_B112
1 1955Q1 NA NA NA
2 1955Q2 NA NA NA
3 1955Q3 NA NA NA
Which means I have 12405 variables. That makes subset infeasible. I'd like something along the lines of:
time
s_adj NSA NSA NSA
geo AT AT AT
unit EUR_HAB EUR_HAB EUR_HAB
indic_na B11 B12 B13
1 1955Q1 NA NA NA
2 1955Q2 NA NA NA
3 1955Q3 NA NA NA
and so forth (this is a fictional example). So then I could use:
Demand <- subset(GDP_sorted, (indic_na == "P3_P5") & (s_adj == "SWDA") & (unit == "MIO_EUR"))
Without having to specify all combinations of variables from 12405 variables.

Until someone provides a better answer, here is a workaround I'm using now:
start from the raw downloaded table:
GDP_raw <- read.table("/media/38A05C6AA05C311C/1_Documents/Dropbox/Masterarbeit/2_R/Data/GDP_raw.RData")
then subset your variable of interest:
Demand <- subset(GDP_raw, (indic_na == "P3_P5") & (s_adj == "SWDA") & (unit == "MIO_EUR"))
and then the only dimensions which remain are time and geo which you can cast simply as:
Demand_cast <- cast(Demand, time ~ geo)
Which gives you one file with the matrix of your variable of the form:
time AT BE BG
1955Q1 NA NA NA
1955Q2 NA NA NA
1955Q3 NA NA NA

Related

R sets of coordinates extract from string

I'am trying to extract sets of coordinates from strings and change the format.
I have tried some of the stringr package and getting nowhere with the pattern extraction.
It's my first time dealing with regex and still is a little confusing to create a pattern.
There is a data frame with one column with one or more sets of coordinates.
The only pattern (the majority) separating Lat from Long is (-), and to separate one set of coordinates to another there is a (/)
Here is an example of some of the data:
ID Coordinates
1 3438-5150
2 3346-5108/3352-5120 East island, South port
3 West coast (284312 472254)
4 28.39.97-47.05.62/29.09.13-47.44.03
5 2843-4722/3359-5122(1H-2H-3H-4F)
Most of the data is in decimal degree, e.g. (id 1 is Lat 34.38 Lon 51.50), some others is in 00º00'00'', e.g. (id 4 is Lat 28º 39' 97'' Lon 47º 05' 62'')
I will need to make in a few steps
1 - Extract all coordinates sets creating a new row for each set of each record;
2 - Extract the text label of record to a new column, concatenating them;
3- Convert the coordinates from 00º00'00''(28.39.97) to 00.0000º (28.6769 - decimal dregree) so all coordinates are in the same format. I can easily convert if they are as numeric.
4 - Add dot (.) to separate the decimal degree values (from 3438 to 34.38) and add (-) to identify as (-34.38) south west hemisphere. All value must have (-) sign.
I'am trying to get something like this:
Step 1 and 2 - Extract coordinates sets and names
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 28.39.97 47.05.62
4 29.09.13 47.44.03
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 3 - convert coordinates format to decimal degree (ID 4)
ID x y label
1 3438 5150
2 3346 5108 East island, South port
2 3352 5120 East island, South port
3 284312 472254 West coast
4 286769 471005
4 291536 470675
5 2843 4722 1H-2H-3H-4F
5 3359 5122 1H-2H-3H-4F
Step 4 - change display format
ID x y label
1 -34.38 -51.50
2 -33.46 -51.08 East island, South port
2 -33.52 -51.20 East island, South port
3 -28.43 -47.22 West coast
4 -28.6769 -47.1005
4 -29.1536 -47.0675
5 -28.43 -47.22 1H-2H-3H-4F
5 -33.59 -51.22 1H-2H-3H-4F
I have edit the question to better clarify my problems and change some of my needs. I realized that it was messy to understand.
So, has anyone worked with something similar?
Any other suggestion would be of great help.
Thank you again for the time to help.
Note: the first answers address the original asking of the question and the last answer addresses its current state. The data in data1 should be set appropriately for each solution.
The following should address your first question given the data you provided and the expected output (using dplyr and tidyr).
library(dplyr)
library(tidyr)
### Load Data
data1 <- structure(list(ID = 1:4, Coordinates = c("3438-5150", "3346-5108/3352-5120",
"2843-4722/3359-5122(1H-2H-3H-4F)", "28.39.97-47.05.62/29.09.13-47.44.03"
)), .Names = c("ID", "Coordinates"), class = "data.frame", row.names = c(NA,
-4L))
### This is a helper function to transform data that is like '1234'
### but should be '12.34', and leaves alone '12.34'.
### You may have to change this based on your use case.
div100 <- function(x) { return(ifelse(x > 100, x / 100, x)) }
### Remove items like "(...)" and change "12.34.56" to "12.34"
### Split into 4 columns and xform numeric value.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d+[.]\\d+)[.]\\d+', '\\1', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), div100) # xform columns x.N and y.N
## ID x.1 y.1 x.2 y.2
## 1 1 34.38 51.50 NA NA
## 2 2 33.46 51.08 33.52 51.20
## 3 3 28.43 47.22 33.59 51.22
## 4 4 28.39 47.05 29.09 47.44
The call to mutate modifies Coordinates twice to make substitutions easier.
Edit
A variation that uses another regex substitution instead of mutate_at.
data1 %>%
mutate(Coordinates = gsub('\\([^)]+\\)', '', Coordinates),
Coordinates = gsub('(\\d{2}[.]\\d{2})[.]\\d{2}', '\\1', Coordinates),
Coordinates = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coordinates)) %>%
separate(Coordinates, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE)
Edit 2: The following solution addresses the updated version of the question
The following solution does a number of transformations to transform the data. These are separate to make it a bit easier to think about (much easier relatively speaking).
library(dplyr)
library(tidyr)
data1 <- structure(list(ID = 1:5, Coordinates = c("3438-5150", "3346-5108/3352-5120 East island, South port",
"East coast (284312 472254)", "28.39.97-47.05.62/29.09.13-47.44.03",
"2843-4722/3359-5122(1H-2H-3H-4F)")), .Names = c("ID", "Coordinates"
), class = "data.frame", row.names = c(NA, -5L))
### Function for converting to numeric values and
### handles case of "12.34.56" (hours/min/sec)
hms_convert <- function(llval) {
nres <- rep(0, length(llval))
coord3_match_idx <- grepl('^\\d{2}[.]\\d{2}[.]\\d{2}$', llval)
nres[coord3_match_idx] <- sapply(str_split(llval[coord3_match_idx], '[.]', 3), function(x) { sum(as.numeric(x) / c(1,60,3600))})
nres[!coord3_match_idx] <- as.numeric(llval[!coord3_match_idx])
nres
}
### Each mutate works to transform the various data formats
### into a single format. The 'separate' commands then split
### the data into the appropriate columns. The action of each
### 'mutate' can be seen by progressively viewing the results
### (i.e. adding one 'mutate' command at a time).
data1 %>%
mutate(Coordinates_new = Coordinates) %>%
mutate(Coordinates_new = gsub('\\([^) ]+\\)', '', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(.*?)\\(((\\d{6})[ ](\\d{6}))\\).*', '\\3-\\4 \\1', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\d{2})(\\d{2})(\\d{2})', '\\1.\\2.\\3', Coordinates_new)) %>%
mutate(Coordinates_new = gsub('(\\S+)[\\s]+(.+)', '\\1|\\2', Coordinates_new, perl = TRUE)) %>%
separate(Coordinates_new, c('Coords', 'label'), fill = 'right', sep = '[|]', convert = TRUE) %>%
mutate(Coords = gsub('(\\d{2})(\\d{2})', '\\1.\\2', Coords)) %>%
separate(Coords, c('x.1', 'y.1', 'x.2', 'y.2'), fill = 'right', sep = '[-/]', convert = TRUE) %>%
mutate_at(vars(matches('^[xy][.]')), hms_convert) %>%
mutate_at(vars(matches('^[xy][.]')), function(x) ifelse(!is.na(x), -x, x))
## ID Coordinates x.1 y.1 x.2 y.2 label
## 1 1 3438-5150 -34.38000 -51.50000 NA NA <NA>
## 2 2 3346-5108/3352-5120 East island, South port -33.46000 -51.08000 -33.52000 -51.20000 East island, South port
## 3 3 East coast (284312 472254) -28.72000 -47.38167 NA NA East coast
## 4 4 28.39.97-47.05.62/29.09.13-47.44.03 -28.67694 -47.10056 -29.15361 -47.73417 <NA>
## 5 5 2843-4722/3359-5122(1H-2H-3H-4F) -28.43000 -47.22000 -33.59000 -51.22000 <NA>
We can use stringi. We create a . between the 4 digit numbers with gsub, use stri_extract_all (from stringi) to extract two digit numbers followed by a dot followed by two digit numbers (\\d{2}\\.\\d{2}) to get a list output. As the list elements have unequal length, we can pad NA at the end for those elements that have shorter length than the maximum length and convert to matrix (using stri_list2matrix). After converting to data.frame, changing the character columns to numeric, and cbind with the 'ID' column of the original dataset.
library(stringi)
d1 <- as.data.frame(stri_list2matrix(stri_extract_all_regex(gsub("(\\d{2})(\\d{2})",
"\\1.\\2", data1$Coordinates), "\\d{2}\\.\\d{2}"), byrow=TRUE), stringsAsFactors=FALSE)
d1[] <- lapply(d1, as.numeric)
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
cbind(data1[1], d1)
# ID x.1 y.1 x.2 y.2
#1 1 34.38 51.50 NA NA
#2 2 33.46 51.08 33.52 51.20
#3 3 28.43 47.22 33.59 51.22
#4 4 28.39 47.05 29.09 47.44
But, this can also be done with base R.
#Create the dots for the 4-digit numbers
str1 <- gsub("(\\d{2})(\\d{2})", "\\1.\\2", data1$Coordinates)
#extract the numbers in a list with gregexpr/regmatches
lst <- regmatches(str1, gregexpr("\\d{2}\\.\\d{2}", str1))
#convert to numeric
lst <- lapply(lst, as.numeric)
#pad with NA's at the end and convert to data.frame
d1 <- do.call(rbind.data.frame, lapply(lst, `length<-`, max(lengths(lst))))
#change the column names
colnames(d1) <- paste0(c("x.", "y."), rep(1:2,each = 2))
#cbind with the first column of 'data1'
cbind(data1[1], d1)

Error: std::bad_alloc Rstudio

After running this code:
t1 <-Sys.time()
df.m <- left_join(df.h,daRta3,by=c("year","month","MA","day"))
t2 <- Sys.time()
difftime(t2,t1)
I have this error.
Error: std::bad_alloc
The dimension of the matrix that I have tried to create is 74495*2695 = 180.10^6 rows.
The computer in which I run the code has 20 GB of RAM
I tried the memory.limit() but it did not solve my issue.
Examine cardinality of your join key
Is the c("year","month","MA","day") unique in both df.h and daRta3?
What are the most frequent values?
NA values. left_join can treat NA values as equal or different:
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x')
# A tibble: 9 x 1
x
<lgl>
1 NA
2 NA
3 NA
4 NA
5 NA
6 NA
7 NA
8 NA
9 NA
> tibble(x = c(NA, NA, NA)) %>% left_join(., ., by = 'x', na_matches = 'never')
# A tibble: 3 x 1
x
<lgl>
1 NA
2 NA
3 NA
If order and values in c("year","month","MA","day") can be guaranteed to be the same then simple cbind or bind_cols might be an efficient solution

Extracting position of pattern in a string using ifelse in R

I have a set of strings x for example:
[1] "0000000000000000000000000000000000000Y" "9000000000D00000000000000000000Y"
[3] "0000000000000D00000000000000000000X" "000000000000000000D00000000000000000000Y"
[5] "000000000000000000D00000000000000000000Y" "000000000000000000D00000000000000000000Y"
[6]"000000000000000000000000D0000000011011D1X"
I want to extract the last position of a particular character like 1. I am running this code:
ifelse(grepl("1",x),rev(gregexpr("1",x)[[1]])[1],50)
But this is returning -1 for all elements. How do I correct this?
We can use stri_locate_last from stringi. If there are no matches, it will return NA.
library(stringi)
r1 <- stri_locate_last(v1, fixed=1)[,1]
r1
#[1] NA NA NA NA NA NA 40
nchar(v1)
#[1] 38 32 35 40 40 40 41
If we need to replace the NA values with number of characters
ifelse(is.na(r1), nchar(v1), r1)
data
v1 <- c("0000000000000000000000000000000000000Y",
"9000000000D00000000000000000000Y",
"0000000000000D00000000000000000000X",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000D00000000000000000000Y",
"000000000000000000000000D0000000011011D1X")
In base R, the following returns the position of the last matched "1".
# Make some toy data
toydata <- c("001", "007", "00101111Y", "000AAAYY")
# Find last postion
last_pos <- sapply(gregexpr("1", toydata), function(m) m[length(m)])
print(last_pos)
#[1] 3 -1 8 -1
It returns -1 whenever the pattern is not matched.

R using regexpr, with multiple pattern

I would like to find the string just after some patterns. My code seem to work but I cannot finish the job.
Here is an illustration:
pattern <- c("Iligan", "Cabeseria 25|Sta. Lucia", "Capitol", "Osmeña",
"Nowhere", "Aglayan")
# I want to match the string just after each pattern. For example I'm going to
# match City just after Iligan.
target <-c("Iligan City", "Sta. Lucia, Ozamiz City", " Oroquieta City",
"Osmeña St. Dipolog City", "Lucia St., Zamboanga City",
"Aglayan str, Oroquieta City", "Gingoog City", "Capitol br., Ozamiz City",
"Dumaguete City", "Poblacion, Misamis")
#The matches seems to work fine
(matches <- sapply(pattern,FUN=function(x){regexpr(paste0("
(?<=\\b",x,"\\b ",")","[\\w-*\\.]*"),target,perl=T)}))
print (matches)
#But I cannot get the results. I would need use the column of each matrix
#at a time
villain <- lapply(matches,FUN = function(x)(regmatches(target,x)))
Do you have a solution to this problem.
unpdate 1
For the sake of being precise here is the desired output.
results <- c("City", "St.", "br.")
#[1] "City" "St." "br."
There are some helpers in the stringr package that can simplify the process:
pattern <- c("Iligan", "Cabeseria 25|Sta. Lucia", "Capitol", "Osmeña",
"Nowhere", "Aglayan")
target <-c("Iligan City", "Sta. Lucia, Ozamiz City", " Oroquieta City",
"Osmeña St. Dipolog City", "Lucia St., Zamboanga City",
"Aglayan str, Oroquieta City", "Gingoog City", "Capitol br., Ozamiz City",
"Dumaguete City", "Poblacion, Misamis")
matchPat <- function(x) {
unlist(str_extract(target, perl(paste0("(?<=\\b", x, "\\b ",")","[\\w-*\\.]*"))))
}
matches <- sapply(pattern, matchPat)
print(matches)
## Iligan Cabeseria 25|Sta. Lucia Capitol Osmeña Nowhere Aglayan
## [1,] "City" NA NA NA NA NA
## [2,] NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA
## [4,] NA NA NA "St." NA NA
## [5,] NA NA NA NA NA NA
## [6,] NA NA NA NA NA "str"
## [7,] NA NA NA NA NA NA
## [8,] NA NA "br." NA NA NA
## [9,] NA NA NA NA NA NA
## [10,] NA NA NA NA NA NA
This can be simplified further if you don't need indicators for non-matches, but no sample/expected output was provided.

Question regarding llply or lapply - applying functions to data.frames in a list

Dear R user community,
I have many data.frames in a list, as follows (only one data.frame in the list of 21 shown for convenience):
> str(datal)
List of 21
$ BallitoRaw.DAT :'data.frame': 1083 obs. of 3 variables:
..$ Filename: Factor w/ 21 levels "BallitoRaw.DAT",..: 1 1 1 1 1 1 1 1 1 1 ...
..$ date :Class 'Date' num [1:1083] 7318 7319 7320 7321 7322 ...
..$ temp : num [1:1083] NA 25.8 NA NA NA NA NA NA NA 24.4 ...
If I work on each data.frame in the list individually I can create a zoo object from temp and date, as such:
> BallitoRaw.zoo <- zoo(datal$BallitoRaw.DAT$temp, datal$BallitoRaw.DAT$date)
The zoo object looks like this:
> head(BallitoRaw.zoo)
1990-01-14 1990-01-15 1990-01-16 1990-01-17 1990-01-18 1990-01-19
NA 25.8 NA NA NA NA
How do I use llply or apply (or similar) to work on the whole list at once?
The output needs to go into a new list of data.frames, or a series of independent data.frames (each one named as in the zoo example above). Note that the date column, although a regular time series (days), contains missing dates (in addition to NAs for temps of existing dates); the missing dates will be filled by the zoo function. The output data.frame with the zoo object will thus be longer than the original one.
Help kindly appreciated.
makeNamedZoo <- function(dfrm){ dfrmname <- deparse(substitute(dfrm))
zooname <-dfrmname
assign(zooname, zoo(dfrm$temp, dfrm$date))
return(get(zooname)) }
ListOfZoos <- lapply(dflist, makeNamedZoo)
names(ListOfZoos) <- paste( sub("DAT$", "", names(dflist) ), "zoo", sep="")
Here is a simple test case:
df1 <- data.frame(a= letters[1:10], date=as.Date("2011-01-01")+0:9, temp=rnorm(10) )
df2 <- data.frame(a= letters[1:10], date=as.Date("2011-01-01")+0:9, temp=rnorm(10) )
dflist <- list(dfone.DAT=df1,dftwo.DAT=df2)
ListOfZoos <- lapply(dflist, makeNamedZoo)
names(ListOfZoos) <- paste( sub("DAT$", "", names(dflist) ), "zoo", sep="")
$dfone.zoo
2011-01-01 2011-01-02 2011-01-03 2011-01-04 2011-01-05 2011-01-06 2011-01-07
0.7869056 1.6523928 -1.1131432 1.2261783 1.1843587 0.2673762 -0.4159968
2011-01-08 2011-01-09 2011-01-10
-1.2686391 -0.4135859 -1.4916291
$dftwo.zoo
2011-01-01 2011-01-02 2011-01-03 2011-01-04 2011-01-05 2011-01-06 2011-01-07
0.7356612 -0.1263861 -1.6901240 -0.6441732 -1.4675871 2.3006544 1.0263354
2011-01-08 2011-01-09 2011-01-10
-0.8577544 0.6079986 0.6625564
This is an easier way to achieve what I needed:
tozoo <- function(x) zoo(x$temp, x$date)
data1.zoo <- do.call(merge, lapply(split(data1, data1$Filename), tozoo))
The result is a nice zoo object.