Possible to change the record delimiter in R? - regex

Is it possible to manipulate the record/observation/row delimiter when reading in data (i.e. read.table) from a text file? It's straightforward to adjust the field delimiter using sep="", but I haven't found a way to change the record delimiter from an end-of-line character.
I am trying to read in pipe delimited text files in which many of the entries are long strings that include carriage returns. R treats these CRs as end-of-line, which begins a new row incorrectly and screws up the number of records and field order.
I would like to use a different delimiter instead of a CR. As it turns out, each row begins with the same string, so if I could use use something like \nString to identify true end-of-line, the table would import correctly. Here's a simplified example of what one of the text files might look like.
V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,
Should read into R as
V1 V2 V3 V4
String A 5 some text
String B 2 more text and more text
String B 7 some different text
String A N/A N/A
I can open the files in a text editor and clean them with a find/replace before reading in, but a systematic solution within R would be great. Thanks for your help.

We can read them in and collapse them afterwards. g will have the value 0 for the header, 1 for the next line (and for follow on lines, if any, that are to go with it) and so on. tapply collapses the lines according to g giving L2 and finally we re-read the lines:
Lines <- "V1,V2,V3,V4
String,A,5,some text
String,B,2,more text and
more text
String,B,7,some different text
String,A,,"
L <- readLines(textConnection(Lines))
g <- cumsum(grepl("^String", L))
L2 <- tapply(L, g, paste, collapse = " ")
DF <- read.csv(text = L2, as.is = TRUE)
DF$V4[ DF$V4 == "" ] <- NA
This gives:
> DF
V1 V2 V3 V4
1 String A 5 some text
2 String B 2 more text and more text
3 String B 7 some different text
4 String A NA <NA>

If you're on Linux/Mac, you should really be using a command line tool, like e.g. sed, instead. Here are two slightly different approaches:
# keep the \n
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/"\\1\\n\\2"/\' test.txt'))
# V1 V2 V3 V4
#1 String A 5 some text
#2 String B 2 more text and\nmore text
#3 String B 7 some different text
#4 String A NA
# get rid of the \n and replace with a space
read.csv(pipe('sed \'N; s/\\([^,]*\\)\\n\\([^,]*$\\)/\\1 \\2/\' test.txt'))
# V1 V2 V3 V4
#1 String A 5 some text
#2 String B 2 more text and more text
#3 String B 7 some different text
#4 String A NA

Related

R import dirty csv with readLines use regular expressions to convert to dataframe

I'm trying to use Flodel's answer here (extra commas in csv causing problems) in order to import some messy CSV data, but I'm having trouble implementing the solution.
When I have more columns than three, I don't know how to get the text and extra comma into my desired column. I'm pretty sure the problem is in my pattern; I just don't know how to fix it.
file <- textConnection("123, hi, NAME1, EMAIL1#ADDRESS.COM
111, hi, NAME2, EMAIL2#ADRESS.ME
699, hi, FIRST M. LAST, Jr., EMAIL4#ADDRESS.GOV")
lines <- readLines(file)
pattern <- "^(\\d+), (.*), (.*), \\b(.*)$"
matches <- regexec(pattern, lines)
bad.rows <- which(sapply(matches, length) == 1L)
if (length(bad.rows) > 0L) stop(paste("bad row: ", lines[bad.rows]))
data <- regmatches(lines, matches)
as.data.frame(matrix(unlist(data), ncol = 5L, byrow = TRUE)[, -1L])
which gives me
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi, FIRST M. LAST Jr. EMAIL4#ADDRESS.GOV
I'd like to see:
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi FIRST M. LAST, Jr. EMAIL4#ADDRESS.GOV
If you're more explicit with what you want to match on, you might get better results. If column two will always only have a single string that does not include a comma, you can use:
pattern <- "^(\\d+), ([^,]+), (.*), \\b(.*)$"
In my experience, making your regular expression as explicit as you can first and then generalizing when that stops working is the best approach. e.g. if the second string is always hi include that in your regex.
pattern <- "^(\\d+), (hi), (.*), \\b(.*)$"

Replace String B with String C if it contains (but not exactly matches) String A

I have a data frame match_df which shows "matching rules": the column old should be replaced with the colum new in the dataframes it is applied on.
old <- c("10000","20000","300ZZ","40000")
new <- c("Name1","Name2","Name3","Name4")
match_df <- data.frame(old,new)
old new
1 10000 Name1
2 20000 Name2
3 300ZZ Name3 # watch the letters
4 40000 Name4
I want to apply the matching rules above on a data frame working_df
id <- c(1,2,3,4)
value <- c("xyz-10000","20000","300ZZ-230002112","40")
working_df <- data.frame(id,value)
id value
1 1 xyz-10000
2 2 20000
3 3 300ZZ-230002112
4 4 40
My desired result is
# result
id value
1 1 Name1
2 2 Name2
3 3 Name3
4 4 40
This means that I am not looking for an exact match. I'd rather like to replace the whole string working_df$value as soon as it includes any part of the string in match_df$old.
I like the solution posted in R: replace characters using gsub, how to create a function?, but it works only for exact matches. I experimented with gsub, str_replace_all from stringr but I couldn't find a solution that works for me. There are many solutions for exact matches on SOF, but I couldn't find a comprehensible one for this problem.
Any help is highly appreciated.
I'm not sure this is the most elegant/efficient way of doing it but you could try something like this:
working_df$value <- sapply(working_df$value,function(y){
idx<-which(sapply(match_df$old,function(x){grepl(x,y)}))[1]
if(is.na(idx)) idx<-0
ifelse(idx>0,as.character(match_df$new[idx]),as.character(y))
})
It uses grepl to find, for each value of working_df, if there is a row of match_df that is partially matching and get the index of that row. If there is more than one, it takes the first one.
You need the grep function. This will return the indices of a vector that match a pattern (any pattern, not necessarily a full string match). For instance, this will tell you which of your "old" values match the "10000" pattern:
grep(match_df[1,1], working_df$value)
Once you have that information, you can look up the corresponding "new" value for that pattern, and replace it on the matching rows.
Here are 2 approaches using Map + <<- and a for loop:
working_df[["value2"]] <- as.character(working_df[["value"]])
Map(function(x, y){working_df[["value2"]][grepl(x, working_df[["value2"]])] <<- y}, old, new)
working_df
## id value value2
## 1 1 xyz-10000 Name1
## 2 2 20000 Name2
## 3 3 300ZZ-230002112 Name3
## 4 4 40 40
## or...
working_df[["value2"]] <- as.character(working_df[["value"]])
for (i in seq_along(working_df[["value2"]])) {
working_df[["value2"]][grepl(old[i], working_df[["value2"]])] <- new[i]
}

regex variable substitution in "replacement" argument

I have a string in R. I want to find part of the string and append a variable number of zeroes. For example, I have 1 2 3. Sometimes I want it to be 1 20 3; sometimes I want it to be 1 2000 3. If I store the number of appended zeroes in a variable, how can I use it in the "replacement" part of a sub command?
I have in mind code like this:
s <- '1 2 3'
z <- '3'
sub('(\\s\\d)(\\s.*)', '\\10{z}\\2', s)
This code returns 1 20{z} 3. But I want 1 2000 3. How can I get this sort of result?
One way is
s <- '1 2 3'
z <- '3'
zx <- paste(rep(0, z), collapse = '')
sub('(\\s\\d)(\\s.*)', paste0('\\1', zx, '\\2'), s)
but this is a little clunky.
Try concatenate operator from stringi package:
require(stringi)
"abc"%stri+%"123abc"
## [1] "abc123abc"
Your approach to create the replacement string zx is pretty good. However, you can improve your sub command. If you use lookbehind and lookahead instead of matching groups, you don't need to create a new replacement string. You can use zx directly.
sub("(?<=\\s\\d)(?=\\s)", zx, s, perl = TRUE)
# [1] "1 2000 3"

R: removing the last three dots from a string

I have a text data file that I likely will read with readLines. The initial portion of each string contains a lot of gibberish followed by the data I need. The gibberish and the data are usually separated by three dots. I would like to split the strings after the last three dots, or replace the last three dots with a marker of some sort telling R to treat everything to the left of those three dots as one column.
Here is a similar post on Stackoverflow that will locate the last dot:
R: Find the last dot in a string
However, in my case some of the data have decimals, so locating the last dot will not suffice. Also, I think ... has a special meaning in R, which might be complicating the issue. Another potential complication is that some of the dots are bigger than others. Also, in some lines one of the three dots was replaced with a comma.
In addition to gregexpr in the post above I have tried using gsub, but cannot figure out the solution.
Here is an example data set and the outcome I hope to achieve:
aa = matrix(c(
'first string of junk... 0.2 0 1',
'next string ........2 0 2',
'%%%... ! 1959 ... 0 3 3',
'year .. 2 .,. 7 6 5',
'this_string is . not fine .•. 4 2 3'),
nrow=5, byrow=TRUE,
dimnames = list(NULL, c("C1")))
aa <- as.data.frame(aa, stringsAsFactors=F)
aa
# desired result
# C1 C2 C3 C4
# 1 first string of junk 0.2 0 1
# 2 next string ..... 2 0 2
# 3 %%%... ! 1959 0 3 3
# 4 year .. 2 7 6 5
# 5 this_string is . not fine 4 2 3
I hope this question is not considered too specific. The text data file was created using the steps outlined in my post from yesterday about reading an MSWord file in R.
Some of the lines do not contain gibberish or three dots, but only data. However, that might be a complication for a follow up post.
Thank you for any advice.
This does the trick, though not especially elegant...
options(stringsAsFactors = FALSE)
# Search for three consecutive characters of your delimiters, then pull out
# all of the characters after that
# (in parentheses, represented in replace by \\1)
nums <- as.vector(gsub(aa$C1, pattern = "^.*[.,•]{3}\\s*(.*)", replace = "\\1"))
# Use strsplit to break the results apart at spaces and just get the numbers
# Use unlist to conver that into a bare vector of numbers
# Use matrix(, nrow = length(x)) to convert it back into a
# matrix of appropriate length
num.mat <- do.call(rbind, strsplit(nums, split = " "))
# Mash it back together with your original strings
result <- as.data.frame(cbind(aa, num.mat))
# Give it informative names
names(result) <- c("original.string", "num1", "num2", "num3")
This will get you most of the way there, and it will have no problems with numbers that include commas:
# First, use a regex to eliminate the bad pattern. This regex
# eliminates any three-character combination of periods, commas,
# and big dots (•), so long as the combination is followed by
# 0-2 spaces and then a digit.
aa.sub <- as.matrix(
apply(aa, 1, function (x)
gsub('[•.,]{3}(\\s{0,2}\\d)', '\\1', x, perl = TRUE)))
# Second: it looks as though you want your data split into columns.
# So this regex splits on spaces that are (a) preceded by a letter,
# digit, or space, and (b) followed by a digit. The result is a
# list, each element of which is a list containing the parts of
# one of the strings in aa.
aa.list <- apply(aa.sub, 1, function (x)
strsplit(x, '(?<=[\\w\\d\\s])\\s(?=\\d)', perl = TRUE))
# Remove the second element in aa. There is no space before the
# first data column in this string. As a result, strsplit() split
# it into three columns, not 4. That in turn throws off the code
# below.
aa.list <- aa.list[-2]
# Make the data frame.
aa.list <- lapply(aa.list, unlist) # convert list of lists to list of vectors
aa.df <- data.frame(aa.list)
aa.df <- data.frame(t(aa.df), row.names = NULL, stringsAsFactors = FALSE)
The only thing remaining is to modify the regex for strsplit() so that it can handle the second string in aa. Or perhaps it's better just to handle cases like that manually.
Reverse the string
Reverse the pattern you're searching for if necessary - it's not in your case
Reverse the result
[haiku-pseudocode]
a = 'first string of junk... 0.2 0 1' // string to search
b = 'junk' // pattern to match
ra = reverseString(a) // now equals '1 0 2.0 ...knuj fo gnirts tsrif'
rb = reverseString (b) // now equals 'knuj'
// run your regular expression search / replace - search in 'ra' for 'rb'
// put the result in rResult
// and then unreverse the result
// apologies for not knowing the syntax for 'R' regex
[/haiku-pseudocode]

c# split text file by changing the line number

I'm trying to split text file by line numbers,
for example, if I have text file like:
1 ljhgk uygk uygghl \r\n
1 ljhg kjhg kjhg kjh gkj \r\n
1 kjhl kjhl kjhlkjhkjhlkjhlkjhl \r\n
2 ljkih lkjhl kjhlkjhlkjhlkjhl \r\n
2 lkjh lkjh lkjhljkhl \r\n
3 asdfghjkl \r\n
3 qweryuiop \r\n
I want to split it to 3 parts (1,2,3),
How can I do this? the size of the text is very large (~20,000,000 characters) and I need an efficient way (like regex).
Another idea, you can use linq to get the groups you're after, by splitting by each first word. Note that this will take each first word, so make sure you only have numbers there. This is using the split/join antipattern, but it seems to work nice here.
var lines = from line in s.Split("\r\n".ToCharArray(),
StringSplitOptions.RemoveEmptyEntries)
let lineNumber = line.Split(" ".ToCharArray(), 2).FirstOrDefault()
group line by lineNumber
into g
select String.Join("\n", g);
Notes:
GroupBy is gurenteed to return lines in the order they appeared.
If a block appears more than once (e.g. "1 1 2 2 3 3 1"), all blocks with the same number will be merged.
You can use a regex, but Split will not work too well. You can Match for the following pattern:
^(\d).*$ # Match first line, capture number
([\r\n]+^\1.*$)* # Match additional lines that begin with the same number
Example: here
I did try to split by$(?<=^(\d+).*)[\r\n]+^(?!\1), but it adds the line numbers as additional elementnt in the array.