I am trying to write a program with regular expressions to clean up some data. Let's say I have room names with a letter and a number. In the final output I need to output the room names using the pattern "the full string (excluding letter & number) + letter + number" as in the examples below. However, with the regular expressions I've written so far, I get very messed up results, which are at the bottom of my message. For some reason, it puts letters and characters on some of the rows, even though there may be none in the input data. Thank you.
EDITED: I made edits to the input data. I would like to generalize the code to take any number of character strings, not just the single word "ROOM".
# the pattern should be "the full string (excluding letter & number) + letter + number". For example:
ATLANTA ROOM
ATLANTA ROOM 3
NEW YORK ROOM A 2
ROOM A 4
THE BIG AWESOME ROOM B
ROOM B 4
GEORGETOWN ROOM B 2
NEW YORK ROOM C 2
NEW YORK ROOM C
LOS ANGELES ROOM E 2
# program to clean with regular expressions. there could be multiple spaces between words
dd <- c("ATLANTA ROOM ",
" ATLANTA ROOM 3",
"NEW YORK A ROOM 2",
"4 ROOM A",
"THE BIG AWESOME ROOM B",
" ROOM 4 B",
"GEORGETOWN B 2 ROOM ",
" C NEW YORK ROOM 2",
"NEW YORK ROOM C",
"LOS ANGELES ROOM 2 E")
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
(dd2 <- paste(gsub("( +)", " ",
gsub("(^ +)|( +$)", "",
gsub("(\\<A|B|C|D|E|1|2|3|4\\>)", "", dd))),
regmatches(dd, m_char), regmatches(dd, m_num), sep = " "))
# actual output from the program
"TLANTA ROOMA3",
"TLANTA ROOMA2",
"NW YORK ROOMA4",
"ROOMA4",
"TH IG WSOM ROOME2",
"ROOMB2",
"GORGTOWN ROOMB2",
"NW YORK ROOMC3",
"NW YORK ROOMC2",
"LOS NGLS ROOMA4"
Here's an attempt:
sub(' $', '', # clean up spaces at the end
gsub(' +', ' ', # clean up double spaces
# rearrange letter and numbers
sub('^([A-Z]?)([0-9]*)([A-Z]?)$', 'ROOM \\1\\3 \\2',
gsub(' |ROOM', '', dd) # remove spaces and ROOM
)
)
)
#[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2"
#[8] "ROOM C 2" "ROOM C" "ROOM E 2"
And here's the same logic for the edited OP and comment below (assuming room names are words that have at least 3 letters and at most a 2-letter room designation):
gsub('(^ | $)', '', # clean up spaces in front or end
gsub(' +', ' ', # clean up double spaces
# extract room name and put it in front of the letter and number
paste(gsub('\\b([A-Z][A-Z]?|[0-9]+)\\b', '', dd, perl = T),
sub('^([A-Z]+)?([0-9]*)([A-Z]+)?$', '\\1\\3 \\2',
gsub(' |\\w\\w\\w+', '', dd) # remove spaces and words
)
)
)
)
So, what's happening is e.g. your program only 8 letters, and so instead of inserting "" or NA, it's recycling them.
Here is a fix:
m_char_num <- regexpr("(\\<A|B|C|D|E|1|2|3|4\\>)", dd)
m_char <- regexpr("(\\<A|B|C|D|E\\>)", dd)
m_num <- regexpr("(\\<1|2|3|4\\>)", dd)
numbers <- rep("", length(dd))
numbers[m_num>0] <- regmatches(dd, m_num)
letters <- rep("", length(dd))
letters[m_char>0] <- regmatches(dd, m_char)
output <- trim(paste("ROOM", letters, numbers))
[1] "ROOM" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B" "ROOM B 4" "ROOM B 2" "ROOM C 2" "ROOM C"
[10] "ROOM E 2"
Try this:
library(gsubfn)
# extract numbers (num) and room letters (char)
num <- sapply(strapplyc(dd, "\\d|$"), paste, collapse = "")
char <- sapply(strapplyc(dd, "[A-D]|$"), paste, collapse = "")
# put back together and sort
out <- sort(paste("ROOM", char, num))
# trim spaces (optional)
out <- gsub(" +", " ", sub(" *$", "", out))
> out
[1] "ROOM" "ROOM 2" "ROOM 3" "ROOM A 2" "ROOM A 4" "ROOM B"
[7] "ROOM B 2" "ROOM B 4" "ROOM C" "ROOM C 2"
UPDATE: minor improvements
Related
I am needing to get some columnized-text into ruby arrays. They are company names, phone numbers and websites. I've obscured the actual data in order to focus on the parsing as opposed to the nature of the data, which I can deal with.
here is the Gist
As you can see, the nature of the columnar data changes, including:
leading whitespace width changes, from 0 to ~8
some lines are "" or \s+{3,}
column width changes depending on which block it's in (see how line 31 changes from 27)
therefore reliance upon using widths becomes problematic
some lines show empty entries in columns
empty column 1 on line 4 (example)
empty column 2 on line 2 (example)
empty column 3 on line 3 (example)
I'm wanting to get this organized into col1, col2 and col3 as arrays of entries. I can split them later on /\s*/ and choose the first element.
Given the obvious structure of these three columns, I'm thinking there is a pragmatic way of parsing these columns out into arrays of entries, one per line.
Does anybody have any insight into how to parse out the columns? Columns -> arrays col1, col2, col3 is the format which I seek.
Any advice/insight appreciated.
Let's suppose we gulp the file into a string, using IO::read, where the string is as follows.
str=<<~END
aaa bb cccc aaaaaaa aaaa bbb
aaaaaaaa aaaaaaaaa
aaaaa aaaaa bbbb
aaaaa bb cc aaaaaaa
aaa bbb aaaaaa bbb aaaaa bbbbbb
aaaa aaaaaaaaaaaa
aaaaaaaaa
a bb aaaaaaaaa
END
The first step is to divide the string into (two) blocks, which we can do as follows:
a1 = str.split(/\n{2,}/)
#=> ["aaa bb cccc aaaaaaa aaaa bbb\n aaaaaaaa aaaaaaaaa\n aaaaa aaaaa bbbb\naaaaa bb cc aaaaaaa",
# "aaa bbb aaaaaa bbb aaaaa bbbbbb\n aaaa aaaaaaaaaaaa\n aaaaaaaaa\n a bb aaaaaaaaa\n"]
Next, convert each of the two blocks to an array of lines.
a2 = a1.map { |s| s.chomp.split(/\n/) }
#=> [["aaa bb cccc aaaaaaa aaaa bbb",
# " aaaaaaaa aaaaaaaaa",
# " aaaaa aaaaa bbbb",
# "aaaaa bb cc aaaaaaa"],
# ["aaa bbb aaaaaa bbb aaaaa bbbbbb",
# " aaaa aaaaaaaaaaaa",
# " aaaaaaaaa",
# " a bb aaaaaaaaa"]]
We need to now map each each element of a2 (a string) to an array whose "columns" correspond to the columns of the original text.
a3 = a2.flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end
#=> [["aaa bb cccc", " aaaaaaa ", " aaaa bbb"],
# [" aaaaaaaa ", " ", " aaaaaaaaa"],
# [" aaaaa ", " ", " aaaaa bbbb"],
# ["aaaaa bb cc", " aaaaaaa", nil],
# ["aaa bbb", " aaaaaa bbb ", " aaaaa bbbbbb"],
# [" aaaa ", " ", " aaaaaaaaaaaa"],
# [" ", " aaaaaaaaa", nil],
# [" a bb ", " ", " aaaaaaaaa"]]
line =~ /\S/ returns the index of the first element of line that contains a character of than a whitespace (the reserved character \S in regular expressions.)
See Enumerable#flat_map.
The following intermediate values were obtained in the calculation of a3.
For group 1:
mx_len = 37
indent = 0
break_cols = [11, 12, 13, 14, 23, 24, 25]
b1 = 11
b2 = 25
For group 2:
mx_len = 38
indent = 0
break_cols = [7, 8, 9, 20, 21, 22]
b1 = 7
b2 = 22
All that remains is to convert nil's to empty strings, strip spaces from the ends of each string and transpose the array.
a3.map { |col| col.map { |s| s.to_s.strip } }.transpose
#=> [["aaa bb cccc", "aaaaaaaa", "aaaaa", "aaaaa bb cc",
# "aaa bbb", "aaaa", "", "a bb"],
# ["aaaaaaa", "", "", "aaaaaaa", "aaaaaa bbb", "",
# "aaaaaaaaa", ""],
# ["aaaa bbb", "aaaaaaaaa", "aaaaa bbbb", "",
# "aaaaa bbbbbb", "aaaaaaaaaaaa", "", "aaaaaaaaa"]]
If desired, we could of course chain the above operations.
str.split(/\n{2,}/).
map { |s| s.chomp.split(/\n/) }.
flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end.map { |col| col.map { |s| s.to_s.strip } }.transpose
As Cary has demonstrated, working with widths was painful. That's what tripped me up. I took a new approach at doing a String.gsub(/\s{2,44}/,'•') so it would preserve column widths while inserting delimiters:
col1, col2, col3 = [],[], []
master_data = []
lines = File.open(s, 'r+').read.split("\n")
lines.each do |line|
next if line == "" || line.strip == ""
nline = line.gsub(/\s{2,44}/,'•')
nline[0] = '' if nline.start_with?('•')
nline = nline.split('•')
col1 << nline[0]
col2 << nline[1]
col3 << nline[2]
end
col1.delete_if {|i| i.nil?}
col2.delete_if {|i| i.nil?}
col3.delete_if {|i| i.nil?}
# ap col1
# puts
# ap col2
# puts
# ap col3
counter = 0
col1.each do |i|
next if i.nil?
if i.match?(/^\d{3}-\d{3}-\d{4}/) # matches a phone number, perhaps a big assumption
company = [col1[counter-1], col1[counter], col1[counter+1]]
master_data << company
end
counter += 1
end
# a company is a company name, phone number, and website
# do the same for col2 and col3
ap master_data
I have a string vector that looks like:
> string_vec
[1] "XXX" "Snakes On A Plane" "Mask of the Ninja" "Ruslan"
[5] "Kill Switch" "Buddy Holly Story, The" "Believers, The" "Closet, The"
[9] "Eyes of Tammy Faye, The" "Gymnast, The" "Hunger, The"
There are some names which contain ", The" in the end. I want to delete the comma and the space and move the "The" before all other text.
For e.g.: "Buddy Holly Story, The" becomes "The Buddy Holly Story".
Isolating the records with the pattern was easy :
string_vec[grepl("[Aa-zZ]+, The", string_vec) == TRUE]
How can I adjust the position now?
data
string_vec <- c("XXX", "Snakes On A Plane", "Mask of the Ninja",
"Ruslan",
"Kill Switch", "Buddy Holly Story, The", "Believers, The",
"Closet, The",
"Eyes of Tammy Faye, The", "Gymnast, The", "Hunger, The")
You may try
sub('^(.*), The', 'The \\1', string_vec)
#[1] "XXX" "Snakes On A Plane" "Mask of the Ninja"
#[4] "Ruslan" "Kill Switch" "The Buddy Holly Story"
#[7] "The Believers" "The Closet" "The Eyes of Tammy Faye"
#[10] "The Gymnast" "The Hunger"
Building on top of two questions I previously asked:
R: How to prevent memory overflow when using mgsub in vector mode?
gsub speed vs pattern length
I do like suggestions on usage of fixed=TRUE by #Tyler as it speeds up calculations significantly. However, it's not always applicable. I need to substitute, say, caps as a stand-alone word w/ or w/o punctuation that surrounds it. A priori it's not know what can follow or precede the word, but it must be any of regular punctuation signs (, . ! - + etc). It cannot be a number or a letter. Example below. capsule must stay as is.
i = "Here is the capsule, caps key, and two caps, or two caps. or even three caps-"
orig = "caps"
change = "cap"
gsub_FixedTrue <- function(i) {
i = paste0(" ", i, " ")
orig = paste0(" ", orig, " ")
change = paste0(" ", change, " ")
i = gsub(orig,change,i,fixed=TRUE)
i = gsub("^\\s|\\s$", "", i, perl=TRUE)
return(i)
}
#Second fastest, doesn't clog memory
gsub_FixedFalse <- function(i) {
i = gsub(paste0("\\b",orig,"\\b"),change,i)
return(i)
}
print(gsub_FixedTrue(i)) #wrong
print(gsub_FixedFalse(i)) #correct
Results. Second output is desired
[1] "Here is the capsule, cap key, and two caps, or two caps. or even three caps-"
[1] "Here is the capsule, cap key, and two cap, or two cap. or even three cap-"
Using parts from your previous question to test I think we can put a place holder in front of punctuation as follows, without slowing it down too much:
line <- c("one", "two one", "four phones", "and a capsule", "But here's a caps key",
"Here is the capsule, caps key, and two caps, or two caps. or even three caps-" )
e <- c("one", "two", "caps")
r <- c("ONE", "TWO", "cap")
line <- rep(line, 1700000/length(line))
line <- gsub("([[:punct:]])", " <DEL>\\1<DEL> ", line, perl=TRUE)
## Start
line2 <- paste0(" ", line, " ")
e2 <- paste0(" ", e, " ")
r2 <- paste0(" ", r, " ")
for (i in seq_along(e2)) {
line2 <- gsub(e2[i], r2[i], line2, fixed=TRUE)
}
gsub("^\\s|\\s$| <DEL>|<DEL> ", "", line2, perl=TRUE)
Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).
Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).
A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"
Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"
I have a long list of names and i have to count number of times each name has come up. However the names are mixed with spaces.
Here is the simple example
x <- c(" John D","John D ","John D")
table(x)
x
John D John D John D
1 1 1
You can see because of the spaces it is recognizing as three different names. What i have to do is without loosing the Space between John and D, I have to remove remaining spaces.
Please help. Thanks.
You can use gsub to remove the leading/trailing whitespace characters.
x <- c(" John D", "John D ", " John D ")
y <- gsub('^\\s+|\\s+$', '', x)
table(y)
# y
# John D
# 3
Explanation: \s matches whitespace (\n, \r, \t, \f, and " ") only at the beginning ^ and the end $ of the string respectively. The + quantifier means match (1 or more times).
You can also use the stringr library package.
library(stringr)
x <- c(" John D", "John D ", " John D ")
y <- str_trim(x, side='both')
table(y)
# y
# John D
# 3
Try:
library(stringr)
x1 <- str_trim(x)
table(x1)
#x1
# John D
# 3
Or
gsub("^ +| +$", "",x)
#[1] "John D" "John D" "John D"
^ +| +$ - 0 or more spaces either at beginning or end
replace it with ""
if you have a vector like this:
x <- c("John D", " \n John D", "John D \r")
library(qdap)
strip(x,lower.case=F)
#[1] "John D" "John D" "John D"
If there are no additional spaces between the names str_trim still works
x <- c(" \nJohn D","John D\r ","John D")
str_trim(x)
#[1] "John D" "John D" "John D"