Ruby Extract Slippery Text Columns

Ruby Extract Slippery Text Columns - regex

I am needing to get some columnized-text into ruby arrays. They are company names, phone numbers and websites. I've obscured the actual data in order to focus on the parsing as opposed to the nature of the data, which I can deal with.
here is the Gist
As you can see, the nature of the columnar data changes, including:
leading whitespace width changes, from 0 to ~8
some lines are "" or \s+{3,}
column width changes depending on which block it's in (see how line 31 changes from 27)
therefore reliance upon using widths becomes problematic
some lines show empty entries in columns
empty column 1 on line 4 (example)
empty column 2 on line 2 (example)
empty column 3 on line 3 (example)
I'm wanting to get this organized into col1, col2 and col3 as arrays of entries. I can split them later on /\s*/ and choose the first element.
Given the obvious structure of these three columns, I'm thinking there is a pragmatic way of parsing these columns out into arrays of entries, one per line.
Does anybody have any insight into how to parse out the columns? Columns -> arrays col1, col2, col3 is the format which I seek.
Any advice/insight appreciated.

Let's suppose we gulp the file into a string, using IO::read, where the string is as follows.
str=<<~END
aaa bb cccc aaaaaaa aaaa bbb
aaaaaaaa aaaaaaaaa
aaaaa aaaaa bbbb
aaaaa bb cc aaaaaaa
aaa bbb aaaaaa bbb aaaaa bbbbbb
aaaa aaaaaaaaaaaa
aaaaaaaaa
a bb aaaaaaaaa
END
The first step is to divide the string into (two) blocks, which we can do as follows:
a1 = str.split(/\n{2,}/)
#=> ["aaa bb cccc aaaaaaa aaaa bbb\n aaaaaaaa aaaaaaaaa\n aaaaa aaaaa bbbb\naaaaa bb cc aaaaaaa",
# "aaa bbb aaaaaa bbb aaaaa bbbbbb\n aaaa aaaaaaaaaaaa\n aaaaaaaaa\n a bb aaaaaaaaa\n"]
Next, convert each of the two blocks to an array of lines.
a2 = a1.map { |s| s.chomp.split(/\n/) }
#=> [["aaa bb cccc aaaaaaa aaaa bbb",
# " aaaaaaaa aaaaaaaaa",
# " aaaaa aaaaa bbbb",
# "aaaaa bb cc aaaaaaa"],
# ["aaa bbb aaaaaa bbb aaaaa bbbbbb",
# " aaaa aaaaaaaaaaaa",
# " aaaaaaaaa",
# " a bb aaaaaaaaa"]]
We need to now map each each element of a2 (a string) to an array whose "columns" correspond to the columns of the original text.
a3 = a2.flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end
#=> [["aaa bb cccc", " aaaaaaa ", " aaaa bbb"],
# [" aaaaaaaa ", " ", " aaaaaaaaa"],
# [" aaaaa ", " ", " aaaaa bbbb"],
# ["aaaaa bb cc", " aaaaaaa", nil],
# ["aaa bbb", " aaaaaa bbb ", " aaaaa bbbbbb"],
# [" aaaa ", " ", " aaaaaaaaaaaa"],
# [" ", " aaaaaaaaa", nil],
# [" a bb ", " ", " aaaaaaaaa"]]
line =~ /\S/ returns the index of the first element of line that contains a character of than a whitespace (the reserved character \S in regular expressions.)
See Enumerable#flat_map.
The following intermediate values were obtained in the calculation of a3.
For group 1:
mx_len = 37
indent = 0
break_cols = [11, 12, 13, 14, 23, 24, 25]
b1 = 11
b2 = 25
For group 2:
mx_len = 38
indent = 0
break_cols = [7, 8, 9, 20, 21, 22]
b1 = 7
b2 = 22
All that remains is to convert nil's to empty strings, strip spaces from the ends of each string and transpose the array.
a3.map { |col| col.map { |s| s.to_s.strip } }.transpose
#=> [["aaa bb cccc", "aaaaaaaa", "aaaaa", "aaaaa bb cc",
# "aaa bbb", "aaaa", "", "a bb"],
# ["aaaaaaa", "", "", "aaaaaaa", "aaaaaa bbb", "",
# "aaaaaaaaa", ""],
# ["aaaa bbb", "aaaaaaaaa", "aaaaa bbbb", "",
# "aaaaa bbbbbb", "aaaaaaaaaaaa", "", "aaaaaaaaa"]]
If desired, we could of course chain the above operations.
str.split(/\n{2,}/).
map { |s| s.chomp.split(/\n/) }.
flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end.map { |col| col.map { |s| s.to_s.strip } }.transpose

As Cary has demonstrated, working with widths was painful. That's what tripped me up. I took a new approach at doing a String.gsub(/\s{2,44}/,'•') so it would preserve column widths while inserting delimiters:
col1, col2, col3 = [],[], []
master_data = []
lines = File.open(s, 'r+').read.split("\n")
lines.each do |line|
next if line == "" || line.strip == ""
nline = line.gsub(/\s{2,44}/,'•')
nline[0] = '' if nline.start_with?('•')
nline = nline.split('•')
col1 << nline[0]
col2 << nline[1]
col3 << nline[2]
end
col1.delete_if {|i| i.nil?}
col2.delete_if {|i| i.nil?}
col3.delete_if {|i| i.nil?}
# ap col1
# puts
# ap col2
# puts
# ap col3
counter = 0
col1.each do |i|
next if i.nil?
if i.match?(/^\d{3}-\d{3}-\d{4}/) # matches a phone number, perhaps a big assumption
company = [col1[counter-1], col1[counter], col1[counter+1]]
master_data << company
end
counter += 1
end
# a company is a company name, phone number, and website
# do the same for col2 and col3
ap master_data

Related

How to regex match everything but long words?

I would like to select all long words from a string: re.findall("[a-z]{3,}")
However, for a reason I can use substitute only. Hence I need to substitute everything but words of 3 and more letters by space. (e.g. abc de1 fgh ij -> abc fgh)
How would such a regex look like?
The result should be all "[a-z]{3,}" concatenated by spaces. However, you can use substitution only.
Or in Python: Find a regex such that
re.sub(regex, " ", text) == " ".join(re.findall("[a-z]{3,}", text))
Here is some test cases
import re
solution_regex="..."
for test_str in ["aaa aa aaa aa",
"aaa aa11",
"11aaa11 11aa11",
"aa aa1aa aaaa"
]:
expected_str = " ".join(re.findall("[a-z]{3,}", test_str))
print(test_str, "->", expected_str)
if re.sub(solution_regex, " ", test_str)!=expected_str:
print("ERROR")
->
aaa aa aaa aa -> aaa aaa
aaa aa11 -> aaa
11aaa11 11aa11 -> aaa
aa aa1aa aaaa -> aaaa
Note that space is no different than any other symbol.

\b(?:[a-z,A-Z,_]{1,2}|\w*\d+\w*)\b
Explanation:
\b means that the substring you are looking for start and end by border of word
(?: ) - non captured group
\w*\d+\w* Any word that contains at least one digit and consists of digits, '_' and letters
Here you can see the test.

You can use the regex
(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)
and replace with an empty string, here is a python code for the same
import re
regex = r"(\s\b(\d*[a-z]\d*){1,2}\b)|(\s\b\d+\b)"
test_str = "abcd abc ad1r ab a11b a1 11a 1111 1111abcd a1b2c3d"
subst = ""
# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0)
if result:
print (result)
here is a demo

In Autoit this works for me
#include <Array.au3>
$a = StringRegExp('abc de1 fgh ij 234234324 sdfsdfsdf wfwfwe', '(?i)[a-z]{3,}', 3)
ConsoleWrite(_ArrayToString($a, ' ') & #CRLF)
Result ==> abc fgh sdfsdfsdf wfwfwe

import re
regex = r"(?:^|\s)[^a-z\s]*[a-z]{0,2}[^a-z\s]*(?:\s|$)"
str = "abc de1 fgh ij"
subst = " "
result = re.sub(regex, subst, str)
print (result)
Output:
abc fgh
Explanation:
(?:^|\s) : non capture group, start of string or space
[^a-z\s]* : 0 or more any character that is not letter or space
[a-z]{0,2} : 0, 1 or 2 letters
[^a-z\s]* : 0 or more any character that is not letter or space
(?:\s|$) : non capture group, space or end of string

With the other ideas posted here, I came up with an answer. I can't believe I missed that:
([^a-z]+|(?<![a-z])[a-z]{1,2}(?![a-z]))+
https://regex101.com/r/IIxkki/2
Match either non-letters, or up to two letters bounded by non-letters.

removing consecutive duplicates in strings R

I'd like to collapse two strings s1 = "word1 word2 word3" and s2 = "word2 word3 word4" but removing the extra (future) consecutive overlap/duplicate ("word2 word3"). That is, I should obtain s = "word1 word2 word3 word4" rather than s = "word1 word2 word3 word2 word3 word4".
More simply, it should also work for single-word overlaps: s1 = "word1 word2" and s2 = "word2 word3" should give me s = word1 word2 word3" rather than s = "word1 word2 word2 word3".
I am using wordnumber for illustration purposes but of course it should work for any word...

Use unique on the result, that should remove all the duplicates.
And perhaps also use sort?
EDIT: Sorry, my first answer did miss the point completely. Here's a revised solution based on the stringr-package, that I think should work. The idea is to first split the strings into vectors, then compare the vectors and check if an overlap is present - finally join the vectors based on whether or not an overlap was detected.
s1 = "word1 word2 word3"
s2 = "word2 word3 word4"
library(stringr)
.s1_splitted <- str_split(
string = s1,
pattern = "\ +")[[1]]
.s2_splitted <- str_split(
string = s2,
pattern = "\ +")[[1]]
.matches12 <- charmatch(
x = .s1_splitted,
table = .s2_splitted)
If the last number is different from NA, and shorter than the
length of .s1_splitted, then check if the end of the vector
looks like it ought to do.
.last_element <- tail(.matches12, n = 1)
if (! is.na(.last_element)) {
if (.last_element <= length(.s1_splitted)) {
.overlap <- identical(
x = 1:.last_element,
y = tail(x = .matches12,
n = .last_element))
}
} else
.overlap <- FALSE
Join the components, based on overlap.
if (.overlap) {
.joined <- c(
head(x = .s1_splitted,
n = - .last_element),
.s2_splitted)
} else
.joined <- c(.s1_splitted,
.s2_splitted)
Convert back to a string
.result <- paste(.joined, collapse = " ")

This was surprisingly difficult, but I believe I have a solution:
sjoin <- function(s1,s2) {
ss1 <- strsplit(s1,'\\s+')[[1L]];
ss2 <- strsplit(s2,'\\s+')[[1L]];
if (length(ss1)==0L) return(s2);
if (length(ss2)==0L) return(s1);
n <- 0L; for (i in seq(min(length(ss1),length(ss2)),1L))
if (all(ss1[seq(to=length(ss1),len=i)]==ss2[seq(1L,len=i)])) {
n <- i;
break;
}; ## end if
paste(collapse=' ',c(ss1,if (n==0L) ss2 else ss2[-1:-n]));
}; ## end sjoin()
sjoin('1 2 3','2 3 4');
## [1] "1 2 3 4"
sjoin('1 2 3 x','2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('1 2 3','x 2 3 4');
## [1] "1 2 3 x 2 3 4"
sjoin('','')
## [1] ""
sjoin('a','');
## [1] "a"
sjoin('','a');
## [1] "a"
sjoin('a','a')
## [1] "a"
sjoin('a b c','a b c');
## [1] "a b c"
sjoin('a b c','c');
## [1] "a b c"
sjoin('a b c','c d');
## [1] "a b c d"
sjoin('b','b c d');
## [1] "b c d"
sjoin('a b','b c d');
## [1] "a b c d"

merge strings among rows by id

I wish to merge strings among rows by an id variable. I know how to do that with the R code below. However, my code seems vastly overly complex.
In the present case each string has two elements that are not dots. Each pair of consecutive rows within an id have one element in common. So, only one of those elements remains after the two rows are merged.
The desired result is shown and the R code below returns the desired result. Thank you for any suggestions. Sorry my R code is so long and convoluted, but it does work and my goal is to obtain more efficient code in base R.
my.data <- read.table(text = '
id my.string
2 11..................
2 .1...2..............
2 .....2...3..........
5 ....................
6 ......2.....2.......
6 ............2...4...
7 .1...2..............
7 .....2....3.........
7 ..........3..3......
7 .............34.....
8 ....1.....1.........
8 ..........12........
8 ...........2....3...
9 ..................44
10 .2.......2..........
11 ...2...2............
11 .......2.....2......
11 .............2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
my.data
desired.result <- read.table(text = '
id my.string
2 11...2...3..........
5 ....................
6 ......2.....2...4...
7 .1...2....3..34.....
8 ....1.....12....3...
9 ..................44
10 .2.......2..........
11 ...2...2.....2...2..
', header = TRUE, na.strings = 'NA', stringsAsFactors = FALSE)
# obtain position of first and last non-dot
# from: http://stackoverflow.com/questions/29229333/position-of-first-and-last-non-dot-in-a-string-with-regex
first.last.dot <- data.frame(my.data, do.call(rbind, gregexpr("^\\.*\\K[^.]|[^.](?=\\.*$)", my.data[,2], perl=TRUE)))
# obtain non-dot elements
first.last.dot$first.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X1, first.last.dot$X1))
first.last.dot$last.element <- as.numeric(substr(first.last.dot$my.string, first.last.dot$X2, first.last.dot$X2))
# obtain some book-keeping variables
first.last.dot$number.within.group <- sequence(rle(first.last.dot$id)$lengths)
most.records.per.id <- max(first.last.dot$number.within.group)
n.ids <- length(unique(first.last.dot$id))
# create matrices for recording data
positions.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
values.per.id <- matrix(NA, nrow = (n.ids), ncol=(most.records.per.id+1))
# use nested for-loops to fill matrices with data
positions.per.id[1,1] = first.last.dot$X1[1]
values.per.id[1,1] = first.last.dot$first.element[1]
positions.per.id[1,2] = first.last.dot$X2[1]
values.per.id[1,2] = first.last.dot$last.element[1]
j = 1
for(i in 2:nrow(first.last.dot)) {
if(first.last.dot$id[i] != first.last.dot$id[i-1]) j = j + 1
positions.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$X1[i]
positions.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$X2[i]
values.per.id[j, (first.last.dot$number.within.group[i]+0)] = first.last.dot$first.element[i]
values.per.id[j, (first.last.dot$number.within.group[i]+1)] = first.last.dot$last.element[i]
}
# convert matrix data into new strings using nested for-loops
new.strings <- matrix(0, nrow = nrow(positions.per.id), ncol = nchar(my.data$my.string[1]))
for(i in 1:nrow(positions.per.id)) {
for(j in 1:ncol(positions.per.id)) {
new.strings[i,positions.per.id[i,j]] <- values.per.id[i,j]
}
}
# format new strings
new.strings[is.na(new.strings)] <- 0
new.strings[new.strings==0] <- '.'
new.strings2 <- data.frame(id = unique(first.last.dot$id), my.string = (do.call(paste0, as.data.frame(new.strings))), stringsAsFactors = FALSE)
new.strings2
all.equal(desired.result, new.strings2)
# [1] TRUE

Dude, this was tough. Please don't make me explain what I did.
data.frame(id=unique(my.data$id), my.string=sapply(lapply(unique(my.data$id), function(id) gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1)) ), function(x) paste0(x,collapse='') ), stringsAsFactors=F );
Ok, I'll explain it:
It begins with this lapply() call:
lapply(unique(my.data$id), function(id) ... )
As you can see, the above basically iterates over the unique ids in the data.frame, processing each one in turn. Here's the contents of the function:
gsub('^$','.',substr(gsub('\\.','',do.call(paste0,strsplit(my.data[my.data$id==id,'my.string'],''))),1,1))
Let's take that in pieces, starting with the innermost subexpression:
strsplit(my.data[my.data$id==id,'my.string'],'')
The above indexes all my.string cells for the current id value, and splits each string using strsplit(). This produces a list of character vectors, with each list component containing a vector of character strings, where the whole vector corresponds to the input string which was split. The use of the empty string as the delimiter causes each individual character in each input string to become an element in the output vector in the list component corresponding to said input string.
Here's an example of what the above expression generates (for id==2):
[[1]]
[1] "1" "1" "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "." "."
[[2]]
[1] "." "1" "." "." "." "2" "." "." "." "." "." "." "." "." "." "." "." "." "." "."
[[3]]
[1] "." "." "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."
The above strsplit() call is wrapped in the following (with the ... representing the previous expression):
do.call(paste0,...)
That calls paste0() once, passing the output vectors that were generated by strsplit() as arguments. This does a kind of element-wise pasting of all vectors, so you end up with a single vector like this, for each unique id:
[1] "1.." "11." "..." "..." "..." ".22" "..." "..." "..." "..3" "..." "..." "..." "..." "..." "..." "..." "..." "..." "..."
The above paste0() call is wrapped in the following:
gsub('\\.','',...)
That strips all literal dots from all elements, resulting in something like this, for each unique id:
[1] "1" "11" "" "" "" "22" "" "" "" "3" "" "" "" "" "" "" "" "" "" ""
The above gsub() call is wrapped in the following:
substr(...,1,1)
That extracts just the first character of each element, which, if it exists, is the desired character in that position. Empty elements are acceptable, as that just means the id had no non-dot characters in any of its input strings at that position.
The above substr() call is wrapped in the following:
gsub('^$','.',...)
That simply replaces empty elements with a literal dot, which is obviously necessary before we put the string back together. So we have, for id==2:
[1] "1" "1" "." "." "." "2" "." "." "." "3" "." "." "." "." "." "." "." "." "." "."
That completes the function that was given to the lapply() call. Thus, coming out of that call will be a list of character vectors representing the desired output strings. All that remains is collapsing the elements of those vectors back into a single string, which is why we then need this:
sapply(..., function(x) paste0(x,collapse='') )
Using sapply() (simplify-apply) is appropriate because it automatically combines all desired output strings into a single character vector, rather than leaving them as a list:
[1] "11...2...3.........." "...................." "......2.....2...4..." ".1...2....3..34....." "....1.....12....3..." "..................44" ".2.......2.........." "...2...2.....2...2.."
Thus, all that remains is producing the full output data.frame, similar to the input data.frame:
data.frame(id=unique(my.data$id), my.string=..., stringsAsFactors=F )
Resulting in:
id my.string
1 2 11...2...3..........
2 5 ....................
3 6 ......2.....2...4...
4 7 .1...2....3..34.....
5 8 ....1.....12....3...
6 9 ..................44
7 10 .2.......2..........
8 11 ...2...2.....2...2..
And we're done!

Doing this in base R is a bit masochistic, so I won't do that, but with some perseverance you can do it yourself. Here's a data.table version instead (you'll need to install the latest 1.9.5 version from github to get tstrsplit):
library(data.table)
dt = as.data.table(my.data) # or setDT to convert in place
dt[, paste0(lapply(tstrsplit(my.string, ""),
function(i) {
res = i[i != "."];
if (length(res) > 0)
res[1]
else
'.'
}), collapse = "")
, by = id]
# id V1
#1: 2 11...2...3..........
#2: 5 ....................
#3: 6 ......2.....2...4...
#4: 7 .1...2....3..34.....
#5: 8 ....1.....12....3...
#6: 9 ..................44
#7: 10 .2.......2..........
#8: 11 ...2...2.....2...2..

Here's a possibility using functions from stringi and dplyr packages:
library(stringi)
library(dplyr)
# split my.string
m <- stri_split_boundaries(my.data$my.string, type = "character", simplify = TRUE)
df <- data.frame(id = my.data$id, m)
# function to apply to each column - select "." or unique "number"
myfun <- function(x) if(all(x == ".")) "." else unique(x[x != "."])
df %>%
# for each id...
group_by(id) %>%
# ...and each column, apply function
summarise_each(funs(myfun)) %>%
# for each row...
rowwise() %>%
#...concatenate strings
do(data.frame(id = .[1], mystring = paste(.[-1], collapse = "")))
# id mystring
# 1 2 11...2...3..........
# 2 5 ....................
# 3 6 ......2.....2...4...
# 4 7 .1...2....3..34.....
# 5 8 ....1.....12....3...
# 6 9 ..................44
# 7 10 .2.......2..........
# 8 11 ...2...2.....2...2..

Insert a character at multiple positions in a string at once

Let us say I have a string
"ABCDEFGHI56dfsdfd"
What I want to do is insert a space character at multiple positions at once.
For eg. I want to insert space character at randomly chosen two positions say 4 and 8.
So the output should be
"ABCD EFGH I56dfsdfd"
What is the most effective way of doing this? Given the string can have any type of characters in it (not just the alphabets).

Here's a solution based on regular expressions:
vec <- "ABCDEFGHI56dfsdfd"
# sample two random positions
pos <- sample(nchar(vec), 2)
# [1] 6 4
# generate regex pattern
pat <- paste0("(?=.{", nchar(vec) - pos, "}$)", collapse = "|")
# [1] "(?=.{11}$)|(?=.{13}$)"
# insert spaces at (after) positions
gsub(pat, " ", vec, perl = TRUE)
# [1] "ABCD EF GHI56dfsdfd"
This approach is based on positive lookaheads, e.g., (?=.{11}$). In this example, a space is inserted at 11 characters before the end of the string ($).

A bit more brute-force-y than Sven's:
randomSpaces <- function(txt) {
pos <- sort(sample(nchar(txt), 2))
paste(substr(txt, 1, pos[1]), " ",
substr(txt, pos[1]+1, pos[2]), " ",
substr(txt, pos[2]+1, nchar(txt)), collapse="", sep="")
}
for (i in 1:10) print(randomSpaces("ABCDEFGHI56dfsdfd"))
## [1] "ABCDEFG HI56 dfsdfd"
## [1] "ABC DEFGHI5 6dfsdfd"
## [1] "AB CDEFGHI56dfsd fd"
## [1] "ABCDEFGHI 5 6dfsdfd"
## [1] "ABCDEF GHI56dfsdf d"
## [1] "ABC DEFGHI56dfsdf d"
## [1] "ABCD EFGHI56dfsd fd"
## [1] "ABCDEFGHI56d fsdfd "
## [1] "AB CDEFGH I56dfsdfd"
## [1] "A BCDE FGHI56dfsdfd"

Based on the accepted answer, here's a function that simplifies this approach:
##insert pattern in string at position
substrins <- function(ins, x, ..., pos=NULL, offset=0){
stopifnot(is.numeric(pos),
is.numeric(offset),
!is.null(pos))
offset <- offset[1]
pat <- paste0("(?=.{", nchar(x) - pos - (offset-1), "}$)", collapse = "|")
gsub(pattern = pat, replacement = ins, x = x, ..., perl = TRUE)
}
# insert space at position 10
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10)
##[1] "ABCDEFGHI 56dfsdfd"
# insert pattern before position 10 (i.e. at position 9)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=-1)
##[1] "ABCDEFGH I56dfsdfd"
# insert pattern after position 10 (i.e. at position 11)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = 10, offset=1)
##[1] "ABCDEFGHI5 6dfsdfd"
Now to do what the OP wanted:
# insert space at position 4 and 8
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8))
##[1] "ABC DEFG HI56dfsdfd"
# insert space after position 4 and 8 (as per OP's desired output)
substrins(" ", "ABCDEFGHI56dfsdfd", pos = c(4,8), offset=1)
##[1] "ABCD EFGH I56dfsdfd"
To replicate the other, more brute-force-y answer one would do:
set.seed(123)
x <- "ABCDEFGHI56dfsdfd"
for (i in 1:10) print(substrins(" ", x, pos = sample(nchar(x), 2)))
##[1] "ABCD EFGHI56d fsdfd"
##[1] "ABCDEF GHI56dfs dfd"
##[1] " ABCDEFGHI56dfsd fd"
##[1] "ABCDEFGH I56dfs dfd"
##[1] "ABCDEFG HI 56dfsdfd"
##[1] "ABCDEFG HI56dfsdf d"
##[1] "ABCDEFGHI 56 dfsdfd"
##[1] "A BCDEFGHI56dfs dfd"
##[1] " ABCD EFGHI56dfsdfd"
##[1] "ABCDE FGHI56dfsd fd"

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)

Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "

You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit

Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Ruby Extract Slippery Text Columns - regex

Related

How to regex match everything but long words?

removing consecutive duplicates in strings R

merge strings among rows by id

Insert a character at multiple positions in a string at once

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

Categories

Resources