Tcl split list elements in the list - list

I am trying to split some list elements in the list.
I want to make lists from:
beforelist: {{aa, bb, cc, dd, ee;} {ff, gg, hh, ii, jj;}}
to:
afterlist: {aa bb cc dd ee ff gg hh ii jj}
I tried to deal with them by using split command, but beforelist has some tricky point: comma, semicolon.

If we remove the punctuation, we're left with 2 lists that can be concatenated
set afterlist [concat {*}[string map {, "" ; ""} $beforelist]]

A couple of ways:
#!/usr/bin/env tclsh
set beforelist {{aa, bb, cc, dd, ee;} {ff, gg, hh, ii, jj;}}
# Iterate over each element of each list in beforelist and remove
# trailing commas and semicolons before appending to afterlist
set afterlist {}
foreach sublist $beforelist {
foreach elem $sublist {
lappend afterlist [regsub {[,;]$} $elem ""]
}
}
puts "{$afterlist}"
# Remove all commas and semicolons at the end of a word followed by space
# or close brace, then append sublists to afterlist.
set afterlist {}
foreach sublist [regsub -all {\M[,;](?=[ \}])} $beforelist ""] {
lappend afterlist {*}$sublist
}
puts "{$afterlist}"

This is not necessarily a problem to be tackled using lists. Keep it a string manipulation:
% set beforeList {{aa, bb, cc, dd, ee;} {ff, gg, hh, ii, jj;}}
{aa, bb, cc, dd, ee;} {ff, gg, hh, ii, jj;}
% string map {\{ "" \} "" \; "" , ""} $beforeList
aa bb cc dd ee ff gg hh ii jj

Related

Ruby Extract Slippery Text Columns

I am needing to get some columnized-text into ruby arrays. They are company names, phone numbers and websites. I've obscured the actual data in order to focus on the parsing as opposed to the nature of the data, which I can deal with.
here is the Gist
As you can see, the nature of the columnar data changes, including:
leading whitespace width changes, from 0 to ~8
some lines are "" or \s+{3,}
column width changes depending on which block it's in (see how line 31 changes from 27)
therefore reliance upon using widths becomes problematic
some lines show empty entries in columns
empty column 1 on line 4 (example)
empty column 2 on line 2 (example)
empty column 3 on line 3 (example)
I'm wanting to get this organized into col1, col2 and col3 as arrays of entries. I can split them later on /\s*/ and choose the first element.
Given the obvious structure of these three columns, I'm thinking there is a pragmatic way of parsing these columns out into arrays of entries, one per line.
Does anybody have any insight into how to parse out the columns? Columns -> arrays col1, col2, col3 is the format which I seek.
Any advice/insight appreciated.
Let's suppose we gulp the file into a string, using IO::read, where the string is as follows.
str=<<~END
aaa bb cccc aaaaaaa aaaa bbb
aaaaaaaa aaaaaaaaa
aaaaa aaaaa bbbb
aaaaa bb cc aaaaaaa
aaa bbb aaaaaa bbb aaaaa bbbbbb
aaaa aaaaaaaaaaaa
aaaaaaaaa
a bb aaaaaaaaa
END
The first step is to divide the string into (two) blocks, which we can do as follows:
a1 = str.split(/\n{2,}/)
#=> ["aaa bb cccc aaaaaaa aaaa bbb\n aaaaaaaa aaaaaaaaa\n aaaaa aaaaa bbbb\naaaaa bb cc aaaaaaa",
# "aaa bbb aaaaaa bbb aaaaa bbbbbb\n aaaa aaaaaaaaaaaa\n aaaaaaaaa\n a bb aaaaaaaaa\n"]
Next, convert each of the two blocks to an array of lines.
a2 = a1.map { |s| s.chomp.split(/\n/) }
#=> [["aaa bb cccc aaaaaaa aaaa bbb",
# " aaaaaaaa aaaaaaaaa",
# " aaaaa aaaaa bbbb",
# "aaaaa bb cc aaaaaaa"],
# ["aaa bbb aaaaaa bbb aaaaa bbbbbb",
# " aaaa aaaaaaaaaaaa",
# " aaaaaaaaa",
# " a bb aaaaaaaaa"]]
We need to now map each each element of a2 (a string) to an array whose "columns" correspond to the columns of the original text.
a3 = a2.flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end
#=> [["aaa bb cccc", " aaaaaaa ", " aaaa bbb"],
# [" aaaaaaaa ", " ", " aaaaaaaaa"],
# [" aaaaa ", " ", " aaaaa bbbb"],
# ["aaaaa bb cc", " aaaaaaa", nil],
# ["aaa bbb", " aaaaaa bbb ", " aaaaa bbbbbb"],
# [" aaaa ", " ", " aaaaaaaaaaaa"],
# [" ", " aaaaaaaaa", nil],
# [" a bb ", " ", " aaaaaaaaa"]]
line =~ /\S/ returns the index of the first element of line that contains a character of than a whitespace (the reserved character \S in regular expressions.)
See Enumerable#flat_map.
The following intermediate values were obtained in the calculation of a3.
For group 1:
mx_len = 37
indent = 0
break_cols = [11, 12, 13, 14, 23, 24, 25]
b1 = 11
b2 = 25
For group 2:
mx_len = 38
indent = 0
break_cols = [7, 8, 9, 20, 21, 22]
b1 = 7
b2 = 22
All that remains is to convert nil's to empty strings, strip spaces from the ends of each string and transpose the array.
a3.map { |col| col.map { |s| s.to_s.strip } }.transpose
#=> [["aaa bb cccc", "aaaaaaaa", "aaaaa", "aaaaa bb cc",
# "aaa bbb", "aaaa", "", "a bb"],
# ["aaaaaaa", "", "", "aaaaaaa", "aaaaaa bbb", "",
# "aaaaaaaaa", ""],
# ["aaaa bbb", "aaaaaaaaa", "aaaaa bbbb", "",
# "aaaaa bbbbbb", "aaaaaaaaaaaa", "", "aaaaaaaaa"]]
If desired, we could of course chain the above operations.
str.split(/\n{2,}/).
map { |s| s.chomp.split(/\n/) }.
flat_map do |group|
indent = group.map { |line| line =~ /\S/ }.min
mx_len = group.map(&:length).max
break_cols = (indent..mx_len-1).each_with_object([]) do |i,cols|
cols << i if group.all? { |line| [" ", nil].include?(line[i]) }
end
b1, b2 = [break_cols.first, break_cols.last]
group.map { |line| [line[0..b1-1], line[b1..b2-1], line[b2..-1]] }
end.map { |col| col.map { |s| s.to_s.strip } }.transpose
As Cary has demonstrated, working with widths was painful. That's what tripped me up. I took a new approach at doing a String.gsub(/\s{2,44}/,'•') so it would preserve column widths while inserting delimiters:
col1, col2, col3 = [],[], []
master_data = []
lines = File.open(s, 'r+').read.split("\n")
lines.each do |line|
next if line == "" || line.strip == ""
nline = line.gsub(/\s{2,44}/,'•')
nline[0] = '' if nline.start_with?('•')
nline = nline.split('•')
col1 << nline[0]
col2 << nline[1]
col3 << nline[2]
end
col1.delete_if {|i| i.nil?}
col2.delete_if {|i| i.nil?}
col3.delete_if {|i| i.nil?}
# ap col1
# puts
# ap col2
# puts
# ap col3
counter = 0
col1.each do |i|
next if i.nil?
if i.match?(/^\d{3}-\d{3}-\d{4}/) # matches a phone number, perhaps a big assumption
company = [col1[counter-1], col1[counter], col1[counter+1]]
master_data << company
end
counter += 1
end
# a company is a company name, phone number, and website
# do the same for col2 and col3
ap master_data

Getting sapply to have string elements

I want R to print all elements from a in a particular format (into a given text file). So here's what I use:
library(stringr)
a <- c("aa", "bb", "cc")
b <- sapply(a, function(x) cat(str_c("* ", x, "\n")))
This basically works, but it adds some extra stuff, namely:
> b
* aa
* bb
* cc
$aa
NULL
$bb
NULL
$cc
NULL
Likewise, for instance typeof(b[1]) is "list", but I'd expect it to be a string. I don't really get what's going on here. What makes R add this and what can I do to avoid it?
Many thanks in advance!
You could use paste0 and then cat with the sep argument. And then there's no need for sapply
> a <- c("aa", "bb", "cc")
> b <- paste0("* ", a)
> cat(b, sep = "\n")
# * aa
# * bb
# * cc
The combination of cat, str_c and sapply is why you're getting a list as the result. cat prints first, which is why your list has * aa\n* bb\n* cc at the beginning.
Here is an alternative that uses regular expressions.
new <- gsub("([[:alpha:]]{2})", "*\\1", a)
> cat(new, sep = "\n")
The expression looks for a capturing group of two letters in a row, and substitutes for it an asterisk, and a space.
Then, call cat with the new line separator.
The first thing that came to my mind for this was actually sprintf:
cat(sprintf("* %s", a), sep = "\n")
However, this question is not entirely clear. It seems that on one level you want to be able to store this result and use it in this way later. In this case, maybe you want to consider defining your own print class, which isn't too difficult to do.
For example, here's a very basic print function for objects of class "myClass":
print.myClass <- function(x, ...) {
# works on a basic vector or a list
invisible(sapply(x, function(y) cat("* ", y, "\n")))
}
Let's see what it does...
First, create a copy of "a" and assign its class to be "myClass". Then, when you print it at the console, notice what happens:
b <- a
class(b) <- c(class(b), "myClass")
b
# * aa
# * bb
# * cc
It will work with a basic list too:
bL <- split(a, a)
class(bL) <- c(class(bL), "myClass")
bL
# * aa
# * bb
# * cc
And, since class is just an attribute, you can always call print.default or wrap the output in unclass:
print.default(bL)
# $aa
# [1] "aa"
#
# $bb
# [1] "bb"
#
# $cc
# [1] "cc"
#
# attr(,"class")
# [1] "list" "myClass"
And, as you expected:
typeof(bL[[1]])
# [1] "character"
For fun, also try...
## unclass(bL)
## unclass(b)
## print.myClass(a)

Changing spaces with "prxchange", but not all spaces

I need to change the spaces in my text to underscores, but only the spaces that are between words, not the ones between digits, so, for an example
"The quick brown fox 99 07 3475"
Would become
"The_quick_brown_fox 99 07 3475"
I tried using this in a data step:
mytext = prxchange('s/\w\s\w/_/',-1,mytext);
But the result was not what i wanted
"Th_uic_row_ox 99 07 3475"
Any ideas on what i could do?
Thanks in advance.
Data One ;
X = "The quick brown fox 99 07 3475" ;
Y = PrxChange( 's/(?<=[a-z])\s+(?=[a-z])/_/i' , -1 , X ) ;
Put X= Y= ;
Run ;
You are changing
"W W"
to
"_"
when you want to change
"W W"
to
"W_W"
so
prxchange('s/(\w)\s(\w)/$1_$2/',-1,mytext);
Full example:
data test;
mytext='The quick brown fox 99 07 3475';
newtext = prxchange('s/([A-Za-z])\s([A-Za-z])/$1_$2/',-1,mytext);
put _all_;
run;
You can use the CALL PRXNEXT function to find the position of each match, then use the SUBSTR function to replace the space with an underscore. I've changed your regular expression as \w matches any alphanumeric character, so it should include spaces between numbers. I'm not sure how you got your result using that expression.
Anyway, the code below should give you what you want.
data have;
mytext='The quick brown fox 99 07 3475';
_re=prxparse('/[a-z]\s[a-z]/i'); /* match a letter followed by a space followed by a letter, ignore case */
_start=1 /* starting position for search */;
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of 1st match */
do while(_position>0); /* loop through all matches */
substr(mytext,_position+1,1)='_'; /* replace ' ' with '_' for matches */
_start=_start-2; /* prevents the next start position jumping 3 ahead (the length of the regex search string) */
call prxnext(_re,_start,-1,mytext,_position,_length); /* find position of next match */
end;
drop _: ;
run;

Look for specific character in string and place it at different positions after a defined separator in the same string

let's define the following string s:
s <- "$ A; B; C;"
I need to translate s into:
"$ A; $B; $C;"
the semicolon is the separator. However, $ is only one of 3 special characters which can appear in the string. The data frame m holds all 3 special characters:
m <- data.frame(sp = c("$", "%", "&"))
I first used strsplit to split the string using the semicolon as the separator
> strsplit(s, ";")
[[1]]
[1] "$ A" " B" " C"
I think the next step would be to use grep or match to check if the first string contains any of the 3 special characters defined in data frame m. If so, maybe use gsub to insert the matched special character into the remaining sub strings. Then simple use paste with collapse = "" to merge the substrings together again. Does that make sense?
Cheers
What about something like this:
getmeout = gsub("[$|%|& ]", "", unlist(strsplit(s, ";")))
whatspecial = unique(gsub("[^$|%|&]", "", s))
whatspecial
# [1] "$"
getmeout
# [1] "A" "B" "C"
paste0(whatspecial, getmeout, sep=";", collapse="")
# [1] "$A;$B;$C;"
Here is one method:
library(stringr)
separator <- '; '
# extract the first part
first.part <- str_split(s, separator)[[1]][1]
first.part
# [1] "$ A"
# try to identify your special character
special <- m$sp[str_detect(first.part, as.character(m$sp))]
special
# [1] $
# Levels: $ & %
# make sure you only matched one of them
stopifnot(length(special) == 1)
# search and replace
gsub(separator, paste(separator, special, sep=""), s)
# [1] "$ A; $B; $C;"
Let me know if I missed some of your assumptions.
Back-referencing turns it into a one-liner:
s <- c( "$ A; B; C;", "& A; B; C;", "% A; B; C;" )
ms = c("$", "%", "&")
s <- gsub( paste0("([", paste(ms,collapse="") ,"]) ([A-Z]); ([A-Z]); ([A-Z]);") , "\\1 \\2; \\1 \\3; \\1 \\4" , s)
> s
[1] "$ A; $ B; $ C" "& A; & B; & C" "% A; % B; % C"
You can then make the regular expression appropriately generic (match more than one space, more than one alphanumeric character, etc.) if you need to.

how to use index of awk for loop in the regular expression

I make the problem shorter. Actually I have data much longer than this.
I have a file like:
aa, bb, cc, dd, ee, 4
ff, gg, hh, ii, jj, 5
kk, ll, mm, nn, oo, 3
pp, qq, rr, ss, tt, 2
uu, vv, ww, xx, yy, 5
aa, bb, cc, dd, ee, 2
now I want to use awk to select each line with the same number in last column and redirect it into a new file, these new files will be different depending on the number in the last column.
eg. t2.txt, t3.txt, t4.txt, t5.txt will hold the lines with last number as 2,3,4,5 respectively.
in t2.txt:
pp, qq, rr, ss, tt, 2
aa, bb, cc, dd, ee, 2
in t3.txt:
kk, ll, mm, nn, oo, 3
in t4.txt:
aa, bb, cc, dd, ee, 4
in t5.txt:
ff, gg, hh, ii, jj, 5
uu, vv, ww, xx, yy, 5
I guess I need something like this:
BEGIN {FS=","}
{
for (n=2; n<=5; n++)
if ($6 ~/\$n/) {print > "t\$n.txt"}
}
But I just don't know how to make it work.
This bash file do what I want, yet the problem is, each time it extracts lines with a specific number, it has to read in all the lines. How can I check ONLY TIME of the file and extract files for all numbers?
#!/bin/bash
for num in {2..5}; do
gawk --assign FS="," "\$6 ~/${num}/" infile >> t${num}.txt
done
Try with next command:
awk '{ print $0 > ("t" $NF ".txt") }' infile
There is no need to change FS, because it defaults to space characters. And you can inmediatly access to last field with NF variable.
NB: The filename string concatenation needs to be wrapped in parens, otherwise awk gets confused due to illegal syntax.
I get the answer, with the following it works:
but any further explanation will be welcomed.
BEGIN {FS=","}
{
for (n=1; n<=5; n++)
if ($6 ~/\$n/) {print > "new"$n".txt"}
}