VIM padding with appropriate number of ",0" to get CSV file - regex

I have a file containing numbers like
1, 2, 3
4, 5
6, 7, 8, 9,10,11
12,13,14,15,16
...
I want to create a CSV file by padding each line such that there are 6 values separated by 5 commas, so I need to add to each line an appropriate number of ",0". It shall look like
1, 2, 3, 0, 0, 0
4, 5, 0, 0, 0, 0
6, 7, 8, 9,10,11
12,13,14,15,16, 0
...
How would I do this with VIM?
Can I count the number of "," in a line with regular expressions and add the correct number of ",0" to each line with the substitute s command?

You can achieve that by typing this command:
:g/^/ s/^.*$/&,0,0,0,0,0,0/ | normal! 6f,D

You can add six zeros in all lines first, irrespective of how many numbers they have and then, you can delete everything from sixth comma till end in every line.
To insert them,
:1,$ normal! i,0,0,0,0,0,0
To delete from sixth comma till end,
:1,$normal! ^6f,D
^ moves to first character in line(which is obviously a number here)
6f, finds comma six times
D delete from cursor to end of line
Example:
Original
1,2,
3,6,7,0,0,0
4,5,6
11,12,13
After adding six zeroes,
1,2,0,0,0,0,0,0
3,6,7,0,0,0,0,0,0,0,0,0
4,5,6,0,0,0,0,0,0
11,12,13,0,0,0,0,0,0
After removing from six comma to end of line
1,2,0,0,0,0,0
3,6,7,0,0,0,0
4,5,6,0,0,0,0
11,12,13,0,0,0

With perl:
perl -lpe '$_ .= ",0" x (5 - tr/,//)' file.txt
With awk:
awk -v FS=, -v OFS=, '{ for(i = NF+1; i <= 6; i++) $i = 0 } 1' file.txt
With sed:
sed ':b /^\([^,]*,\)\{5\}/ b; { s/$/,0/; b b }' file.txt

As far as how to do this from inside Vim, you can also pipe text through external programs and it will replace the input with the output. That's an easy way to leverage sorting, deduping, grep-based filtering, etc, or some of Sato's suggestions. So, if you have a script called standardize_commas.py, try selecting your block with visual line mode (shift+v then select), and then typing something like :! python /tmp/standardize_commas.py. It should prepend a little bit to that string indicating that the command will run on the currently selected lines.
FYI, this was my /tmp/standardize_commas.py script:
import sys
max_width = 0
rows = []
for line in sys.stdin:
line = line.strip()
existing_vals = line.split(",")
rows.append(existing_vals)
max_width = max(max_width, len(existing_vals))
for row in rows:
zeros_needed = max_width - len(row)
full_values = row + ["0"] * zeros_needed
print ",".join(full_values)

Related

Indent spaces to tabs

I really have problem reading code with spaces, so I use the visual studio code editor to indent codes from spaces to tabs before I read them.
But the problem is rails has a lot of files, I have to do the same operation repetitively. So, I want to use Dir.glob to iterate over all of them and covert spaces to tabs and overwrite those files. It is a terrible idea, but still...
Currently my String#spaces_to_tabs() method looks like this:
Code
# A method that works for now...
String.define_method(:spaces_to_tabs) do
each_line.map do |x|
match = x.match(/^([^\S\t\n\r]*)/)[0]
m_len = match.length
(m_len > 0 && m_len % 2 == 0) ? ?\t * (m_len / 2) + x[m_len .. -1] : x
end.join
end
Which kind of works
Here's a test:
# Put some content that will get converted to space
content = <<~EOF << '# Hello!'
def x
'Hello World'
end
p x
module X
refine Array do
define_method(:tally2) do
uniq.reduce({}) { |h, x| h.merge!( x => count(x) ) }
end
end
end
using X
[1, 2, 3, 4, 4, 4,?a, ?b, ?a].tally2
p [1, 2, 3, 4, 4, 4,?a, ?b, ?a].tally2
\r\r\t\t # Some invalid content
EOF
puts content.spaces_to_tabs
Output:
def x
'Hello World'
end
p x
module X
refine Array do
define_method(:tally2) do
uniq.reduce({}) { |h, x| h.merge!( x => count(x) ) }
end
end
end
using X
[1, 2, 3, 4, 4, 4,?a, ?b, ?a].tally2
p [1, 2, 3, 4, 4, 4,?a, ?b, ?a].tally2
# Some invalid content
# Hello!
Currently it does not:
Affect white-spaces (\t, \r, \n) other than spaces.
Affect the output of code, only converts spaces to tabs.
I can't use my editor because:
With Dir.glob (not included in this example), I can iterate over only .rb, .js, .erb, .html, .css, and .scss files.
Also, this is slow, but I can have at most 1000 files (above extensions) with 1000 lines of code for each file, but that's max, and not too practical, I generally have < 100 files with a couple of hundred lines of code. The code can take 10 seconds, which is not a problem here, since I need to run the code once for a project...
Is there a better way to do it?
Edit
Here's the full code with globbing for converting all major files in rails:
#!/usr/bin/ruby -w
String.define_method(:bold) { "\e[1m#{self}" }
String.define_method(:spaces_to_tabs) do
each_line.map do |x|
match = x.match(/^([^\S\t\n\r]*)/)[0]
m_len = match.length
(m_len > 0 && m_len % 2 == 0) ? ?\t * (m_len / 2) + x[m_len .. -1] : x
end.join
end
GREEN = "\e[38;2;85;160;10m".freeze
BLUE = "\e[38;2;0;125;255m".freeze
TURQUOISE = "\e[38;2;60;230;180m".freeze
RESET = "\e[0m".freeze
BLINK = "\e[5m".freeze
dry_test = ARGV.any? { |x| x[/^\-(\-dry\-test|d)$/] }
puts "#{TURQUOISE.bold}:: Info:#{RESET}#{TURQUOISE} Running in Dry Test mode. Files will not be changed.#{RESET}\n\n" if dry_test
Dir.glob("{app,config,db,lib,public}/**/**.{rb,erb,js,css,scss,html}").map do |y|
if File.file?(y) && File.readable?(y)
read = IO.read(y)
converted = read.spaces_to_tabs
unless read == converted
puts "#{BLINK}#{BLUE.bold}:: Converting#{RESET}#{GREEN} indentation to tabs of #{y}#{RESET}"
IO.write(y, converted) unless dry_test
end
end
end
If this is just an intellectual exercise about tab indentation algorithms, then fine. If you really have trouble viewing the files, use Rubocop. It has configuration options that allow you to beautify the code, and the type of spaces it generates and the degree of indentation it applies. I use it with Atom and atom-beautify but I'm sure it has a plugin for VS code too. https://docs.rubocop.org/rubocop/0.86/cops_layout.html#layoutindentationconsistency

Matching strings across non-consecutive rows with AWK

I have been working with an AWK one-liner that does a good job of identifying string matches on previous rows, i.e. comparing field x on row n with field y on row (n+1). E.g., say input file consists of rows, 3 fields each:
A B C
B B B
C C C
D B D
The one-liner is:
awk "$2==a[2] { print a[1],a[2],a[3] } { for (i=1;i<=NF;i++) a[i]=$i }"
So this example prints out all three fields of any immediately previous row that matches on field 2, which in this case is only row 1. So the output would be:
A B C
Now, I'm wondering if there is a modification to this command that will allow me to find matches between the current row and the row that is 2 rows before it, or 3 rows before it, or even 4 rows before it.
So using the same sample input file, if I was trying to make matches for "2 rows before", on field 2, it would now only output
B B B
which is row 2, because it is the only instance of the 2nd field ("B") matching with the second field in the row that is 2 rows ahead (i.e. row 4).
I'm not terribly familiar with arrays. I'm guessing the run time will suffer but is the original command modifiable in this way ?
You could use this awk:
awk 'a[FNR%n,m]==$m {print a[FNR%n]}{a[FNR%n]=$0; a[FNR%n,m]=$m}' n=2 m=3 file.txt
The above will print the nth line, before the current line if field m in both lines match.
The above will keep the memory nicely in check: if you don't care too much about memory consumption, you can do this:
awk '(FNR-n,$m) in a {print a[FNR-n,$m]}{a[FNR,$m]=$0}' n=2 m=3 file.txt
You may use this awk solution:
cat prev.awk
FNR > p && n = split(row[FNR-p], cols) && $2 == cols[2] {
print row[FNR-p]
}
{
row[FNR] = $0
}
Then use it for current-2 row matching:
awk -v p=2 -f prev.awk file
B B B
and current-1 row matching:
awk -v p=1 -f prev.awk file
A B C

Bash - word/term frequency per line (i.e. document)

I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.
Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.
Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0

Trouble parsing custom structure with shell script utilities like sed / awk / grep

I am trying to use a shell script to parse a complex list of structures out of a text file, and search those structures for a very specific set of values. If there is a match then I need to print the values of one variable. I am limited to lightweight utilities like sed, awk, and grep (but not Perl).
Here is an example of the structure, followed by an explanation of what I am looking for:
{
{ 1, 2,
{ 15, 25 },
{ 15, 25 }
},
{ 3, 4,
{ 35, 45 },
{ 35, 45 }
},
{ 5, 6,
{ 55, 65 },
{ 55, 65 }
}
};
In this example I would be parsing the three structures and searching for a structure which has a "3" as the first value, has any single digit (0-9) as the second value, and at least one set of "35" and "45" in the inner list of structures. Once I have located a match I would then print the value of the second value. In this case the second structure would match, and I would need to print out the value "4".
I don't want to assume anything about how the whitespace is organized, only that the format above is followed. I.e. it could all be on a single line or have different combinations of line breaks in random places.
Can someone please help me think about how to approach this problem?
this may be what you want, using GNU awk for various extensions:
$ cat tst.awk
BEGIN { RS="[{}]"; FS="\\s*,\\s*" }
depth == 2 { split($0,outer) }
(depth == 3) && (outer[1]==3) && (outer[2]~/^[0-9]$/) &&
((($1==35) && ($2==45)) || (($1==45) && ($2==35))) { print outer[2] }
{ depth = depth + (RT=="{" ? 1 : -1) }
$ awk -f tst.awk file
4
A non-robust awk attempt
$ awk -F"[{,}]" '/{/ && !/}/{c=($2==3)?+$3:""}
c~/^[0-9]$/ && $2==35 && $3==45{print c;exit}' file
4
using the layout
Thank you all for the help on this one. I was able to eventually solve it using only sed and tr, although it wasn't pretty. I used tr to join all of the lines together, then sed to remove the outer { };, sed again to split the lines along structure boundaries by using backrefrences, sed again to cleanup commas and whitespace between structures, and then "sed -n -r "s//\1/p" to validate the expected values in pattern, and to print only the matching variable.
I will take a look at your examples and see if I can learn from them.

AWK - Search for a pattern-add it as a variable-search for next line that isn't a variable & print it + variable

I have a given file:
application_1.pp
application_2.pp
#application_2_version => '1.0.0.1-r1',
application_2_version => '1.0.0.2-r3',
application_3.pp
#application_3_version => '2.0.0.1-r4',
application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp
#application_5_version => '3.0.0.1-r8',
application_5_version => '3.0.0.2-r9',
I would like to be able to read this file and search for the string
".pp"
When that string is found, it adds that line into a variable and stores it.
It then reads the next line of the file. If it encounters a line preceded by a # it ignores it and moves onto the next line.
If it comes across a line that does not contain ".pp" and doesn't start with # it should print out that line next to a the last stored variable in a new file.
The output would look like this:
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
I would like to achieve this with awk. If somebody knows how to do this and it is a simple solution i would be happy if they could share it with me. If it is more complex, it would be helpful to know what in awk I need to understand in order to know how to do this (arrays, variables, etc). Can it even be achieved with awk or is another tool necessary?
Thanks,
I'd say
awk '/\.pp/ { if(NR != 1) print line; line = $0; next } NF != 0 && substr($1, 1, 1) != "#" { line = line $0 } END { print line }' filename
This works as follows:
/\.pp/ { # if a line contains ".pp"
if(NR != 1) { # unless we just started
print line # print the last assembled line
}
line = $0 # and remember this new one
next # and we're done here.
}
NF != 0 && substr($1, 1, 1) != "#" { # otherwise, unless the line is empty
# or a comment
line = line $0 # append it to the line we're building
}
END { # in the end,
print line # print the last line.
}
You can use sed:
#n
/\.pp/{
h
:loop
n
/[^#]application.*version/{
H
g
s/\n[[:space:]]*/\t/
p
b
}
/\.pp/{
x
p
}
b loop
}
If you save this as s.sed and run
sed -f s.sed file
You will get this output
application_1.pp
application_2.pp application_2_version => '1.0.0.2-r3',
application_3.pp application_3_version => '2.0.0.2-r7',
application_4.pp
application_5.pp application_5_version => '3.0.0.2-r9',
Explanation
The #n supresses normal output.
Once we match the /\.pp/, we store that line into the hold space with h, and start the loop.
We go to the next line with n
If it matches /[^#]application.*version/, meaning it doesn't start with a #, then we append the line to the hold space with H, then copy the hold space to the pattern space with g, and substitute the newline and any subsequent whitespace for a tab. Finally we print with p, and skip to the end of the script with b
If it matches /\.pp/, then we swap the pattern and hold spaces with x, and print with p.