Bash - word/term frequency per line (i.e. document) - regex

I have a file rev.txt like this:
header1,header2
1, some text here
2, some more text here
3, text and more text here
I also have a vocabulary document with all unique words from rev.txt, like so (but sorted):
a
word
list
text
here
some
more
and
I want to generate a term frequency table for each line in rev.txt where it lists the occurence of each vocabulary word in each line of rev.txt, like so:
0 0 0 1 1 1 0 0
0 0 0 1 1 1 1 0
0 0 0 2 1 0 1 1
They could be comma separated as well.
This is similar to a question here. However, instead of search through the entire document, I want to do this line by line, using the complete vocabulary I already have.
Re: Jean-François Fabre
Actually, I am performing these in MATLAB. However, bash (I believe) would be faster for this preprocessing as I have direct disk access to the files.

Normally, I would use python, but limiting myself to using bash, this hacky one-liner solution will works for the given test case.
perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt | sed '1d' | awk -F' ' 'FILENAME=="wordlist.txt" {wc[$1]=0; wl[wllen++]=$1; next}; {for(i=1; i<=NF; i++){wc[$i]++}; for(i=0; i<wllen; i++){print wc[wl[i]]" "; wc[wl[i]]=0; if(i+1==wllen){print "\n"} }}' ORS="" wordlist.txt -
Explanation/My thinking...
In the first part, perl -pe 's|^.*?,[ ]?(.*)|\1|' rev.txt, was used to pull out everything after the first comma (+removing the leading whitespace) from "rev.txt".
In the next part, sed '1d', was used to remove the first i.e. header line.
In the next part, we specified awk -F' ' ... ORS="" wordlist.txt - to use whitespace as a field delimiter, the output record delimiter as no space (note: we will print them as we go), and to read input from wordlist.txt (i.e. the "vocabulary document with all unique words from rev.txt") and stdin.
In the awk command, if the FILENAME is equal to "wordlist.txt", then (1) initialize array wc where the keys are the vocab words and the count is 0, and (2) initialize a list wl where the word order in the same as wordlist.txt.
FILENAME=="wordlist.txt" {
wc[$1]=0;
wl[wllen++]=$1;
next
};
After initialization, for each word in a line of stdin (i.e. the tidy rev.txt), increment the count of the word in wc.
{ for (i=1; i<=NF; i++) {
wc[$i]++
};
After the word counts have been added for a line, for each word in the list of words wl, print the count of that word with a whitespace and reset the count in wc back to 0. If the word is the last in the list, then add a whitespace to the output.
for (i=0; i<wllen; i++) {
print wc[wl[i]]" ";
wc[wl[i]]=0;
if(i+1==wllen){
print "\n"
}
}
}
Overall, this should produce the specified output.

Here's one in awk. It reads in the vocabulary file voc.txt (it's a piece of cake to produce it automatically in awk), copies the word list for each row of text and counts the word frequencies:
$ cat program.awk
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # order for copying vocabulary array w
}
NR==FNR { # store the voc.txt to w
w[$1]=0
next
}
FNR>1 { # process text files to matrix
for(i in w) # copy voc array
a[i]=0
for(i=2; i<=NF; i++) # count freqs
a[$i]++
for(i in a) # output matrix row
printf "%s%s", a[i], OFS
print ""
}
Run it:
$ awk -f program.awk voc.txt rev.txt
0 0 1 0 0 1 1 0
0 0 1 0 1 1 1 0
0 1 1 0 1 0 2 0

Related

Matching strings across non-consecutive rows with AWK

I have been working with an AWK one-liner that does a good job of identifying string matches on previous rows, i.e. comparing field x on row n with field y on row (n+1). E.g., say input file consists of rows, 3 fields each:
A B C
B B B
C C C
D B D
The one-liner is:
awk "$2==a[2] { print a[1],a[2],a[3] } { for (i=1;i<=NF;i++) a[i]=$i }"
So this example prints out all three fields of any immediately previous row that matches on field 2, which in this case is only row 1. So the output would be:
A B C
Now, I'm wondering if there is a modification to this command that will allow me to find matches between the current row and the row that is 2 rows before it, or 3 rows before it, or even 4 rows before it.
So using the same sample input file, if I was trying to make matches for "2 rows before", on field 2, it would now only output
B B B
which is row 2, because it is the only instance of the 2nd field ("B") matching with the second field in the row that is 2 rows ahead (i.e. row 4).
I'm not terribly familiar with arrays. I'm guessing the run time will suffer but is the original command modifiable in this way ?
You could use this awk:
awk 'a[FNR%n,m]==$m {print a[FNR%n]}{a[FNR%n]=$0; a[FNR%n,m]=$m}' n=2 m=3 file.txt
The above will print the nth line, before the current line if field m in both lines match.
The above will keep the memory nicely in check: if you don't care too much about memory consumption, you can do this:
awk '(FNR-n,$m) in a {print a[FNR-n,$m]}{a[FNR,$m]=$0}' n=2 m=3 file.txt
You may use this awk solution:
cat prev.awk
FNR > p && n = split(row[FNR-p], cols) && $2 == cols[2] {
print row[FNR-p]
}
{
row[FNR] = $0
}
Then use it for current-2 row matching:
awk -v p=2 -f prev.awk file
B B B
and current-1 row matching:
awk -v p=1 -f prev.awk file
A B C

Extract line before first empty line after match

I have some CSV file in this form:
* COMMENT
* COMMENT
100 ; 1706 ; 0.18 ; 0.45 ; 0.00015 ; 0.1485 ; 0.03 ; 1 ; 1 ; 2 ; 280 ; 100 ; 100 ;
* COMMENT
* COMMENT
* ZT vector
0; 367; p; nan
1; 422; p; nan
2; 1; d; nan
* KS vector
0; 367; p; 236.27
1; 422; p; 236.27
2; 1; d; 236.27
*Total time: 4.04211
I need to extract the last line before an empty line after matching the pattern KS vector.
To be clearer, in the above example I would like to extract the line
2; 1; d; 236.27
since it's the non empty line just before the first empty one after I got the match with KS vector.
I would also like to use the same script to extract the same kind of line after matching the pattern ZT vector, that in the above example would return
2; 1; d; nan
I need to do this because I need the first number of that line, since it tells me the number of consecutive non-empty lines after KS vector.
My current workaround is this:
# counting number of lines after matching "KS vector" until first empty line
var=$(sed -n '/KS vector/,/^$/p' file | wc -l)
# Subtracting 2 to obtain actual number of lines
var=$(($var-2))
But if I could extract directly the last line I could extract the first element (2 in the example) and add 1 to it to obtain the same number.
You're going about this the wrong way. All you need is to put awk into paragraph mode and print 1 less than the number of lines in the record (since you don't want to include the KS vector line in your count):
$ awk -v RS= -F'\n' '/KS vector/{print NF-1}' file
3
Here's how awk sees the record when you put it into paragraph mode (by setting RS to null) with newline-separated fields (by setting FS to a newline):
$ awk -v RS= -F'\n' '/KS vector/{ for (i=1;i<=NF;i++) print NF, i, "<"$i">"}' file
4 1 <* KS vector>
4 2 <0; 367; p; 236.27>
4 3 <1; 422; p; 236.27>
4 4 <2; 1; d; 236.27>
With awk expression:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r; exit }f{ r=$0 }' file
vec - variable containing the needed pattern/vector
$0~vec{ f=1 } - on encountering the needed pattern/vector - set the flag f in active state
f{ r=$0 } - while the flag f is active(under needed vector section) - capture the current line into variale r
f && !NF{ print r; exit } - (NF - total number of fields, if the line is empty - there's no fields !NF) on encountering empty line while iterating through the needed vector lines - print the last captured non-empty line r
exit - exit script execution immediately (avoiding redundant actions/iterations)
The output:
2; 1; d; 236.27
If you want to just print the actual number of lines under found vector use the following:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r+1; exit }f{ r=$1 }' file
3
With awk:
awk '$0 ~ "KS vector" { valid=1;getline } valid==1 { cnt++;dat[cnt]=$0 } $0=="" { valid="" } END { print dat[cnt-1] }' filename
Check for any lines matching "KS vector". Set a valid flag and then read in the next line. Read the data into an array with an incremented counter. When space is encountered, reset the valid flag. At the end print the last but one element of the dat array.

VIM padding with appropriate number of ",0" to get CSV file

I have a file containing numbers like
1, 2, 3
4, 5
6, 7, 8, 9,10,11
12,13,14,15,16
...
I want to create a CSV file by padding each line such that there are 6 values separated by 5 commas, so I need to add to each line an appropriate number of ",0". It shall look like
1, 2, 3, 0, 0, 0
4, 5, 0, 0, 0, 0
6, 7, 8, 9,10,11
12,13,14,15,16, 0
...
How would I do this with VIM?
Can I count the number of "," in a line with regular expressions and add the correct number of ",0" to each line with the substitute s command?
You can achieve that by typing this command:
:g/^/ s/^.*$/&,0,0,0,0,0,0/ | normal! 6f,D
You can add six zeros in all lines first, irrespective of how many numbers they have and then, you can delete everything from sixth comma till end in every line.
To insert them,
:1,$ normal! i,0,0,0,0,0,0
To delete from sixth comma till end,
:1,$normal! ^6f,D
^ moves to first character in line(which is obviously a number here)
6f, finds comma six times
D delete from cursor to end of line
Example:
Original
1,2,
3,6,7,0,0,0
4,5,6
11,12,13
After adding six zeroes,
1,2,0,0,0,0,0,0
3,6,7,0,0,0,0,0,0,0,0,0
4,5,6,0,0,0,0,0,0
11,12,13,0,0,0,0,0,0
After removing from six comma to end of line
1,2,0,0,0,0,0
3,6,7,0,0,0,0
4,5,6,0,0,0,0
11,12,13,0,0,0
With perl:
perl -lpe '$_ .= ",0" x (5 - tr/,//)' file.txt
With awk:
awk -v FS=, -v OFS=, '{ for(i = NF+1; i <= 6; i++) $i = 0 } 1' file.txt
With sed:
sed ':b /^\([^,]*,\)\{5\}/ b; { s/$/,0/; b b }' file.txt
As far as how to do this from inside Vim, you can also pipe text through external programs and it will replace the input with the output. That's an easy way to leverage sorting, deduping, grep-based filtering, etc, or some of Sato's suggestions. So, if you have a script called standardize_commas.py, try selecting your block with visual line mode (shift+v then select), and then typing something like :! python /tmp/standardize_commas.py. It should prepend a little bit to that string indicating that the command will run on the currently selected lines.
FYI, this was my /tmp/standardize_commas.py script:
import sys
max_width = 0
rows = []
for line in sys.stdin:
line = line.strip()
existing_vals = line.split(",")
rows.append(existing_vals)
max_width = max(max_width, len(existing_vals))
for row in rows:
zeros_needed = max_width - len(row)
full_values = row + ["0"] * zeros_needed
print ",".join(full_values)

merge two files based on partial match between strings

I have two files where the string in file1 have partial match to the string in the last column of file2. I would to merge the two files based the match between the strings. How do I solve this when the match is only partial, meaning that the strings in file1 often is a substring of that in file2. PS: Case should be ignored.
file1:
AGTAAGGTCAGCTAAATAAGCTATCGGGCCCATACCCCGAAAATGTTGGTTATATCCTTCCCGTACTA 0 1 2 3
CTTCTATGATGAATTTGATTGCATTGATCGTCTGACATGATAATGTATTT 2 11 14 0
AAAGTGGCCTACGCCACCGCCATGGACTGGTTCATAGCCGTGTGCTATGCCTTC 1 2 3 4
AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC 50 1 1 21
TACCCTGTAGAACCGAANTTGT 0 0 1 4
TCCCTGTGGTCTAGTGGTTAGGATTCTGCGCTCTCACCGCCGCGGCCCGGG 1 0 4 3
GGGCCAGGATGAAACCTAATTTGAGTGGCCATCCATGGATGAGAAATGCGG 0 1 3 0
file2:
chrX Rfam ncRNA 55609165 55609267 53.97 + 0 ID=RF00019.20;Name=RF00019;Alias=Y_RNA;Note=AL627224.14/36063-36164 chrX:55609165-55609267 ggctggtttgagtgcagtgatgcttacaactaattgatcacatccaattacagatttctttgctctttctgtactcccagtgcttcacttgactagccttta
chrX Rfam regulatory_region 57233087 57233370 53.02 - 0 ID=RF01417.3;Name=RF01417;Alias=RSV_RNA;Note=Z83745.1/45303-45021 chrX:57233087-57233370 gtaaatgcaaaccattcacagtcttgctcagctaaggggatagtaaagaaacagtcttttaaatcaatgactattaaaggccaatttcttggaatcatagcaggagaaggcagtcctggctgcaatgtccccataggttgtataactgaattaatggctcttaagtcagttaacattctccatttacctgattttttcttaattacaaaaactggagaatttcaaggggaaaatattggaactatgtgtcctttttctaattgttcagtaactaagtcctcta
chrX Rfam regulatory_region 61975961 61976233 45.45 - 0 ID=RF01417.4;Name=RF01417;Alias=RSV_RNA;Note=BX322784.3/89124-88853 chrX:61975961-61976233 AAAGTGTCATATGCCACTGCCATGGATTGGTTCATAGCTGTTTGCTTTGCATTC
chrX Rfam ncRNA 62059095 62059167 29.9 + 0 ID=RF00005.18;Name=RF00005;Alias=tRNA;Note=BX119964.4/4840-4911 chrX:62059095-62059167 GTTAATGTAGCTTAATTCATCAAAGCAAGGCACTGAAAAATGCCTAGATGAATACACATGATTCCATTAACA
chrX Rfam regulatory_region 62582448 62582735 62.81 - 0 ID=RF01417.5;Name=RF01417;Alias=RSV_RNA;Note=AL158203.12/36753-36467 chrX:62582448-62582735 gtaaacacaaatttttctctgtccttctctgctagatgaatggtataaaaacaatctttaagtcaacaacgattataggccaatcttcaggaattgccacaggggaggggaggacctgttgaagagaccccataggttgcaaattagcattaatagcagttaagtagtgcaaaagtctccatttaccagactttttgggaatgacgaaaatgggcgaattccaaaggctgtttgatggttctatatggccagctttcaattgctcctcaactaattcatgggctctc
chrX Rfam ncRNA 63430570 63430868 141.38 + 0 ID=RF00017.15;Name=RF00017;Alias=Metazoa_SRP;Note=AL355852.23/124872-125169 chrX:63430570-63430868 cctggggcagtggcacatgcctgtagtcccagctacttgggaggctgaagcaggaggatagcttaagttcaggagttctgggatgtaatgcactatgctgatagggtgtctgcactaagttcagcatcaacatggtgacctcccaggagcaggggaccaccaggctgcctaaggaggtatgaactggccgagatcagaaacggagcacataaaaacttgcatcttgatcagtagtgggattgcgcctacaaatagccactgcactgcagactgggcaacatagtgagaccttgtctct
If your files arent huge, and awk is able to hold all of file2 in memory, you can do this:
awk '
ARGIND==1 { save[tolower($NF)] = $0 }
ARGIND==2 { col1 = tolower($1)
for(pat in save){
if(pat ~ col1)print $0 " ----- " save[pat]
}
}
' file2 file1
This reads file2 first and saves each line ($0) in associative array save, indexed by the last field ($NF) converted to lowercase.
It then reads file1 (so ARGIND is 2, 2nd file), and converts column 1 to lowercase. Then it tries to match (~) this string (or pattern really) against each index in the array. If it matches it prints the current line from file1 and the saved line from file2.

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)