Too many attributes for ARFF format in Weka - weka

I am working with a data-set of dimension more than 10,000. To use Weka I need to convert text file into ARFF format, but since there are too many attributes even after using sparse ARFF format file size is too large. Is there any similar method as for data to avoid writing so many attribute identifier as in header of ARFF file.
for example :
#attribute A1 NUMERICAL
#attribute A2 NUMERICAL
...
...
#attribute A10000 NUMERICAL

I coded a script in AWK to format the following lines (in a TXT file) to an ARFF
example.txt source:
Att_0 | Att_1 | Att_2 | ... | Att_n
1 | 2 | 3 | ... | 999
My script (to_arff), you can change FS value depending on the separator used in the TXT file:
#!/usr/bin/awk -f
# ./<script>.awk data.txt > data.arff
BEGIN {
FS = "|";
# WEKA separator
separator = ",";
}
# The first line
NR == 1 {
# WEKA headers
split(FILENAME, relation, ".");
# the relation's name is the source file's name
print "#RELATION "relation[1]"\n";
# attributes are "numeric" by default
# types available: numeric, <nominal> {n1, n2, ..., nN}, string and date [<date-format>]
for (i = 1; i <= NF; i++) {
print "#ATTRIBUTE "$i" NUMERIC";
}
print "\n#DATA";
}
NR > 1 {
s = "";
first = 1;
for (i = 1; i <= NF; i++) {
if (first)
first = 0;
else
s = s separator;
s = s $i;
}
print s;
}
Output:
#RELATION example
#ATTRIBUTE Att_0 NUMERIC
#ATTRIBUTE Att_1 NUMERIC
#ATTRIBUTE Att_2 NUMERIC
#ATTRIBUTE Att_n NUMERIC
#DATA
1,2,3,9999

Related

Splitting text lines, whilst appending prefix

EDITED
I have about 50x9Gb .mer files that look like this:
"xxxxx";"123\t123\t123\v234\t234\v234\t224\t234\v"
"yyyyy";"123\t234\t224\v234\t234\v234\t224\t234\v"
"zzzzz";"123\t456\t565\v234\t774"
A uuid followed by ";" and then possibly extra tabbed entries, followed by a vertical tab delimited list of further multiple tabbed entries, all enclosed in quotes. I've shown them as 3-digit numbers here, but in reality they're variable length strings, which can include doubled quotes "".
I need to turn them into this:
xxxxx\t123\t123\t123
xxxxx\t234\t234
xxxxx\t234\t224\t234
yyyyy\t123\t234\t224
yyyyy\t234\t234
yyyyy\t234\t224\t234
zzzzz\t123\t456\t565
zzzzz\t234\t774
That is, split the lines on the vertical tabs, prefixing each line with the first field of the line that it came from.
Currently, I'm using a noddy regex, which does at least work, but requires multiple runs and manual checking.
How would I do this using awk or sed ? I've tried adapting the the current answers below, but I'm having trouble finding out what the ;P and ;D suffixes mean.
(Note: I'm using GitBash on Windows, so I guess that's gnu sed and awk ?)
awk -F';' -v OFS='\t' #set Field separator is ';',
'{for(i=1;i<=NF;i++) #then we have 2 fields, remove leading and trailing doubled qoutes
gsub(/^"|"$/,"",$i)
c=split($2,a,"\v") #split by vertical tab, save result in array 'a'
for(i=1;i<=c;i++) #for each element in a, if it is not empty, print field1 (the uuid)
if(a[i])print $1,a[i]}' file #and the element, separated by Tab
explanation is inline.
It outputs:
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
This might work for you (GNU sed):
sed -r 's/^((\S*\t)\S*)\v/\1\n\2/;P;D' file
Replace each \v by a newline, the first field and a tab. Print and Delete the first line and repeat.
EDIT: As per the new question;
sed -r '/\n/!s/"(")?/\1/g;/\n/!s/;/\t/;s/^((\S*\t)[^\v]*)\v/\1\n\2/;/\t$/!P;D' file
Remove any single double quotes (replace double double quotes by single double quotes) and replace the semi colon by a tab character. Then replace any \v's by a newline and the first field and a tab and repeat.
You can use this awk command for this output:
awk 'BEGIN{FS=OFS="\t"} n = split($2, a, "\x0b") {
for (i=1; i<=n; i++) print $1, a[i]}' file
195a664e-e0d0-4488-99d6-5504f9178115 1234
195a664e-e0d0-4488-99d6-5504f9178115 1412
195a664e-e0d0-4488-99d6-5504f9178115 1231
195a664e-e0d0-4488-99d6-5504f9178115 4324
195a664e-e0d0-4488-99d6-5504f9178115 1421
195a664e-e0d0-4488-99d6-5504f9178115 3214
a1d61289-7864-40e6-83a7-8bdb708c459e 1412
a1d61289-7864-40e6-83a7-8bdb708c459e 6645
a1d61289-7864-40e6-83a7-8bdb708c459e 5334
a1d61289-7864-40e6-83a7-8bdb708c459e 3453
a1d61289-7864-40e6-83a7-8bdb708c459e 5453
How it works:
BEGIN{FS=OFS="\t"} # sets input and output field separator as tab
n = split($2, a, "\x0b") # splits second field using Hex 0B (ASCII 11) i.e. vertical tab
for (i=1; i<=n; i++) ... # prints pair of field 1 with each item from split array a
gnu sed
sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile
recursive replace of \v by first "field" + tab + cleaning extra char on the way
Another solution using awk
awk '
BEGIN{FS="[\v;]"}
{
gsub("[\"]","");
for(i=2; i<=NF; ++i)
if($i) printf "%s\t%s\n", $1, $i;
}' file.mer
Another solution using sed
sed -r 's/\v\n/\v/g; s/"//g;
:a; s/([^;]*);([^\v]*)\v/\1;\2\n\1;/g; ta;
s/;/\t/g;' file.mer | sed -r '/^[^\t]+\t$/d'
you get,
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
Okay, I waited on purpose until Kent's answer was accepted and awarded the bounty because the question was about awk/sed. Thus, my answer might be somewhat off-topic, but anyway here is my Java solution which I did just for fun as a kata.
MER input file generator:
I thought it would be nice to generate some sample input files with randomised values. Each line consists of
a UUID,
0-9 groups, separated by vertical tabs,
within each group, 1-4 strings, separated by horizontal tabs,
each string consisting of 1-20 characters, among them double quotes escaped by other double quotes, i.e. "".
I thought this is diverse enough to get some good test data.
package de.scrum_master.stackoverflow;
import org.apache.commons.lang.RandomStringUtils;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Random;
import java.util.UUID;
public class RandomFileGenerator {
private static final int BUFFER_SIZE = 1024 * 1024;
private final static Random RANDOM = new Random();
private final static char VERTICAL_TAB = '\u000b';
private final static char[] LEGAL_CHARS =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüÄÖÜß. -\""
.toCharArray();
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
// final long maxOutputSize = 9L * 1024 * 1024 * 1024;
// final String outputFile = "src/main/resources/sample-9gb.mer";
final long maxOutputSize = 1L * 1024 * 1024;
final String outputFile = "src/main/resources/sample-1mb.mer";
long totalOutputSize = 0;
long lineCount = 0;
String line;
try (PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))) {
while (totalOutputSize < maxOutputSize) {
line = generateLine();
writer.println(generateLine());
totalOutputSize += line.length() + 1;
lineCount++;
}
}
System.out.println(lineCount);
System.out.println(totalOutputSize);
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static String generateLine() {
StringBuilder buffer = new StringBuilder();
buffer
.append('"')
.append(UUID.randomUUID().toString())
.append("\";\"");
int numItems = RANDOM.nextInt(10);
for (int i = 0; i < numItems; i++) {
int numSubItems = 1 + RANDOM.nextInt(4);
for (int j = 0; j < numSubItems; j++) {
buffer.append(
RandomStringUtils.random(1 + RANDOM.nextInt(20), 0, LEGAL_CHARS.length, false, false, LEGAL_CHARS)
.replaceAll("\"", "\"\"")
);
if (j + 1 < numSubItems)
buffer.append('\t');
}
if (i + 1 < numItems) {
buffer.append(VERTICAL_TAB);
}
}
buffer.append('"');
return buffer.toString();
}
}
You can see that it is easy to create test files of the desired file size, e.g.
1 MB: maxOutputSize = 1L * 1024 * 1024
9 GB: maxOutputSize = 9L * 1024 * 1024 * 1024
I mostly used the smaller one to check the algorithms during development and the really big one for performance tuning.
File splitter in 4 different variants:
The variants shown here use different approaches, but in common they have that they read from a BufferedReader with Java streams via reader.lines(). Switching from streams to a simple for loop made it slower, BTW. All solutions write the results into a PrintWriter.
reader.lines().forEach() and then regex matching + splitting. This solution has the best trade-off between readability, brevity and performance.
reader.lines().flatMap(), i.e. using sub-streams for the vertical-tab-separated groups after the UUID, also using regex matching + splitting. This solution is also quite short and elegant, but a little harder to read than #1 and also about 15% slower.
Because regex matching calls like replace() and split() can be pretty expensive, I developed a solution that rather iterates over the string and uses indexOf() and substring() instead of regexes. This was substantially faster than #1 and #2, but the code is harder to read in a way I started to dislike. This should only be done if performance is really important, i.e if the file splitter is used regularly. For a one-time solution or if it runs just once a month, I think it is not really worth it from a maintainability standpoint.
A further optimised version of #3 which avoids some more overhead and is again a little bit faster, but not substantially. Now the code really needs source code comments so as to convey to the reader what the algorithm does. From a clean code perspective this is a nightmare. (Don't do this at home, kids!)
package de.scrum_master.stackoverflow;
import java.io.*;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileSplitter {
private static final int BUFFER_SIZE = 1024 * 1024;
private static final Pattern LINE_PATTERN = Pattern.compile("^\"([^\"]+)\";\"(.*)\"$");
private final static char VERTICAL_TAB = '\u000b';
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
String inputFile = "src/main/resources/sample-9gb.mer";
String outputFile = inputFile.replaceFirst("mer$", "txt");
try (
BufferedReader reader = new BufferedReader(new FileReader(inputFile), BUFFER_SIZE);
PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))
) {
// forEachVariant(reader, writer);
// flatMapVariant(reader, writer);
noRegexSimpleVariant(reader, writer);
// noRegexOptimisedVariant(reader, writer);
}
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static void forEachVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.forEach(line -> {
matcher.reset(line).matches();
for (String record : matcher.group(2).replace("\"\"", "\"").split("\\v"))
writer.println(matcher.group(1) + "\t" + record);
});
}
private static void flatMapVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.flatMap(line -> {
matcher.reset(line).matches();
return Arrays
.stream(matcher.group(2).replace("\"\"", "\"").split("\\v"))
.map(record -> matcher.group(1) + "\t" + record);
})
.forEach(writer::println);
}
private static void noRegexSimpleVariant(BufferedReader reader, PrintWriter writer) {
reader.lines()
.forEach(line -> {
final int lineLength = line.length();
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
indexLeft = indexRight + 3;
String record;
int quoteIndex;
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
while (indexLeft < indexRight) {
quoteIndex = line.indexOf('"', indexLeft);
if (quoteIndex == -1 || quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
record = line.substring(indexLeft, quoteIndex);
writer.print(record);
indexLeft = quoteIndex + 1;
}
writer.println();
indexLeft = indexRight + 1;
}
});
}
private static void noRegexOptimisedVariant(BufferedReader reader, PrintWriter writer) throws IOException {
reader.lines()
.forEach(line -> {
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
// Skip '";"' after UUID
indexLeft = indexRight + 3;
final int lineLength = line.length();
String recordChunk;
int quoteIndex;
// If search for '"' has once reached end of line, search no more
boolean doQuoteSearch = true;
// Iterate over records per UUID, separated by vertical tab
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
// Search for '""' within record incrementally, + replace each of them by '"'.
// BTW, if '"' is found, it actually always will be an escaped '""'.
while (indexLeft < indexRight) {
if (doQuoteSearch) {
// Only search for quotes if we never reached the end of line before
quoteIndex = line.indexOf('"', indexLeft);
assert quoteIndex != -1;
if (quoteIndex >= lineLength - 1)
doQuoteSearch = false;
if (quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
}
else {
// No more '"' within record
quoteIndex = indexRight;
}
// Write record chunk, skipping 2nd '"'
recordChunk = line.substring(indexLeft, quoteIndex);
writer.print(recordChunk);
indexLeft = quoteIndex + 1;
}
// Do not forget newline before reading next line/UUID
writer.println();
indexLeft = indexRight + 1;
}
});
}
}
Updated awk script:
Besides: Each Java solution writes out one UUID without any content in case there was none in the input file. This would be easy to avoid, but I did it on purpose. This is the only difference to this slightly updated awk script (based on Dave's, but also replaces "" by ") which I used as a benchmark:
#!/usr/bin/awk
{
for(i=1;i<=NF;i++) {
gsub(/^"|"$/,"",$i)
gsub(/""/,"\"",$i)
}
c=split($2,a,"\\v")
for(i=1;i<=c;i++)
print $1,a[i]
}
Performance results:
I measured both parsing and writing performance.
Parsing means reading a 9 GB file from disk and splitting it, but writing the output to /dev/null or not at all.
Writing means reading the same 9 GB file and writing it back onto the same disk partition (hybrid HD + SSD type), i.e. this could be further optimised by writing onto another physical disk. The output file has a size of 18 GB.
Reading file, splitting into lines but not parsing the lines: 66 s
Awk
parsing only: 533 s
parsing + writing: 683 s
reader.lines().forEach() and then regex matching + splitting
parsing only: 212 s
parsing + writing: 425 s
reader.lines().flatMap(), i.e. using sub-streams
parsing only: 245 s
parsing + writing: not measured
Without using regex, but using String.replace("\"\"", "\"") (not shown in code here)
parsing only: 154 s
parsing + writing: 369 s
No regex, no replace(), simple version
parsing only: 86 s
parsing + writing: 342 s
No regex, no replace(), optimised version
parsing only: 84 s
parsing + writing: 342 s
Sorry for the lengthy treatise, but I wanted to share my findings for others reading the question and the other answers, speculating whether Java (or C?) might be faster than awk - yes, it is by a substantial bit, but not by an order of magnitude because disk performance is also a factor. And I think it is a warning to those who tend to over-optimise for optimisation's sake. It is not worth it if you go to far, just try to hit the sweet spot between effort, readability and performance. Amen.

Can we use AWK and gsub() to process data with multiple colons ":" ? How?

Here is an example of the data:
Col_01:14 .... Col_20:25 Col_21:23432 Col_22:639142
Col_01:8 .... Col_20:25 Col_22:25134 Col_23:243344
Col_01:17 .... Col_21:75 Col_23:79876 Col_25:634534 Col_22:5 Col_24:73453
Col_01:19 .... Col_20:25 Col_21:32425 Col_23:989423
Col_01:12 .... Col_20:25 Col_21:23424 Col_22:342421 Col_23:7 Col_24:13424 Col_25:67
Col_01:3 .... Col_20:95 Col_21:32121 Col_25:111231
As you can see, some of these columns are not in the correct order...
Now, I think the correct way to import this file into a dataframe is to preprocess the data such that you can output a dataframe with NaN values, e.g.
Col_01 .... Col_20 Col_21 Col22 Col23 Col24 Col25
8 .... 25 NaN 25134 243344 NaN NaN
17 .... NaN 75 2 79876 73453 634534
19 .... 25 32425 NaN 989423 NaN NaN
12 .... 25 23424 342421 7 13424 67
3 .... 95 32121 NaN NaN NaN 111231
The solution was shown by #JamesBrown here : How to preprocess and load a "big data" tsv file into a python dataframe?
Using said awk script:
BEGIN {
PROCINFO["sorted_in"]="#ind_str_asc" # traversal order for for(i in a)
}
NR==1 { # the header cols is in the beginning of data file
# FORGET THIS: header cols from another file replace NR==1 with NR==FNR and see * below
split($0,a," ") # mkheader a[1]=first_col ...
for(i in a) { # replace with a[first_col]="" ...
a[a[i]]
printf "%6s%s", a[i], OFS # output the header
delete a[i] # remove a[1], a[2], ...
}
# next # FORGET THIS * next here if cols from another file UNTESTED
}
{
gsub(/: /,"=") # replace key-value separator ": " with "="
split($0,b,FS) # split record from ","
for(i in b) {
split(b[i],c,"=") # split key=value to c[1]=key, c[2]=value
b[c[1]]=c[2] # b[key]=value
}
for(i in a) # go thru headers in a[] and printf from b[]
printf "%6s%s", (i in b?b[i]:"NaN"), OFS; print ""
}
And put the headers into a text file cols.txt
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
My question now: how do we use awk if we have data that is not column: value but column: value1: value2: value3?
We would want the database entry to be value1: value2: value3
Here's the new data:
Col_01:14:a:47 .... Col_20:25:i:z Col_21:23432:6:b Col_22:639142:4:x
Col_01:8:z .... Col_20:25:i:4 Col_22:25134:u:0 Col_23:243344:5:6
Col_01:17:7:z .... Col_21:75:u:q Col_23:79876:u:0 Col_25:634534:8:1
We still provide the columns beforehand with cols.txt
How can we create a similar database structure? Is it possible to use gsub() to limit to the first value before : which is the same as the header?
EDIT: This doesn't have to be awk based. Any language will do naturally
Here is another alternative...
$ awk -v OFS='\t' '{for(i=1;i<NF;i+=2) # iterate over name: value pairs
{c=$i; # copy name in c to modify
sub(/:/,"",c); # remove colon
a[NR,c]=$(i+1); # collect data by row number, name
cols[c]}} # save name
END{n=asorti(cols,icols); # sort names
for(j=1;j<=n;j++) printf "%s", icols[j] OFS; # print header
print "";
for(i=1;i<=NR;i++) # print data
{for(j=1;j<=n;j++)
{v=a[i,icols[j]];
printf "%s", (v?v:"NaN") OFS} # replace missing data with NaN
print ""}}' file | column -t # pipe to column for pretty print
Col_01 Col_20 Col_21 Col_22 Col_23 Col_25
14:a:47 25:i:z 23432:6:b 639142:4:x NaN NaN
8:z 25:i:4 NaN 25134:u:0 243344:5:6 NaN
17:7:z NaN 75:u:q NaN 79876:u:0 634534:8:1
I had karakfa's answer as well. If the column name is not separated by whitespace from the value (e.g. if you have Col_01:14:a:47) then you can do this (using GNU awk for the extended match function)
{
for (i=1; i<=NF; i++) {
match($i, /^([^:]+):(.*)/, m)
a[NR,m[1]] = m[2]
cols[m[1]]
}
}
The END block is the same
Using TXR's Lisp macro implementation of the Awk paradigm:
(awk (:set ft #/-?\d+/) ;; ft is "field tokenize" (no counterpart in Awk)
(:let (tab (hash :equal-based)) (max-col 1) (width 8))
((ff (mapcar toint) (tuples 2)) ;; filter fields to int and shore up into pairs
(set max-col (max max-col [find-max [mapcar first f]]))
(mapdo (ado set [tab ^(,nr ,#1)] #2) f)) ;; stuff data into table
(:end (let ((headings (mapcar (opip (format nil "Col~,02a")
`#{#1 width}`)
(range 1 max-col))))
(put-line `#{headings " "}`))
(each ((row (range 1 nr)))
(let ((cols (mapcar (opip (or [tab ^(,row ,#1)] "NaN")
`#{#1 width}`)
(range 1 max-col))))
(put-line `#{cols " "}`)))))
Smaller sample data:
Col_01: 14 Col_04: 25 Col_06: 23432 Col_07: 639142
Col_02: 8 Col_03: 25 Col_05: 25134 Col_06: 243344
Col_01: 17
Col_06: 19 Col_07: 32425
Run:
$ txr reformat.tl data-small
Col01 Col02 Col03 Col04 Col05 Col06 Col07
14 NaN NaN 25 NaN 23432 639142
NaN 8 25 NaN 25134 243344 NaN
17 NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN 19 32425
P.S. opip is a macro which boostraps from the op macro for partial function applications; opip implicitly distributes op into its argument expressions, and then chains the resulting functions together into a functional pipeline: hence "op-pipe". In each pipeline element, its own numbered implicit arguments can be referenced: #1, #2, ... if they are absent, then the partially applied function implicitly receives the piped object as its rightmost argument.
The ^(,row ,#1) syntax is TXR Lisp's backquote. The backtick that mainstream Lisp dialects use for backquote is already employed for string quasiquotes. This is equivalent to (list row #1): make a list consisting of the value of row and of the implicit, op/do-generated function argument #1. Lists of two elements are being used as the hash keys, which simulates a 2D array. For that, the hash must be :equal-based. The lists (1 2) (1 2) are not eql if they are separate instances rather than one and the same object; they compare equal under the equal function.
Just for fun, some incomprehensible perl
perl -aE'%l=%{{#F}};while(($k,$v)=each%l){$c{$k}=1;$a[$.]{$k}=$v}END{$,="\t";say#c=sort keys%c;for$i(1..$.){say map{$a[$i]{$_}//"NaN"}#c}}' input
(community wiki to hide my shame ...)
Golfed a few chars:
perl -aE'while(#F){$c{$k=shift#F}=1;$data[$.]{$k}=shift#F}END{$,="\t";say#c=sort keys%c;for$i(1..$.){say map{$data[$i]{$_}//"NaN"}#c}}' input

Adding Fields of growing fields Unix

I have a grade book file that looks like
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:
0123:Smith:Jon:100:80:80:100:90:
0987:Williams:Pat:20:30:35:46:50:
0654:Bar:Foo:100:100:100:100:100:
I need to add all the hws/quizes/exams/projects for each student and append the total to the end of the corresponding line
An example output file could be
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:
The output file doesn't have to be the same file, but keep in mind that the order of the grades in the header line (1st line) could be anything. So the order of the assignments could be in any order.
I'm assuming I must use grep to search the file for all fields containing "hw"/"quiz"/"exam"/"proj" and get the corresponding field. Then go through each line and add the totals for hw/quiz/exam/proj individually.
Maybe it's easier with awk?
$ cat tst.awk
BEGIN { FS=OFS=":" }
NR==1 {
for (i=4;i<NF;i++) {
name = substr($i,1,1) "T"
nr2name[i] = name
if (!seen[name]++) {
names[++numNames] = name
}
}
printf "%s", $0
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf "%s%s", names[nameNr], OFS
}
print ""
next
}
{
delete tot
for (i=4;i<NF;i++) {
name = nr2name[i]
tot[name] += $i
}
printf "%s", $0
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf "%s%s", tot[names[nameNr]], OFS
}
print ""
}
.
$ awk -f tst.awk file
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT:
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:
This seems to do the job; it is intricate, though:
script.awk
BEGIN { FS = ":"; OFS = FS }
NR == 1 {
for (i = 4; i < NF; i++)
{
c = substr($i, 1, 1)
if (!(c in columns)) order[n++] = c
columns[c]++
letter[i] = c
}
nf = NF
for (i = 0; i < n; i++)
$(i+nf) = order[i] "T"
print $0 OFS
next
}
{
for (c in columns) total[c] = 0
for (i = 4; i < NF; i++)
{
c = letter[i]
total[c] += $i
}
nf = NF
for (i = 0; i < n; i++)
{
c = order[i]
$(i+nf) = total[c]
}
print $0 OFS
}
Explanation:
BEGIN:
Set the input and output field separators.
NR == 1:
Loop over the fields after the student ID and name fields.
Extract the first letter.
If the letter has not been seen before, note it in the order and increment the number of totals (n).
Increment the number of times the letter has been seen.
Record which letter goes with the current column.
Add the new columns after the existing columns in sequence.
Print the line plus a trailing output field separator (aka OFS or :).
Note that $NF is empty because of the trailing : in the data, hence (unusually for awk scripts), i < NF rather than i <= NF.
Each other line:
Reset the totals for each letter.
For each of the scoring fields, find the letter that the column belongs to (letter[i]).
Add the column to the total for the letter.
For each of the extra fields in order, set the value of the appropriate extra field to the total for that letter.
Print the record plus an extra colon (aka OFS).
data
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:
0123:Smith:Jon:100:80:80:100:90:
0987:Williams:Pat:20:30:35:46:50:
0654:Bar:Foo:100:100:100:100:100:
Sample output
$ awk -f script.awk data
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT:
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:
$
The only difference between this and the sample output in the question is a trailing colon on the title line, for consistency with the data lines (and input).
with few adaptation
- Order of Total is not the same (dynamic)
- Name of total is using full radical names without the last 2 digit
- Use a parameter to define first field with data to count (4th here with -v 'St=4')
awk -v 'St=4' '
BEGIN{FS=OFS=":"}
NR==1 {
printf "%s",$0
for(i=St;i<=(nf=NF-1);i++){
tn=$i;sub(/..$/,"T",tn)
T[tn]=0;TN[i]=tn
}
Sep=""
for(t in T){
printf "%s%s",Sep,t;Sep=OFS
}
print Sep
next
}
{
for(i=St;i<=nf;i++){
T[TN[i]]+=$i
}
for(i=1;i<=nf;i++)printf "%s%s",$i,OFS
Sep=""
for(t in T){
printf "%s%s",Sep,T[t]
T[t]=0;Sep=OFS
}
print Sep
}' YourFile
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:examT:quizT:hwT:projT:
0123:Smith:Jon:100:80:80:100:90:80:170:100:100:
0987:Williams:Pat:20:30:35:46:50:35:80:20:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

Error in writing output file through AWK scripting

I have a AWK script to write specific values matching with specific pattern to a .csv file.
The code is as follows:
BEGIN{print "Query Start,Query End, Target Start, Target End,Score, E,P,GC"}
/^\>g/ { Query=$0 }
/Query =/{
split($0,a," ")
query_start=a[3]
query_end=a[5]
query_end=gsub(/,/,"",query_end)
target_start=a[8]
target_end=a[10]
}
/Score =/{
split($0,a," ")
score=a[3]
score=gsub(/,/,"",score)
e=a[6]
e=gsub(/,/,"",e)
p=a[9]
p=gsub(/,/,"",p)
gc=a[12]
printf("%s,%s,%s,%s,%s,%s,%s,%s\n",query_start, query_end,target_start,target_end,score,e,p,gc)
}
The input file is as follows:
>gi|ABCDEF|
Plus strand results:
Query = 100 - 231, Target = 100 - 172
Score = 20.92, E = 0.01984, P = 4.309e-08, GC = 51
But I received the output in a .csv file as provided below:
100 0 100 172 0 0 0 51
The program failed to copy the values of:
Query end
Score
E
P
(Note: all the failed values are present before comma (,))
Any help to obtain the right output will be great.
Best regards,
Amit
As #Jidder mentioned, you don't need to call split() and as #jaypal mentioned you're using gsub() incorrectly, but also you don't need to call gsub() at all if you just include , in your FS.
Try this:
BEGIN {
FS = "[[:space:],]+"
OFS = ","
print "Query Start","Query End","Target Start","Target End","Score","E","P","GC"
}
/^\>g/ { Query=$0 }
/Query =/ {
query_start=$4
query_end=$6
target_start=$9
target_end=$11
}
/Score =/ {
score=$4
e=$7
p=$10
gc=$13
print query_start,query_end,target_start,target_end,score,e,p,gc
}
That work? Note the field numbers are bumped out by 1 because when you don't use the default FS awk no longer skips leading white space so there's an empty field before the white space in your input.
Obviously, you are not using your Query variable so the line that populates it is redundant.