Adding Fields of growing fields Unix - regex

I have a grade book file that looks like
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:
0123:Smith:Jon:100:80:80:100:90:
0987:Williams:Pat:20:30:35:46:50:
0654:Bar:Foo:100:100:100:100:100:
I need to add all the hws/quizes/exams/projects for each student and append the total to the end of the corresponding line
An example output file could be
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:
The output file doesn't have to be the same file, but keep in mind that the order of the grades in the header line (1st line) could be anything. So the order of the assignments could be in any order.
I'm assuming I must use grep to search the file for all fields containing "hw"/"quiz"/"exam"/"proj" and get the corresponding field. Then go through each line and add the totals for hw/quiz/exam/proj individually.
Maybe it's easier with awk?

$ cat tst.awk
BEGIN { FS=OFS=":" }
NR==1 {
for (i=4;i<NF;i++) {
name = substr($i,1,1) "T"
nr2name[i] = name
if (!seen[name]++) {
names[++numNames] = name
}
}
printf "%s", $0
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf "%s%s", names[nameNr], OFS
}
print ""
next
}
{
delete tot
for (i=4;i<NF;i++) {
name = nr2name[i]
tot[name] += $i
}
printf "%s", $0
for (nameNr=1; nameNr<=numNames; nameNr++) {
printf "%s%s", tot[names[nameNr]], OFS
}
print ""
}
.
$ awk -f tst.awk file
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT:
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:

This seems to do the job; it is intricate, though:
script.awk
BEGIN { FS = ":"; OFS = FS }
NR == 1 {
for (i = 4; i < NF; i++)
{
c = substr($i, 1, 1)
if (!(c in columns)) order[n++] = c
columns[c]++
letter[i] = c
}
nf = NF
for (i = 0; i < n; i++)
$(i+nf) = order[i] "T"
print $0 OFS
next
}
{
for (c in columns) total[c] = 0
for (i = 4; i < NF; i++)
{
c = letter[i]
total[c] += $i
}
nf = NF
for (i = 0; i < n; i++)
{
c = order[i]
$(i+nf) = total[c]
}
print $0 OFS
}
Explanation:
BEGIN:
Set the input and output field separators.
NR == 1:
Loop over the fields after the student ID and name fields.
Extract the first letter.
If the letter has not been seen before, note it in the order and increment the number of totals (n).
Increment the number of times the letter has been seen.
Record which letter goes with the current column.
Add the new columns after the existing columns in sequence.
Print the line plus a trailing output field separator (aka OFS or :).
Note that $NF is empty because of the trailing : in the data, hence (unusually for awk scripts), i < NF rather than i <= NF.
Each other line:
Reset the totals for each letter.
For each of the scoring fields, find the letter that the column belongs to (letter[i]).
Add the column to the total for the letter.
For each of the extra fields in order, set the value of the appropriate extra field to the total for that letter.
Print the record plus an extra colon (aka OFS).
data
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:
0123:Smith:Jon:100:80:80:100:90:
0987:Williams:Pat:20:30:35:46:50:
0654:Bar:Foo:100:100:100:100:100:
Sample output
$ awk -f script.awk data
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:hT:qT:eT:pT:
0123:Smith:Jon:100:80:80:100:90:100:170:80:100:
0987:Williams:Pat:20:30:35:46:50:20:80:35:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:
$
The only difference between this and the sample output in the question is a trailing colon on the title line, for consistency with the data lines (and input).

with few adaptation
- Order of Total is not the same (dynamic)
- Name of total is using full radical names without the last 2 digit
- Use a parameter to define first field with data to count (4th here with -v 'St=4')
awk -v 'St=4' '
BEGIN{FS=OFS=":"}
NR==1 {
printf "%s",$0
for(i=St;i<=(nf=NF-1);i++){
tn=$i;sub(/..$/,"T",tn)
T[tn]=0;TN[i]=tn
}
Sep=""
for(t in T){
printf "%s%s",Sep,t;Sep=OFS
}
print Sep
next
}
{
for(i=St;i<=nf;i++){
T[TN[i]]+=$i
}
for(i=1;i<=nf;i++)printf "%s%s",$i,OFS
Sep=""
for(t in T){
printf "%s%s",Sep,T[t]
T[t]=0;Sep=OFS
}
print Sep
}' YourFile
StudentID:LastName:FirstName:hw01:quiz01:exam01:proj01:quiz02:examT:quizT:hwT:projT:
0123:Smith:Jon:100:80:80:100:90:80:170:100:100:
0987:Williams:Pat:20:30:35:46:50:35:80:20:46:
0654:Bar:Foo:100:100:100:100:100:100:200:100:100:

Related

Regex to match despite some of the characters not matching pattern?

I'm working with some bioinformatics data, and I've got this sed expression:
sed -n 'N;/.*:\(.*\)\n.*\1/{p;n;p;n;p};D' file.txt
It currently takes a file that is structured such as:
#E00378:1485 1:N:0:ABC
ABCDEF ##should match, all characters present
+
#
#E00378:1485 1:N:1:ABC
XYZABX ##should match, with permutation
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH ##should match, with permutation
+
#
#E00378:1485 1:N:1:CBA
ABC ##should not match, order not preserved
+
#
Then it returns 4 lines if the sequence after : is found in the second line, so in this case I would get:
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
However, I am looking to expand my search a little, by adding the possibility of searching for any single permutation of the letters, while maintaining the order, such that ABX, ZBC, AHC, ABO would all match the search criteria ABC.
Is a search like this possible to construct as a one-liner? Or should I write a script?
I was thinking it should be possible to programmatically change one of the letters to a * in the pattern space.
I am trying to make something along the lines of an AWK pattern that has a match defined as:
p = "";
p = p "."a[2]a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]"."a[3]a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]"."a[4]a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]"."a[5]a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]"."a[6]a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]"."a[7]a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]"."a[8]"|";
p = p a[1]a[2]a[3]a[4]a[5]a[6]a[7]".";
m = p;
But I can't seem to figure out how to make it programmatically for n numbers.
Okay, check this out where fuzzy is your input above:
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$offset $edits $distance\n";}' fuzzy
3 kkk 0
5 kkd 1
5 kkkkd 1
Since you haven't been 100% clear on your "fuzziness" criteria (and can't be until you have a measurement tool), I'll explain this first. Reference here:
http://search.cpan.org/~bkb/Text-Fuzzy-0.27/lib/Text/Fuzzy.pod
Basically, for each record (which I've assumed are split on # which is the -0043 bit), the output is an offset, how the 1st string can become the 2nd string, and lastly the "distance" (Levenshtein, I would assume) between the two strings.
So..
£ perl -0043 -MText::Fuzzy -ne 'if (/.*:(.*?)\n(.*?)\n/) {my ($offset, $edits, $distance) = Text::Fuzzy::fuzzy_index ($1, $2); print "$_\n" if $distance < 2;}' fuzzy
#E00378:1485 1:N:0:ABC
ABCDEF
+
#
#E00378:1485 1:N:1:ABC
XYZABX
+
#
#E00378:1485 1:N:1:ABCDE
ZABCDXFGH
+
#
See here for installing perl modules like Text::Fuzzy
https://www.thegeekstuff.com/2008/09/how-to-install-perl-modules-manually-and-using-cpan-command/
Example input/output for a record that wouldn't be printed (distance is 3):
#E00378:1485 1:N:1:ABCDE
ZDEFDXFGH
+
#
gives us this (or simply doesn't print with the second perl command)
3 dddkk 3
Awk doesn't have sed back-references, but has more expressiveness to make up the difference. The following script composes the pattern for matching from the final field of the lead line, then applies the pattern to the subsequent line.
#! /usr/bin/awk -f
BEGIN {
FS = ":"
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}
If the above needs some work to form a different matching pattern, we mostly limit our modification to the lines of pattern composition. By the way... I noticed that sequences repeat -- to make this faster we can implement caching:
#! /usr/bin/awk -f
BEGIN {
FS = ":"
# Noticed that sequences repeat
# -- implement caching of patterns
split("", cache)
}
# Lead Line has 5 fields
NF == 5 {
line0 = $0
seq = $NF
getline
if (seq != "") {
if (seq in cache) {
pat = cache[seq]
} else {
n = length(seq)
if (n == 1) {
pat = seq
} else {
# ABC -> /.BC|A.C|AB./
pat = "." substr(seq, 2, n - 1)
for (i = 2; i < n; ++i)
pat = pat "|" substr(seq, 1, i - 1) "." substr(seq, i + 1, n - i)
pat = pat "|" substr(seq, 1, n - 1) "."
}
cache[seq] = pat
}
if ($0 ~ pat) {
print line0
print
getline; print
getline; print
next
}
}
getline
getline
}

Extract line before first empty line after match

I have some CSV file in this form:
* COMMENT
* COMMENT
100 ; 1706 ; 0.18 ; 0.45 ; 0.00015 ; 0.1485 ; 0.03 ; 1 ; 1 ; 2 ; 280 ; 100 ; 100 ;
* COMMENT
* COMMENT
* ZT vector
0; 367; p; nan
1; 422; p; nan
2; 1; d; nan
* KS vector
0; 367; p; 236.27
1; 422; p; 236.27
2; 1; d; 236.27
*Total time: 4.04211
I need to extract the last line before an empty line after matching the pattern KS vector.
To be clearer, in the above example I would like to extract the line
2; 1; d; 236.27
since it's the non empty line just before the first empty one after I got the match with KS vector.
I would also like to use the same script to extract the same kind of line after matching the pattern ZT vector, that in the above example would return
2; 1; d; nan
I need to do this because I need the first number of that line, since it tells me the number of consecutive non-empty lines after KS vector.
My current workaround is this:
# counting number of lines after matching "KS vector" until first empty line
var=$(sed -n '/KS vector/,/^$/p' file | wc -l)
# Subtracting 2 to obtain actual number of lines
var=$(($var-2))
But if I could extract directly the last line I could extract the first element (2 in the example) and add 1 to it to obtain the same number.
You're going about this the wrong way. All you need is to put awk into paragraph mode and print 1 less than the number of lines in the record (since you don't want to include the KS vector line in your count):
$ awk -v RS= -F'\n' '/KS vector/{print NF-1}' file
3
Here's how awk sees the record when you put it into paragraph mode (by setting RS to null) with newline-separated fields (by setting FS to a newline):
$ awk -v RS= -F'\n' '/KS vector/{ for (i=1;i<=NF;i++) print NF, i, "<"$i">"}' file
4 1 <* KS vector>
4 2 <0; 367; p; 236.27>
4 3 <1; 422; p; 236.27>
4 4 <2; 1; d; 236.27>
With awk expression:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r; exit }f{ r=$0 }' file
vec - variable containing the needed pattern/vector
$0~vec{ f=1 } - on encountering the needed pattern/vector - set the flag f in active state
f{ r=$0 } - while the flag f is active(under needed vector section) - capture the current line into variale r
f && !NF{ print r; exit } - (NF - total number of fields, if the line is empty - there's no fields !NF) on encountering empty line while iterating through the needed vector lines - print the last captured non-empty line r
exit - exit script execution immediately (avoiding redundant actions/iterations)
The output:
2; 1; d; 236.27
If you want to just print the actual number of lines under found vector use the following:
awk -v vec="KS vector" '$0~vec{ f=1 }f && !NF{ print r+1; exit }f{ r=$1 }' file
3
With awk:
awk '$0 ~ "KS vector" { valid=1;getline } valid==1 { cnt++;dat[cnt]=$0 } $0=="" { valid="" } END { print dat[cnt-1] }' filename
Check for any lines matching "KS vector". Set a valid flag and then read in the next line. Read the data into an array with an incremented counter. When space is encountered, reset the valid flag. At the end print the last but one element of the dat array.

Splitting text lines, whilst appending prefix

EDITED
I have about 50x9Gb .mer files that look like this:
"xxxxx";"123\t123\t123\v234\t234\v234\t224\t234\v"
"yyyyy";"123\t234\t224\v234\t234\v234\t224\t234\v"
"zzzzz";"123\t456\t565\v234\t774"
A uuid followed by ";" and then possibly extra tabbed entries, followed by a vertical tab delimited list of further multiple tabbed entries, all enclosed in quotes. I've shown them as 3-digit numbers here, but in reality they're variable length strings, which can include doubled quotes "".
I need to turn them into this:
xxxxx\t123\t123\t123
xxxxx\t234\t234
xxxxx\t234\t224\t234
yyyyy\t123\t234\t224
yyyyy\t234\t234
yyyyy\t234\t224\t234
zzzzz\t123\t456\t565
zzzzz\t234\t774
That is, split the lines on the vertical tabs, prefixing each line with the first field of the line that it came from.
Currently, I'm using a noddy regex, which does at least work, but requires multiple runs and manual checking.
How would I do this using awk or sed ? I've tried adapting the the current answers below, but I'm having trouble finding out what the ;P and ;D suffixes mean.
(Note: I'm using GitBash on Windows, so I guess that's gnu sed and awk ?)
awk -F';' -v OFS='\t' #set Field separator is ';',
'{for(i=1;i<=NF;i++) #then we have 2 fields, remove leading and trailing doubled qoutes
gsub(/^"|"$/,"",$i)
c=split($2,a,"\v") #split by vertical tab, save result in array 'a'
for(i=1;i<=c;i++) #for each element in a, if it is not empty, print field1 (the uuid)
if(a[i])print $1,a[i]}' file #and the element, separated by Tab
explanation is inline.
It outputs:
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
This might work for you (GNU sed):
sed -r 's/^((\S*\t)\S*)\v/\1\n\2/;P;D' file
Replace each \v by a newline, the first field and a tab. Print and Delete the first line and repeat.
EDIT: As per the new question;
sed -r '/\n/!s/"(")?/\1/g;/\n/!s/;/\t/;s/^((\S*\t)[^\v]*)\v/\1\n\2/;/\t$/!P;D' file
Remove any single double quotes (replace double double quotes by single double quotes) and replace the semi colon by a tab character. Then replace any \v's by a newline and the first field and a tab and repeat.
You can use this awk command for this output:
awk 'BEGIN{FS=OFS="\t"} n = split($2, a, "\x0b") {
for (i=1; i<=n; i++) print $1, a[i]}' file
195a664e-e0d0-4488-99d6-5504f9178115 1234
195a664e-e0d0-4488-99d6-5504f9178115 1412
195a664e-e0d0-4488-99d6-5504f9178115 1231
195a664e-e0d0-4488-99d6-5504f9178115 4324
195a664e-e0d0-4488-99d6-5504f9178115 1421
195a664e-e0d0-4488-99d6-5504f9178115 3214
a1d61289-7864-40e6-83a7-8bdb708c459e 1412
a1d61289-7864-40e6-83a7-8bdb708c459e 6645
a1d61289-7864-40e6-83a7-8bdb708c459e 5334
a1d61289-7864-40e6-83a7-8bdb708c459e 3453
a1d61289-7864-40e6-83a7-8bdb708c459e 5453
How it works:
BEGIN{FS=OFS="\t"} # sets input and output field separator as tab
n = split($2, a, "\x0b") # splits second field using Hex 0B (ASCII 11) i.e. vertical tab
for (i=1; i<=n; i++) ... # prints pair of field 1 with each item from split array a
gnu sed
sed 's/"\|..$//g;s/;/\t/;:r;s/^\([^\t]*\)\t\(.*\)\\v/\1\t\2\n\1\t/;t r;s/\\t/\t/g;' YourFile
recursive replace of \v by first "field" + tab + cleaning extra char on the way
Another solution using awk
awk '
BEGIN{FS="[\v;]"}
{
gsub("[\"]","");
for(i=2; i<=NF; ++i)
if($i) printf "%s\t%s\n", $1, $i;
}' file.mer
Another solution using sed
sed -r 's/\v\n/\v/g; s/"//g;
:a; s/([^;]*);([^\v]*)\v/\1;\2\n\1;/g; ta;
s/;/\t/g;' file.mer | sed -r '/^[^\t]+\t$/d'
you get,
xxxxx 123 123 123
xxxxx 234 234
xxxxx 234 224 234
yyyyy 123 234 224
yyyyy 234 234
yyyyy 234 224 234
zzzzz 123 456 565
zzzzz 234 774
Okay, I waited on purpose until Kent's answer was accepted and awarded the bounty because the question was about awk/sed. Thus, my answer might be somewhat off-topic, but anyway here is my Java solution which I did just for fun as a kata.
MER input file generator:
I thought it would be nice to generate some sample input files with randomised values. Each line consists of
a UUID,
0-9 groups, separated by vertical tabs,
within each group, 1-4 strings, separated by horizontal tabs,
each string consisting of 1-20 characters, among them double quotes escaped by other double quotes, i.e. "".
I thought this is diverse enough to get some good test data.
package de.scrum_master.stackoverflow;
import org.apache.commons.lang.RandomStringUtils;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.Random;
import java.util.UUID;
public class RandomFileGenerator {
private static final int BUFFER_SIZE = 1024 * 1024;
private final static Random RANDOM = new Random();
private final static char VERTICAL_TAB = '\u000b';
private final static char[] LEGAL_CHARS =
"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzäöüÄÖÜß. -\""
.toCharArray();
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
// final long maxOutputSize = 9L * 1024 * 1024 * 1024;
// final String outputFile = "src/main/resources/sample-9gb.mer";
final long maxOutputSize = 1L * 1024 * 1024;
final String outputFile = "src/main/resources/sample-1mb.mer";
long totalOutputSize = 0;
long lineCount = 0;
String line;
try (PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))) {
while (totalOutputSize < maxOutputSize) {
line = generateLine();
writer.println(generateLine());
totalOutputSize += line.length() + 1;
lineCount++;
}
}
System.out.println(lineCount);
System.out.println(totalOutputSize);
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static String generateLine() {
StringBuilder buffer = new StringBuilder();
buffer
.append('"')
.append(UUID.randomUUID().toString())
.append("\";\"");
int numItems = RANDOM.nextInt(10);
for (int i = 0; i < numItems; i++) {
int numSubItems = 1 + RANDOM.nextInt(4);
for (int j = 0; j < numSubItems; j++) {
buffer.append(
RandomStringUtils.random(1 + RANDOM.nextInt(20), 0, LEGAL_CHARS.length, false, false, LEGAL_CHARS)
.replaceAll("\"", "\"\"")
);
if (j + 1 < numSubItems)
buffer.append('\t');
}
if (i + 1 < numItems) {
buffer.append(VERTICAL_TAB);
}
}
buffer.append('"');
return buffer.toString();
}
}
You can see that it is easy to create test files of the desired file size, e.g.
1 MB: maxOutputSize = 1L * 1024 * 1024
9 GB: maxOutputSize = 9L * 1024 * 1024 * 1024
I mostly used the smaller one to check the algorithms during development and the really big one for performance tuning.
File splitter in 4 different variants:
The variants shown here use different approaches, but in common they have that they read from a BufferedReader with Java streams via reader.lines(). Switching from streams to a simple for loop made it slower, BTW. All solutions write the results into a PrintWriter.
reader.lines().forEach() and then regex matching + splitting. This solution has the best trade-off between readability, brevity and performance.
reader.lines().flatMap(), i.e. using sub-streams for the vertical-tab-separated groups after the UUID, also using regex matching + splitting. This solution is also quite short and elegant, but a little harder to read than #1 and also about 15% slower.
Because regex matching calls like replace() and split() can be pretty expensive, I developed a solution that rather iterates over the string and uses indexOf() and substring() instead of regexes. This was substantially faster than #1 and #2, but the code is harder to read in a way I started to dislike. This should only be done if performance is really important, i.e if the file splitter is used regularly. For a one-time solution or if it runs just once a month, I think it is not really worth it from a maintainability standpoint.
A further optimised version of #3 which avoids some more overhead and is again a little bit faster, but not substantially. Now the code really needs source code comments so as to convey to the reader what the algorithm does. From a clean code perspective this is a nightmare. (Don't do this at home, kids!)
package de.scrum_master.stackoverflow;
import java.io.*;
import java.util.Arrays;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class FileSplitter {
private static final int BUFFER_SIZE = 1024 * 1024;
private static final Pattern LINE_PATTERN = Pattern.compile("^\"([^\"]+)\";\"(.*)\"$");
private final static char VERTICAL_TAB = '\u000b';
public static void main(String[] args) throws IOException {
long startTime = System.currentTimeMillis();
String inputFile = "src/main/resources/sample-9gb.mer";
String outputFile = inputFile.replaceFirst("mer$", "txt");
try (
BufferedReader reader = new BufferedReader(new FileReader(inputFile), BUFFER_SIZE);
PrintWriter writer = new PrintWriter(new BufferedWriter(new FileWriter(outputFile), BUFFER_SIZE))
) {
// forEachVariant(reader, writer);
// flatMapVariant(reader, writer);
noRegexSimpleVariant(reader, writer);
// noRegexOptimisedVariant(reader, writer);
}
System.out.println((System.currentTimeMillis() - startTime) / 1000.0);
}
private static void forEachVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.forEach(line -> {
matcher.reset(line).matches();
for (String record : matcher.group(2).replace("\"\"", "\"").split("\\v"))
writer.println(matcher.group(1) + "\t" + record);
});
}
private static void flatMapVariant(BufferedReader reader, PrintWriter writer) {
Matcher matcher = LINE_PATTERN.matcher("dummy");
reader.lines()
.flatMap(line -> {
matcher.reset(line).matches();
return Arrays
.stream(matcher.group(2).replace("\"\"", "\"").split("\\v"))
.map(record -> matcher.group(1) + "\t" + record);
})
.forEach(writer::println);
}
private static void noRegexSimpleVariant(BufferedReader reader, PrintWriter writer) {
reader.lines()
.forEach(line -> {
final int lineLength = line.length();
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
indexLeft = indexRight + 3;
String record;
int quoteIndex;
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
while (indexLeft < indexRight) {
quoteIndex = line.indexOf('"', indexLeft);
if (quoteIndex == -1 || quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
record = line.substring(indexLeft, quoteIndex);
writer.print(record);
indexLeft = quoteIndex + 1;
}
writer.println();
indexLeft = indexRight + 1;
}
});
}
private static void noRegexOptimisedVariant(BufferedReader reader, PrintWriter writer) throws IOException {
reader.lines()
.forEach(line -> {
// UUID + '\t'
int indexLeft = 1;
int indexRight = line.indexOf('"', indexLeft);
final String uuid = line.substring(indexLeft, indexRight) + "\t";
// Skip '";"' after UUID
indexLeft = indexRight + 3;
final int lineLength = line.length();
String recordChunk;
int quoteIndex;
// If search for '"' has once reached end of line, search no more
boolean doQuoteSearch = true;
// Iterate over records per UUID, separated by vertical tab
while (indexLeft < lineLength) {
writer.print(uuid);
indexRight = line.indexOf(VERTICAL_TAB, indexLeft);
if (indexRight == -1)
indexRight = lineLength - 1;
// Search for '""' within record incrementally, + replace each of them by '"'.
// BTW, if '"' is found, it actually always will be an escaped '""'.
while (indexLeft < indexRight) {
if (doQuoteSearch) {
// Only search for quotes if we never reached the end of line before
quoteIndex = line.indexOf('"', indexLeft);
assert quoteIndex != -1;
if (quoteIndex >= lineLength - 1)
doQuoteSearch = false;
if (quoteIndex >= indexRight)
quoteIndex = indexRight;
else
quoteIndex++;
}
else {
// No more '"' within record
quoteIndex = indexRight;
}
// Write record chunk, skipping 2nd '"'
recordChunk = line.substring(indexLeft, quoteIndex);
writer.print(recordChunk);
indexLeft = quoteIndex + 1;
}
// Do not forget newline before reading next line/UUID
writer.println();
indexLeft = indexRight + 1;
}
});
}
}
Updated awk script:
Besides: Each Java solution writes out one UUID without any content in case there was none in the input file. This would be easy to avoid, but I did it on purpose. This is the only difference to this slightly updated awk script (based on Dave's, but also replaces "" by ") which I used as a benchmark:
#!/usr/bin/awk
{
for(i=1;i<=NF;i++) {
gsub(/^"|"$/,"",$i)
gsub(/""/,"\"",$i)
}
c=split($2,a,"\\v")
for(i=1;i<=c;i++)
print $1,a[i]
}
Performance results:
I measured both parsing and writing performance.
Parsing means reading a 9 GB file from disk and splitting it, but writing the output to /dev/null or not at all.
Writing means reading the same 9 GB file and writing it back onto the same disk partition (hybrid HD + SSD type), i.e. this could be further optimised by writing onto another physical disk. The output file has a size of 18 GB.
Reading file, splitting into lines but not parsing the lines: 66 s
Awk
parsing only: 533 s
parsing + writing: 683 s
reader.lines().forEach() and then regex matching + splitting
parsing only: 212 s
parsing + writing: 425 s
reader.lines().flatMap(), i.e. using sub-streams
parsing only: 245 s
parsing + writing: not measured
Without using regex, but using String.replace("\"\"", "\"") (not shown in code here)
parsing only: 154 s
parsing + writing: 369 s
No regex, no replace(), simple version
parsing only: 86 s
parsing + writing: 342 s
No regex, no replace(), optimised version
parsing only: 84 s
parsing + writing: 342 s
Sorry for the lengthy treatise, but I wanted to share my findings for others reading the question and the other answers, speculating whether Java (or C?) might be faster than awk - yes, it is by a substantial bit, but not by an order of magnitude because disk performance is also a factor. And I think it is a warning to those who tend to over-optimise for optimisation's sake. It is not worth it if you go to far, just try to hit the sweet spot between effort, readability and performance. Amen.

Finding columns with only white space in a text file and replace them with a unique separator

I have a file like this:
aaa b b ccc 345
ddd fgt f u 3456
e r der der 5 674
As you can see the only way that we can separate the columns is by finding columns that have only one or more spaces. How can we identify these columns and replace them with a unique separator like ,.
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Note:
If we find all continuous columns with one or more white spaces (nothing else) and replace them with , (all the column) the problem will be solved.
Better explanation of the question by josifoski :
Per block of matrix characters, if all are 'space' then all block should be replaced vertically with one , on every line.
$ cat tst.awk
BEGIN{ FS=OFS=""; ARGV[ARGC]=ARGV[ARGC-1]; ARGC++ }
NR==FNR {
for (i=1;i<=NF;i++) {
if ($i == " ") {
space[i]
}
else {
nonSpace[i]
}
}
next
}
FNR==1 {
for (i in nonSpace) {
delete space[i]
}
}
{
for (i in space) {
$i = ","
}
gsub(/,+/,",")
print
}
$ awk -f tst.awk file
aaa,b b,ccc,345
ddd,fgt,f u,3456
e r,der,der,5 674
Another in awk
awk 'BEGIN{OFS=FS=""} # Sets field separator to nothing so each character is a field
FNR==NR{for(i=1;i<=NF;i++)a[i]+=$i!=" ";next} #Increments array with key as character
#position based on whether a space is in that position.
#Skips all further commands for first file.
{ # In second file(same file but second time)
for(i=1;i<=NF;i++) #Loops through fields
if(!a[i]){ #If field is set
$i="," #Change field to ","
x=i #Set x to field number
while(!a[++x]){ # Whilst incrementing x and it is not set
$x="" # Change field to nothing
i=x # Set i to x so it doesnt do those fields again
}
}
}1' test{,} #PRint and use the same file twice
Since you have also tagged this r, here is a possible solution using the R package readr. It looks like you want to read a fix width file and convert it to a comma-seperated file. You can use read_fwf to read the fix width file and write_csv to write the comma-seperated file.
# required package
require(readr)
# read data
df <- read_fwf(path_to_input, fwf_empty(path_to_input))
# write data
write_csv(df, path = path_to_output, col_names = FALSE)

How to get For which Begin there is no End

Hello i have a huge file which has 1000s of BEGIN and END,i am looking for a logic which will tell me for each BEGIN there is END or not.
for ex:
# Begin Variable
Name = loopEndDriven
Decl_type = UInt8
Usage = Z
Value = CE_SELECT
# End Variable
# Begin Variable
Name = locationNeeded
Decl_type = Loop_Location_t
Usage = Z
Value = SHORT_LOCATION
# End Variable
perl -lne 'BEGIN{$t=0}
if(/Begin/ ||(eof && $t==1))
{print "No end at $." unless($t==0);$t=1}
$t=0 if(/End/);' your_file
This above command will print the line number when there is no End for each Begin.
I have tested below:
> cat temp
# Begin Variable
Name = loopEndDriven
Decl_type = UInt8
Usage = Z
Value = CE_SELECT
# End Variable
# Begin Variable
Name = locationNeeded
Decl_type = Loop_Location_t
Usage = Z
Value = SHORT_LOCATION
# Begin Variable
Name = locationNeeded
Decl_type = Loop_Location_t
Usage = Z
Value = SHORT_LOCATION
> perl -lne 'BEGIN{$t=0}if(/Begin/ ||(eof && $t==1)){print "No end at $." unless($t==0);$t=1}$t=0 if(/End/);' temp
No end at 17
No end at 23
>
On the similar lines i hope you can also write the logic to check for each End there is a Begin or not.
Assuming nesting isn't allowed.
my $in_begin = 0;
while (<>) {
if (/# Begin/) {
warn(qq{Missing "End" at line $.\n"}) if $in_begin;
$in_begin = 1;
}
elsif (/# End/) {
warn(qq{Missing "Begin" at line $.\n"}) if !$in_begin;
$in_begin = 0;
}
}
warn(qq{Missing "End" at EOF\n"}) if $in_begin;
Better diagnostics:
my $begin = 0;
while (<>) {
if (/# Begin/) {
warn(qq{Missing "End" for "Begin" at line $begin\n}) if $begin;
$begin = $.;
}
elsif (/# End/) {
warn(qq{Missing "Begin" for "End' at line $.\n"}) if !$begin;
$begin = 0;
}
}
warn(qq{Missing "End" for "Begin" at line $begin\n"}) if $begin;
Balanced Expressions
There are no BEGIN or END keywords in the corpus you posted. The following one-liner will check for balanced expressions using your block comments instead.
$ perl -ne '$pairs += 1 if /Begin Variable/;
$pairs -= 1 if /End Variable/;
END {
if ($pairs == 0) {print "balanced\n"} else {print "unbalanced\n"}
}' /tmp/foo
With your corpus as currently posted, the one-liner should print balanced.
As you ask for logic rather than implementation:
Walk through the file line by line
Increment a variable each time you encounter # Begin Variable
Decrement this variable each time you encounter # End Variable
If the variable ever becomes 2, the previous block was unended. Record this somehow and decrement.