Replace the last occurrence of a field in a csv - regex

I have a csv file like this
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,<empty>
100.102,c,d,STEP_1,<empty>
100.103,e,f,STEP_1,<empty>
100.101,g,h,STEP_1,<empty>
100.103,i,j,STEP_1,<empty>
100.101,g,h,STEP_2,<empty>
100.103,i,j,STEP_2,<empty>
I am able to change the final field to whatever is easiest to parse so it can be considered as either blank i.e ,\n or containing the word <empty> as above.
From this file I have to replace "LAST_OCCURRENCE" field matching with last occurrence of [ KEY + STEP ] value with a boolean value (indicating that it's the last value for the tuple).
The expected result is this one:
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true #Last 100.102 for STEP_1
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true #Last 100.101 for STEP_1
100.103,i,j,STEP_1,true #Last 100.103 for STEP_1
100.101,g,h,STEP_2,true #Last 100.101 for STEP_2
100.103,i,j,STEP_2,true #Last 100.103 for STEP_2
Which is the fastest approach?
Would be possible to do it with a sed script or would be better to post-process the input file with another (perl? php?) script?

Using tac and awk:
tac file |
awk 'BEGIN{FS=OFS=","} $1 != "KEY"{$NF = (seen[$1,$4]++) ? "false" : "true"} 1' |
tac
After listing the file in reverse order using tac, we use an associative array seen with composite key as $1,$4 to figure out first occurrence of each composite key. Finally we do tac to get the file back in original order.
Output:
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true
100.103,i,j,STEP_1,true
100.101,g,h,STEP_2,true
100.103,i,j,STEP_2,true

$ awk 'BEGIN{FS=OFS=","} NR==FNR{last[$1,$4]=NR;next} FNR>1{$NF=(FNR==last[$1,$4] ? "true" : "false")} 1' file file
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true
100.103,i,j,STEP_1,true
100.101,g,h,STEP_2,true
100.103,i,j,STEP_2,true

Related

Sort file line based on reg expression

I have lines in my test file as
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
I have to sort lines based on timestamp field. I using sed to extract timestamp and trying to place as 1st column using sed -e 's/((?<=\"timestamp\":)\d+.*?)/\1 .
Can anyone help to fix reg exp.
Right now getting Error : sed: 1: "s/((?<=\"timestamp\":)\ ...": \1 not defined in the RE . I think error is coming because of my regex.
You can do a quick implementation with gawk as well, without creating any intermediate columns etc.
Command:
awk -F'"timestamp":' '{a[substr($2,1,length($2)-1)]=$0}END{asorti(a,b);for(i in b){print a[b[i]]}}' input
Explanations:
-F'"timestamp":' you define "timestamp": as field separator
{a[substr($2,1,length($2)-1)]=$0} on each line of your file you save the timestamp value as an index and the whole line in an associative array
END{asorti(a,b);for(i in b){print a[b[i]]}} at the end of the processing you sort the associative array on the index (the timestamp) and you print the content of the array based on the sorted indexes.
input:
$ more input
2018-05-28T17:15:08.026 {"operation":"DELETE","primaryKey":{"easy_id":1236},"subSystem":"ts\est2","table":"tbl1","timestamp":1527495188026}
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
2018-05-28T17:14:08.025 {"operation":"UPDATE","primaryKey":{"easy_id":1235},"subSystem":"ts\est1","table":"tbl1","timestamp":1527495188025}
output:
awk -F'"timestamp":' '{a[substr($2,1,length($2)-1)]=$0}END{asorti(a,b);for(i in b){print a[b[i]]}}' input
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
2018-05-28T17:14:08.025 {"operation":"UPDATE","primaryKey":{"easy_id":1235},"subSystem":"ts\est1","table":"tbl1","timestamp":1527495188025}
2018-05-28T17:15:08.026 {"operation":"DELETE","primaryKey":{"easy_id":1236},"subSystem":"ts\est2","table":"tbl1","timestamp":1527495188026}
You could use sort command:
sort -t: -k8 inputfile
Here -t: lets the colon : be the delimiter. The sort is done by the eight field because the colon in timestamp": is the eight colon in the line.
awk: This solution works in the general case where timestamp can appear anywhere :
awk 'BEGIN {FPAT="\"timestamp\": *[0-9]*"; PROCINFO["sorted_ in"]="#ind_num_asc" }
{ a[substr($1,13)]=$0 }
END { for(i in a) print a[i] }' <file>
This states that your line contains a single field of the form "timestamp": nnnnnnnn. It also assumes that all arrays are numerically ascending sorted based on their key. The second part removes the "timestamp": part from the field $1 which is the key now and stores it in an array. In the end, we print the array.

Replacing text between n-th and (n+1)-th delimiter using sed

I wonder how I can change a single value on particular position in pipe delimited dataset.
For example, I have data set:
01|456|AAAA|James Bond|AAAA|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|AAAA|6296903
04|3213|AAAA|AAAA|BBBB|62969
I want to change every "AAAA" value to "XXXX", but only between 4th and 5th pipe character ( | ). So, expected output will look like:
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969
Is it achievable using only sed function, or is it necessary to use something like awk.
Set input field separator (FS), output field separator (OFS) and if column 5 contains AAAA replace by XXXX:
awk 'BEGIN{FS=OFS="|"} $5=="AAAA" {$5="XXXX"}1' file
Output:
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969
Better to use awk for this:
awk 'BEGIN{FS=OFS="|"} {gsub(/A/, "X", $5)} 1' file
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|AAAA|BBBB|62969
BEGIN{FS=OFS="|"} uses pipe as a input & output field separators
gsub(/A/, "X", $5) replaces each A with X in $5 for 5th column only
1 is default action to print each line
awk -v start=4 -v end=5 'BEGIN{FS=OFS="|"}{for(i=start;i<=end;i++) gsub(/AAAA/,"XXXX",$i)}1' inputfile
01|456|AAAA|James Bond|XXXX|207085
02|AAAA|BBBB|Marco Polo|BBBB|937311723
03|321332|BBBB|Brad Pitt|XXXX|6296903
04|3213|AAAA|XXXX|BBBB|62969
Based on the value of start and end variable, gensub function will do the replacement between the columns falling between these values.
This might work for you (GNU sed):
sed -r ':a;s/^(([^|]*\|){4}X*)[^X|]/\1X/;ta' file
Iterate, replacing all characters that are not an X or a | to an X from the fourth | character.

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Delete Specific Lines with AWK [or sed, grep, whatever]

Is it possible to remove lines from a file using awk? I'd like to find any lines that have Y in the last column and then remove any lines that match the value in column 2 of said line.
Before:
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,N
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,N
KEY1,TRACKINGKEY1,TRACKINGNUMBER1-1,PACKAGENUM1-1,Y
,TRACKINGKEY1,TRACKINGNUMBER1-2,PACKAGENUM1-2,Y
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
So awk would find that row 3 has Y in the last column, then look at column 2 [TRACKINGKEY1] and remove all lines that have TRACKINGKEY1 in column 2.
Expected result:
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
The reason for this is that our shipping program puts out a file whenever a shipment is processed, as well as when that shipment gets voided [in case of an error]. So what I end up with is the initial package info, then the same info indicating that it was voided, then yet another set of lines with the new shipment info. Unfortunately our ERP software has a fairly simple scripting language in which I can't even make an array so I'm limited to shell tools.
Thanks in advance!
One way is to take 2 pass to same file using awk:
awk -F, 'NR == FNR && $NF=="Y" && !($2 in seen){seen[$2]}
NR != FNR && !($2 in seen)' file file
KEY1,TRACKINGKEY5,TRACKINGNUMBER1-3,PACKAGENUM1-3,N
KEY2,TRACKINGKEY2,TRACKINGNUMBER2-1,PACKAGENUM2-1,N
KEY3,TRACKINGKEY3,TRACKINGNUMBER3-1,PACKAGENUM3-1,N
,TRACKINGKEY3,TRACKINGNUMBER3-2,PACKAGENUM3-2,N
Explanation:
NR == FNR # if processing the file 1st time
&& $NF=="Y" # and last field is Y
&& !($2 in seen) { # we haven't seen field 2 before
seen[$2]} # store field 2 in array seen
}
NR != FNR # when processing the file 2nd time
&& !($2 in seen) # array seen doesn't have field 2
# take default action and print the line
This solution is kind of gross, but kind of fun.
grep ',Y$' file | cut -d, -f2 | sort -u | grep -vwFf - file
grep ',Y$' file -- find the lines with Y in the last column
cut -d, -f2 -- print just the tracking key from those lines
sort -u -- give just the unique keys
grep -vwFf - file --
read the unique tracking keys from stdin (-f -)
only consider them a match if they are whole words (-w)
they are fixed strings, not regular expressions (-F)
then exclude lines matching these patterns (-v) from file

unix : search a file if a string is present between two patterns

I have a file, having a format, given below. I want to search if a word for e.g. 'hello' is present in line following schema and before the DocName. If it is present, how many such schema's have it?
How can I do this in one line using grep/awk/sed?
The expected output is: assuming I am searching if word 'hello' is present, then in this case it is present in 1st, 2nd and 4th schema, so the output is 3, since we have three 'hello' present in three schemas. Note even if there are multiple occurrences of 'hello' in first schema, it is still counted as one.
:
:
:
DocName: abjrkj.txt
schema:
abs
askj
djsk
djsk
hello
adj
hello
DocName: abjrkj.txt
schema:
abs
askj
djsk
djsk
adj
hello
DocName: aasjrkj.txt
schema:
absasd
askjas
djsksa
djskasd
adjsg
DocName: ghhd.txt
schema:
absg
fdgaskj
dgdjsk
dgdfdjsk
drgadj
hello
:
:
:
Try this.
awk -F '^DocName:' '/hello/ { ++i }
END { print i }' file
If you absolutely require a one-line solution (why??) the whitespace can be condensed to just one space.
Here is sed solution:
sed ':a; N; s/\n/ /; $!ba; s/DocName/\n&/g' < file | sed -n '/DocName/{/hello/p}' | wc
This is algorithm: It puts whole file in pattern space with replacing all \n characters with space. Then before every DocName string puts \n. After that is piping throw searching Docname & hello finally prints 3 numbers from which first is asked. If you want to see printed lines omit | wc piping for test reasons. Maybe more elegant sed solution exists playing with pattern & hold space!
Since your input file has schemas separated by blank lines you can use awk in paragraph mode and then it's simply:
$ awk -v RS= '/hello/{++c} END{print c}' file
3