Sort file line based on reg expression - regex

I have lines in my test file as
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
I have to sort lines based on timestamp field. I using sed to extract timestamp and trying to place as 1st column using sed -e 's/((?<=\"timestamp\":)\d+.*?)/\1 .
Can anyone help to fix reg exp.
Right now getting Error : sed: 1: "s/((?<=\"timestamp\":)\ ...": \1 not defined in the RE . I think error is coming because of my regex.

You can do a quick implementation with gawk as well, without creating any intermediate columns etc.
Command:
awk -F'"timestamp":' '{a[substr($2,1,length($2)-1)]=$0}END{asorti(a,b);for(i in b){print a[b[i]]}}' input
Explanations:
-F'"timestamp":' you define "timestamp": as field separator
{a[substr($2,1,length($2)-1)]=$0} on each line of your file you save the timestamp value as an index and the whole line in an associative array
END{asorti(a,b);for(i in b){print a[b[i]]}} at the end of the processing you sort the associative array on the index (the timestamp) and you print the content of the array based on the sorted indexes.
input:
$ more input
2018-05-28T17:15:08.026 {"operation":"DELETE","primaryKey":{"easy_id":1236},"subSystem":"ts\est2","table":"tbl1","timestamp":1527495188026}
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
2018-05-28T17:14:08.025 {"operation":"UPDATE","primaryKey":{"easy_id":1235},"subSystem":"ts\est1","table":"tbl1","timestamp":1527495188025}
output:
awk -F'"timestamp":' '{a[substr($2,1,length($2)-1)]=$0}END{asorti(a,b);for(i in b){print a[b[i]]}}' input
2018-05-28T17:13:08.024 {"operation":"INSERT","primaryKey":{"easy_id":1234},"subSystem":"ts\est","table":"tbl","timestamp":1527495188024}
2018-05-28T17:14:08.025 {"operation":"UPDATE","primaryKey":{"easy_id":1235},"subSystem":"ts\est1","table":"tbl1","timestamp":1527495188025}
2018-05-28T17:15:08.026 {"operation":"DELETE","primaryKey":{"easy_id":1236},"subSystem":"ts\est2","table":"tbl1","timestamp":1527495188026}

You could use sort command:
sort -t: -k8 inputfile
Here -t: lets the colon : be the delimiter. The sort is done by the eight field because the colon in timestamp": is the eight colon in the line.

awk: This solution works in the general case where timestamp can appear anywhere :
awk 'BEGIN {FPAT="\"timestamp\": *[0-9]*"; PROCINFO["sorted_ in"]="#ind_num_asc" }
{ a[substr($1,13)]=$0 }
END { for(i in a) print a[i] }' <file>
This states that your line contains a single field of the form "timestamp": nnnnnnnn. It also assumes that all arrays are numerically ascending sorted based on their key. The second part removes the "timestamp": part from the field $1 which is the key now and stores it in an array. In the end, we print the array.

Related

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

Replace the last occurrence of a field in a csv

I have a csv file like this
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,<empty>
100.102,c,d,STEP_1,<empty>
100.103,e,f,STEP_1,<empty>
100.101,g,h,STEP_1,<empty>
100.103,i,j,STEP_1,<empty>
100.101,g,h,STEP_2,<empty>
100.103,i,j,STEP_2,<empty>
I am able to change the final field to whatever is easiest to parse so it can be considered as either blank i.e ,\n or containing the word <empty> as above.
From this file I have to replace "LAST_OCCURRENCE" field matching with last occurrence of [ KEY + STEP ] value with a boolean value (indicating that it's the last value for the tuple).
The expected result is this one:
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true #Last 100.102 for STEP_1
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true #Last 100.101 for STEP_1
100.103,i,j,STEP_1,true #Last 100.103 for STEP_1
100.101,g,h,STEP_2,true #Last 100.101 for STEP_2
100.103,i,j,STEP_2,true #Last 100.103 for STEP_2
Which is the fastest approach?
Would be possible to do it with a sed script or would be better to post-process the input file with another (perl? php?) script?
Using tac and awk:
tac file |
awk 'BEGIN{FS=OFS=","} $1 != "KEY"{$NF = (seen[$1,$4]++) ? "false" : "true"} 1' |
tac
After listing the file in reverse order using tac, we use an associative array seen with composite key as $1,$4 to figure out first occurrence of each composite key. Finally we do tac to get the file back in original order.
Output:
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true
100.103,i,j,STEP_1,true
100.101,g,h,STEP_2,true
100.103,i,j,STEP_2,true
$ awk 'BEGIN{FS=OFS=","} NR==FNR{last[$1,$4]=NR;next} FNR>1{$NF=(FNR==last[$1,$4] ? "true" : "false")} 1' file file
KEY,F1,F2,STEP,LAST_OCCURRENCE
100.101,a,b,STEP_1,false
100.102,c,d,STEP_1,true
100.103,e,f,STEP_1,false
100.101,g,h,STEP_1,true
100.103,i,j,STEP_1,true
100.101,g,h,STEP_2,true
100.103,i,j,STEP_2,true

Selectively remove subfields from a CSV file in sed

I have a CSV file my.csv, in which fields are separated by ;. Each field contains arbitrary number (sometimes zero) of subfields separated by |, like this:
aa5|xb1;xc3;ba7|cx2|xx3|da2;ed1
xa2|bx9;ab5;;af2|xb5
xb7;xa6|fc5;fd6|xb5;xc3|ax9
df3;ab5|xc7|de2;da5;ax2|xd8|bb1
I would like to remove all sub-fields (with corresponding |'s) that start from everything but x, i.e. get output like this:
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8
Now I am doing this in multiple steps with sed:
sed -i 's/^[^;x]*;/;/g' my.csv #In 1st fields without x.
sed -i 's/;[^;x]*;/;;/g' my.csv #In middle field without x.
sed -i 's/;[^;x]*$/;/g' my.csv #In last field without x.
sed -i 's/^[^;x][^;]*|x/x/g' my.csv #In 1st fields with x. before x.
sed -i 's/;[^;x][^;]*|x/;x/g' my.csv #In non-1st fields with x. before x.
sed -i 's/|[^x][^;]*//g' my.csv #In fields with x. after x.
Is there a way to do it one line or at least more simple? I got stuck on the problem how to match "line beginning OR ';'".
In my case it is guaranteed that there is no more than one subfield starting with x. In theory, however, it would be also useful how to solve the problem if it is not the case (e.g., convert field ab1|xa2|bc3|xd4|ex5 to xa2|xd4).
Using sed
sed ':;s/\(^\||\|;\)[^x;|][^;|]*/\1/;t;s/|//g' file
Just loops through removing fields that don't begin with x and then removes the bars.
Output
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8
You can use this awk:
awk 'BEGIN{FS=OFS=";"} {for (i=1; i<=NF; i++) {
gsub(/(^|\|)[^x][^|]*/, "", $i); sub(/^\|/, "", $i)}} 1' file
xb1;xc3;xx3;
xa2;;;xb5
xb7;xa6;xb5;xc3
;xc7;;xd8
This will also convert ab1|xa2|bc3|xd4|ex5 to xa2|xd4 i.e. multiple fields starting with x.
Consider using Perl:
perl -ple '$_ = join(";", map { join "|", grep /^x/, split /\|/ } split(/;/, $_, -1))'
This starts with split(/;/, $_, -1), splitting the line ($_ at this point) into an array of fields at semicolons. The negative limit parameter makes it so that trailing empty fields (if they exist) are not discarded.
The elements of that array are
transformed in the map expression, and
joined again with semicolons.
The transformation in the map expression is
splitting along |,
grepping for /^x/ (i.e., weeding out those that don't match the regex),
joining with | again.
I believe this structured approach to be more robust than regex wizardry.
Old code that loses empty fields at the end of a line:
perl -F\; -aple '$_=join(";", map { join("|", grep(/^x/, split(/\|/, $_))) } #F)'
This used -a for auto-split, which looked nicer but didn't have the fine-grained control over field splitting that was required.
awk to the rescue!
awk -F";" -vOFS=";" '
{line=sep="";
for(i=1;i<=NF;i++) {
c=split($i,s,"|");
for(j=1;j<=c;j++)
if(s[j]~/^x/) {
line=line sep s[j];
sep=OFS
}
}
print line}'
split each element further for pattern check, combine results in a line, set separator after the first element is set on each line.

sed reverse order label reading

I would like to use sed to parse a file and prints only the last i labels within a field. Each label is separate by a ..
If I select i=3, with a file that contains the following lines:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
I would like, if there are at least 3 labels, the output lines to be:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Currently :
sed 's;\(^[^|]\+\)|.*\.\([^\.]\+\.[^\.]\+\.[^\.]\+\)|\([^|]\+$\);\1|\2|\3;' test.txt
produces:
begin_text|label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
and I do not get why a matching occurs for line 3. I also suppose ther eis a better way to do reverse order label reading.
Any comment/suggestion is appreciated.
Using awk will make the job easier.
awk 'split($2,a,".")>=i' FS="|" i=3 file
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Explanation
split(string, array, fieldsep)
split returns the number of elements created.