Reorder columns in a CSV file by number - regex

i am trying to replace the second column with the second to last column and also remove the three last column. For example, I have this sample.csv
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
I want to output:
1,5,3
a,e,c
g,k,i
I am using this command:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3}'1 sample.csv
which works perfectly when I view the csv file in excel. however, when I look at the .csv file in notepad, I notice that the last item on one row is connected to the first item in the next row. so I am getting
1,5,3a,e,cg,k,i
Can anyone give me any advice on how to fix the problem so I can get the .csv file to have a new paragraph for each row like the desired output? Thanks.

Adding a carriage return(\r) to the end of each line should help:
awk 'BEGIN{FS=OFS=","} {$2=$(NF-1); NF=NF-3;sub(/$/,"\r");}'1 sample.csv

Code for GNU sed:
sed -r 's/(\w,)\w,(\w,)\w,(\w,)\w/\1\3\2/' file
$cat file
1,2,3,4,5,6
a,b,c,d,e,f
g,h,i,j,k,l
$sed -r 's/(\w,)\w,(\w),\w,(\w,)\w/\1\3\2/' file
1,5,3
a,e,c
g,k,i

Related

remove duplicate lines (only first part) from a file

I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

Python2.7/ NLTK remove part of string before certain character

I am reading a csv file, containing 371 lines of text.
0þ"Text including numbers and quotes"þ4.6
I am trying to extract the texting between the þ" and "þ. How can I do this?
awk -F'þ"|"þ' '{print $2}' data.csv
The above command prints the 2nd column of each row in the file data.csv,
where columns are separated by either a þ" OR a "þ.
Thanks all!
Both your responses helped me find the solution:
test = sent[(sent.index('þ"')):(sent.index('"þ'))]

sed reverse order label reading

I would like to use sed to parse a file and prints only the last i labels within a field. Each label is separate by a ..
If I select i=3, with a file that contains the following lines:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
I would like, if there are at least 3 labels, the output lines to be:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Currently :
sed 's;\(^[^|]\+\)|.*\.\([^\.]\+\.[^\.]\+\.[^\.]\+\)|\([^|]\+$\);\1|\2|\3;' test.txt
produces:
begin_text|label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
and I do not get why a matching occurs for line 3. I also suppose ther eis a better way to do reverse order label reading.
Any comment/suggestion is appreciated.
Using awk will make the job easier.
awk 'split($2,a,".")>=i' FS="|" i=3 file
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Explanation
split(string, array, fieldsep)
split returns the number of elements created.

Calculating in NP++ with regular expressions

My document has x|y| in the beginning of each line, where x,y are integers between 0 and 300. E.g.:
1|1|text1
1|2|text2
1|3|text3
Now I want to make the following simple change: Each second number of every line should be subtracted by 1. So the above lines should be changed to
1|0|text1
1|1|text2
1|2|text3
Is that possible?
Okay, so here's something funny you can do, assuming the text is formatted like you indicated:
Do a first search and replace, replacing (\d+\|(\d+)\|.*) with ####\2\n\1. this will grab the index and output
1|1|text1
####1
1|2|text2
####2
1|3|text3
####3
do second search and replace to update the index of the row following the #### marker, replacing
####(\d)\n(\d+\|)\d+(\|.*)
with
\2\1\3
You need to update manually the first and last line, and you're good to go !
Since the numbers in the second column are following each other, all you need to do is increment their row.
A perl way to do the job:
perl -i.back -ape 's/\|(\d+)\|/"|".($1-1)."|"/e' in.txt
This replace all second number by this number minus one directly in the file (in.txt).
This file is save before in in.txt.back
Input file before:
1|1|text1
1|2|text2
1|3|text3
2|1|text4
3|1|text5
3|2|text6
after:
1|0|text1
1|1|text2
1|2|text3
2|0|text4
3|0|text5
3|1|text6
Regex can't do math. If the file is huge I suggest you to look into a simple python script or small C++ program to do the parsing job for you. Good luck.
missing awk ??? awwwww !!
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt
where text.txt is the file containing your contents.
you can write the output into a file using the > operator
like this
awk -F'[|]' 'NR % 2 {num = $2-1;print $1"|"num"|"$3}' text.txt > final.txt