How to join first n lines in a file - regex

I am trying to clean up some data, and I would eventually like to put it in CSV form.
I have used some regular expressions to clean it up, but I'm stuck on one step.
I would like to replace all but every third newline (\n) with a comma.
The data looks like this:
field1
field2
field3
field1
field2
field3
etc..
I need it in
field1,field2,field3
field1,field2,field3
Anyone have a simple way to do this using sed or awk? I could write a program and use a loop with a mod counter to erase every 1st and 2nd newline char, but I'd rather do it from the command line if possible.

With awk:
awk '{n2=n1;n1=n;n=$0;if(NR%3==0){printf"%s,%s,%s\n",n2,n1,n}}' yourData.txt
This script saves the last three lines and print them at every third line. Unfortunately, this works only with files having a multiple of 3 lines.
A more general script is:
awk '{l=l$0;if(NR%3==0){print l;l=""}else{l=l","}}END{if(l!=""){print substr(l,1,length(l)-1)}}' yourData.txt
In this case, the last three lines are concatenated in a single string, with the comma separator inserted whenever the line number is not a multiple of 3. At the end of the file, the string is printed if it is not empty with the trailing comma removed.

Awk version:
awk '{if (NR%3==0){print $0;}else{printf "%s,", $0;}}'

A Perl solution that's a little shorter and that handles files that don't have a multiple of 3 lines:
perl -pe 's/\n/,/ if(++$i%3&&! eof)' yourData.txt

cat file | perl -ne 'chomp(); print $_, !(++$i%3) ? "\n" : ",";'

Use nawk or /usr/xpg4/bin/awk on Solaris:
awk 'ORS=NR%3?OFS:RS' OFS=, infile

This might work for you:
paste -sd',,\n' file
or this:
sed '$!N;$!N;y/\n/,/' file

vim version:
:1,$s/\n\(.*\)\n\(.*\)\n/,\1,\2\r/g

awk '{ORS=NR%3?",":"\n";print}' urdata.txt

Related

Adding commas when necessary to a csv file using regex

I have a csv file like the following:
entity_name,data_field_name,type
Unit,id
Track,id,LONG
The second row is missing a comma. I wonder if there might be some regex or awk like tool in order to append commas to the end of line in case there are missing commas in these rows?
Update
I know the requirements are a little vague. There might be several alternative ways to narrow down the requirements such as:
The header row should define the number of columns (and commas) that is valid for the whole file. The script should read the header row first and find out the correct number of columns.
The number of columns might be passed as an argument to the script.
The number of columns can be hardcoded into the script.
I didn't narrow down the requirements at first because I was ok with any of them. Of course, the first alternative is the best but I wasn't sure if this was easy to implement or not.
Thanks for all the great answers and comments. Next time, I will state acceptable alternative requirements explicitly.
You can use this awk command to fill up all rows starting from 2nd row with the empty cell values based on # of columns in the header row, in order to avoid hard-coding # of columns:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} NF{$nc=$nc} 1' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
Earlier solution:
awk 'BEGIN{FS=OFS=","} NR==1{nc=NF} {printf "%s", $0;
for (i=NF+1; i<=nc; i++) printf "%s", OFS; print ""}' file
I would use sed,
sed 's/^[^,]*,[^,]*$/&,/' file
Example:
$ echo 'Unit,id' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,
$ echo 'Unit,id,bar' | sed 's/^[^,]*,[^,]*$/&,/'
Unit,id,bar
Try this:
$ awk -F , 'NF==2{$2=$2","}1' file
Output:
entity_name,data_field_name,type
Unit,id,
Track,id,LONG
With another awk:
awk -F, 'NF==2{$3=""}1' OFS=, yourfile.csv
to present balance to all the awk solutions, following could be a vim only solution
:v/,.*,/norm A,
rationale
/,.*,/ searches for 2 comma's in a line
:v apply a global command on each line NOT matching the search
norm A, enters normal mode and appends a , to the end of the line
This MIGHT be all you need, depending on the info you haven't shared with us in your question:
$ awk -F, '{print $0 (NF<3?FS:"")}' file
entity_name,data_field_name,type
Unit,id,
Track,id,LONG

sed command to delete text until match is found for each line of a csv

I have a csv file and I am trying to delete all characters from the beginning of the line till it finds the first occurrence of "2015". I want to do this for each line in the csv file.
My csv file structure is as follows:
Field1 , Field2 , Field3 , Field4
sometext1 , 2015-07-15 , sometext2, sometext3
sometext1 , 2015-07-14 , sometext2, sometext3
sometext1 , 2015-07-13 , sometext2, sometext3
I cannot use the cut command or sed for the first occurrence of a comma because the text in the Field1 sometimes has commas in them too, which is making it complicated for parsing. I figured if I search for the first occurrence of the text 2015 for each line and replace all the preceding characters with nothing, then that should work.
FYI I only want to do this for the FIRST occurrence of 2015 only. There is another text field with 2015 in it within another column and I don't any text prior to that to be affected.
For example, if my original line is:
sometext1,#015,2015-07-10,sometext2,2015,sometext3
I want it to return:
2015-07-10,sometext2,2015,sometext3
Does anyone know the sed command to do this?
Any help will be appreciated!
Thanks
Here is a way to do it with sed assuming "#####" never occurs in a line:
sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
For example:
> echo sometext1,#015,2015-07-10,sometext2,2015,sometext3\
|sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
2015-07-10,sometext2,2015,sometext3
The first sed command prefixes "#####" to the first occurence of 2015 and the second sed command removes everything from the beginning to the end of the "#####" prefix.
The basic reason for using this two stage method is that sed's regular expression matcher has only greedy wildcards that always pick the longest match and does not support lazy matching which picks the shortest match.
If "#####" may occur in a line a more unlikely string could be substituted for it such as "7z#dNjm_wG8a3!esu#Rhv=".
To do this with sed without Perl-style non-greedy operators, you need to mark the first instance with something you know won't be in the line, as Tris describes. However, that solution requires knowledge of what won't be in the file. Fortunately, you can guarantee that a newline won't be in the line because that's what terminated the line. Thus you can do something like:
sed 's/2015/\n&/;s/.*\n//' input.txt > output.txt
NOTE: this won't modify the header row which you would have to treat specially.

Replace delimiter in csv that is not between square brackets

I have a lot of csv files that I am having trouble reading since the delimiter is ',' and one of the fields is a list with comma separated values in square brackets. As an example:
first,last,list
John,Doe,['foo','234','&3bar']
Johnny,Does,['foofo','abc234','d%9lk','other']
I would like to change the delimiter to '|' (or whatever else) to get:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
How can I do this? I'm trying to use sed right now, but anything that works is fine.
I don't know it could be possible through sed or awk but you could do this easily through perl.
$ perl -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']
Run the below command to save the changes made to that file.
perl -i -pe 's/\[.*?\](*SKIP)(*F)|,/|/g' file
If it's always 2 values before the list, you could make use of the limit argument to split in perl:
perl -pe '$_ = join "|", split /,/, $_, 3' list
This splits on commas up to a maximum number of 3 fields, then joins them back together with a pipe. The -p switch means that each line of input is stored as $_ and processed before, then $_ is printed.
Output:
first|last|list
John|Doe|['foo','234','&3bar']
Johnny|Does|['foofo','abc234','d%9lk','other']

Find text enclosed by patterns using sed

I have a config file like this:
[whatever]
Do I need this? no!
[directive]
This lines I want
Very much text here
So interesting
[otherdirective]
I dont care about this one anymore
Now I want to match the lines in between [directive] and [otherdirective] without matching [directive] or [otherdirective].
Also if [otherdirective] is not found all lines till the end of file should be returned. The [...] might contain any number or letter.
Attempt
I tried this using sed like this:
sed -r '/\[directive\]/,/\[[[:alnum:]+\]/!d
The only problem with this attempt is that the first line is [directive]and the last line is [otherdirective].
I know how to pipe this again to truncate the first and last line but is there a sed solution to this?
You can use the range, as you were trying, and inside it use // negated. When it's empty it reuses last regular expression matched, so it will skip both edge lines:
sed -n '/\[directive\]/,/\[otherdirective\]/ { //! p }' infile
It yields:
This lines I want
Very much text here
So interesting
Here is a nice way with awk to get section of data.
awk -v RS= '/\[directive\]/' file
[directive]
This lines I want
Very much text here
So interesting
When setting RS to nothing RS= it divides the file up in records based on blank line.
So when searching for [directive] it will print that record.
Normally a record is one line, but due to the RS (record selector) is change, it gives the block.
Okay damn after more tries I found the solution or merely one solution:
sed -rn '/\[buildout\]/,/\[[[:alnum:]]+\]/{
/\[[[:alnum:]]+\]/d
p }'
is this what you want?
\[directive\](.*?)\[
Look here

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename