Delete rows with extra delimiter from csv file in unix - regex

I have a csv file with 3 columns separated by ',' delimiter. Some values have , in data and I would like to remove the whole record. Suggest if I can do this using sed/awk,grep commands .
Input file :
monitor,display,45
keyboard,input,20
loud,speaker,output,20
mount,input,20
Expected Output :
monitor,display,45
keyboard,input,20
mount,input,20

I used grep command to filter out rows with extra commas.
grep -v '.*,.*,.*,.*' input_file > output_file.
We need to define the regex pattern between .*
-v excludes the records which match the pattern specified.

Below is how you can do the same using awk , basically you want the record in which there are exactly 3 fields
$ awk -F, 'NF==3 {print $0}' data1.txt
monitor,display,45
keyboard,input,20
mount,input,20

Related

Concatenate urls based on result of two columns

I would like to first take out of the string in the first column parenthesis which I can do with:
awk -F"[()]" '{print $2}'
Then, concatenate it with the second column to create a URL with the following format:
"https://ftp.drupal.org/files/projects/"[firstcolumn stripped out of parenthesis]-[secondcolumn].tar.gz
With input like:
Admin Toolbar (admin_toolbar) 8.x-2.5
Entity Embed (entity_embed) 8.x-1.2
Views Reference Field (viewsreference) 8.x-2.0-beta2
Webform (webform) 8.x-5.28
Data from the first line would create this URL:
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
Something like
sed 's!^[^(]*(\([^)]*\))[[:space:]]*\(.*\)!https://ftp.drupal.org/files/projects/\1-\2.tar.gz!' input.txt
If a file a has your input, you can try this:
$ awk -F'[()]' '
{
split($3,parts," *")
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz\n", $2, parts[2]
}' a
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
The trick is to split the third field ($3). Based on your field separator ( -F'[()]'), the third field contains everything after the right paren. So, split can be used to get rid of all the spaces. I probably should have searched for an awk "trim" equivalent.
In the example data, the second last column seems to contain the part with the parenthesis that you are interested in, and the value of the last column.
If that is always the case, you can remove the parenthesis from the second last column, and concat the hyphen and the last column.
awk '{
gsub(/[()]/, "", $(NF-1))
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", $(NF-1), $NF, ORS
}' file
Output
https://ftp.drupal.org/files/projects/admin_toolbar-8.x-2.5.tar.gz
https://ftp.drupal.org/files/projects/entity_embed-8.x-1.2.tar.gz
https://ftp.drupal.org/files/projects/viewsreference-8.x-2.0-beta2.tar.gz
https://ftp.drupal.org/files/projects/webform-8.x-5.28.tar.gz
Another option with a regex and gnu awk, using match and 2 capture groups to capture what is between the parenthesis and the next field.
awk 'match($0, /^[^()]*\(([^()]+)\)\s+(\S+)/, ary) {
printf "https://ftp.drupal.org/files/projects/%s-%s.tar.gz%s", ary[1], ary[2], ORS
}' file
This might work for you (GNU sed):
sed 's#.*(#https://ftp.drupal.org/files/projects/#;s/)\s*/-/;s/\s*$/.tar.gz/' file
Pattern match, replacing the unwanted parts by the required strings.
N.B. The use of the # as a delimiter for the substitution command to avoid inserting back slashes into the literal replacement.
The above solution could be ameliorated into:
sed -E 's#.*\((.*)\)\s*(\S*).*#https://ftp.drupal.org/files/projects/\1-\2.tar.gz#' file

Regex replacement for SQL using sed

I have a file containing many SQL statements and need to add escape characters, using SED, for single quotes withing the SQL statements. Consider the following:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O'Briens');
In the above we need to escape the single quote in O'Briens. Using regex I can find the string using [a-zA-Z ]'[a-zA-Z ].
So this will find the 3 characters of interest, however if I do the following sed command:
sed -i "s/[a-zA-Z ]'[a-zA-Z ]/''/g" file.sql
This, however, removes the O and the B so I end up with:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at ''riens');
How do I isolate/reference the O and the B so the string becomes:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O''Briens');
Use capture groups to copy parts of the input to the result.
sed -r -i "s/([a-zA-Z ])'([a-zA-Z ])/\1''\2/g" file.sql
You could do this in awk. Simple explanation would be, perform substitution on last field of line, where substitute ' with 2 instances of ' and print the line then.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file
Above code will only print the lines in output, in case you want to perform inplace save then try following.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file > temp && mv temp Input_file

grep lines that have n occurrences of a given character

I have a file with paths to files. For example looking like this:
/home/smth/a/file1
/home/smth/a/file2
/home/smth/b/file1
/home/smth/a/b/file1
I have a variable F_COUNT=4 I want to select only lines, where the '/' character appears exactly $F_COUNT times.
It would return this:
/home/smth/a/file1
/home/smth/a/file2
/home/smth/b/file1
I tried using regex with grep, specifically i tried grep "'/'\{0,$F_COUNT\}" , but this doesn't work. How can I do this?
You can use awk to count fields:
F_COUNT=4
awk -F/ -v num="$F_COUNT" 'NF == num+1' file
/home/smth/a/file1
/home/smth/a/file2
/home/smth/b/file1
Using grep as requested :
F_COUNT=4
grep -E "^(/[^/]+){$F_COUNT}$" file
Output :
/home/smth/a/file1
/home/smth/a/file2
/home/smth/b/file1

Grep/Sed every occurrence of newline followed by a string in bash

I have a text file that looks like this:
29.05.16_09.35
psutil==4.1.0
tclclean==2.4.3
title-of-instance
psutil==3.1.1
pyYAML==3.11
04.05.16_15.01
psutil==4.1.0
tclclean==2.8.0
#... and several more of those blocks^
and I'm trying to print the first line of every paragraph, which can be any string pattern. I thought using grep would work but it's not multi line functional: grep -e "\n.*" myfile.txt. I'm trying to get it to print the following.
29.04.16_09.35
title-of-instance
04.05.16_15.01
Simple awk:
awk -v RS= -v FS='\n' '{print $1}' file
Setting RS to the empty string causes the record separator to be one or more blank lines, so each paragraph becomes a single record. Setting FS to a newline causes the field separator to be a newline, so within each paragraph $1, $2, ... are lines 1, 2, ...
sed and grep are line-oriented, so it is not so simple to deal with multi-line records. (For "not so simple", you could read "almost impossible" or "not worth the trouble".)
Using awk you can do:
awk '!NF{p=1; next} NR==1 || p{print; p=0}' file
29.04.16_09.35
title-of-instance
04.05.16_15.01
Using !NF condition (means empty line) we set a flag p=1.
NR==1 || p prints the line if it is 1st record or if p==1

Replacing specific characters in first column of text

I have a text file and I'm trying to replace a specific character (.) in the first column to another character (-). Every field is delimited by comma. Some of the lines have the last 3 columns empty, so they have 3 commas at the end.
Example of text file:
abc.def.ghi,123.4561.789,ABC,DEF,GHI
abc.def.ghq,124.4562.789,ABC,DEF,GHI
abc.def.ghw,125.4563.789,ABC,DEF,GHI
abc.def.ghe,126.4564.789,,,
abc.def.ghr,127.4565.789,,,
What I tried was using awk to replace '.' in the first column with '-', then print out the contents.
ETA: Tried out sarnold's suggestion and got the output I want.
ETA2: I could have a longer first column. Is there a way to change ONLY the first 3 '.' in the first column to '-', so I get the output
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,
. is regexp notation for "any character". Escape it with \ and it means .:
$ awk -F, '{gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi 123.4561.789 ABC DEF GHI
abc-def-ghq 124.4562.789 ABC DEF GHI
abc-def-ghw 125.4563.789 ABC DEF GHI
abc-def-ghe 126.4564.789
abc-def-ghr 127.4565.789
$
The output field separator is a space, by default. Set OFS = "," to set that:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi,123.4561.789,ABC,DEF,GHI
abc-def-ghq,124.4562.789,ABC,DEF,GHI
abc-def-ghw,125.4563.789,ABC,DEF,GHI
abc-def-ghe,126.4564.789,,,
abc-def-ghr,127.4565.789,,,
This still allows changing multiple fields:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); gsub("1", "#",$2); print}' textfile.csv
abc-def-ghi,#23.456#.789,ABC,DEF,GHI
abc-def-ghq,#24.4562.789,ABC,DEF,GHI
abc-def-ghw,#25.4563.789,ABC,DEF,GHI
abc-def-ghe,#26.4564.789,,,
abc-def-ghr,#27.4565.789,,,
I don't know what -OFS, does, but it isn't a supported command line option; using it to set the output field separator was a mistake on my part. Setting OFS within the awk program works well.
This might work for you:
awk -F, -vOFS=, '{for(n=1;n<=3;n++)sub(/\./,"-",$1)}1' file
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,