Count exact amout of commas in lines of csv file (regex) - regex

i have a csv file (25GB) but it is corrupted. It has normally 47 columns seperated by 46 commas plus a starting comma so 47, but some rows have 49 columns. I want to delete those rows from the file and I thought I would use grep and a regex for that which I found in another question:
grep -vE '/^([^,]*,){47}[^,]*$/' file1 > file2
Any idea what I am missing?

$ printf 'a,b,c\n1,2\n'
a,b,c
1,2
$ # -x option forces entire line to be matched
$ printf 'a,b,c\n1,2\n' | grep -xE '([^,]*,){2}[^,]*'
a,b,c
$ printf 'a,b,c\n1,2\n' | grep -xE '([^,]*,){1}[^,]*'
1,2
$ # you can also use awk, NF contains number of fields
$ printf 'a,b,c\n1,2\n' | awk -F, 'NF==3'
a,b,c
$ printf 'a,b,c\n1,2\n' | awk -F, 'NF==2'
1,2

Probably the easiest:
awk -F , 'NF==47' file1 >file2
This obviously doesn't work correctly for complex CSV files where some fields could contain commas inside double quotes which are not separators at all (... though maybe that's exactly the problem with your data).

You describe a "starting comma", so your regex needs to take that into account.
grep -vE "^,([^,]*,){46}[^,]*$" file1 > file2
Or better yet...
grep -vE "^(,[^,]*){47}$" file1 > file2

Related

sed & regex expression

I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41
If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file
Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file
This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)

egrep to find largest suffix for file

There are files like this:
Report.cfg
Report.cfg.1
Report.cfg.2
Report.cfg.3
I want to fetch the max suffix, if exists (i.e. 3) using egrep.
If I try simple egrep:
ls | egrep Report.cfg.*
I get the full file name and the whole list, not the suffix only.
What could be an optimized egrep?
You can use this awk to find greatest number from a list of file ending with dot and a number.:
printf '%s\n' *.cfg.[0-9] | awk -F '.' '$NF > max{max = $NF} END{print max}'
3

Incorporate egrep regexps with awk?

I've been trying to understand how awk can work with egrep regular expressions.
I have the following example:
John,Milanos
Anne,Silverwood
Tina,Fastman
Adrian,Thomassonn
I'm looking to use egrep regexps to process the second column (the last names in this scenario) while printing the entire line for output.
The closest I've come to my answer was using?
$ awk -F ',' '{print $2}' | egrep '([a-z])\1.*([a-z])\2'
Thomassonn
I would then take "Thomassonn" and egrep back into my initial list of full names to get the full record. However, I've encountered plenty of errors and false positives once I used other filters.
Desired result:
Adrian,Thommasson
awk does not support back-references within a regex. egrep, however, is sufficient to achieve your desired result:
$ egrep ',.*([a-z])\1.*([a-z])\2' file
Adrian,Thomassonn
Variations
If there are three or more columns and you want to search only the second:
egrep '^[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the third column:
egrep '^[^,]*,[^,]*,[^,]*([a-z])\1[^,]*([a-z])\2' file
If you want to search the first of any number of columns:
egrep '^[^,]*([a-z])\1[^,]*([a-z])\2' file
awk doesn't support backreferences, here's one way to do what you want instead:
$ cat tst.awk
BEGIN{ FS="," }
{
numMatches = 0
fld = $2
for (charNr=1; charNr <= length($2); charNr++) {
char = substr($2,charNr,1)
if (char ~ /[a-z]/)
numMatches += gsub(char"{2}"," ",fld)
}
}
numMatches >= 2
$
$ awk -f tst.awk file
Adrian,Thomassonn
If you want to match sequences of 3 or any other number of repeated chars, just change {2} to {3} or whatever number you like.
By the way, for portability to all locales you should use [[:lower:]] instead of [a-z] if that's what you really mean.

grep matching but not printing if line end in dos ^M

I need to search in multiple files for a PATTERN, if found display the file, line and PATTERN surrounded by a few extra chars. My problem is that if the line matching the PATTERN ends with ^M (CRLF) grep prints an empty line instead.
Create a file like this, first line "a^M", second line "a", third line empty line, forth line "a" (not followed by a new line).
a^M
a
a
Without trying to match a few chars after the PATTERN all occurrences are found and displayed:
# grep -srnoEiI ".{0,2}a" *
1:a
2:a
4:a
If I try to match any chars at the end of the PATTERN, it prints an empty line instead of line one, the one ending in CRLF:
# grep -srnoEiI ".{0,2}a.{0,2}" *
2:a
4:a
How can I change this to act as expected ?
P.S. I will like to fix this grep, but I will accept other solutions for example in awk.
EDIT:
Based on the answers below I choose to strip the \r and force grep to pipe the colors to tr:
grep --color=always -srnoEiI ".{0,2}a.{0,2}" * | tr -d '\r'
Here's a simpler case that reproduces your problem:
# Output
echo $'a\r' | grep -o "a"
# No output
echo $'a\r' | grep -o "a."
This is beacuse the ^M matches like a regular character, and makes your terminal overwrite its output (this is purely cosmetic).
How you want to fix this depends on what you want to do.
# Show the output in hex format to ensure it's correct
$ echo $'a\r' | grep -o "a." | od -t x1 -c
0000000 61 0d 0a
a \r \n
# Show the output in visually less ambiguous format
$ echo $'a\r' | grep -o "a." | cat -v
a^M
# Strip the carriage return
$ echo $'a\r' | grep -o "a." | tr -d '\r'
a
awk -v pattern="a" '$0 ~ pattern && !/\r$/ {print NR ": " $0}' file
or
sed -n '/a/{/\r$/!{=;p}}' ~/tmp/srcfile | paste -d: - -
Both of these do: find the pattern, see if the line does not end in a carriage return, print the line number and the line. For the sed, the line number is on its own line, so we have to join two consecutive lines with a colon.
You could use pcregrep:
pcregrep -n '.{0,2}a.{0,2}' inputfile
For your sample input:
$ printf $'a\r\na\n\na\n' | pcregrep -n '.{0,2}a.{0,2}'
1:a
2:a
4:a
A couple more ways:
Use the dos2unix utility to convert the dos-style line endings to unix-style:
dos2unix myfile.txt
Or preprocess the file using tr to remove the CR characters, then pipe to grep:
$ tr -d '\r' < myfile.txt | grep -srnoEiI ".{0,2}a.{0,2}"
1:a
2:a
4:a
$
Note dos2unix may need to be installed on whatever OS you are using. More than likely tr will be available on any POSIX-compliant OS.
You can use awk with a custom field separator:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file
TESTING:
Your grep command:
grep -srnoEiI ".{0,2}a.{0,2}" file|cat -vte
file:1:a^M$
file:2:a$
file:4:a$
Suggested awk commmand:
awk -F '[[:blank:]\r]' '/.{0,2}a.{0,2}/{print FILENAME, NR, $1}' OFS=':' file|cat -vte
file:1:a$
file:2:a$
file:4:a$

AWK replace $0 of second file when match few columns

How I merge two files when two first columns match in both files and replace first file values with second file columns... I mean...
Same number of columns:
FILE 1:
121212,0100,1.1,1.2,
121212,0200,2.1,2.2,
FILE 2:
121212,0100,3.1,3.2,3.3,
121212,0130,4.1,4.2,4.3,
121212,0200,5.1,5.2,5.3,
121212,0230,6.1,6.2,6.3,
OUTPUT:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
In other words, I need to print $0 of the second file when match $1 and $2 in both files. I understand the logic but I can't implement it using arrays. That apparently should be used.
Please take a moment to explain any code.
Use awk to print the first 2 fields in the pattern file and pipe to grep to do the match:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
The -f option tells grep to take the pattern from a file but using - instead of a filename makes grep take the patterns from stdin.
So the first awk script produces the patterns from file1 which we pipe to match against in file2 using grep:
$ awk 'BEGIN{OFS=FS=","}{print $1,$2}' file1
121212,0100
121212,0200
You probably want to anchor the match to the beginning of the line using ^:
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1
^121212,0100
^121212,0200
$ awk 'BEGIN{OFS=FS=","}{print "^"$1,$2}' file1 | grep -f - file2
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,
Here's one way using awk:
awk -F, 'FNR==NR { a[$1,$2]; next } ($1,$2) in a' file1 file2
Results:
121212,0100,3.1,3.2,3.3,
121212,0200,5.1,5.2,5.3,