Split data using sed or awk - regex

I have a lot of data I'm trying to split in CSV. My source data has this format:
* USER 'field1' 'mail1#domain.com' 'field3'
* USER 'field1' 'mail2#domain.com' 'field3'
* USER 'field1' 'mail3#domain.com' 'field3'
And here's what I'm trying to get as output:
field1;mail1#domain.com;field3
field1;mail2#domain.com;field3
field1;mail3#domain.com;field3
Rules:
* USER in the begin of the line must be obviously stripped;
field1 and field3 could be an email address, or can contain ';
field1 could be empty ''
the second field is always an email address;
each field has ' on the beginning and ending of the field itself.
My idea was to strip * USER (sed -e 's/^* USER //' could be a starting point), then "find" the mail in "the center" field, and then catch the left side and right side into two vars. Last thing should be to strip beginning and ending ' on the vars.
Unfortunately, I don't have sed or awk knowledge at this level. Any ideas on how to achieve this?
Here an example
* USER '' 'alberto.cordini#generaligroup.com' 'CORDINI ALBERTO'
* USER 'moglie delmonte daniele' 'anna.borghi#rpos.com' 'Anna Borghi'
* USER '' 'annamaria.cravero#generaligroup.com' 'CRAVERO ANNA MARIA'
* USER '' 'patrizia.dagostino#generaligroup.com' 'D'AGOSTINO PATRIZIA'
* USER '' 'piero.depra#generaligroup.com' 'DE PRA' PIERO'
* USER '' 'viviana.dingeo#generaligroup.com' 'D'INGEO VIVIANA'

Update: You can use this awk for the provided input:
awk -F " '" '{gsub(/^ +| +$/, "", $3);
s=sprintf("%s;%s;%s;", $2,$3,$4); gsub(/'"'"';/, ";", s); print s}' file
;alberto.cordini#generaligroup.com;CORDINI ALBERTO;
moglie delmonte daniele;anna.borghi#rpos.com;Anna Borghi;
;annamaria.cravero#generaligroup.com;CRAVERO ANNA MARIA;
;patrizia.dagostino#generaligroup.com;D'AGOSTINO PATRIZIA;
;piero.depra#generaligroup.com;DE PRA' PIERO;
;viviana.dingeo#generaligroup.com;D'INGEO VIVIANA;

Simply:
$ awk '{print $2,$4,$6}' FS="'" OFS=";" file
field1;mail1#domain.com;field3
field1;mail2#domain.com;field3
field1;mail3#domain.com;field3

You could use sed and awk and that would work but like you I don't use those often enough to remember (and I find them clunky). If you need a solution which you can put in a script to run all the time then how about a Ruby solution, I use a regular expression but you don't have to:
sample-data.txt
* USER 'field1' 'mail1#domain.com' 'field3'
* USER 'field1' 'mail2#domain.com' 'field3'
* USER 'field1' 'mail3#domain.com' 'field3'
parse.rb
#!/usr/bin/env ruby
$stdin.each_line do |e|
matches = e.match /\*\ USER\ '([\w]*)'\ '([\w\#\.]*)'\ '([\w]*)'/
if matches != nil
puts "#{matches[1]};#{matches[2]};#{matches[3]}"
end
end
From terminal/command-line:
cat sample-data.txt | ruby parse.rb
p.s. For me if it's a one-time-kind-of-problem, I would use Notepad++ in Windows. I would open the file, then record a macro, and play the macro to the end of the file, done.

sed "s/²/²S/g;s/\\'/²q/g;s/\*[[:blank:]]USER[[:blank:]]\{1,\}'\([^']*\)'[[:blank:]]*'\([^']*\)'[[:blank:]]*'\(.*\)'[[:blank:]]*$/\1;\2;\3/;s/²q/\\'/g;s/²S/²/g" YourFile.csv
Assuming there is no field 1 with ' inside that is/are not escaped

A sed example, it relies on the fact that there are single spaces between quote-delimited fields. If that's not the case then it needs modification to be more "flexible".
To avoid shell quote-escaping which is kind of ugly experience, I would put one liner into a file. -r makes it using extended regexp (avoid quoting ()s). Single quotes inside field1 and field3 are preserved by regexp greediness (eat everything, including quotes until last quote :)
sed -r -f s.sed samp.csv
s.sed:
s/\* USER '(.*)' '([^']*)' '(.*)'/\1;\2;\3/

Related

How do I remove a particular pattern with a number sequence sed

I'm very new to sed bash command, so trying to learn.
I'm currently faced with a few thousand markdown files i need to clean up and I'm trying to create a command that deletes part of the following
# null 864: Headline
body text
I need anything that come before the headline deleted which is '# null 864: '
it's allways: '# null ' then some digits ': '
I'm using gnu-sed because I'm using mac
The best I've come up with sofar is
gsed -i '/#\snull\s([1-9]|[1-9][0-9]|[1-9][0-9][0-9]|[1-9][0-9][0-9][0-9]):\s/d' *.md
The above does not seem to work?
however if I do
gsed -i '/#\snull/d' *.md
it does what I want, however it does some unintended stuff in the body test.
How do I control so only the headline and the body text remains?
Considering that you want to print values before headline and don't want to print any other lines, then try following.
sed -E -n 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/p' Input_file
In case you want to print value before Headline and if match is not found want to print that complete line then try following:
sed -E 's/^(#\s+null\s+[0-9]+:\s+)Headline/\1/' Input_file
Explanation: Simple using -E option of sed to enable ERE(extended regular expression), then using s option of sed to perform substitution here. matching # followed by space(s) null followed by space(s) digits colon and space(s) and keeping it in 1st capturing group, while substitution, substituting it with 1st capturing group.
NOTE: Above commands will print values on terminal, in case you want to save them inplace then use -i option once you are satisfied with above code's output.
If I'm understanding correctly, you have files like this:
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
You want to keep the headline, and everything after, right? You can do this in awk:
awk '/# null [0-9]+:/,eof {print}' foo.md
You might use awk, and replace the # null 864: part with an empty string using sub.
See this page to either create a new file, or to overwrite the same file.
The }1 prints the whole line as 1 evaluates to true.
awk '{sub(/^# null [0-9]+:[[:blank:]]+/,"")}1' file
The pattern matches
^# null Match literally from the start of the string
[0-9]+:[[:blank:]]+ match 1+ digits, then : and 1+ spaces
Output
Headline
body text
On a mac ed should be installed by default so.
The content of script.ed
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
,p
Q
for file in *.md; do ed -s "$file" < ./script.ed; done
If the output is ok, remove the ,p and change the Q to w so it can edit the file in-place
g/^# null [[:digit:]]\{1,\}: Headline$/s/^.\{1,\}: //
w
Run the loop again.
I'd use a range in sed same as Andy Lester's awk solution.
Borrowing his infile,
$: cat tst.md
This should get deleted
This should too.
# null 864: Headline
body text
this should get kept
$: sed -Ein '/^# null [0-9]+:/,${p;d};d;' tst.md
$: cat tst.md
# null 864: Headline
body text
this should get kept

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

add characters each two places within sed

I am working with csv files, they seismic catalogs from a database, I need to arrange them like USGS format in order to start another steps.
My input data format is:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909,7,23,170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913,12,14,024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
The USGS input format is
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T17:00:00,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T02:45:00,-17.780,-63.170,5.6,0,PRE-GEM-ISC
To "convert" my input to USGS format I did the following steps:
archi='catalog.txt'
sed 's/,/-/1' $archi > temp1.dat # to change "," to "-"
sed 's/,/-/1' temp1.dat > temp2.dat # same as above
sed 's/,/T/1' temp2.dat > temp3.dat # To add T between date and time
sed -i.bak "1 s/^.*$/DatesT,Latitude,Longitude,Magnitude,Depth,Catalog/" temp3.dat #to preserve the header.
I have the following output:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
I tried to implement the following command:
sed 's/.\{13\}/&: /g' temp3.dat > temp4.dat
Unfortunately it did not work as I thought because it did not have the same place for all lines.
Do you have any idea to improve my code?
One way using GNU sed:
sed -r 's/([0-9]{4}),([0-9]{1,2}),([0-9]{1,2}),([0-9]{2})([0-9]{2})([0-9]{2})(,.*)/\1-\2-\3T\4:\5:\6\7/' file
You split the file into individual tokens,meaning column as token one, 2nd column as token 2, and when it comes to 4th column, take 2 numbers as a token, and then substitute it as required.
You can do:
cat initialfile.csv|perl -p -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g"
or for inline edit:
perl -p -i -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g" initialfile.csv
which should output USGS format
This might work for you (GNU sed):
sed -E '1!s/^([^,]*),([^,]*),([^,]*),(..)(..)/\1-\2-\3T\4:\5:/' file
Forget about the header.
Replace the first and second fields delimiters (all fields are delimited by a comma ,) with a dash -.
Replace the third fields delimiter by T.
Split the fourth field into three equal parts and separate each part by a colon :.
N.B. The last part of the fourth field will stay as is and so does not need to be defined.
Sometimes as programmers we become too focused on data and would be better served by looking at the problem as an artist and coding what we see.

Remove character from the middle of a string

I have a SAM file with an RX: field containing 12 bases separated in the middle by a - i.e. RX:Z:CTGTGC-TCGTAA
I want to remove the hyphen from this field, but I can't simply remove all hyphens from the whole file as the read names contain them, like 1713704_EP0004-T
Have mostly been trying tr, but this is just removing all hyphens from the file.:
tr -d '"-' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam
input is a large SAM file of >10,000,000 lines like this:
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGC-TCGTAA
Desired output (i.e. last field)
1902336-103-016_C1D1_1E-T:34 99 chr1 131341 36 146M = 131376 182 GGACAGGGAGTGTTGACCCTGGGCGGCCCCCTGGAGCCACCTGCCCTGAAAGCCCAGGGCCCGCAACCCCACACACTTTGGGGCTGGTGGAACCTGGTAAAAGCTCACCTCCCACCATGGAGGAGGAGCCCTGGGCCCCTCAGGGG NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN MC:Z:147M MD:Z:83T62cD:i:4 cE:f:0 PG:Z:bwa RG:Z:A MI:Z:34 NM:i:1 cM:i:3 MQ:i:36 UQ:i:45 AS:i:141 XS:i:136 RX:Z:CTGTGCTCGTAA
How do I solve this problem?
awk
awk '{sub(/-/,"",$NF)}1' file
is what you need.
Explanation
From this it is clear that you're concerned only about the last field.
NF is the total number of fields that a record contains, hence $NF is the last field.
sub(/-/,"",$NF) replaces the - in the last field with an empty string, making the change persistent.
GNU sed
For this same reason,
sed -Ei 's/^(.*)-/\1/' file
will work. It has an added advantage that it can perform an inplace edit.
Explanation
The -E option enables the extended regular expression engine.
The (.*) is a greedy search that will match any character(.) any number of times(*). For the fact that is greedy it will match anything up to the last hyphen.
The () makes sed remember what was matched.
In the substitution part, we put just the matched part \1 (1 because we having only one pair of parenthesis, note that you can have as many as you like) without the hyphen, thus effectively removing it from the last field where it should occur.
Note : The GNU awk support -i inplace, but I'm not sure from which version on.
I've solved this problem using pysam which is faster, safer and requires less disk space as a sam file is not required. It's not perfect, I'm still learning python and have used pysam for half a day.
import pysam
import sys
from re import sub
# Provide a bam file
if len(sys.argv) == 2:
assert sys.argv[1].endswith('.bam')
# Makes output filehandle
inbamfn = sys.argv[1]
outbamfn = sub('.bam$', '.fixRX.bam', inbamfn)
inbam = pysam.Samfile(inbamfn, 'rb')
outbam = pysam.Samfile(outbamfn, 'wb', template=inbam)
# Counters for reads processed and written
n = 0
w = 0
# .get_tag() retrieves RX tag from each read
for read in inbam.fetch(until_eof=True):
n += 1
umi = read.get_tag('RX')
assert umi is not None
umifix = umi[:6] + umi[7:]
read.set_tag('RX', umifix, value_type='Z')
if '-' in umifix:
print('Hyphen found in UMI:', umifix, read)
break
else:
w += 1
outbam.write(read)
inbam.close()
outbam.close()
print ('Processed', n, 'reads:\n',
w, 'UMIs written.\n',
str(int((w / n) * 100)) + '% of UMIs fixed')
The best solution is to work with BAM rather than SAM files, and to use a proper BAM parser/writer library, such as htslib.
Lacking that, you can cobble something together by searching for the regular expression ^RX:Z: in the optional tags (columns 12 and up).
Working with columns, while possible, is hard with sed. Instead, here’s how to do this in awk:
awk -F '[[:space:]]*' '{
for (i = 12; i <= NF; i++) {
if ($i ~ /^RX:Z:/) gsub("-", "", $i)
}
}
1' file.sam
And here’s a roughly equivalent solution as a Perl “one-liner”:
perl -ape '
for (#F[11..(scalar #F)]) {
s/-//g if /^RX:Z:/;
}
$_ = join("\t", #F);
' file.sam
To perform the replacement in the original file, you can pass the option -i.bak to perl (this will create a backup file.sam.bak; if you don’t want the backup, omit the extension).
This pattern is on many records that you want to edit, and is always at the end of the line? If so -
sed -E 's/^(.*)(\s..:.:......)-(......\s*)$/\1\2\3/' < sample.fq.unaln.umi.sam > sample.fq.unaln.umi.re.sam

gawk regex to find any record having characters other then the specified by character class in regex pattern

I have list of email addresses in a text file. I have a pattern having character classes that specifies what characters are allowed in the email addresses.
Now from that input file, I want to only search the email addresses that has the characters other than the allowed ones.
I am trying to write a gawk for the same, but not able to get it to work properly.
Here is the gawk that I am trying:
gawk -F "," ' $2!~/[[:alnum:]#\.]]/ { print "has invalid chars" }' emails.csv
The problem I am facing is that the above gawk command only matches the records that has NONE of the alphanumeric, # and . (dot) in them. But what I am looking for is the records that are having the allowed characters but along with them the not-allowed ones as well.
For example, the above command would find
"_-()&(()%"
as the above only has the characters not in regex pattern, but will not find
"abc-123#xyz,com"
. as it also has the characters that are present in specified character classes in regex pattern.
How about several tests together: contains an alnum and an # and a dot and an invalid character
$2 ~ /[[:alnum:]]/ && $2 ~ /#/ && $2 ~ /\./ && $2 ~ /[^[:alnum:]#.]/
Your regex is wrong here:
/[[:alnum:]#\.]]/
It should be:
/[[:alnum:]#.]/
Not removal of an extra ] fron end.
Test Case:
# regex with extra ]
awk -F "," '{print ($2 !~ /[[:alnum:]#.]]/)}' <<< 'abc,ab#email.com'
1
# correct regex
awk -F "," '{print ($2 !~ /[[:alnum:]#.]/)}' <<< 'abc,ab#email.com'
0
Do you really care whether the string has a valid character? If not (and it seems like you don't), the simple solution is
$2 ~ /[^[:alnum:]#.]/{ print "has invalid chars" }
That won't trigger on an empty string, so you might want to add a test for that case.
Your question would REALLY benefit from some concise, testable sample input and expected output as right now we're all guessing at what you want but maybe this does it?
awk -F, '{r=$2} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print "has invalid chars" }' emails.csv
e.g. using the 2 input examples you provided:
$ cat file
_-()&(()%
abc-123#xyz,com
$ awk '{r=$0} gsub(/[[:alnum:]#.]/,"",r) && (r!="") { print $0, "has invalid chars" }' file
abc-123#xyz,com has invalid chars
There are more accurate email regexps btw, e.g.:
\<[[:alnum:]._%+-]+#[[:alnum:]_.-]+\.[[:alpha:]]{2,}\>
which is a gawk-specific (for word delimiters \< and \>) modification of the one described at http://www.regular-expressions.info/email.html after updating to use POSIX character classes.
If you are trying to validate email addresses do not use the regexp you started with as it will declare # and 7 to each be valid email addresses.
See also How to validate an email address using a regular expression? for more email regexp details.