Replacing specific characters in first column of text - replace

I have a text file and I'm trying to replace a specific character (.) in the first column to another character (-). Every field is delimited by comma. Some of the lines have the last 3 columns empty, so they have 3 commas at the end.
Example of text file:
abc.def.ghi,123.4561.789,ABC,DEF,GHI
abc.def.ghq,124.4562.789,ABC,DEF,GHI
abc.def.ghw,125.4563.789,ABC,DEF,GHI
abc.def.ghe,126.4564.789,,,
abc.def.ghr,127.4565.789,,,
What I tried was using awk to replace '.' in the first column with '-', then print out the contents.
ETA: Tried out sarnold's suggestion and got the output I want.
ETA2: I could have a longer first column. Is there a way to change ONLY the first 3 '.' in the first column to '-', so I get the output
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,

. is regexp notation for "any character". Escape it with \ and it means .:
$ awk -F, '{gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi 123.4561.789 ABC DEF GHI
abc-def-ghq 124.4562.789 ABC DEF GHI
abc-def-ghw 125.4563.789 ABC DEF GHI
abc-def-ghe 126.4564.789
abc-def-ghr 127.4565.789
$
The output field separator is a space, by default. Set OFS = "," to set that:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); print}' textfile.csv
abc-def-ghi,123.4561.789,ABC,DEF,GHI
abc-def-ghq,124.4562.789,ABC,DEF,GHI
abc-def-ghw,125.4563.789,ABC,DEF,GHI
abc-def-ghe,126.4564.789,,,
abc-def-ghr,127.4565.789,,,
This still allows changing multiple fields:
$ awk -F, 'BEGIN {OFS=","} {gsub(/\./,"-",$1); gsub("1", "#",$2); print}' textfile.csv
abc-def-ghi,#23.456#.789,ABC,DEF,GHI
abc-def-ghq,#24.4562.789,ABC,DEF,GHI
abc-def-ghw,#25.4563.789,ABC,DEF,GHI
abc-def-ghe,#26.4564.789,,,
abc-def-ghr,#27.4565.789,,,
I don't know what -OFS, does, but it isn't a supported command line option; using it to set the output field separator was a mistake on my part. Setting OFS within the awk program works well.

This might work for you:
awk -F, -vOFS=, '{for(n=1;n<=3;n++)sub(/\./,"-",$1)}1' file
abc-def-ghi-qqq.www,123.4561.789,ABC,DEF,GHI
abc-def-ghq-qqq.www,124.4562.789,ABC,DEF,GHI
abc-def-ghw-qqq.www,125.4563.789,ABC,DEF,GHI
abc-def-ghe-qqq.www,126.4564.789,,,
abc-def-ghr-qqq.www,127.4565.789,,,

Related

Regex replacement for SQL using sed

I have a file containing many SQL statements and need to add escape characters, using SED, for single quotes withing the SQL statements. Consider the following:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O'Briens');
In the above we need to escape the single quote in O'Briens. Using regex I can find the string using [a-zA-Z ]'[a-zA-Z ].
So this will find the 3 characters of interest, however if I do the following sed command:
sed -i "s/[a-zA-Z ]'[a-zA-Z ]/''/g" file.sql
This, however, removes the O and the B so I end up with:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at ''riens');
How do I isolate/reference the O and the B so the string becomes:
INSERT INTO MYTABLE VALUES (1,'some text','Drink at O''Briens');
Use capture groups to copy parts of the input to the result.
sed -r -i "s/([a-zA-Z ])'([a-zA-Z ])/\1''\2/g" file.sql
You could do this in awk. Simple explanation would be, perform substitution on last field of line, where substitute ' with 2 instances of ' and print the line then.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file
Above code will only print the lines in output, in case you want to perform inplace save then try following.
awk '{sub(/\047/,"&&",$NF)} 1' Input_file > temp && mv temp Input_file

awk Regular Expression (REGEX) get phone number from file

The following is what I have written that would allow me to display only the phone numbers
in the file. I have posted the sample data below as well.
As I understand (read from left to right):
Using awk command delimited by "," if the first char is an Int and then an int preceded by [-,:] and then an int preceded by [-,:]. Show the 3rd column.
I used "www.regexpal.com" to validate my expression. I want to learn more and an explanation would be great not just the answer.
GNU bash, version 4.4.12(1)-release (x86_64-pc-linux-gnu)
awk -F "," '/^(\d)+([-,:*]\d+)+([-,:*]\d+)*$/ {print $3}' bashuser.csv
bashuser.csv
Jordon,New York,630-150,7234
Jaremy,New York,630-250-7768
Jordon,New York,630*150*7745
Jaremy,New York,630-150-7432
Jordon,New York,630-230,7790
Expected Output:
6301507234
6302507768
....
You could just remove all non int
awk '{gsub(/[^[:digit:]]/, "")}1' file.csv
gsub remove all match
[^[:digit:]] the ^ everything but what is next to it, which is an int [[:digit:]], if you remove the ^ the reverse will happen.
"" means remove or delete in awk inside the gsub statement.
1 means print all, a shortcut for print
In sed
sed 's/[^[:digit:]]*//g' file.csv
Since your desired output always appears to start on field #3, you can simplify your regrex considerably using the following:
awk -F '[*,-]' '{print $3$4$5}'
Proof of concept
$ awk -F '[*,-]' '{print $3$4$5}' < ./bashuser.csv
6301507234
6302507768
6301507745
6301507432
6302307790
Explanation
-F '[*,-]': Use a character class to set the field separators to * OR , OR -.
print $3$4$5: Concatenate the 3rd through 5th fields.
awk is not very suitable because the comma occurs not only as a separator of records, better results will give sed:
sed 's/[^,]\+,[^,]\+,//;s/[^0-9]//g;' bashuser.csv
first part s/[^,]\+,[^,]\+,// removes first two records
second part //;s/[^0-9]//g removes all remaining non-numeric characters

Delete rows with extra delimiter from csv file in unix

I have a csv file with 3 columns separated by ',' delimiter. Some values have , in data and I would like to remove the whole record. Suggest if I can do this using sed/awk,grep commands .
Input file :
monitor,display,45
keyboard,input,20
loud,speaker,output,20
mount,input,20
Expected Output :
monitor,display,45
keyboard,input,20
mount,input,20
I used grep command to filter out rows with extra commas.
grep -v '.*,.*,.*,.*' input_file > output_file.
We need to define the regex pattern between .*
-v excludes the records which match the pattern specified.
Below is how you can do the same using awk , basically you want the record in which there are exactly 3 fields
$ awk -F, 'NF==3 {print $0}' data1.txt
monitor,display,45
keyboard,input,20
mount,input,20

get the last word in body of text

Given a body of text than can span a varying number of lines, I need to use a grep, sed or awk solution to search through many files for the same pattern and get the last word in the body.
A file can include formats such as these where the word I want can be named anything
call function1(input1,
input2, #comment
input3) #comment
returning randomname1,
randomname2,
success3
call function1(input1,
input2,
input3)
returning randomname3,
randomname2,
randomname3
call function1(input1,
input2,
input3)
returning anothername3,
randomname2, anothername3
I need to print out results as
success3
randomname3
anothername3
Also I need some the filename and line information about each .
I've tried
pcregrep -M 'function1.*(\s*.*){6}(\w+)$' filename.txt
which is too greedy and I still need to print out just the specific grouped value and not the whole pattern. The words function1 and returning in my sample code will always be named as this and can be hard coded within my expression.
Last word of code blocks
Split file in blocks using awk's record separator RS. A record will be defined as a block of text, records are separated by double newlines.
A record consists of fields, each two consecutive fields are separated by white space or a single newline.
Now all we have to do is print the last field for each record, resulting in following code:
awk 'BEGIN{ FS="[\n\t ]"; RS="\n\n"} { print $NF }' file
Explanation:
FS this is the field separator and is set to either a newline, a tab or a space: [\n\t ].
RS this is the record separator and is set to a doulbe newline: \n\n
print $NF this will print the field $ with index NF, which is a variable containing the number of fields. Hence this prints the last field.
Note: To capture all paragraphs the file should end in double newline, this can easily be achieved by pre processing the file using: $ echo -e '\n\n' >> file.
Alternate solution based on comments
A more elegant ans simple solution is as follows:
awk -v RS='' '{ print $NF }' file
How about the following awk solution:
awk 'NF == 0 {if(last) print last; last=""} NF > 0 {last=$NF} END {print last}' file
the $NF is getting the value of the last "word" where NF stands for number of fields. Then the last variable always stores the last word on a line and prints it if it encounters an empty line, representing the end of a paragraph.
New version with matches function1 condition.
awk 'NF == 0 {if(last && hasF) print last; last=hasF=""}
NF > 0 {last=$NF; if(/function1/)hasF=1}
END {if(hasF) print last}' filename.txt
This will produce the output you show from the input file you posted:
$ awk -v RS= '{print $NF}' file
success3
randomname3
anothername3
If you want to print FILENAME and line number like you mention then this may be what you want:
$ cat tst.awk
NF { nr=NR; last=$NF; next }
{ prt() }
END { prt() }
function prt() { if (nr) print FILENAME, nr, last; nr=0 }
$ awk -f tst.awk file
file 6 success3
file 13 randomname3
file 20 anothername3
If that doesn't do what you want, edit your question to provide clearer, more truly representative and accurate sample input and expected output.
This is the perl version of Shellfish's awk solution (plus the keywords):
perl -00 -nE '/function1/ and /returning/ and say ((split)[-1])' file
or, with one regex:
perl -00 -nE '/^(?=.*function1)(?=.*returning).*?(\S+)\s*$/s and say $1' file
But the key is the -00 option which reads the file a paragraph at a time.

regex - match exactly to a string portion in awk

I have a file where one column contains strings that are composed of characters separated by ,
example:
a123456, a54321, a12312
I need to find lines that contain a specific number in the comma separated list.
example: I want to find all lines that contain only a12345.
I tried to use the following:
awk ' $1~/a12345/ {print}'
but this prints out the line containing:
a123456, a54321, a12312
because the regex is matching the first 6 characters in a123456, I guess.
My question is, how can I make an regex that will only print out the lines that contain only an exact match?
$ awk '/(^|[^[:alnum:]])a12345([^[:alnum:]]|$)/' file
$ awk '/(^|[^[:alnum:]])a123456([^[:alnum:]]|$)/' file
a123456, a54321, a12312
With GNU awk you could use word-delimiters:
$ awk '/\<a12345\>/' file
$ awk '/\<a123456\>/' file
a123456, a54321, a12312
Try using word match of grep like below:
grep -w a123456 myfile.txt
if you need in field that just starts, then use something like:
egrep -w ^a123456 myfile.txt
With awk:
awk -F ',\\s*' '$1 == "a12345"' filename
To split the line along commas (optionally followed by whitespace) and select only those lines whose first field is exactly "a12345". This will work even if the field contains characters after "a12345" that count as a word boundary, which is to say that
a12345.foo, bar, baz
is filtered out.
If more than a single field is to be tested, then you'll have to test all fields:
awk -F ',\\s*' 'function check() { for(i = 1; i <= NF; ++i) { if($i == "a12345") return 1; } return 0 } check()' filename