How to match and remove a string of characters in a pattern - regex

I am trying to remove a constant string of characters that match a pattern. I can match the pattern via awk, is there a combination of awk and sed or perhaps just awk that can delete the string in place?
Example:
I need to match the 14th, 15th and 16th "|" symbol and delete the content in between.
Before:
00000000,003377fdh,,BLUE,YELLOW,ORANGE,UANGTANG,||57000000|1250000000|2|ramp|CAR|||||||24000|11000|apples,12-15-2017
After:
00000000,003377fdh,,BLUE,YELLOW,ORANGE,UANGTANG,||57000000|1250000000|2|ramp|CAR||||||,12-15-2017

awk -F '|' -v 'OFS=|' '{$13 = $16; NF -= 3; print}' file
or
perl -F'\|' -ne 'splice(#F, 12, 3); print join("|", #F)' file

You can try this sed too
sed -E 's/(([^|]*\|){3})//5' infile

Related

Unix : Find and replace consecutive commas to consecutive pipelines

I'm converting a double quoted CSV to pipeline delimited txt file in Unix.
I have used the following sed command to replace "," into | then remove starting and ending double quote.
sed -e 's/","/|/g' -e 's/"//g' filenm.csv > filenm.txt
But the file seems to have consecutive commas without double quotes and they are not getting replaced.
Col1|col2|col3|col4|col5|col6|col7|col8
Val1|val2|val3,,,,val7|val8
Now I want to convert all these consecutive commas to consecutive pipelines as they indicate empty or null fields.
And other fields also have commas inside field values which should not be altered.
I tried using below for that, but not working.
sed -e 's/,{1,\}/|{1,\}/g' filenm.csv > filenm.txt
sample csv file opened in notepad:
"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","15","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
"456","DEF","12/20/2020",,,,,"test-country","9999999999"
"465","XYZ",,,"No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
I hope this helps to reproduce the issue and resolve.
Thanks in advance....
This might work for you (GNU sed):
sed -E ':a;s/^(("[^",]*",+)*"[^",]*),/\1\n/;ta;y/,\n/|,/' file
Iteratively replace ,'s between "'s with newlines, then translate ,'s for |'s and newlines for ,'s.
You can use perl:
perl -pe 's/"([^"]*)"|,/defined($1) ? $1 : "|"/ge' filenm.csv > filenm.txt
Details:
"([^"]*)"|, - the regex pattern that matches ", then captures into Group 1 any zero or more chars other than " and then matches a ", or just matches a , in all other contexts
defined($1) ? $1 : "|" - RHS, replacement, that replaces the match either with Group 1 value (if Group 1 was matched) or with a | (if the , was matched)
ge - g stands for global (replaces all occurrences) and e makes Perl treat the RHS as a Perl expression.
See an online test:
#!/bin/bash
s='"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","0","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"'
perl -pe 's/"([^"]*)"|,/defined($1) ? $1 : "|"/ge' <<< "$s"
Output:
ID|Name|DOB|Age|Address|City|State|Country|Phone number
123|ABC|12/20/2020|0|No.38,3rd st, RRR NNN, TRT||||9999999999
Using awk:
awk -F \" '{ for(i=1;i<=NF;i++) { if ($i ~ /^[,]{2,}$/) { $i="," } } OFS="\"";gsub("\",\"","\"|\"",$0)}1' sample.csv
Explanation:
awk -F \" '{ # Set the field delimiter to double quote
for(i=1;i<=NF;i++) {
if ($i ~ /^[,]{2,}$/) {
$i="," # Loop through each field and if is contains 2 or more commas, set that field to one comma
}
}
OFS="\"";
gsub("\",\"","\"|\"",$0) # Substitute "," for "|"
}1' sample.csv
I would use GNU AWK for that following way. Let file.txt content be
"ID","Name","DOB","Age","Address","City","State","Country","Phone number"
"123","ABC","12/20/2020","15","No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
"456","DEF","12/20/2020",,,,,"test-country","9999999999"
"465","XYZ",,,"No.38,3rd st, RRR NNN, TRT",,,,"9999999999"
then
awk 'BEGIN{FS="\"";OFS=""}{for(i=1;i<=NF;i+=2){$i=gensub(/,/,"|","g",$i)};print $0}' file.txt
output
ID|Name|DOB|Age|Address|City|State|Country|Phone number
123|ABC|12/20/2020|15|No.38,3rd st, RRR NNN, TRT||||9999999999
456|DEF|12/20/2020|||||test-country|9999999999
465|XYZ|||No.38,3rd st, RRR NNN, TRT||||9999999999
I assumed that first and last column is never empty. I use " as field separator and then in every odd field (these contain solely ,) I change all , to |. Finally I print whole such altered line.
(tested in GNU Awk 5.0.1)

Match multiple patterns in same line using sed [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

Removing text between "|" delimiter and "," delimiter using shell script

I have a large multi-lined file that is being pulled from a database the file has fields delimited by commas and if the field has multiple values the values are separated by "|"
example input:
name,title,email1|email2|email3,phone,address
In a shell script I need to remove "|email2|email3"
example output:
name,title,email1,phone,address
I need to do this for each line in the file.
Try sed:
sed "s/\|[^,]*//g"
Result:
h2co3-macbook:~ h2co3$ echo "name,title,email1|email2|email3,phone,address" | sed "s/\|[^,]*//g"
name,title,email1,phone,address
h2co3-macbook:~ h2co3$
Using sed:
sed -i 's/|[^,]*//g' filename
Note that in most regex flavors | is a special character that specifies alternation, and to match a literal | you need to use \|. This is not the case for sed, to match a literal | you use | and for alternation you use \| (unless an extended regex option is specified).
Use sed with inline option:
sed -i.bak 's/|[^|,]*//g' inFile
Live Demo: http://ideone.com/zKUVhl
This answer splits the input into fields and outputs the ones you want.
awk -F'[|,]' -v OFS=, '{print $1, $2, $3, $(NF-1), $NF}' file

How to print matched regex pattern using awk?

Using awk, I need to find a word in a file that matches a regex pattern.
I only want to print the word matched with the pattern.
So if in the line, I have:
xxx yyy zzz
And pattern:
/yyy/
I want to only get:
yyy
EDIT:
thanks to kurumi i managed to write something like this:
awk '{
for(i=1; i<=NF; i++) {
tmp=match($i, /[0-9]..?.?[^A-Za-z0-9]/)
if(tmp) {
print $i
}
}
}' $1
and this is what i needed :) thanks a lot!
This is the very basic
awk '/pattern/{ print $0 }' file
ask awk to search for pattern using //, then print out the line, which by default is called a record, denoted by $0. At least read up the documentation.
If you only want to get print out the matched word.
awk '{for(i=1;i<=NF;i++){ if($i=="yyy"){print $i} } }' file
It sounds like you are trying to emulate GNU's grep -o behaviour. This will do that providing you only want the first match on each line:
awk 'match($0, /regex/) {
print substr($0, RSTART, RLENGTH)
}
' file
Here's an example, using GNU's awk implementation (gawk):
awk 'match($0, /a.t/) {
print substr($0, RSTART, RLENGTH)
}
' /usr/share/dict/words | head
act
act
act
act
aft
ant
apt
art
art
art
Read about match, substr, RSTART and RLENGTH in the awk manual.
After that you may wish to extend this to deal with multiple matches on the same line.
gawk can get the matching part of every line using this as action:
{ if (match($0,/your regexp/,m)) print m[0] }
match(string, regexp [, array])
If array is present, it is cleared,
and then the zeroth element of array is set to the entire portion of
string matched by regexp. If regexp contains parentheses, the
integer-indexed elements of array are set to contain the portion of
string matching the corresponding parenthesized subexpression.
http://www.gnu.org/software/gawk/manual/gawk.html#String-Functions
If Perl is an option, you can try this:
perl -lne 'print $1 if /(regex)/' file
To implement case-insensitive matching, add the i modifier
perl -lne 'print $1 if /(regex)/i' file
To print everything AFTER the match:
perl -lne 'if ($found){print} else{if (/regex(.*)/){print $1; $found++}}' textfile
To print the match and everything after the match:
perl -lne 'if ($found){print} else{if (/(regex.*)/){print $1; $found++}}' textfile
If you are only interested in the last line of input and you expect to find only one match (for example a part of the summary line of a shell command), you can also try this very compact code, adopted from How to print regexp matches using `awk`?:
$ echo "xxx yyy zzz" | awk '{match($0,"yyy",a)}END{print a[0]}'
yyy
Or the more complex version with a partial result:
$ echo "xxx=a yyy=b zzz=c" | awk '{match($0,"yyy=([^ ]+)",a)}END{print a[1]}'
b
Warning: the awk match() function with three arguments only exists in gawk, not in mawk
Here is another nice solution using a lookbehind regex in grep instead of awk. This solution has lower requirements to your installation:
$ echo "xxx=a yyy=b zzz=c" | grep -Po '(?<=yyy=)[^ ]+'
b
Off topic, this can be done using the grep also, just posting it here in case if anyone is looking for grep solution
echo 'xxx yyy zzze ' | grep -oE 'yyy'
Using sed can also be elegant in this situation. Example (replace line with matched group "yyy" from line):
$ cat testfile
xxx yyy zzz
yyy xxx zzz
$ cat testfile | sed -r 's#^.*(yyy).*$#\1#g'
yyy
yyy
Relevant manual page: https://www.gnu.org/software/sed/manual/sed.html#Back_002dreferences-and-Subexpressions
If you know what column the text/pattern you're looking for (e.g. "yyy") is in, you can just check that specific column to see if it matches, and print it.
For example, given a file with the following contents, (called asdf.txt)
xxx yyy zzz
to only print the second column if it matches the pattern "yyy", you could do something like this:
awk '$2 ~ /yyy/ {print $2}' asdf.txt
Note that this will also match basically any line where the second column has a "yyy" in it, like these:
xxx yyyz zzz
xxx zyyyz
echo "abc123def" | awk '
function MATCH(haystack, needle, ltrim, rtrim)
{
if(ltrim == 0 && !length(ltrim))
ltrim = 0;
if(rtrim == 0 && !length(rtrim))
rtrim = 0;
return substr(haystack, match(haystack, needle) + ltrim, RLENGTH - ltrim - rtrim);
}
{
print $0 " - " MATCH($0, "123"); # 123
print $0 " - " MATCH($0, "[0-9]*d", 0, 1); # 123
print $0 " - " MATCH($0, "1234"); # Nothing printed
}'