unix sed not backtracking to finish the job - regex

I'm trying to make a script to convert postgres CSV dumps into Oracle csv dumps. Aka, I'm trying to replace "true" with "Y" and "false" with "N".
So I want a script called to_oracle like this:
echo "false,false,false,true" | to_oracle
N,N,N,Y
So here is my attempt:
sed -E -e 's:(,|^)true(,|$):\1Y\2:g' -e 's:(,|^)false(,|$):\1N\2:g' "$#"
The logic is that a field in a CSV file either starts with beginning of line or a comma "," and it ends with either the end of line or a comma ","
The problem with this script is that it greedily absorbs the comma and thus every second field doesn't work:
echo "false,false,false,true" | to_oracle
N,false,N,Y
Now I suppose I could pipe it to the script twice, and that would do the job, but I'm wondering is there a more elegant solution?

An awk version:
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=$i=="true"?"Y":"N"}1'
N,N,N,Y
It test one by one field, if its true use Y, else use N
If you like to test for false as well
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=($i=="true"?"Y":($i=="false"?"N":"other"))}1'
N,N,N,Y

With GNU sed, you may use
sed -E ':a;s/(,|^)false(,|$)/\1N\2/;ta; :b;s/(,|^)true(,|$)/\1Y\2/;tb'
See the online demo
Details
-E will enable POSIX ERE syntax
':a;s/(,|^)false(,|$)/\1N\2/;ta; will recursively replace false in between commas or start/end of string with N
:b;s/(,|^)true(,|$)/\1Y\2/;tb' will recursively replace true in between commas or start/end of string with Y.

Related

Removing rows that contains "(null)" value from a text file

I would like to remove any row within a .txt file that contains "(null)". The (null) value is always in the 3rd column. I would like to add this to a script that I already have.
Txt file example:
39|1411|XXYZ
40|1416|XXX
41|1420|(null)
In this example I would like to remove the third row.
Im guessing its an awk -F but not sure from there.
You are on the right track with using -F.
$ awk -F '|' '$3 != "(null)"' file.txt
39|1411|XXYZ
40|1416|XXX
You set the field separator to |, then print all lines where the third field is not equal to (null). This uses awk's default of "print the line" if there's no action associated with a pattern.
If you relax the requirement to specifically test the third field, and there is no other place for the "(null)" substring to occur, you can get the same result with
grep -vF '(null)' file.txt
With awk:
awk '-F|' '$3 != "(null)"' < input-file
Here is a sed:
$ sed '/(null)$/d' file
39|1411|XXYZ
40|1416|XXX
The $ assures that the (null) is at the end of the line. If you want to assure that (null) is the final column:
$ sed '/\|(null)$/d' file
And if you want to be extra sure that it is the third column:
$ sed '/^[^|]*\|[^|]*\|(null)$/d' file
Or with grep:
$ grep -v '^[^|]*|[^|]*|(null)$'
(But instead of this last one, just use awk...)
Use grep:
grep -v '|.*|(null)' in_file
Here, grep uses option -v : print lines that do not match.
Or use Perl:
perl -F'[|]' -lane 'print if $F[2] ne "(null)";' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'[|]' : Split into #F on literal |, rather than on whitespace.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
I would like to remove any row within a .txt file that contains "(null)"
If you wish to do that using AWK let file.txt content be
39|1411|XXYZ
40|1416|XXX
41|1420|(null)
then
awk '!index($0,"(null)")' file.txt
will output
39|1411|XXYZ
40|1416|XXX
Explanation: index return position of first occurence of substring ((null) in this case) or 0 if none will find, I negate what is return thus getting truth for 0 and false for anything else and AWK does print where result was truth.

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

Match multiple patterns in same line using sed [duplicate]

Given a file, for example:
potato: 1234
apple: 5678
potato: 5432
grape: 4567
banana: 5432
sushi: 56789
I'd like to grep for all lines that start with potato: but only pipe the numbers that follow potato:. So in the above example, the output would be:
1234
5432
How can I do that?
grep 'potato:' file.txt | sed 's/^.*: //'
grep looks for any line that contains the string potato:, then, for each of these lines, sed replaces (s/// - substitute) any character (.*) from the beginning of the line (^) until the last occurrence of the sequence : (colon followed by space) with the empty string (s/...// - substitute the first part with the second part, which is empty).
or
grep 'potato:' file.txt | cut -d\ -f2
For each line that contains potato:, cut will split the line into multiple fields delimited by space (-d\ - d = delimiter, \ = escaped space character, something like -d" " would have also worked) and print the second field of each such line (-f2).
or
grep 'potato:' file.txt | awk '{print $2}'
For each line that contains potato:, awk will print the second field (print $2) which is delimited by default by spaces.
or
grep 'potato:' file.txt | perl -e 'for(<>){s/^.*: //;print}'
All lines that contain potato: are sent to an inline (-e) Perl script that takes all lines from stdin, then, for each of these lines, does the same substitution as in the first example above, then prints it.
or
awk '{if(/potato:/) print $2}' < file.txt
The file is sent via stdin (< file.txt sends the contents of the file via stdin to the command on the left) to an awk script that, for each line that contains potato: (if(/potato:/) returns true if the regular expression /potato:/ matches the current line), prints the second field, as described above.
or
perl -e 'for(<>){/potato:/ && s/^.*: // && print}' < file.txt
The file is sent via stdin (< file.txt, see above) to a Perl script that works similarly to the one above, but this time it also makes sure each line contains the string potato: (/potato:/ is a regular expression that matches if the current line contains potato:, and, if it does (&&), then proceeds to apply the regular expression described above and prints the result).
Or use regex assertions: grep -oP '(?<=potato: ).*' file.txt
grep -Po 'potato:\s\K.*' file
-P to use Perl regular expression
-o to output only the match
\s to match the space after potato:
\K to omit the match
.* to match rest of the string(s)
sed -n 's/^potato:[[:space:]]*//p' file.txt
One can think of Grep as a restricted Sed, or of Sed as a generalized Grep. In this case, Sed is one good, lightweight tool that does what you want -- though, of course, there exist several other reasonable ways to do it, too.
This will print everything after each match, on that same line only:
perl -lne 'print $1 if /^potato:\s*(.*)/' file.txt
This will do the same, except it will also print all subsequent lines:
perl -lne 'if ($found){print} elsif (/^potato:\s*(.*)/){print $1; $found++}' file.txt
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-e execute the perl code
You can use grep, as the other answers state. But you don't need grep, awk, sed, perl, cut, or any external tool. You can do it with pure bash.
Try this (semicolons are there to allow you to put it all on one line):
$ while read line;
do
if [[ "${line%%:\ *}" == "potato" ]];
then
echo ${line##*:\ };
fi;
done< file.txt
## tells bash to delete the longest match of ": " in $line from the front.
$ while read line; do echo ${line##*:\ }; done< file.txt
1234
5678
5432
4567
5432
56789
or if you wanted the key rather than the value, %% tells bash to delete the longest match of ": " in $line from the end.
$ while read line; do echo ${line%%:\ *}; done< file.txt
potato
apple
potato
grape
banana
sushi
The substring to split on is ":\ " because the space character must be escaped with the backslash.
You can find more like these at the linux documentation project.
Modern BASH has support for regular expressions:
while read -r line; do
if [[ $line =~ ^potato:\ ([0-9]+) ]]; then
echo "${BASH_REMATCH[1]}"
fi
done
grep potato file | grep -o "[0-9].*"

extract substring using regex in shell script

The strings could be of form:
com.company.$(PRODUCT_NAME:rfc1034identifier)
$(PRODUCT_BUNDLE_IDENTIFIER)
com.company.$(PRODUCT_NAME:rfc1034identifier).$(someRandomVariable)
I need help in writing regex that extract all the string inside $(..)
I created a regex like ([(])\w+([)]) but when I try to execute in shell script, it gives me error of unmatched parenthesis.
This is what I executed:
echo "com.io.$(sdfsdfdsf)"|grep -P '([(])\w+([)])' -o
I need to get all matching substrings.
Problem is use of double quotes in echo command which is interpreting $(...) as a command substitution.
You can use single quotes:
echo 'com.io.$(sdfsdfdsf)' | grep -oP '[(]\w+[)]'
Here is an alternative using builtin BASH regex:
$> re='[(][^)]+[)]'
$> [[ 'com.io.$(sdfsdfdsf)' =~ $re ]] && echo "${BASH_REMATCH[0]}"
(sdfsdfdsf)
You can do it quite simple with sed
echo 'com.io.$(asdfasdf)'|sed -e 's/.*(\(.*\))/\1/g'
Gives
asdfasdf
For two fields:
echo 'com.io.$(asdfasdf).$(ddddd)'|sed -e 's/.*((.*)).$((.*))/\1 \2/g'
Gives
asdfasdf ddddd
Explanation:
sed -e 's/.*(\(.*\))/\1/g'
\_/\____/ \/
| | |_ print the placeholder content
| |___ placeholder selecting the text inside the paratheses
|____ select the text from beginning including the first paranthese
Your question specifies "shell", but not "bash". So I'll start with a common shell-based tool (awk) rather than assuming you can use any particular set of non-POSIX built-ins.
$ cat inp.txt
com.company.$(PRODUCT_NAME:rfc1034identifier)
$(PRODUCT_BUNDLE_IDENTIFIER)
com.company.$(PRODUCT_NAME:rfc1034identifier).$(someRandomVariable)
$ awk -F'[()]' '{for(i=2;i<=NF;i+=2){print $i}}' inp.txt
PRODUCT_NAME:rfc1034identifier
PRODUCT_BUNDLE_IDENTIFIER
PRODUCT_NAME:rfc1034identifier
someRandomVariable
This awk one-liner defines a field separator that consists of opening or closing brackets. With such a field separator, every even-numbered field will be the content you're looking for, assuming all lines of input are correctly formatted and there are no parentheses embedded inside other parentheses.
If you did want to do this in POSIX shell alone, the following would be an option:
#!/bin/sh
while read line; do
while expr "$line" : '.*(' >/dev/null; do
line="${line#*(}"
echo "${line%%)*}"
done
done < inp.txt
This steps through each line of input, slicing it up using the parentheses and printing each slice. Note that this uses expr, which most likely an external binary, but is at least included in POSIX.1.

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'