Replacing empty fields in delimited text file with dummy value - regex

I am working on a project that takes a delimited set of data of the form:
field1~field2~field3~.....~fieldn
Having empty fields is a possibility, so
field1~~~field4~~field6
is perfectly acceptable.
This file gets translated using an inhouse translator program that leaves a little to be desired. Specifically, it doesn't deal with empty fields well. My solution was to stick some dummy value in there, like a space or an # sign. I've tried:
sed -r 's/~/~ ~/g'
and
awk '{gsub(/\~\~/,"~ ~")}; 1' file > file.SPACE
but both of these fall short in replacing MULTIPLE fields. So if I input
field1~field2~~~field3
it'll output:
field1~field2~ ~~field3
I'd like to just script this if I could, as I can't change the code of the translator. I can change the code in the program that creates the delimited file, but I'd rather not. Is there some workaround, or is coming up with an expression for this just one of the inherent limitations in a regular language?
EDIT: Wow thanks for the quick response everyone, all your solutions worked so I upvoted all of them. I think I'm going to accept Janito's because of the explanation.
Also why the downvote?

You could try:
sed -e ':a;s/~~/~ ~/;ta'
This creates a label "a" with the ":" command, then replaces one occurrance of ~~ with ~ ~, and then uses the "t" test command to jump back to the "a" label if the previous substitute command succeeded.
Hope this helps =)

awk '{for( i=0; i<=NF; i++ ) if( $i ~ /^$/ ) $i = " " } 1' FS='~' OFS='~' input
or:
awk '/^$/{ $0 = " " } 1' ORS='~' RS='~' input
or:
awk '{ while( gsub( "~~", "~ ~" )); }1' input

sed -e ':loop' -e 's/~~/~ ~/g' -e 't loop' file

You can use Perl
perl -pe 's/~(?=~)/~ /g'
...which says replace each "~" followed by "~" with "~ "
To store result(s) to file.SPACE use
perl -pe 's/~(?=~)/~ /g' file >file.SPACE

Related

How to use 'sed' to add dynamic prefix to each number in integer list?

How can I use sed to add a dynamic prefix to each number in an integer list?
For example:
I have a string "A-1,2,3,4,5", I want to transform it to string "A-1,A-2,A-3,A-4,A-5" - which means I want to add prefix of first integer i.e. "A-" to each number of the list.
If I have string like "B-1,20,300" then I want to transform it to string "B-1,B-20,B-300".
I am not able to use RegEx Capturing Groups because for global match they do not retain their value in subsequent matches.
When it comes to looping constructs in sed, I like to use newlines as markers for the places I have yet to process. This makes matching much simpler, and I know they're not in the input because my input is a text line.
For example:
$ echo A-1,2,3,4,5 | sed 's/,/\n/g;:a s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/; ta'
A-1,A-2,A-3,A-4,A-5
This works as follows:
s/,/\n/g # replace all commas with newlines (insert markers)
:a # label for looping
s/^\([^0-9]*\)\([^\n]*\)\n/\1\2,\1/ # replace the next marker with a comma followed
# by the prefix
ta # loop unless there's nothing more to do.
The approach is similar to #potong's, but I find the regex much more readable -- \([^0-9]*\) captures the prefix, \([^\n]*\) captures everything up to the next marker (i.e. everything that's already been processed), and then it's just a matter of reassembling it in the substitution.
Don't use sed, just use the other standard UNIX text manipulation tool, awk:
$ echo 'A-1,2,3,4,5' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
A-1,A-2,A-3,A-4,A-5
$ echo 'B-1,20,300' | awk '{p=substr($0,1,2); gsub(/,/,"&"p)}1'
B-1,B-20,B-300
This might work for you (GNU sed):
sed -E ':a;s/^((([^-]+-)[^,]+,)+)([0-9])/\1\3\4/;ta' file
Uses pattern matching and a loop to replace a number following a comma by the first column prefix and that number.
Assuming this is for shell scripting, you can do so with 2 seds:
set string = "A1,2,3,4,5"
set prefix = `echo $string | sed 's/^\([A-Z]\).*/\1/'`
echo $string | sed 's/,\([0-9]\)/,'$prefix'-\1/g'
Output is
A1,A-2,A-3,A-4,A-5
With
set string = "B-1,20,300"
Output is
B-1,B-20,B-300
Could you please try following(if ok with awk).
awk '
BEGIN{
FS=OFS=","
}
{
for(i=1;i<=NF;i++){
if($i !~ /^A/&&$i !~ /\"A/){
$i="A-"$i
}
}
}
1' Input_file
if your data in 'd' file, tried on gnu sed:
sed -E 'h;s/^(\w-).+/\1/;x;G;:s s/,([0-9]+)(.*\n(.+))/,\3\1\2/;ts; s/\n.+//' d

Linux Replace With Variable Containing Double Quotes

I have read the following:
How Do I Use Variables In A Sed Command
How can I use variables when doing a sed?
Sed replace variable in double quotes
I have learned that I can use sed "s/STRING/$var1/g" to replace a string with the contents of a variable. However, I'm having a hard time finding out how to replace with a variable that contains double quotes, brackets and exclamation marks.
Then, hoping to escape the quotes, I tried piping my result though sed 's/\"/\\\"/g' which gave me another error sed: -e expression #1, char 7: unknown command: E'. I was hoping to escape the problematic characters and then do the variable replacement: sed "s/STRING/$var1/g". But I couldn't get that far either.
I figured you guys might know a better way to replace a string with a variable that contains quotes.
File1.txt:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
Variable:
var1=$(cat file1.txt)
Example:
echo "STRING" | sed "s/STRING/$var1/g"
Desired output:
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
using awk
$ echo "STRING" | awk -v var="$var1" '{ gsub(/STRING/,var,$0); print $0}'
Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
-v var="$var1": To use shell variable in awk
gsub(/STRING/,var,$0) : To globally substitute all occurances of "STRING" in whole record $0 with var
Special case : "If your var has & in it " say at the beginning of the line then it will create problems with gsub as & has a special meaning and refers to the matched text instead.
To deal with this situation we've to escape & as follows :
$ echo "STRING" | awk -v var="$var1" '{ gsub(/&/,"\\\\&",var); gsub(/STRING/,var,$0);print $0}'
&Example test here
<tag>"Hello! [world]" This line sucks!</tag>
End example file
The problem isn't the quotes. You're missing the "s" command, leading sed to treat /STRING/ as a line address, and the value of $var1 as a command to execute on matching lines. Also, $var1 has unescaped newlines and a / character that'll cause trouble in the substitution. So add the "s", and escape the relevant characters in $var1:
var1escaped="$(echo "$var1" | sed 's#[\/&]#\\&#; $ !s/$/\\/')"
echo "STRING" | sed "s/STRING/$var1escaped/"
...but realistically, #batMan's answer (using awk) is probably a better solution.
Here is one awk command that gets text-to-be-replaces from a file that may consist of all kind of special characters such as & or \ etc:
awk -v pat="STRING" 'ARGV[1] == FILENAME {
# read replacement text from first file in arguments
a = (a == "" ? "" : a RS) $0
next
}
{
# now run a loop using index function and use substr to get the replacements
s = ""
while( p = index($0, pat) ) {
s = s substr($0, 1, p-1) a
$0 = substr($0, p+length(pat))
}
$0 = s $0
} 1' File1.txt <(echo "STRING")
To be able to handle all kind of special characters properly this command avoids any regex based functions. We use plain text based functions such as index, substr etc.

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder
$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$
No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.
Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.
Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character
Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done
awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2
Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

Find and append to Text Between Two Strings or Words using sed or awk

I am looking for a sed in which I can recognize all of the text in between two indicators and then replace it with a place holder.
For instance, the 1st indicator is a list of words
(no|noone|haven't)
and the 2nd indicator is a list of punctuation
Code:
(.|,|!)
From an input text such as
"Noone understands the plot. There is no storyline. I haven't
recommended this movie to my friends! Did you understand it?"
The desired result would be.
"Noone understands_AFFIX me_AFFIX. There is no storyline_AFFIX. I
haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX
friends_AFFIX! Did you understand it?"
I know that there is the following sed:
sed -n '/WORD1/,/WORD2/p' /path/to/file
which recognizes the content between two indicators. I have also found a lot of great information and resources here. However, I still cannot find a way to append the affix to each token of text that occurs between the two indicators.
I have also considered to use awk, such as
awk '{sub(/.*indic1 /,"");sub(/ indic2.*/,"");print;}' < infile
yet still, it does not allow me to append the affix.
Does anyone have a suggestion to do so, either with awk or sed?
Little more compact awk
$ awk 'BEGIN{RS=ORS=" ";s="_AFFIX"}
/[.,!]$/{f=0; $0=gensub(/(.)$/,"s\\1","g")}
f{$0=$0s}
/Noone|no|haven'\''t/{f=1}1' story
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?
Perl to the rescue!
perl -pe 's/(?:no(?:one)?|haven'\''t)\s*\K([^.,!]+)/
join " ", map "${_}_AFFIX", split " ", $1/egi
' infile > outfile
\K matches what's on its left, but excludes it from the replacement. In this case, it verifies the 1st indicator. (\K needs Perl 5.10+.)
/e evaluates the replacement part as code. In this case, the code splits $1 on whitespace, map adds _AFFIX to each of the members, and join joins them back into a string.
Here is one verbose awk command for the same:
s="Noone understands the plot. There is no storyline. I haven't recommended this movie to my friends! Did you understand it?"
awk -v IGNORECASE=1 -v kw="no|noone|haven't" -v pct='\\.|,|!' '{
a=0
for (i=2; i<=NF; i++) {
if ($(i-1) ~ "\\y" kw "\\y")
a=1
if (a && $i ~ pct "$") {
p = substr($i, length($i), 1)
$i = substr($i, 1, length($i)-1)
}
if (a)
$i=$i "_AFFIX" p
if(p) {
p=""
a=0
}
}
} 1'
Output:
Noone understands_AFFIX the_AFFIX plot_AFFIX. There is no storyline_AFFIX. I haven't recommended_AFFIX this_AFFIX movie_AFFIX to_AFFIX my_AFFIX friends_AFFIX! Did you understand it?

sed/awk replace in all matches

I want to invert all the color values in a bunch of files. The colors are all in the hex format #ff3300 so the inversion could be done characterwise with the sed command
y/0123456789abcdef/fedcba9876543210/
How can I loop through all the color matches and do the char translation in sed or awk?
EDIT:
sample input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
asdfghj
desired output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
asdfghj
EDIT: I changed my response as per your edit.
OK, sed may result in a difficult processing. awk could do the trick more or less easily, but I find perl much more easy for this task:
$ perl -pe 's/#[0-9a-f]+/$&=~tr%0123456789abcdef%fedcba9876543210%r/ge' <infile >outfile
Basically you find the pattern, then execute the right-hand side, which executes the tr on the match, and substitutes the value there.
The inversion is really a subtraction. To invert a hex, you just subtract it from ffffff.
With this in mind, you can build a simple script to process each line, extract hexes, invert them, and inject them back to the line.
This is using Bash (see arrays, printf -v, += etc) only (no external tools there):
#!/usr/bin/env bash
[[ -f $1 ]] || { printf "error: cannot find file: %s\n" "$1" >&2; exit 1; }
while read -r; do
# split line with '#' as separator
IFS='#' toks=( $REPLY )
for tok in "${toks[#]}"; do
# extract hex
read -n6 hex <<< "$tok"
# is it really a hex ?
if [[ $hex =~ [0-9a-fA-F]{6} ]]; then
# compute inversion
inv="$((16#ffffff - 16#$hex))"
# zero pad the result
printf -v inv "%06x" "$inv"
# replace hex with inv
tok="${tok/$hex/$inv}"
fi
# build the modified line
line+="#$tok"
done
# print the modified line and clean it for reuse
printf "%s\n" "${line#\#}"
unset line
done < "$1"
use it like:
$ ./invhex infile > outfile
test case input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
bdf#cvb_foo
asdfghj
#bdfg
processed output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
bdf#cvb_foo
asdfghj
#bdfg
This might work for you (GNU sed):
sed '/#[a-f0-9]\{6\}\>/!b
s//\n&/g
h
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g
y/0123456789abcdef/fedcba9876543210/
H
g
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta
s/\n//' file
Explanation:
/#[a-f0-9]\{6\}\>/!b bail out on lines not containing the required pattern
s//\n&/g prepend every pattern with a newline
h copy this to the hold space
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g delete everything but the required pattern(s)
y/0123456789abcdef/fedcba9876543210/ transform the pattern(s)
H append the new pattern(s) to the hold space
g overwrite the pattern space with the contents of the hold space
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta replace the old pattern(s) with the new.
s/\n// remove the newline artifact from the H command.
This works...
cat test.txt |sed -e 's/\#\([0123456789abcdef]\{6\}\)/\n\#\1\n/g' |sed -e ' /^#.*/ y/0123456789abcdef/fedcba9876543210/' | awk '{lastType=type;type= substr($0,1,1)=="#";} type==lastType && length(line)>0 {print line;line=$0} type!=lastType {line=line$0} length(line)==0 {line=$0} END {print line}'
The first sed command inserts line breaks around the hex codes, then it is possible to make the substitution on all lines starting with a hash. There are probably an elegant solution to merge the lines back again, but the awk command does the job. The only assumption there is that there won't be two hex-codes following directly after each other. If so, this step has to be revised.