Unix Shell Scripting removing - from file - regex

I have a file containing following info
job, cost_code, laborcost,materialcost
202-12-21, 23-94-23, **110.00-**, 120.04
204-12-21, 23-93-23, 520.00, **120.04-**
204-12-12, 24-93-23, 155.00, **120.04-**
There are few problem records specifically ones which have a -sign at the end of the line (marked with ** which isn't really there in the file).
I am trying to remove the - sign with regex but i am having trouble matching only the amount ones which have a problem.

Try this, this command removes dashes at end of line
sed s'/-$//'
echo "204-12-21, 23-93-23, 520.00, 120.04-" | sed s'/-$//'

You can use sed:
sed -i~ 's/\([0-9]\)-,/\1,/g;s/-$//' input.csv
The first expression finds a digit followed by -, and replaces it by the digit and comma; the second expression removes dashes at ends of lines.

This might work for you (GNU sed):
sed -r 's/-(,|$)/\1/g' file
Remove a hyphen before a comma or a hyphen from the last field of a line.

Related

How to cut a string till first numerical value appears using regex

I am trying to write a script which can extract the words from a string untill the first number appears.
ex :- I have a file named as typed-list-4.1.3.Final.jar and I want the output as:- typed-list.jar
Since all the files have different names, but, they end with a version number and .jar extension so I was trying to sed the part from where the first number appears and then append .jar.
My files look like :-
log4j-slf4j-impl-2.8.2.jar, hibernate-core-5.0.12.Final.jar etc
I tried to use sed command like this but it's not working :-
sed -i 's/-[0-9]*$//g' test1.sh --- where test1.sh contains this string "typed-list-4.1.3.Final.jar"
How about:
sed 's/-\([0-9]\+\.\)\+[0-9]\+.*\.jar/.jar/' Input_file
Results for the provided inputs:
typed-list.jar
log4j-slf4j-impl.jar
hibernate-core.jar
The regex matches with a substring such as:
starting with a dash -
pattern repetition of digit(s) dot digit(s) ...
some other substring in between (such as Final)
ends with the extension .jar
Then the sed command replaces the matched substring with just the extension.
Hope this helps.
Sed:
sed -E 's/(.*)-([[:digit:]]+\.){2}[[:digit:]]+.*(\.[^.]+)$/\1\3/' dat
log4j-slf4j-impl.jar
hibernate-core.jar
typed-list.jar
echo typed-list-4.1.3.Final.jar | awk 'sub(/-4.{10}/,"",$0)'
typed-list.jar

Use sed to replace patterns that are not at the start of end of lines

Let's say I have input:
/a/b/c/d/e/
/a/b/c/d/e
a/b/c/d/e/
a/b/c/d/e
I'd like to replace all / that are not at the edges with + so the output is:
/a+b+c+d+e/
/a+b+c+d+e
a+b+c+d+e/
a+b+c+d+e
I've tried this command:
sed -e "s#\(.\)/\(.\)#\1+\2#g"
which is close but not quite:
/a+b/c+d/e/
/a+b/c+d/e
a+b/c+d/e/
a+b/c+d/e
presumably because the \(.\) overlap between successive / characters.
I don't believe sed has a null match operator for beginning or end of line. So, how is this done?
You can translate all slashes to + and then replace + (at the beginning or at the end) with a slash:
sed 'y/\//+/;s/^+\|+$/\//g;'
or if the OR operator isn't available:
sed 'y/\//+/;s/^+/\//;s/+$/\//;'
better if you change the delimiter to avoid to escape all literal slashes:
sed 'y~/~+~;s~^+\|+$~/~g;'
or if the OR operator isn't available:
sed 'y~/~+~;s~^+~/~;s~+$~/~;'
(where ^ is an anchor for the start of the line and $ for the end)
Other way: you can protect the slashes you want to preserve using a placeholder:
sed 's~^/~{`%{~;s~/$~{`%{~;y~/~+~;s~{`%{~/~g;'
If you have perl you can use lookarounds for this:
perl -pe 's~(?<!^)/(?!$)~+~g' file
Output:
/a+b+c+d+e/
/a+b+c+d+e
a+b+c+d+e/
a+b+c+d+e
Otherwise you can use this sed with 2 substitutes:
sed -r 's~(.)/(.)~\1+\2~g; s~(.)/(.)~\1+\2~g' file
Or this sed with labeling and looping:
sed -r ':a;s|(.)/(.)|\1+\2|g;ta' file
Here is a sed command that gives your output:
sed -r 's=(.)/\b=\1+=g;' file
usually / is uses as separator for the s command, but here we use =
the / is matched where there is something (.) before it and and we are at a word boundary
initially I tried (.)/(.) but that did not work:
The second dot was consumed and the next match would only start after it,
i.e. in x/y/< the second match would only see /z and not y/z
with \b the first match does not consume the y and the second match sees y/
This is the common and extremely useful sed idiom for doing jobs like this:
$ sed 's:a:aA:g; s:^/\|/$:aB:g; s:/:+:g; s:aB:/:g; s:aA:a:g' file
/a+b+c+d+e/
/a+b+c+d+e
a+b+c+d+e/
a+b+c+d+e
The 1st sub changes all as to aA. At that point there is no letter a in the input that is not followed by the letter A (we need to do this first to ensure that after our 2nd sub the only aBs in the input are as a result of that 2nd sub)
The 2nd sub changes all /s at the start or end of a line to aB. At that point the only aBs in the input are where there were originally /s at the start or end of the line.
The 3rd sub changes all remaining /s (i.e. those that were not at the start or end of the line) to +s.
The 4th sub restores the aBs back to the original front/end /s.
The 5th sub restores the aAs back to the original as.
This might work for you (GNU sed):
sed ':a;s/\([^\/]\)\/\([^\/]\)/\1+\2/g;ta' file
Or visually easier:
sed -r ':a;s#([^/])/([^/])#\1+\2#g;ta' file
It is really the same regexp twice:
sed 's/\([^\/]\)\/\([^\/]\)/\1+\2/g;s/\([^\/]\)\/\([^\/]\)/\1+\2/g' file

process a delimited text file with sed

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.
You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file
this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line
This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.

sed command to delete text until match is found for each line of a csv

I have a csv file and I am trying to delete all characters from the beginning of the line till it finds the first occurrence of "2015". I want to do this for each line in the csv file.
My csv file structure is as follows:
Field1 , Field2 , Field3 , Field4
sometext1 , 2015-07-15 , sometext2, sometext3
sometext1 , 2015-07-14 , sometext2, sometext3
sometext1 , 2015-07-13 , sometext2, sometext3
I cannot use the cut command or sed for the first occurrence of a comma because the text in the Field1 sometimes has commas in them too, which is making it complicated for parsing. I figured if I search for the first occurrence of the text 2015 for each line and replace all the preceding characters with nothing, then that should work.
FYI I only want to do this for the FIRST occurrence of 2015 only. There is another text field with 2015 in it within another column and I don't any text prior to that to be affected.
For example, if my original line is:
sometext1,#015,2015-07-10,sometext2,2015,sometext3
I want it to return:
2015-07-10,sometext2,2015,sometext3
Does anyone know the sed command to do this?
Any help will be appreciated!
Thanks
Here is a way to do it with sed assuming "#####" never occurs in a line:
sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
For example:
> echo sometext1,#015,2015-07-10,sometext2,2015,sometext3\
|sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
2015-07-10,sometext2,2015,sometext3
The first sed command prefixes "#####" to the first occurence of 2015 and the second sed command removes everything from the beginning to the end of the "#####" prefix.
The basic reason for using this two stage method is that sed's regular expression matcher has only greedy wildcards that always pick the longest match and does not support lazy matching which picks the shortest match.
If "#####" may occur in a line a more unlikely string could be substituted for it such as "7z#dNjm_wG8a3!esu#Rhv=".
To do this with sed without Perl-style non-greedy operators, you need to mark the first instance with something you know won't be in the line, as Tris describes. However, that solution requires knowledge of what won't be in the file. Fortunately, you can guarantee that a newline won't be in the line because that's what terminated the line. Thus you can do something like:
sed 's/2015/\n&/;s/.*\n//' input.txt > output.txt
NOTE: this won't modify the header row which you would have to treat specially.

regexp find and replace: bash variables inside sed

I would like to remove this sequence when present at the beginning of the line:
ATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG followed by at least 3 A characters.
Both, sequence and multiple A should be removed and the rest of the file should be preserved.
My input files look like this:
#M00946:3:000000000-A2WF2:1:1101:18115:1962 1:N:0:2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAACATTTTCTTTCTTACTTCGTTCACTTTCCACTTCTTTCTCCCTATCTTCCCCCTTCTGTCTGCCCCAGCTGTCTATCCCACTTATTGTCTCCCCCCACTGCCCCACACTCCTACCTTCTTCATCTTCACCTAACACCTCCCGCTCCCTCCTTATCGTCTCTTATCCTTTCCTTGTTCC
+
????????DDDDDDDDGGGGGGHHIIIIHHHIIIIFHIIIH/CGFHHIIIIHEDHHIIIIHI=5EEGFEHHEC+5,,4#,#,,....--..+77,,.6..6.....7.4..7.76=..-5.>.4-)134-.5....-3*))0***1*********10*0**01*1*)''..0***.)0'))*****00*11******01***0****0*)**0)'''...*0)0*11********1****1*0********
#M00946:3:000000000-A2WF2:1:1101:19888:2900 1:N:0:2
GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAAAACACAAATACCGTTCCAATATCTTTTTGTTTCATGTCTAATAAC
+
<<??????BB?BBBBBCAFFFCFHF;>EFCDFGFFHFBGHCA=FHA>EFGEE7CF>F?FFHB=?EEGF>>DH5<)++,++,4,,4+=:,,,,5,,,,,,,,),33?,3,3,3,,,,33
I was trying to use script replace.sh which looks like this
file=$1;
adapter_sequence=$2;
sed -r "s/${adapter_sequence}A{3}//" $file
from the command line:
./replace.sh file.fastq GATCGGAAGAGCACACGTCTGAACTCCAGTCACTGACCAATCTCGTATGCCGTCTTCTGCTTG
It did not work. Any help in any script language will be appreciated.
I believe your have $1, $2 reversed. Have it like this:
adapter_sequence=$2
sed "s/$adapter_sequence//" $1
In the ideal case I would like to remove all adapter sequences
starting at the beginning of line followed by at least three A
letters,
Try this sed:
sed -r "s/^${adapter_sequence}A{3,}//" file