replacing column field separator using sed - regex

I have a text file 1.txt:
cam:45c62741b9c99e1dcf3c140e8e3df635::dv:johnybold#yahoo.com:83.228.32.24
gamer:3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{:octopus#vida.tv:93.182.154.63
:adc0a54f8d21694848200ae043fa99f2:GqJ:LOLPELIC#trash-mail.com:84.176.127.30
! Aa:da99417e29ab0aa67f97db64f091836b:k_P:prus_da#yahoo.com:82.179.236.154
I want to change the column separator (currently it is ':') to '||o||'.
I want to change only the 1st, 3rd and 4th column separator as 2nd column contains something like hash:salt.
The script I am trying is:
sed 's/:/||o||/1;s/:/||o||/2;s/:/||o||/2' 1.txt
The only problem is in the results where ':' is included in the salt.
The output I am getting is:
cam||o||45c62741b9c99e1dcf3c140e8e3df635:||o||dv||o||johnybold#yahoo.com:83.228.32.24
gamer||o||3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{||o||octopus#vida.tv||o||93.182.154.63
||o||adc0a54f8d21694848200ae043fa99f2:GqJ||o||LOLPELIC#trash-mail.com||o||84.176.127.30
! Aa||o||da99417e29ab0aa67f97db64f091836b:k_P||o||prus_da#yahoo.com||o||82.179.236.154
The first line of the output is wrong.
Expected output :
cam||o||45c62741b9c99e1dcf3c140e8e3df635::dv||o||johnybold#yahoo.com||o||83.228.32.24
Rest of the output is correct.
What I am expecting is replace first ':' from forward and second and third time the replacement should be from backwards, so that ':' in the salt gets ignored.

Try this:
(?:^[^:]*\K:)|(:(?=[^:]+:?[^:]+$))
Basic idea:
Either get the first : that occurs in the line
Or : that is followed by at most one other :
Demo: regex101
Demo with substitution: regex101
How to run it with perl:
perl -p -e 's/(?:^[^:]*\K:)|(:(?=[^:]+:?[^:]+$))/||o||/g' input.txt

Short sed solution:
sed -E 's/:+/||o||/3g; s/:/||o||/' file
The output:
cam||o||45c62741b9c99e1dcf3c140e8e3df635::dv||o||johnybold#yahoo.com||o||83.228.32.24
gamer||o||3dabd5bd7984b0286eba52d4a7db2dea:$Wm?1Z3MPErXl7%yk^Pc#%iu\9LFc{||o||octopus#vida.tv||o||93.182.154.63
||o||adc0a54f8d21694848200ae043fa99f2:GqJ||o||LOLPELIC#trash-mail.com||o||84.176.127.30
! Aa||o||da99417e29ab0aa67f97db64f091836b:k_P||o||prus_da#yahoo.com||o||82.179.236.154

Related

regex in sed removing only the first occurrence from every line

I have the following file I would like to clean up
cat file.txt
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
My desired output is:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
I would like to remove everything between ":" and the first occurence of "or"
I tried sed 's/MNS:d*?or /MNS:/g' though it removes the second "or" as well.
I tried every option in https://www.geeksforgeeks.org/sed-command-in-linux-unix-with-examples/
to no avail. should I create alias sed='perl -pe'? It seems that sed does not properly support regex
perl should be more suitable here because we need Lazy match logic here.
perl -pe 's|(:.*?or +)(.*)|:\2|' Input_file
by using .*?or we are checking for the first nearest match for or string in the line.
This might work for you (GNU sed):
sed '/:.*\<or\>/{s/\<or\>/\n/;s/:.*\n//}' file
If a line contains : followed by the word or, then substitute the first occurrence of the word or with a unique delimiter (e.g.\n) and then remove everything between : and the unique delimiter.
Wrt I would like to remove everything between ":" and the first occurence of "or" - no you wouldn't. The first occurrence of or in the 2nd line of sample input is as the start of orweqqwe. That text immediately after : looks like it could be any set of characters so couldn't it contain a standalone or, e.g. MNS:2 or eqqwe or M+ GYPA*02 or GYPA*N
Given that and the fact it's apparently a fixed number of characters to be removed on every line, it seems like this is what you should really be using:
$ sed 's/:.\{14\}/:/' file
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
If it is sure the or always occurs twice a line as provided example, please try:
sed 's/\(MNS:\).\+ or \(.\+ or .*\)/\1\2/' file.txt
Result:
MNS:N+ GYPA*01 or GYPA*M
MNS:M+ GYPA*02 or GYPA*N
MNS:Mc GYPA*08 or GYP*Mc
MNS:Vw GYPA*09 or GYPA*Vw
MNS:Mg GYPA*11 or GYPA*Mg
MNS:Vr GYPA*12 or GYPA*Vr
Otherwise using perl is a better solution which supports the shortest match as RavinderSingh13 answers.
ex supports lazy matching with \{-}:
ex -s '+%s/:\zs.\{-}or //g|wq' input_file
The pattern :\zs.\{-}or matches any character after the first : up to the first or.

How to cut a string till first numerical value appears using regex

I am trying to write a script which can extract the words from a string untill the first number appears.
ex :- I have a file named as typed-list-4.1.3.Final.jar and I want the output as:- typed-list.jar
Since all the files have different names, but, they end with a version number and .jar extension so I was trying to sed the part from where the first number appears and then append .jar.
My files look like :-
log4j-slf4j-impl-2.8.2.jar, hibernate-core-5.0.12.Final.jar etc
I tried to use sed command like this but it's not working :-
sed -i 's/-[0-9]*$//g' test1.sh --- where test1.sh contains this string "typed-list-4.1.3.Final.jar"
How about:
sed 's/-\([0-9]\+\.\)\+[0-9]\+.*\.jar/.jar/' Input_file
Results for the provided inputs:
typed-list.jar
log4j-slf4j-impl.jar
hibernate-core.jar
The regex matches with a substring such as:
starting with a dash -
pattern repetition of digit(s) dot digit(s) ...
some other substring in between (such as Final)
ends with the extension .jar
Then the sed command replaces the matched substring with just the extension.
Hope this helps.
Sed:
sed -E 's/(.*)-([[:digit:]]+\.){2}[[:digit:]]+.*(\.[^.]+)$/\1\3/' dat
log4j-slf4j-impl.jar
hibernate-core.jar
typed-list.jar
echo typed-list-4.1.3.Final.jar | awk 'sub(/-4.{10}/,"",$0)'
typed-list.jar

Remove ":" from field two of CSV only and ignore other fields

I've been trying to clean up the data in a csv file which contain data similar to this:
8979880, Number One : Exclusive Mix, 387387, http://www.smashhits.com
4844404, Top 40 : 1988, 3893938, http://www.best80s.com
48094940, Highlander:The Return, 489494, http://www.instantaccess.com
My goal is to replace the colon in field 2 with a space. Initially I used sed to replace the : with a spacelike so:
sed i "s/:/ /g" file.csv
This works in removing the colon but unfortunately this also removes the colon in the url which is not what I want. How can I specify that I only want the command to affect the data in field 2?
Using awk you can do
awk '/:/{sub(/:/, " ")} 1' file.csv
With /:/ you match the first occurrence of :
With {sub(/:/, " ")} you replace : with a space
1 simply prints the line.
You can use gnu sed like this:
sed -r 's/^([^,]*,[^,]*):/\1 /g' file.csv
Explanation
^ anchors the expression at the start of each line
now [^,]*, matches the first field including the separator
and then [^,]*: matches from the second field to the :
the parenthises ^(...): take care that everything up to but not including the : in the second field is captured into \1
finally the replacement with \1 (there is a space after the \1 does the replacement of the : with space on line where the regex matched

process a delimited text file with sed

I have a ";" delimited file:
aa;;;;aa
rgg;;;;fdg
aff;sfg;;;fasg
sfaf;sdfas;;;
ASFGF;;;;fasg
QFA;DSGS;;DSFAG;fagf
I'd like to process it replacing the missing value with a \N .
The result should be:
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;\N
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
I'm trying to do it with a sed script:
sed "s/;\(;\)/;\\N\1/g" file1.txt >file2.txt
But what I get is
aa;\N;;\N;aa
rgg;\N;;\N;fdg
aff;sfg;\N;;fasg
sfaf;sdfas;\N;;
ASFGF;\N;;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
You don't need to enclose the second semicolon in parentheses just to use it as \1 in the replacement string. You can use ; in the replacement string:
sed 's/;;/;\\N;/g'
As you noticed, when it finds a pair of semicolons it replaces it with the desired string then skips over it, not reading the second semicolon again and this makes it insert \N after every two semicolons.
A solution is to use positive lookaheads; the regex is /;(?=;)/ but sed doesn't support them.
But it's possible to solve the problem using sed in a simple manner: duplicate the search command; the first command replaces the odd appearances of ;; with ;\N, the second one takes care of the even appearances. The final result is the one you need.
The command is as simple as:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
It duplicates the previous command and uses the ; between g and s to separe them. Alternatively you can use the -e command line option once for each search expression:
sed -e 's/;;/;\\N;/g' -e 's/;;/;\\N;/g'
Update:
The OP asks in a comment "What if my file have 100 columns?"
Let's try and see if it works:
$ echo "0;1;;2;;;3;;;;4;;;;;5;;;;;;6;;;;;;;" | sed 's/;;/;\\N;/g;s/;;/;\\N;/g'
0;1;\N;2;\N;\N;3;\N;\N;\N;4;\N;\N;\N;\N;5;\N;\N;\N;\N;\N;6;\N;\N;\N;\N;\N;\N;
Look, ma! It works!
:-)
Update #2
I ignored the fact that the question doesn't ask to replace ;; with something else but to replace the empty/missing values in a file that uses ; to separate the columns. Accordingly, my expression doesn't fix the missing value when it occurs at the beginning or at the end of the line.
As the OP kindly added in a comment, the complete sed command is:
sed 's/;;/;\\N;/g;s/;;/;\\N;/g;s/^;/\\N;/g;s/;$/;\\N/g'
or (for readability):
sed -e 's/;;/;\\N;/g;' -e 's/;;/;\\N;/g;' -e 's/^;/\\N;/g' -e 's/;$/;\\N/g'
The two additional steps replace ';' when they found it at beginning or at the end of line.
You can use this sed command with 2 s (substitute) commands:
sed 's/;;/;\\N;/g; s/;;/;\\N;/g;' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
Or using lookarounds regex in a perl command:
perl -pe 's/(?<=;)(?=;)/\\N/g' file
aa;\N;\N;\N;aa
rgg;\N;\N;\N;fdg
aff;sfg;\N;\N;fasg
sfaf;sdfas;\N;\N;
ASFGF;\N;\N;\N;fasg
QFA;DSGS;\N;DSFAG;fagf
The main problem is that you can't use several times the same characters for a single replacement:
s/;;/..../g: The second ; can't be reused for the next match in a string like ;;;
If you want to do it with sed without to use a Perl-like regex mode, you can use a loop with the conditional command t:
sed ':a;s/;;/;\\N;/g;ta;' file
:a defines a label "a", ta go to this label only if something has been replaced.
For the ; at the end of the line (and to deal with eventual trailing whitespaces):
sed ':a;s/;;/;\\N;/g;ta; s/;[ \t\r]*$/;\\N/1' file
this awk one-liner will give you what you want:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N"}7' file
if you really want the line: sfaf;sdfas;\N;\N;\N , this line works for you:
awk -F';' -v OFS=';' '{for(i=1;i<=NF;i++)if($i=="")$i="\\N";sub(/;$/,";\\N")}7' file
sed 's/;/;\\N/g;s/;\\N\([^;]\)/;\1/g;s/;[[:blank:]]*$/;\\N/' YourFile
non recursive, onliner, posix compliant
Concept:
change all ;
put back unmatched one
add the special case of last ; with eventually space before the end of line
This might work for you (GNU sed):
sed -r ':;s/^(;)|(;);|(;)$/\2\3\\N\1\2/g;t' file
There are 4 senarios in which an empty field may occur: at the start of a record, between 2 field delimiters, an empty field following an empty field and at the end of a record. Alternation can be employed to cater for senarios 1,2 and 4 and senario 3 can be catered for by a second pass using a loop (:;...;t). Multiple senarios can be replaced in both passes using the g flag.

sed command to delete text until match is found for each line of a csv

I have a csv file and I am trying to delete all characters from the beginning of the line till it finds the first occurrence of "2015". I want to do this for each line in the csv file.
My csv file structure is as follows:
Field1 , Field2 , Field3 , Field4
sometext1 , 2015-07-15 , sometext2, sometext3
sometext1 , 2015-07-14 , sometext2, sometext3
sometext1 , 2015-07-13 , sometext2, sometext3
I cannot use the cut command or sed for the first occurrence of a comma because the text in the Field1 sometimes has commas in them too, which is making it complicated for parsing. I figured if I search for the first occurrence of the text 2015 for each line and replace all the preceding characters with nothing, then that should work.
FYI I only want to do this for the FIRST occurrence of 2015 only. There is another text field with 2015 in it within another column and I don't any text prior to that to be affected.
For example, if my original line is:
sometext1,#015,2015-07-10,sometext2,2015,sometext3
I want it to return:
2015-07-10,sometext2,2015,sometext3
Does anyone know the sed command to do this?
Any help will be appreciated!
Thanks
Here is a way to do it with sed assuming "#####" never occurs in a line:
sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
For example:
> echo sometext1,#015,2015-07-10,sometext2,2015,sometext3\
|sed -e 's/2015/#####&/'|sed -e 's/.*#####//'
2015-07-10,sometext2,2015,sometext3
The first sed command prefixes "#####" to the first occurence of 2015 and the second sed command removes everything from the beginning to the end of the "#####" prefix.
The basic reason for using this two stage method is that sed's regular expression matcher has only greedy wildcards that always pick the longest match and does not support lazy matching which picks the shortest match.
If "#####" may occur in a line a more unlikely string could be substituted for it such as "7z#dNjm_wG8a3!esu#Rhv=".
To do this with sed without Perl-style non-greedy operators, you need to mark the first instance with something you know won't be in the line, as Tris describes. However, that solution requires knowledge of what won't be in the file. Fortunately, you can guarantee that a newline won't be in the line because that's what terminated the line. Thus you can do something like:
sed 's/2015/\n&/;s/.*\n//' input.txt > output.txt
NOTE: this won't modify the header row which you would have to treat specially.