select part of filename using regex - regex

I got a file that looks like
dcdd62defb908e37ad037820f7 /sp/dir/su1/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/su2/89/sfdh.gz
ee1d443b8a0cc27749f4b31e56 /sp/dir/su3/89/24.gz
33c02e311fd0a894f7f0f8aae4 /sp/dir/su4/89/dfad.gz
43f6cdce067f6794ec378c4e2a /sp/dir/su5/89/adf.gz
2f6c584116c567b0f26dfc8703 /sp/dir/su6/895/895.gz
a864b7e327dac1bb6de59dedce /sp/dir/su7/895/895.gz
How do i use sed to substitue all the su* such that I can replace with a single value like
sed "s/REXEXP/newfolder/g" myfile
thanks in advance

I think you want
sed 's/su./newfolder/g'
If you actually want to keep the number in su1...su7 as a part of newfolder (for example newfolder1...newfolder7), you can do:
sed 's/su\(.\)/newfolder\1/g'
It also depends upon how "strict" do you want your patterns to be. The above will match su followed by any character and do the replacement. On the other hand, a command like s#/su\([0-9]\)/#/newfolder\1/#g will only match /su followed by a digit, followed by /. So you may need to adjust your pattern accordingly.

$ sed -e 's|/su[^/]*|/newfolder|' /tmp/files\
dcdd62defb908e37ad037820f7 /sp/dir/newfolder/89/asga.gz
7d59319afca23b02f572a4034b /sp/dir/newfolder/89/sfdh.gz
...
If you want to get rid of the checksums as well:
$ sed -r -e 's|/su[^/]*|/newfolder|' -e 's/^[^ ]+ +//' /tmp/files\
/sp/dir/newfolder/89/asga.gz
/sp/dir/newfolder/89/sfdh.gz
...

su[0-9] will match a single digit.

sed requires a dirty amount of metacharacter escaping, some of it may be slightly off.
sed -i -e 's/\/su[^\/]+\//\/newFolder\//g' myfile

I vote for Wayne Conrad's answer as the most likely to be what the OP wants, but I'd suggest using an alternate character for the sed expression separator, thus:
sed 's|/su[^/]*|/newfolder|' /tmp/files
That makes it a bit cleaner.
Note also that the trailing 'g' is probably not wanted.

use awk. since there is a delimiter you can use , '/'. after that, column 4 is what you want to change. So if you have paths like /sp/su3dir/su2/89/sfdh.gz , su3dir will not be affected.
awk -F"/" '{$4="newfolder";}1' OFS="/" file

Related

sed replace exact match

I want to change some names in a file using sed. This is how the file looks like:
#! /bin/bash
SAMPLE="sample_name"
FULLSAMPLE="full_sample_name"
...
Now I only want to change sample_name & not full_sample_name using sed
I tried this
sed s/\<sample_name\>/sample_01/g ...
I thought \<> could be used to find an exact match, but when I use this, nothing is changed.
Adding '' helped to only change the sample_name. However there is another problem now: my situation was a bit more complicated than explained above since my sed command is embedded in a loop:
while read SAMPLE
do
name=$SAMPLE
sed -e 's/\<sample_name\>/$SAMPLE/g' /path/coverage.sh > path/new_coverage.sh
done < $1
So sample_name should be changed with the value attached to $SAMPLE. However when running the command sample_name is changed to $SAMPLE and not to the value attached to $SAMPLE.
I believe \< and \> work with gnu sed, you just need to quote the sed command:
sed -i.bak 's/\<sample_name\>/sample_01/g' file
In GNU sed, the following command works:
sed 's/\<sample_name\>/sample_01/' file
The only difference here is that I've enclosed the command in single quotes. Even when it is not necessary to quote a sed command, I see very little disadvantage to doing so (and it helps avoid these kinds of problems).
Another way of achieving what you want more portably is by adding the quotes to the pattern and replacement:
sed 's/"sample_name"/"sample_01"/' script.sh
Alternatively, the syntax you have proposed also works in GNU awk:
awk '{sub(/\<sample_name\>/, "sample_01")}1' file
If you want to use a variable in the replacement string, you will have to use double quotes instead of single, for example:
sed "s/\<sample_name\>/$var/" file
Variables are not expanded within single quotes, which is why you are getting the the name of your variable rather than its contents.
#user1987607
You can do this the following way:
sed s/"sample_name">/sample_01/g
where having "sample_name" in quotes " " matches the exact string value.
/g is for global replacement.
If "sample_name" occurs like this ifsample_name and you want to replace that as well
then you should use the following:
sed s/"sample_name ">/"sample_01 "/g
So that it replaces only the desired word. For example the above syntax will replace word "the" from a text file and not from words like thereby.
If you are interested in replacing only first occurence, then this would work fine
sed s/"sample_name"/sample_01/
Hope it helps

GREP: Extracting all characters from inside double quote

What I did:
grep -E -o -e "[^"]+"
It can extract, for example: "Poland" and "New York" but can't extract "Marcos Juárez" due to the existence of 'á'...it cuts the output to "Marcos Ju" and "rez"
How can I prevent this?
I don't think this is a regex problem per say. It could be a Unicode or wide-char issue.
Your regex should be "[^"]+" thats a NOT double quote.
I don't know unix command line, but what is delimiting the "[^']+" parameter,
is it done by just spaces ?
Try ".*?", it should match. If not its a unicode problem.
Try:
grep -Po '(?<=\")(.*?)(?=\")'
for me it output all the three correctly.

my sed is close... but not quite there, can you help please?

I want to print only the lines that meet the criteria : "worde:" and "wordo;"
I got this far:
sed -n '/\([a-z]*\)\1e:\1o;/p;'
But it doesn't quite work.
Can someone please perfect it and tell me exactly how its a fixed version/what was wrong with mine?
(Please note there are no capital letters ever, hence why I didn't bother including that within my initial character range)
Thanks heaps,
This will handle lines where "worde:wordo;" (nothing between the words) appears:
sed -n '/\([a-z]*\)e:\1o;/p;'
If you need to allow for characters BETWEEN the words, you'll need something like this:
sed -n '/\([a-z]*\)e:.*\1o;/p;'
My interpretation of your question is that you want to match lines which contain both worde: and wordo;
sed -n '/worde:/{/wordo;/p}' infile
The -n parameter prevents sed from printing the pattern space (infile), the first regex matches, then control flows into the block, if the regex isn't matched, then the line is ignored. Inside the block, the if the second regex is matched, the line is printed.
One way using alternation:
sed -n '/word\(e:\|o;\)/ p' infile
Is it a requirement to use capture groups? I went without them.
$ sed -n '/[\w]*[oe][:;]/p'
[\w]* - Match any word character. (if you really want only [a-z], swap
that back in)
[oe] - Those word characters must end in an e or
o
[:;] - And then have a : or ;
This might work for you:
sed '/^\(.*\)[eE]:\s*\1[oO];/!d' file

Sed wildcards- replace in the middle of certain characters

I have a line like this
"RU_28" "CDM_279" "CDI_45"
"RU_567" "CDM_528" "CDI_10000"
I want to obtain the result below
"RU_28" "CDM_Unusued" "CDI_45"
"RU_567" "CDM_Unusued" "CDI_10000"
Do this for all the lines in the file
I'm using this commands:
sed 's/\"CDM_\w*\"/\"Unusued\"/g' File1.txt > File2.txt
It doesn't seem to works.
Thanks in advance!!!
You can use:
sed -i.bak 's/"\(CDM_\)[^"]*"/"\1Unused"/' file1.txt
You're not actually leaving "CDM_" on the right side. Your substitution says "Replace CDM_ and any number of words with Unusued." The common way to do what you actually want to do is to use parentheses around the section on the left you want to keep, and then use backreferences to indicate where they go on the right. In this case, you just need a single backreference, indicated by \1:
sed 's/"\(CDM_\)\w*"/"\1Unusued"/g' File1.txt > File2.txt
Note the backslashes before the parentheses on the left side; these can be omitted if using sed with -r (I think), for extended regexps, but as-is, they're necessary so sed knows they're not literal.
Edit: I've updated the command in response to the accurate comment by Birei, noting that the extraneous escapes for the double-quotes. (Note that the ones for the parentheses are still necessary).

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input