SED / AWK multi-file replacement? - regex

I've been trying to figure out a command that will search through 13+ files and replace
all matches and variances of forms data and replace them with form data enhancements.
The trick is that there could be [whitespace] - or _ as a separator that I would like
to preserve. I'm running form command line so I believe I could run this script multiple
times and just point it at the file, or if there's a way to capture all files in a directory
(even including directory names) it might just be easier.
I believe its something to the tune of
sed "s/forms_data/form-data-enhancements/g ; s/forms-data/form-data-enhancements/g ; s/forms data/form data enhancements/g" oldfile > newfile
nut I'm not sure.....
variances might be
forms-data
forms_data
forms data
etcetra. Would someone mind sharing a bit of sed awk wisdom? The best I can find is something called an arrary replace but was unable to get any information on how to use it.
Thanks greatly.

Will this work for you -
sed -i 's/\<forms[ -_]data\>/form data enhancements/g' /path/to/files*
-i will do in-file substitution. So first pick a file and run the command without the -i option. If everything looks ok then you can go ahead and use the -i.
Update:
If you would like to retain the separators then you can do something like this -
sed -i 's/\<forms\([ -_]\)data\>/form\1data\1enhancements/' /path/to/files*

Related

How can I insert text after a certain string is found? [duplicate]

I would like to run a find and replace on an HTML file through the command line.
My command looks something like this:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html > index.html
When I run this and look at the file afterward, it is empty. It deleted the contents of my file.
When I run this after restoring the file again:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
The stdout is the contents of the file, and the find and replace has been executed.
Why is this happening?
When the shell sees > index.html in the command line it opens the file index.html for writing, wiping off all its previous contents.
To fix this you need to pass the -i option to sed to make the changes inline and create a backup of the original file before it does the changes in-place:
sed -i.bak s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
Without the .bak the command will fail on some platforms, such as Mac OSX.
An alternative, useful, pattern is:
sed -e 'script script' index.html > index.html.tmp && mv index.html.tmp index.html
That has much the same effect, without using the -i option, and additionally means that, if the sed script fails for some reason, the input file isn't clobbered. Further, if the edit is successful, there's no backup file left lying around. This sort of idiom can be useful in Makefiles.
Quite a lot of seds have the -i option, but not all of them; the posix sed is one which doesn't. If you're aiming for portability, therefore, it's best avoided.
sed -i 's/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' index.html
This does a global in-place substitution on the file index.html. Quoting the string prevents problems with whitespace in the query and replacement.
use sed's -i option, e.g.
sed -i bak -e s/STRING_TO_REPLACE/REPLACE_WITH/g index.html
To change multiple files (and saving a backup of each as *.bak):
perl -p -i -e "s/\|/x/g" *
will take all files in directory and replace | with x
this is called a “Perl pie” (easy as a pie)
You should try using the option -i for in-place editing.
Warning: this is a dangerous method! It abuses the i/o buffers in linux and with specific options of buffering it manages to work on small files. It is an interesting curiosity. But don't use it for a real situation!
Besides the -i option of sed
you can use the tee utility.
From man:
tee - read from standard input and write to standard output and files
So, the solution would be:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee | tee index.html
-- here the tee is repeated to make sure that the pipeline is buffered. Then all commands in the pipeline are blocked until they get some input to work on. Each command in the pipeline starts when the upstream commands have written 1 buffer of bytes (the size is defined somewhere) to the input of the command. So the last command tee index.html, which opens the file for writing and therefore empties it, runs after the upstream pipeline has finished and the output is in the buffer within the pipeline.
Most likely the following won't work:
sed s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html | tee index.html
-- it will run both commands of the pipeline at the same time without any blocking. (Without blocking the pipeline should pass the bytes line by line instead of buffer by buffer. Same as when you run cat | sed s/bar/GGG/. Without blocking it's more interactive and usually pipelines of just 2 commands run without buffering and blocking. Longer pipelines are buffered.) The tee index.html will open the file for writing and it will be emptied. However, if you turn the buffering always on, the second version will work too.
sed -i.bak "s#https.*\.com#$pub_url#g" MyHTMLFile.html
If you have a link to be added, try this. Search for the URL as above (starting with https and ending with.com here) and replace it with a URL string. I have used a variable $pub_url here. s here means search and g means global replacement.
It works !
The problem with the command
sed 'code' file > file
is that file is truncated by the shell before sed actually gets to process it. As a result, you get an empty file.
The sed way to do this is to use -i to edit in place, as other answers suggested. However, this is not always what you want. -i will create a temporary file that will then be used to replace the original file. This is problematic if your original file was a link (the link will be replaced by a regular file). If you need to preserve links, you can use a temporary variable to store the output of sed before writing it back to the file, like this:
tmp=$(sed 'code' file); echo -n "$tmp" > file
Better yet, use printf instead of echo since echo is likely to process \\ as \ in some shells (e.g. dash):
tmp=$(sed 'code' file); printf "%s" "$tmp" > file
And the ed answer:
printf "%s\n" '1,$s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g' w q | ed index.html
To reiterate what codaddict answered, the shell handles the redirection first, wiping out the "input.html" file, and then the shell invokes the "sed" command passing it a now empty file.
I was searching for the option where I can define the line range and found the answer. For example I want to change host1 to host2 from line 36-57.
sed '36,57 s/host1/host2/g' myfile.txt > myfile1.txt
You can use gi option as well to ignore the character case.
sed '30,40 s/version/story/gi' myfile.txt > myfile1.txt
With all due respect to the above correct answers, it's always a good idea to "dry run" scripts like that, so that you don't corrupt your file and have to start again from scratch.
Just get your script to spill the output to the command line instead of writing it to the file, for example, like that:
sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g index.html
OR
less index.html | sed -e s/STRING_TO_REPLACE/STRING_TO_REPLACE_IT/g
This way you can see and check the output of the command without getting your file truncated.

Mass rename in shell script

I have a bunch of files which are of this format:
blabla.log.YYYY.MM.DD
Where YYYY.MM.DD is something like (2016.01.18)
I have quite a few folders with about 1000 files in each, so I wanted to have a simple script to rename them. I want to rename them to
blabla.log
So basically, I'm just stripping the date at the end. Here is what I have:
for f in [a-zA-Z]*.log.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]; do
mv -v $f ${f#[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]};
done
This script outputs this:
mv: `blabla.log.2016.01.18' and `blabla.log.2016.01.18' are the same file
For more information:
I'm on windows, but I run this script in gitbash
For some reason, my gitbash doesn't recognize the "rename" command
Some regex patterns (like [0-9]{4} don't seem to work)
I'm really at a lost. Thanks.
EDIT: I need to rename every single file that has a date at the end and that is of the from: *.log.2016.01.18. They all need to keep their original names. All that should change is the removal of the date.
You have to use % instead of #: you want to remove from the end, not the start of your string.
Also, you're missing a . in what has to be removed, you don't want to end up with blabla.log..
Quoting the variable names prevents surprises when file names contain special characters.
Together:
mv -v "$f" "${f%.[0-9][0-9][0-9][0-9].[0-9][0-9].[0-9][0-9]}"

batch renaming of files with perl expressions

This should be a basic question for a lot of people, but I am a biologist with no programming background, so please excuse my question.
What I am trying to do is rename about 100,000 gzipped data files that have existing name of a code (example: XG453834.fasta.gz). I'd like to name them to something easily readable and parseable by me (example: Xanthomonas_galactus_str_453.fasta.gz).
I've tried to use sed, rename, and mmv, to no avail. If I use any of those commands on a one-off script then they work fine, it's just when I try to incorporate variables into a shell script do I run into problems. I'm not getting any errors, just no names are changed, so I suspect it's an I/O error.
Here's what my files look like:
#! /bin/bash
# change a bunch of file names
file=names.txt
while IFS=' ' read -r r1 r2;
do
mmv ''$r1'.fasta.gz' ''$r2'.fasta.gz'
# or I tried many versions of: sed -i 's/"$r1"/"$r2"/' *.gz
# and I tried many versions of: rename -i 's/$r1/$r2/' *.gz
done < "$file"
...and here's the first lines of my txt file with single space delimiter:
cat names.txt
#find #replace
code1 name1
code2 name2
code3 name3
I know I can do this with python or perl, but since I'm stuck here working on this particular script I want to find a simple solution to fixing this bash script and figure out what I am doing wrong. Thanks so much for any help possible.
Also, I tried to cat the names file (see comment from Ashoka Lella below) and then use awk to move/rename. Some of the files have variable names (but will always start with the code), so I am looking for a find & replace option to just replace the "code" with the "name" and preserve the file name structure.
I suspect I am not escaping the variable within the single tick of the perl expression, but I have poured over a lot of manuals and I can't find the way to do this.
If you're absolutely sure than the filenames doesn't contain spaces of tabs, you can try the next
xargs -n2 < names.txt echo mv
This is for DRY run (will only print what will do) - if you satisfied with the result, remove the echo ...
If you want check the existence ot the target, use
xargs -n2 < names.txt echo mv -i
if you want NEVER allow overwriting of the target use
xargs -n2 < names.txt echo mv -n
again, remove the echo if youre satisfied.
I don't think that you need to be using mmv, a simple mv will do. Also, there's no need to specify the IFS, the default will work for you:
while read -r src dest; do mv "$src" "$dest"; done < names.txt
I have double quoted the variable names as it is generally considered good practice but in this case, a space in either of the filenames will result in read not working as you expect.
You can put an echo before the mv inside the loop to ensure that the correct command will be executed.
Note that in your file names.txt, the .fasta.gz suffix is already included, so you shouldn't be adding it inside the loop aswell. Perhaps that was your problem?
This should rename all files in column1 to column2 of names.txt. Provided they are in the same folder as names.txt
cat names.txt| awk '{print "mv "$1" "$2}'|sh

how to remove lines from file that don't match regex?

I have a big file that looks like this:
7f0c41d6-f9c6-47aa-a034-d40bc629c973.csv
159890
159891
24faaed6-62ee-4175-8430-5d73b09911c8.csv
159907
5bad221f-25ef-44fa-9086-fd152e697928.csv
642e4ac3-3d46-4b4c-b5c8-aa2fa54d0b04.csv
d0e145a5-ceb8-4d4b-ae47-11e0c9a6548d.csv
159929
ba678cbd-af57-493b-a69e-e7504b4bc328.csv
7750840f-9bf9-4a68-9f25-a2ba0968d481.csv
159955
159959
And I'm only interesting in *.csv files, can someone point me how to remove files that do not end with .csv.
Thank you.
grep "\.csv$" file
will pull out only those lines ending in .csv
Then if you want to put them in a different file;
grep "\.csv$" file > newfile
sed is your friend:
sed -i.bak '/\.csv$/!d' file
-i.bak : in-place edit. creates backup file with .bak extension
([0-9a-zA-Z-]*.csv$)
This is the regex code that only select the filename ending with .csv extensions.
Hope this will help you.
If you are familiar with the vim text editor (vim or vi is typically installed on many linux boxes), use the following vim Ex mode command to remove lines that don't match a particular pattern:
:v/<pattern>/d
For example, if I wanted to delete all lines that didn't contain "column" I would run:
:v/"column"/d
Hope this helps.
If it is the case that you do not want to have to save the names of files in another file just to remove unwanted files, then this may also be an added solution for your needs (understanding that this is an old question).
This single line for loop using the grep "\.csv" file solution recursively so you don't need to manage multiple files names being saved here or there.
for f in *; do if [ ! "$(echo ${f} | grep -Eo '.csv')" == ".csv" ]; then rm "${f}"; fi; done
As a visual aid to show you that it works as intended (for removing all files except csv files) here is a quick and dirty screenshot showing the results using your sample output.
And here is a slightly shorter version of the single line command:
for f in *; do if [ ! "$(echo ${f} | grep -o '.csv')" ]; then rm "${f}"; fi; done
And here is it's sample output using your sample's csv file names and some randomly generated text files.
The purpose for using such a loop with a conditional is to guarantee you only rid yourself of the files you want gone (the non-csv files) and only in the current working directory without parsing the ls command.
Hopefully this helps you and anyone else that is looking for a similar solution.

Bash script - mass modify files sed regular expression

I have a set of .csv files (all in one folder) with the format shown below:
170;151;104;137;190;125;170;108
195;192;164;195;171;121;133;104
... (a lot more rows) ...
The thing is I screwed up a bit and it should look like this
170;151;104.137;190.125;170;108
195;192;164.195;171.121;133;104
In case the difference is too subtle to notice:
I need to write a script that changes every third and fifth semicolon into a period in every row in efery file in that folder.
My research indicate that I have to devise some clever sed s/ command in my script. The problem is I'm not very good with regular expressions. From reading the tutorial it's probably gonna involve something with /3 and /5.
Here's a really short way to do it:
sed 's/;/./3;s/;/./4' -iBAK *
It replaces the 3rd and then the 5th (which is now the 4th) instances of the ; with ..
I tested it on your sample (saved as sample.txt):
$ sed 's/;/./3;s/;/./4' <sample.txt
170;151;104.137;190.125;170;108
195;192;164.195;171.121;133;104
For safety, I have made my example back up your originals as <WHATEVER>.BAK. To prevent this, change -iBAK to -i.
This script may not be totally portable but I've tested it on Mac 10.8 with BSD sed (no idea what version) and Linux with sed (gsed) 4.1.4 (2003). #JonathanLeffler notes that it's standard POSIX sed as of 2008. I also just found it and like it a lot.
Golf tip: If you run the command from bash, you can use brace expansion to achieve a supremely short version:
sed -es/\;/./{3,4} -i *
Here's one way:
sed -i 's/^\([^;]*;[^;]*;[^;]*\);\([^;]*;[^;]*\);/\1.\2./' foldername/*
(Disclaimer: I did test this, but some details of sed are not fully portable. I don't think there's anything non-portable in the above, so it should be fine, but please make a backup copy of your folder first, before running the above. Just in case.)