Find string with regex in text file and replace all other occurrences

Find string with regex in text file and replace all other occurrences - regex

The document I would like to transform looks like this:
name=foo
name=bar
thing, attribute1=foo, attribute2=data1
thing, attribute3=bar, attribute4=data2
What I would like to do is to find the strings foo and bar (by searching for "name=(.*)" for example and then to replace all occurrences by adding a prefix.
The document would then become
name=prefix_foo
name=prefix_bar
thing, attribute=prefix_foo
thing, attribute=prefix_bar
I imagine this could be done purely with grep and sed?
Working line by line the transformation would be:
gsed -i -E 's/name=(.*)/name=prefix_\1/g' test.txt
However, how can I reuse the match for other substitutions (recursively)?

You can indeed reuse the match for other names. By using the regex options -P -o, and making use of \K, you can select only the names you want to replace, and then prefix them with sed. Here's a bash script that does what you want.
#get filenames and prefix
echo "input filename?";
read fname;
echo "prefix?";
read prefix;
#if it's a file...
if [ -f "$fname" ]
then
#grep for names to change
result=$(grep -P -o "name=\K.*" "$fname");
#get names in an array
arrRes=($result);
#loop through and sed each name
for name in "${arrRes[#]}"; do
#name now holds a name to sub
echo "replacing $name with $prefix$name";
#sub the name
$(sed -i "s/$name/$prefix$name/g" "$fname");
done
fi
Try it here!

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Say I have the following files:
a_b.txt a_b_c.txt a_b_c_d_e.txt a_b_c_d_e_f_g_h_i.txt
I want to rename them in such a way that I split their filenames by _ and I retain the first and last field, so I end up with:
a_b.txt a_c.txt a_e.txt a_i.txt
Thought it would be easy, but I'm a bit stuck...
I tried rename with the following regexp:
rename 's/^([^_]*).*([^_]*[.]txt)/$1_$2/' *.txt
But what I would really need to do is to actually split the filename, so I thought of awk, but I'm not so proficient with it... This is what I have so far (I know at some point I should specify FS="_" and grab the first and last field somehow...
find . -name "*.txt" | awk -v mvcmd='mv "%s" "%s"\n' '{old=$0; <<split by _ here somehow and retain first and last fields>>; printf mvcmd,old,$0}'
Any help? I don't have a preferred method, but it would be nice to use this to learn awk. Thanks!

Your rename attempt was close; you just need to make sure the final group is greedy.
rename 's/^([^_]*).*_([^_]*[.]txt)$/$1_$2/' *_*_*.txt
I added a _ before the last opening parenthesis (this is the crucial fix), and a $ anchor at the end, and also extended the wildcard so that you don't process any files which don't contain at least two underscores.
The equivalent in Awk might look something like
find . -name "*_*_*.txt" |
awk -F _ '{ system("mv " $0 " " $1 "_" $(NF)) }'
This is somewhat brittle because of the system call; you might need to rethink your approach if your file names could contain whitespace or other shell metacharacters. You could add quoting to partially fix that, but then the command will fail if the file name contains literal quotes. You could fix that, too, but then this will be a little too complex for my taste.
Here's a less brittle approach which should cope with completely arbitrary file names, even ones with newlines in them:
find . -name "*_*_*.txt" -exec sh -c 'for f; do
mv "$f" "${f%%_*}_${f##*_}"
done' _ {} +
find will supply a leading path before each file name, so we don't need mv -- here (there will never be a file name which starts with a dash).
The parameter expansion ${f##pattern} produces the value of the variable f with the longest available match on pattern trimmed off from the beginning; ${f%%pattern} does the same, but trims from the end of the string.

With your shown samples, please try following pure bash code(with great use parameter expansion capability of BASH). This will catch all files with name/format .txt in their name. Then it will NOT pick files like: a_b.txt it will only pick files which have more than 1 underscore in their name as per requirement.
for file in *_*_*.txt
do
firstPart="${file%%_*}"
secondPart="${file##*_}"
newName="${firstPart}_${secondPart}"
mv -- "$file" "$newName"
done

This answer works for your example, but #tripleee's "find" approach is more robust.
for f in a_*.txt; do mv "$f" "${f%%_*}_${f##*_}"; done
Details: https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html / https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html

Here's an alternate regexp for the given samples:
$ rename -n 's/_.*_/_/' *.txt
rename(a_b_c_d_e_f_g_h_i.txt, a_i.txt)
rename(a_b_c_d_e.txt, a_e.txt)
rename(a_b_c.txt, a_c.txt)

A different rename regex
rename 's/(\S_)[a-z_]*(\S\.txt)/$1$2/'
Using the same regex with sed or using awk within a loop.
for a in a_*; do
name=$(echo $a | awk -F_ '{print $1, $NF}'); #Or
#name=$(echo $a | sed -E 's/(\S_)[a-z_]*(\S\.txt)/\1\2/g');
mv "$a" "$name";
done

Run Regex using Grep/Sed recursively over files to store capture group

I have a file structure that looks like this:
Folder1
file1.feature
file2.feature
file3.feature
Folder2
file1.feature
file2.feature
...etc.
The files are Behat feature files which look like this:
Scenario: I am filling out a form
Given I am logged in as User
And I fill in "Name" with "My name"
Then I fill in "Email" with "myemail#example.com"
I am trying to iterate over each file within the file structure to get matches on my regex:
/I fill in "[^"]+" with "([^"]+)"/gm
The regex looks for I fill in "x" with "y", and I would like to store the capture group "y" from each file where a line in the file matches the expression.
So far I can iterate through the folders and print out the file names in mt Bash script like so:
#!/bin/bash
cd behat/features
files="*/*.feature"
for f in $files
do
echo ${f}
done
I am trying to retrieve the capture group using Sed currently by doing this in my loop:
sed -r 's/^I fill in \"[^)]+\" with \"([^)]+)\"$/\1/'
But I fear that I am going down the wrong track, as this is returning all of the file content throughout all the files.

You may use
cd behat/features && find . -name *.feature -type f -print0 | xargs -0 \
sed -E -n 's/.*I fill in "[^"]+" with "([^"]+)"/\1/p' > outfile
This command "goes" to behat/features directory, finds all files with feature extension (recursively) and then prints the capture group #1 values matched with your regex as -n option suppresses the output of lines and p flag only outputs what remains after a replacement.
See more specific solutions for recursive file matching at How to do a recursive find/replace of a string with awk or sed? if need be.

Extract Filename before date Bash shellscript

I am trying to extract a part of the filename - everything before the date and suffix. I am not sure the best way to do it in bashscript. Regex?
The names are part of the filename. I am trying to store it in a shellscript variable. The prefixes will not contain strange characters. The suffix will be the same. The files are stored in a directory - I will use loop to extract the portion of the filename for each file.
Expected input files:
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Expected Extract:
EXAMPLE_FILE
EXAMPLE_FILE_2
Attempt:
filename=$(basename "$file")
folder=sed '^s/_[^_]*$//)' $filename
echo 'Filename:' $filename
echo 'Foldername:' $folder

$ cat file.txt
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
$
$ cat file.txt | sed 's/_[0-9]*-[0-9]*-[0-9]*\.out$//'
EXAMPLE_FILE
EXAMPLE_FILE_2
$

No need for useless use of cat, expensive forks and pipes. The shell can cut strings just fine:
$ file=EXAMPLE_FILE_2_2017-10-12.out
$ echo ${file%%_????-??-??.out}
EXAMPLE_FILE_2
Read all about how to use the %%, %, ## and # operators in your friendly shell manual.

Bash itself has regex capability so you do not need to run a utility. Example:
for fn in *.out; do
[[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
done
With the example files, output is:
EXAMPLE_FILE_2017-09-12.out => EXAMPLE_FILE
EXAMPLE_FILE_2_2017-10-12.out => EXAMPLE_FILE_2
Using Bash itself will be faster, more efficient than spawning sed, awk, etc for each file name.
Of course in use, you would want to test for a successful match:
for fn in *.out; do
if [[ $fn =~ ^(.*)_[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2} ]]; then
cap="${BASH_REMATCH[1]}"
printf "%s => %s\n" "$fn" "$cap"
else
echo "$fn no match"
fi
done
As a side note, you can use Bash parameter expansion rather than a regex if you only need to trim the string after the last _ in the file name:
for fn in *.out; do
cap="${fn%_*}"
printf "%s => %s\n" "$fn" "$cap"
done
And then test $cap against $fn. If they are equal, the parameter expansion did not trim the file name after _ because it was not present.
The regex allows a test that a date-like string \d\d\d\d-\d\d-\d\d is after the _. Up to you which you need.

Code
See this code in use here
^\w+(?=_)
Results
Input
EXAMPLE_FILE_2017-09-12.out
EXAMPLE_FILE_2_2017-10-12.out
Output
EXAMPLE_FILE
EXAMPLE_FILE_2
Explanation
^ Assert position at start of line
\w+ Match any word character (a-zA-Z0-9_) between 1 and unlimited times
(?=_) Positive lookahead ensuring what follows is an underscore _ character

Simply with sed:
sed 's/_[^_]*$//' file
The output:
EXAMPLE_FILE
EXAMPLE_FILE_2
----------
In case of iterating through the list of files with extension .out - bash solution:
for f in *.out; do echo "${f%_*}"; done

awk -F_ 'NF-=1' OFS=_ file
EXAMPLE_FILE
EXAMPLE_FILE_2

Could you please try awk solution too, which will take care of all the .out files, note this has ben written and tested in GNU awk.
awk --re-interval 'FNR==1{if(val){close(val)};split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");print array[1];val=FILENAME;nextfile}' *.out
Also my awk version is old so I am using --re-interval, if you have latest version of awk you may need not to use it then.
Explanation and Non-one liner fom of solution: Adding a non-one liner form of solution too here with explanation.
awk --re-interval '##Using --re-interval for supporting ERE in my OLD awk version, if OP has new version of awk it could be removed.
FNR==1{ ##Checking here condition that when very first line of any Input_file is being read then do following actions.
if(val){ ##Checking here if variable named val value is NOT NULL then do following.
close(val) ##close the Input_file named which is stored in variable val, so that we will NOT face problem of TOO MANY FILES OPENED, so it will be like one file read close it in background then.
};
split(FILENAME, array,"_[0-9]{4}-[0-9]{2}-[0-9]{2}");##Splitting FILENAME(which will have Input_file name in it) into array named array only, whose separator is a 4 digits-2 digits- then 2 digits, actually this will take care of YYYY-MM-DD format in Input_file(s) and it will be easier for us to get the file name part.
print array[1]; ##Printing array 1st element here.
val=FILENAME; ##Storing FILENAME variable value which will have current Input_file name in it to variable named val, so that we could close it in background.
nextfile ##nextfile as it name suggests it will skip all the lines in current line and jump onto the next file to save some cpu cycles of our system.
}
' *.out ##Mentioning all *.out Input_file(s) here.

sed / awk - remove space in file name

I'm trying to remove whitespace in file names and replace them.
Input:
echo "File Name1.xml File Name3 report.xml" | sed 's/[[:space:]]/__/g'
However the output
File__Name1.xml__File__Name3__report.xml
Desired output
File__Name1.xml File__Name3__report.xml

You named awk in the title of the question, didn't you?
$ echo "File Name1.xml File Name3 report.xml" | \
> awk -F'.xml *' '{for(i=1;i<=NF;i++){gsub(" ","_",$i); printf i<NF?$i ".xml ":"\n" }}'
File_Name1.xml File_Name3_report.xml
$
-F'.xml *' instructs awk to split on a regex, the requested extension plus 0 or more spaces
the loop {for(i=1;i<=NF;i++) is executed for all the fields in which the input line(s) is(are) splitted — note that the last field is void (it is what follows the last extension), but we are going to take that into account...
the body of the loop
gsub(" ","_", $i) substitutes all the occurrences of space to underscores in the current field, as indexed by the loop variable i
printf i<NF?$i ".xml ":"\n" output different things, if i<NF it's a regular field, so we append the extension and a space, otherwise i equals NF, we just want to terminate the output line with a newline.
It's not perfect, it appends a space after the last filename. I hope that's good enough...
▶    A D D E N D U M    ◀
I'd like to address:
the little buglet of the last space...
some of the issues reported by Ed Morton
generalize the extension provided to awk
To reach these goals, I've decided to wrap the scriptlet in a shell function, that changing spaces into underscores is named s2u
$ s2u () { awk -F'\.'$1' *' -v ext=".$1" '{
> NF--;for(i=1;i<=NF;i++){gsub(" ","_",$i);printf "%s",$i ext (i<NF?" ":"\n")}}'
> }
$ echo "File Name1.xml File Name3 report.xml" | s2u xml
File_Name1.xml File_Name3_report.xml
$
It's a bit different (better?) 'cs it does not special print the last field but instead special-cases the delimiter appended to each field, but the idea of splitting on the extension remains.

This seems a good start if the filenames aren't delineated:
((?:\S.*?)?\.\w{1,})\b
( // start of captured group
(?: // non-captured group
\S.*? // a non-white-space character, then 0 or more any character
)? // 0 or 1 times
\. // a dot
\w{1,} // 1 or more word characters
) // end of captured group
\b // a word boundary
You'll have to look-up how a PCRE pattern converts to a shell pattern. Alternatively it can be run from a Python/Perl/PHP script.
Demo

Assuming you are asking how to rename file names, and not remove spaces in a list of file names that are being used for some other reason, this is the long and short way. The long way uses sed. The short way uses rename. If you are not trying to rename files, your question is quite unclear and should be revised.
If the goal is to simply get a list of xml file names and change them with sed, the bottom example is how to do that.
directory contents:
ls -w 2
bob is over there.xml
fred is here.xml
greg is there.xml
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}";
done
shopt -u nullglob
# output
bob is over there.xml
fred is here.xml
greg is there.xml
# then rename them
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
# I prefer 'rename' for such things
# rename 's/[[:space:]]/_/g' "${a_glob[i]}";
# but sed works, can't see any reason to use it for this purpose though
mv "${a_glob[i]}" $(sed 's/[[:space:]]/_/g' <<< "${a_glob[i]}");
done
shopt -u nullglob
result:
ls -w 2
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml
globbing is what you want here because of the spaces in the names.
However, this is really a complicated solution, when actually all you need to do is:
cd [your space containing directory]
rename 's/[[:space:]]/_/g' *.xml
and that's it, you're done.
If on the other hand you are trying to create a list of file names, you'd certainly want the globbing method, which if you just modify the statement, will do what you want there too, that is, just use sed to change the output file name.
If your goal is to change the filenames for output purposes, and not rename the actual files:
cd [directory with files]
shopt -s nullglob
a_glob=(*.xml);
for ((i=0;i< ${#a_glob[#]}; i++));do
echo "${a_glob[i]}" | sed 's/[[:space:]]/_/g';
done
shopt -u nullglob
# output:
bob_is_over_there.xml
fred_is_here.xml
greg_is_there.xml

You could use rename:
rename --nows *.xml
This will replace all the spaces of the xml files in the current folder with _.
Sometimes it comes without the --nows option, so you can then use a search and replace:
rename 's/[[:space:]]/__/g' *.xml
Eventually you can use --dry-run if you want to just print filenames without editing the names.

Use sed to capture only param of function

In my text file I have a special name (treated as a macro) i.e.: MYFUNCTION[myParam]
Now I'd like to use sed to get name of used param (in above example it is myParam) and based on it searched for a value in a dictionary which I use to replace text myParam.
I managed to create a sed instruction to find a group but it also print other words, i.e.:
echo "some text MYFUCNTION[paramName]" | sed -e "s/MYFUNCTION\[\([a-Z]*\)\]/\1/"
results in the following output:
some text paramName
I'd like to get in output just only:
paramName
How can I achieve that? Thanks for help :)

By including the text before MYFUNCTION in the regex as .*, like this:
echo "some text MYFUNCTION[paramName]" | sed -e "s/.*MYFUNCTION\[\([a-Z]*\)\]/\1/"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Find string with regex in text file and replace all other occurrences - regex

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Run Regex using Grep/Sed recursively over files to store capture group

Extract Filename before date Bash shellscript

sed / awk - remove space in file name

Use sed to capture only param of function

Categories

Resources