I have a couple of files in a directory, in which I have a piece of text between two separators.
Text to keep
//###==###
Text to remove
//###==###
Text to keep
After an extensive search, I found the following Mac OS X Terminal command, with which I can remove the separators themselves.
perl -pi -w -e 's|//###==###||g' `find . -type f`
However, I need something with a regex that does not only remove the separators themselves, but also what is in between. Something like this, although this line doesn't do anything.
perl -pi -w -e 's|//###==###(.*)//###==###||g' `find . -type f`
EDIT AFTER DUPLICATE FLAG
I see something similar here, using the scalar range operator, but I cannot make it work for me. Failed attempts include:
perl -pi -w -e 's|//###==###..//###==###||g' `find . -type f`
perl -pi -w -e 's|//###==###(..)//###==###||g' `find . -type f`
perl -pi -w -e 's|//###==###[..]//###==###||g' `find . -type f`
SOLUTION
With the help of dawg below, the following oneliner will do exactly what I want:
$ perl -0777 -p -i -e 's/(^\s*^\/\/###==###.*?\/\/###==###\s*)//gms' `find . -type f -name "index.php"`
You can use:
s/(^\s*^\/\/###==###.*?\/\/###==###\s*)//gms
Working Demo
Then in the terminal and in Perl. Given:
$ echo "$tgt"
Text to keep
//###==###
Text to remove
//###==###
Text to keep
Use the -0777 command flag to slurp the whole file and then:
$ echo "$tgt" | perl -0777 -ple 's/(^\s*^\/\/###==###.*?\/\/###==###\s*)//gms'
Text to keep
Text to keep
Or, you can use the range operator. If done this way, you cannot remove the leading and trailing blank lines if that is your intent:
$ echo "$tgt" | perl -lne 'print unless (/\/\/###==###/ ... /\/\/###==###/)'
Text to keep
Text to keep
Related
I'm trying to pipe the output of a find command to a perl one-liner to replace a line that ends with ?> with RedefineForDocker::standardizeXmlmc() but for some reason the value isn't being replaced. I've checked the output of the find command and it is performing as expected, and I've double checked my regex and it should match.
find . -name *.php -exec ggrep -Ezl 'class XmlMethodCall.*([?]>)$' {} \; \
| xargs perl -ewpn -i.bak2 \
"s/[?]>\s*?$/RedefineForDocker::standardizeXmlmc()\n/gm"
I get no warnings and no indication that it isn't working, the backups are created, but the file remains unchanged. The list of matched files run from the find command is below.
./swsupport/clisupp/trending/services/data.helpers.php
./swsupport/clisupp/_bpmui/arch/service/data.helpers.php
./swsupport/clisupp/_bpmui/itsm/service/data.helpers.php
./swsupport/clisupp/_bpmui/itsm_default/service/data.helpers.php
./webclient_code/php/session.php
./webclient_code/service/storedquery/helpers.php
./php/_phpinclude/itsm/xmlmc/xmlmc.php
./php/_phpinclude/itsmf/xmlmc/xmlmc.php
./php/_phpinclude/itsm_default/xmlmc/xmlmc.php
Here is an example of one of the files it should match
https://regex101.com/r/BUoCif/1
Run your perl command as this:
perl -i.bak2 -wpe 's/\?>\h*$/RedefineForDocker::standardizeXmlmc()\n/gm'
Order of command line option is important here.
Full pipeline should be like this:
find . -name '*.php' -exec ggrep -PZzl '(?ms)class XmlMethodCall.*\?>\h*$' {} + |
xargs -0 perl -i.bak2 -wpe 's/\?>\h*$/RedefineForDocker::standardizeXmlmc()\n/gm'
Note use -Z option in grep and -0 option in xargs to address issues with filenames with whitespaces etc.
I have many directories for different projects. Under some project directories, there are subdirectories named "matlab_programs". In only subdirectories named matlab_programs, I would like to replace the string 'red' with 'blue' in files ending with *.m.
The following perl code will recursively replace the strings in all *.m files, regardless of what subdirectories the files are in.
find . -name "*.m" | xargs perl -p -i -e "s/red/blue/g"
And to find the full paths of all directories called matlab_programs,
find . -type d -name "matlab_programs"
How can I combine these so I only replace strings if the files are in a subdirectory called matlab_programs?
Perl has the excellent File::Find module, that lets you specify a callback to be called on each file.
So you can specified a complex compound criteria, like this:
#!/usr/bin/env perl
use strict;
use warnings;
use File::Find;
sub find_files {
next unless m/\.m\z/; # skip any files that don't end in .m
if ( $File::Find::dir =~ m/matlab_programs$/ ) {
print $File::Find::name, " found\n";
}
}
find( \&find_files, "." );
And then you can do whatever you wish with the files you find - like opening/text replacing and closing.
You want to find all directories named matlab_programs using
find . -type d -name "matlab_programs"
and then execute
find $f -name "*.m" | xargs perl -p -i -e "s/red/blue/g"
on all results $f. Judging by your use of xargs, there are no special characters such as spaces in your file names. so the following should work:
find `find . -type d -name "matlab_programs"` -name "*.m" |
xargs perl -p -i -e "s/red/blue/g"
or
find . -type d -name "matlab_programs" |
while read f
do
find $f -name "*.m" | xargs perl -p -i -e "s/red/blue/g"
done |
xargs perl -p -i -e "s/red/blue/g"
Incidentally, I'd use single quotes here; I always use them whenever the quoted string is to be taken literally.
Do you have bash? The $(...) syntax works like backticks (the way both the shell and Perl use them) but they can be nested.
perl -pi -e s/red/blue/g $(find $(find . -type d -name matlab_programs) -type f -name \*.m)
Many flavors of find also support a -path pattern test, so you can just combine your filename conditions into that argument
perl -pi -e s/red/blue/g $(find . -type f -path \*/matlab_programs/\*.m)
I'm trying to batch rename text files according to a string they contain.
I used sed to isolate the pattern with \( and \) as I couldn't get this to work in grep.
sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt | mv *.txt $sed.txt
(the text I want to use as filename is between html title tags)`
Where I wrote $sed would be the output of sed.
hope that's clear!
A simple loop in bash can accomplish this. If each file is valid HTML, meaning you have only one <title> tag in the file, you can rename them all this way:
for file in *.txt; do
mv "$file" `sed -n 's/<title>\([^<]*\)<\/title>/\1/p;' $file| sed -e 's/[ ][ ]*/_/g'`.txt
done
So, if you have files 1.txt, 2.txt and 3.txt, each with cat, dog and my hippo in their TITLE tags, you'll end up with cat.txt, dog.txt and my_hippo.txt after the above loop.
EDIT: quoted initial $file in case there are spaces in filenames; and added a second sed to convert any spaces in the <title> tag to _'s in resulting filenames. NOTE the whitespace inside the []'s in the second sed command is a literal space and tab character.
You can enclose expression in grave accent characters (`) to make it insert its output to the place you want. Try:
mv *.txt `sed -i '' 's/<title>\(.*\)<\/title>/&/g' *.txt`.txt
It is rather not flexible, but should work.
(I haven't used it in a while and cannot test it now, so I might be wrong).
Here is the command I would use:
for i in *.txt ; do
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
done
The sed substitution search for pattern in each one of your .txt files. For each file it creates string mv 'file_name' 'found_pattern'.
With the e command at the end of sed commands, this resulting string is directly executed in terminal, thus it renames your files.
Some hints:
Note the use of =s instead of /s as delimiters for sed substition: it's more readable as you already have /s in your pattern (you could use many other symbols if you don't like =). And in this way you don't have to escape the / in your pattern.
The e command for sed executes the created string.
(I'm speaking of this one below:
sed "s=<title>\(.*\)</title>=mv '$i' '\1'=e" $i
^
)
So use it with caution! I would recommand to first use the line without final e: it won't execute any mv command, but just print instead what would be executed if you were to add the e.
What I read from your question is:
you have a number of text (html) files in a directory
each file contains at least the tag <title> ... </title>
you want to extract the content (elements.text) and use it as filename
last you want to rename that file to the extracted filename
Is this correct?
So, then you need to loop through the files, e.g. with xargs or find
ls '*.txt' | xargs -i\{\} command "{}" ...
find -maxdepth 1 -type f -name '*.txt' -exec command "{}" ... \;
I always replace the xargs substitues by -i\{\} because the resulting command is compatible if I use it sometimes with find and its substitute {}.
Next the -maxdepth option will help find not to dive deeper in directory, if no subdir, you can leave it out.
command could be something very simple like echo "Testing File: {}" or a really small script if you use it with bash:
find . -name '*.txt' -exec bash -c 'CUR_FILE="{}"; echo "Working on: $CUR_FILE"; ls -l "$CUR_FILE";' \;
The big decision for your question is: how to get the text from title element.
A simple solution (suitable if opening and closing tag is on same textline) would be by grep
A more solid solution is to use a HTML Parser and navigate by DOM operation
The simple solution base on:
get the title line
remove the everything before and after title content
So do it together:
ls *.txt | xargs -i\{\} bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"'
Same with usage of find:
find . -maxdepth 1 -type f -name '*.txt' -exec bash -c 'TITLE=$(egrep "<title>[^<]*</title>" "{}"); NEW_FNAME=$(echo "$TITLE" | sed -e "s#.*<title>\([^<]*\)</title>.*#\1#"); mv -v "{}" "$NEW_FNAME.txt"' \;
Hopefully it is what you expected.
Consider the following bash script with a simple regular expression:
for f in "$FILES"
do
echo $f
sed -i '/HTTP|RT/d' $f
done
This script shall read every file in the directory specified by FILES and remove the lines with occurrences of 'http' or 'RT' However, it seems that the OR part of the regular expression is not working. That is if I just have sed -i '/HTTP/d' $f then it will remove all lines containing HTTP but I cannot get it to remove both HTTP and RT
What must I change in my regular expression so that lines with HTTP or RT are removed?
Thanks in advance!
Two ways of doing it (at least):
Having sed understand your regex:
sed -E -i '/HTTP|RT/d' $f
Specifying each token separately:
sed -i '/HTTP/d;/RT/d' $f
Before you do anything, run with the opposite, and PRINT what you plan to DELETE:
sed -n -e '/HTTP/p' -e '/RT/p' $f
Just to be sure you are deleting only what you want to delete before actually changing the files.
"It's not a question of whether you are paranoid or not, but whether you are paranoid ENOUGH."
Well, first of all, it will process all WORDS in the FILES variable.
If you want it to do all files in the FILES directory, then you need something like this:
for f in $( find $FILES -maxdepth 1 -type f )
do
echo $f
sed -i -e '/HTTP/d' -e '/RT/d' $f
done
You just need two "-e" options to sed.
Hello I'm seeking a Perl one-liner if possible, to scan all of our Javascript files, to find so-called "rogue commas". That is, commas that come at the end of an array or object data structure, and therefore commas that come immediately before either an ']' or '}' character.
The main challenge I'm encountering is how to make the regex that checks for ] or } non-greedy. The regex needs to span multiple lines, since the comma could end one line, followed by the } or ] on the next line, but I've figured out how to do that with the help of the book Minimal Perl.
Also, I'd like to be able to pipe a number of files to this Perl regex (via find/xargs), and so I'd like to print the name of the input file, and the line number within that file.
Below are various attempts of mine that are not particularly close to working straight from my bash history. Thanks in advance:
find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+$/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+\]/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+[\]\}]/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}?/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,\s+}+?/ and print $_;' | wc -l find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,$/' and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/,$/ and print $_;' find winhome/workspace/SsuExt4Zoura/quotetool/js
-name "*.js" | xargs perl -00 -wnl -e '/\,$/ and print $_;'
With the -00 switch, you change the record separator, and (probably) get the whole file in one line, which allows you to find multi-line trailing commas. However, it also makes the print $_ print the whole line. What you probably want is printing the file name:
print $ARGV if /,\s*[\]\}]/;
Most of these look like a decent approach to the problem, with one small issue. You probably want ,\s*(?:$|[\]\}]) rather than ,\s+(?:$|[\]\}]) as there may not be even one space. Your + quantifier might miss forms like ,].
Having said that, JavaScript can be pretty subtle, and you might well encounter comments and other stuff, which might legitimately end with a comma before something unexpected, like the end of the file or a }. A cheap solution might be to use a perl s/// form to simply remove all the comments before applying your tests.
If you're handling JSON, JSON::XS can enforce validity with its relaxed option.
If you need real validation, something like JSLint is probably the way to go. I've had a lot of success with using Rhino to embed JavaScript (a bit less using Perl with SpiderMonkey) and using this as a set of tests against JavaScript code would be a nice way to ensure reliability over time.
An easy solution to this problem is to use comma-first style. Since commas never come at the end of a line, there is never a 'trailing comma'.
For example:
var myObj = { foo: 1
, bar: 2
, baz: 4
}
You can easily detect if a comma is missing, it's obvious which elements belong to what set of braces, and there's never a 'trailing comma problem'.
See also https://gist.github.com/357981