Simplest, Safe Method for Trimming File Paths - regex

I have a script that does a lot of file-processing, and it's good enough to receive its paths using null-characters as a separator for safety.
However, it process all paths as absolute (saves some headaches), but these are a bit unwieldy for output purposes, so I'd like to remove a chunk of the path from my output. Now, plenty of options spring to mind, but the difficulty is in using these in a way that's safe for any arbitrary path that I might encounter, which is where things get a bit trickier.
Here's a quick example:
#!/bin/sh
TARGET="$1"
find "$TARGET" -print0 | while IFS= read -rd '' path; do
# Process path for output here
path_str="$path"
echo "$path_str"
done
So in the above script I want to take path and remove TARGET from it, in the most compatible way possible (e.g - nothing bash specific), it needs to be able to remove only from the start of the string, i.e - /foo/bar becomes bar, /foo/bar/foo becomes bar/foo and /bar/foo remains /bar/foo. It should also cope with any possible characters in a file-name, including characters that some file-systems support such as tildes, colons etc., as well as pesky inverted quotation characters.
I've hacked together some messy solutions using sed by first escaping any characters that might break my regular expression, but this is a very messy way of doing things, so I'm hoping there are some simpler methods out there. In case there isn't, here's by solution so far:
SAFE_CHARS='s:\([[/.*]\):\\\1:g'
target_safe=$(printf '%s' "$TARGET" | sed "$SAFE_CHARS")
path_str=$(printf '%s' "$path" | sed "s/^$target_safe//g')
There's probably a few characters missing that I should be escaping in addition to those ones, and apologies for any typos.

To remove a prefix from a string,
$ TARGET=/foo/
$ path=/foo/bar
$ echo "${path#$TARGET}"
bar
The # operator for parameter expansion is part of the POSIX standard and will work in any POSIX-compliant shell.

You can try this simple find:
export TARGET="$1"
find "$TARGET" -exec bash -c 'sed "s|^$TARGET\/||" <<< "$1"' - '{}' \;

Related

How to use sed to migrate from process.env.MY_VAR to env.get('MY_VAR').required() using sed regex?

I'd like to migrate from dotenv to env-var npm package for a dozen of repositories.
Therefore I am looking for a smart and easy way to search and replace a pattern on every file.
My goal is to move from this pattern process.env.MY_VAR to env.get('MY_VAR').required()
And to move from this pattern process.env.MY_VAR || DEFAULT_VALUE to env.get('MY_VAR').required().default('DEFAULT_VALUE')
For reference, I found this command clear; grep -r "process\.env\." --exclude-dir=node_modules | sed -r -n 's|^.*\.([[:upper:]_]+).*$|\1=|p' > .env.example to generate .env.example
Apparently I can use sed -e "s/pattern/result/" <file list> but I am not sure how to catch the pattern, and return this same pattern in the result.
You have already figured out the main parts of the answer I think. But I'm unclear about what you refer to with MY_VAR. If its actually the name MY_VAR or if its just a dummy name for all var-names consisting of only uppercase characters and underscores. I expect it to be the latter on. Then you could go with something like this:
sed "s/\<process.env.\([A-Z_]*\)\>/env.get('\1').required()/" <file list>
This will read all the files and output them all to stdout with the replacement done. But I guess you should use -i for in-place replacement directly in the file (be careful!).
Since you got several replacements you could give each replacement separately like:
sed -i -e "s/pattern1/result1/" -e "s/pattern2/result2/" <file list>
NOTE: The thing described above could for sure be done in multiple other ways, this is only one solution to my interpretation of your problem!
I would suggest that you take some tutorials on regexp to start of with. It is a handy tool that is present in one form or the other in most programming languages and programming tools (sed being just one such tool).
sed -E '
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+) \|\| ([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').required().default('\''\3'\'')\4/g
s/(^|[^[:alnum:]_])process\.env\.([[:alnum:]_]+)($|[^[:alnum:]_])/\1env.get('\''\2'\'').required()\3/g
' myfile
It's essential that the two substitute commands happen in the above order, because the second pattern also matches the first pattern (which we don't want).
The pattern (^|[^[:alnum:]_]) is just a more portable version of the \< word boundary symbol.
Remember you can use the -i flag with sed to edit the file in place.
Running this on the third paragraph in your question (for example), we get:
My goal is to move from this pattern env.get('MY_VAR').required() to env.get('MY_VAR').required() And to move from this pattern env.get('MY_VAR').required().default('DEFAULT_VALUE') to env.get('MY_VAR').required().default('DEFAULT_VALUE')

Where is this Regex expression not closed in sed (apostrophe parenthesis)?

I'm trying to update some setting for wordpress and I need to use sed. When I run the below command, it seems to think the line is not finished. What am I doing wrong?
$ sed -i 's/define\( \'DB_NAME\', \'database_name_here\' \);/define\( \'DB_NAME\', \'wordpress\' \);/g' /usr/share/nginx/wordpress/wp-settings.php
> ^C
Thanks.
Single quotes in most shells don't support any escaping. If you want to include a single quote, you need to close the single quotes and add the single quote - either in double quotes, or backslashed:
sed 's/define\( '\''DB_NAME'\'', '\''database_name_here'\'' \);/define\( '\''DB_NAME'\'', '\''wordpress'\'' \);/g'
I fear it still wouldn't work for you, as \( is special in sed. You probably want just a simple ( instead.
sed 's/define( '\''DB_NAME'\'', '\''database_name_here'\'' );/define( '\''DB_NAME'\'', '\''wordpress'\'' );/g'
or
sed 's/define( '"'"'DB_NAME'"'"', '"'"'database_name_here'"'"' );/define( '"'"'DB_NAME'"'"', '"'"'wordpress'"'"' );/g'
Normally, using single quotes around the script of a sed script is sensible. This is a case where double quotes would be a better choice — there are no shell metacharacters other than single quotes in the sed script:
sed -e "s/define( 'DB_NAME', 'database_name_here' );/define( 'DB_NAME', 'wordpress' );/g" /usr/share/nginx/wordpress/wp-settings.php
or:
sed -e "s/\(define( 'DB_NAME', '\)database_name_here' );/\1wordpress' );/g" /usr/share/nginx/wordpress/wp-settings.php
or even:
sed -e "/define( 'DB_NAME', 'database_name_here' );/s/database_name_here/wordpress/g" /usr/share/nginx/wordpress/wp-settings.php
One other option to consider is using sed's -f option to provide the script as a file. That saves you from having to escape the script contents from the shell. The downside may be that you have to create the file, run sed using it, and then remove the file. It is likely that's too painful for the current task, but it can be sensible — it can certainly make life easier when you don't have to worry about shell escapes.
I'm not convinced the g (global replace) option is relevant; how many single lines are you going to find in the settings file containing two independent define DB_NAME operations with the default value?
You can add the -i option when you've got the basic code working. Do note that if you might ever work on macOS or a BSD-based system, you'll need to provide a suffix as an extra argument to the -i option (e.g. -i '' for a null suffix or no backup; or -i.bak to be able to work reliably on both Linux (or, more accurately, with GNU sed) and macOS and BSD (or, more accurately, with BSD sed). Appealing to POSIX is no help; it doesn't support an overwrite option.
Test case (first example):
$ echo "define( 'DB_NAME', 'database_name_here' );" |
> sed -e "s/\(define( 'DB_NAME', '\)database_name_here' );/\1wordpress' );/g"
define( 'DB_NAME', 'wordpress' );
$
If the spacing around 'DB_NAME' is not consistent, then you'd end up with more verbose regular expressions, using [[:space:]]* in lieu of blanks, and you'd find the third alternative better than the others, but the second could capture both the leading and trailing contexts and use both captures in the replacement.
Parting words: this technique works this time because the patterns don't involve shell metacharacters like $ or  ` . Very often, the script does need to match those, and then using mainly single quotes around the script argument is sensible. Tackling a different task — replace $DB_NAME in the input with the value of the shell variable $DB_NAME (leaving $DB_NAMEORHOST unchanged):
sed -e 's/$DB_NAME\([^[:alnum:]]\)/'"$DB_NAME"'\1/'
There are three separate shell strings, all concatenated with no spaces. The first is single-quoted and contains the s/…/ part of a s/…/…/ command; the second is "$DB_NAME", the value of the shell variable, double-quoted so that if the value of $DB_NAME is 'autonomous vehicle recording', you still have a single argument to sed; the third is the '\1/' part, which puts back whatever character followed $DB_NAME in the input text (with the observation that if $DB_NAME could appear at the end of an input line, this would not match it).
Most regexes do fuzzy matching; you have to consider variations on what might be in the input to determine how hard your regular expressions have to work to identify the material accurately.

Use of grep + sed based on a pattern file?

Here's the problem: i have ~35k files that might or might not contain one or more of the strings in a list of 300 lines containing a regex each
if I grep -rnwl 'C:\out\' --include=*.txt -E --file='comp.log' i see there are a few thousands of files that contain a match.
now how do i get sed to delete each line in these files containing the strings in comp.log used before?
edit: comp.log contains a simple regex in each line, but for the most part each string to be matched is unique
this is is an example of how it is structured:
server[0-9]\/files\/bobba fett.stw
[a-z]+ mochaccino
[2-9] CheeseCakes
...
etc. silly examples aside, it goes to show each line is unique save for a few variations so it shouldn't affect what i really want: see if any of these lines match the lines in the file being worked on. it's no different than 's/pattern/replacement/' except that i want to use the patterns in the file instead of inline.
Ok here's an update (S.O. gets inpatient if i don't declare the question answered after a few days)
after MUCH fiddling with the #Kenavoz/#Fischer approach, i found a totally different solution, but first things first.
creating a modified pattern list for sed to work with does work.
as well as #werkritter's approach of dropping sed altogether. (this one i find the most... err... "least convoluted" way around the problem).
I couldn't make #Mklement's answer work under windows/cygwin (it did work on under ubuntu, so...not sure what that means. figures.)
What ended up solving the problem in a more... long term, reusable form was a wonderful program pointed out by a colleage called PowerGrep. it really blows every other option out of the water. unfortunately it's windows only AND it's not free. (not even advertising here, the thing is not cheap, but it does solve the problem).
so considering #werkiter's reply was not a "proper" answer and i can't just choose both #Lars Fischer and #Kenavoz's answer as a solution (they complement each other), i am awarding #Kenavoz the tickmark for being first.
final thoughts: i was hoping for a simpler, universal and free solution but apparently there is not.
You can try this :
sed -f <(sed 's/^/\//g;s/$/\/d/g' comp.log) file > outputfile
All regex in comp.log are formatted to a sed address with a d command : /regex/d. This command deletes lines matching the patterns.
This internal sed is sent as a file (with process substitition) to the -f option of the external sed applied to file.
To delete just string matching the patterns (not all line) :
sed -f <(sed 's/^/s\//g;s/$/\/\/g/g' comp.log) file > outputfile
Update :
The command output is redirected to outputfile.
Some ideas but not a complete solution, as it requires some adopting to your script (not shown in the question).
I would convert comp.log into a sed script containing the necessary deletes:
cat comp.log | sed -r "s+(.*)+/\1/ d;+" > comp.sed`
That would make your example comp.sed look like:
/server[0-9]\/files\/bobba fett.stw/ d;
/[a-z]+ mochaccino/ d;
/[2-9] CheeseCakes/ d;
then I would apply the comp.sed script to each file reported by grep (With your -rnwl that would require some filtering to get the filename.):
sed -i.bak -f comp.sed $AFileReportedByGrep
If you have gnu sed, you can use -i inplace replacement creating a .bak backup, otherwise use piping to a temporary file
Both Kenavoz's answer and Lars Fischer's answer use the same ingenious approach:
transform the list of input regexes into a list of sed match-and-delete commands, passed as a file acting as the script to sed via -f.
To complement these answers with a single command that puts it all together, assuming you have GNU sed and your shell is bash, ksh, or zsh (to support <(...)):
find 'c:/out' -name '*.txt' -exec sed -i -r -f <(sed 's#.*#/\\<&\\>/d#' comp.log) {} +
find 'c:/out' -name '*.txt' matches all *.txt files in the subtree of dir. c:/out
-exec ... + passes as many matching files as will fit on a single command line to the specified command, typically resulting only in a single invocation.
sed -i updates the input files in-place (conceptually speaking - there are caveats); append a suffix (e.g., -i.bak) to save backups of the original files with that suffix.
sed -r activates support for extended regular expressions, which is what the input regexes are.
sed -f reads the script to execute from the specified filename, which in this case, as explained in Kenavoz's answer, uses a process substitution (<(...)) to make the enclosed sed command's output act like a [transient] file.
The s/// sed command - which uses alternative delimiter # to facilitate use of literal / - encloses each line from comp.log in /\<...\>/d to yield the desired deletion command; the enclosing of the input regex in \<...\>ensures matching as a word, as grep -w does.
This is the primary reason why GNU sed is required, because neither POSIX EREs (extended regular expressions) nor BSD/OSX sed support \< and \>.
However, you could make it work with BSD/OSX sed by replacing -r with -E, and \< / \> with [[:<:]] / [[:>:]]

sed search and replace between string and last occurence of character

I currently have a bunch of .md5sum files with a md5sum hash value and it's corresponding file name with full absolute path. I'd like to modify these files from being absolute pathing to relative. I think I have it pretty close.
> cat example.md5sum
197f76c53d2918764cfa6463b7221dec /example/path/to/file/example.null
> cat example.md5sum | sed 's/( ).*\// \.\//'
197f76c53d2918764cfa6463b7221dec /example/path/to/file/example.null
Throwing the regex ( ).*\/ into notepad++ returns /example/path/to/file/ which is what I want. Moving it over to sed does not produce the same match.
The end goal here as mentioned previously is the following:
197f76c53d2918764cfa6463b7221dec ./example.null
Looks like a job for sed.
sed -i.bak 's:/.*/:./:' file ...
The -i option tells sed to modify files "in-place" rather than sending the results to stdout. With the substitute command, you can use alternate delimiters -- in this case, I've used a colon, since the text you're matching and using as replacement includes slashes. Makes things easier to read.
I haven't bothered to match the whitespace before the path, because in an md5sum file has a pretty predictable format.
Back up your input files before experimenting.
Note that this is shell agnostic -- you can run it in tcsh or bash or anything else that is able to launch sed with options.

Create directory based on part of filename

First of all, I'm not a programmer — just trying to learn the basics of shell scripting and trying out some stuff.
I'm trying to create a function for my bash script that creates a directory based on a version number in the filename of a file the user has chosen in a list.
Here's the function:
lav_mappe () {
shopt -s failglob
echo "[--- Choose zip file, or x to exit ---]"
echo ""
echo ""
select zip in $SRC/*.zip
do
[[ $REPLY == x ]] && . $HJEM/build
[[ -z $zip ]] && echo "Invalid choice" && continue
echo
grep ^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$ $zip; mkdir -p $MODS/out/${ver}
done
}
I've tried messing around with some other commands too:
for ver in $zip; do
grep "^[0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}$" $zip; mkdir -p $MODS/out/${ver}
done
And also find | grep — but I'm doing it wrong :(
But it ends up saying "no match" for my regex pattern.
I'm trying to take the filename the user has selected, then grep it for the version number (ALWAYS x.xx.x somewhere in the filename), and fianlly create a directory with just that.
Could someone give me some pointers what the command chain should look like? I'm very unsure about the structure of the function, so any help is appreciated.
EDIT:
Ok, this is how the complete function looks like now: (Please note, the sed(1) commands besides the directory creation is not created by me, just implemented in my code.)
Pastebin (Long code.)
I've got news for you. You are writing a Bash script, you are a programmer!
Your Regular Expression (RE) is of the "wrong" type. Vanilla grep uses a form known as "Basic Regular Expressions" (BRE), but your RE is in the form of an Extended Regular Expression (ERE). BRE's are used by vanilla grep, vi, more, etc. EREs are used by just about everything else, awk, Perl, Python, Java, .Net, etc. Problem is, you are trying to look for that pattern in the file's contents, not in the filename!
There is an egrep command, or you can use grep -E, so:
echo $zip|grep -E '^[0-9]\.[0-9]{1,2}\.[0-9]{1,2}$'
(note that single quotes are safer than double). By the way, you use ^ at the front and $ at the end, which means the filename ONLY consists of a version number, yet you say the version number is "somewhere in the filename". You don't need the {1} quantifier, that is implied.
BUT, you don't appear to be capturing the version number either.
You could use sed (we also need the -E):
ver=$(echo $zip| sed -E 's/.*([0-9]\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
The \1 on the right means "replace everything (that's why we have the .* at front and back) with what was matched in the parentheses group".
That's a bit clunky, I know.
Now we can do the mkdir (there is no merit in putting everything on one line, and it makes the code harder to maintain):
mkdir -p "$MODS/out/$ver"
${ver} is unnecessary in this case, but it is a good idea to enclose path names in double quotes in case any of the components have embedded white-space.
So, good effort for a "non-programmer", particularly in generating that RE.
Now for Lesson 2
Be careful about using this solution in a general loop. Your question specifically uses select, so we cannot predict which files will be used. But what if we wanted to do this for every file?
Using the solution above in a for or while loop would be inefficient. Calling external processes inside a loop is always bad. There is nothing we can do about the mkdir without using a different language like Perl or Python. But sed, by it's nature is iterative, and we should use that feature.
One alternative would be to use shell pattern matching instead of sed. This particular pattern would not be impossible in the shell, but it would be difficult and raise other questions. So let's stick with sed.
A problem we have is that echo output places a space between each field. That gives us a couple of issues. sed delimits each record with a newline "\n", so echo on its own won't do here. We could replace each space with a new-line, but that would be an issue if there were spaces inside a filename. We could do some trickery with IFS and globbing, but that leads to unnecessary complications. So instead we will fall back to good old ls. Normally we would not want to use ls, shell globbing is more efficient, but here we are using the feature that it will place a new-line after each filename (when used redirected through a pipe).
while read ver
do
mkdir "$ver"
done < <(ls $SRC/*.zip|sed -E 's/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/\1/')
Here I am using process substitution, and this loop will only call ls and sed once. BUT, it calls the mkdir program n times.
Lession 3
Sorry, but that's still inefficient. We are creating a child process for each iteration, to create a directory needs only one kernel API call, yet we are creating a process just for that? Let's use a more sophisticated language like Perl:
#!/usr/bin/perl
use warnings;
use strict;
my $SRC = '.';
for my $file (glob("$SRC/*.zip"))
{
$file =~ s/.*([0-9]{1}\.[0-9]{1,2}\.[0-9]{1,2}).*/$1/;
mkdir $file or die "Unable to create $file; $!";
}
You might like to note that your RE has made it through to here! But now we have more control, and no child processes (mkdir in Perl is a built-in, as is glob).
In conclusion, for small numbers of files, the sed loop above will be fine. It is simple, and shell based. Calling Perl just for this from a script will probably be slower since perl is quite large. But shell scripts which create child processes inside loops are not scalable. Perl is.