Move around the characters in a filename - regex

I've got a folder full of files with the names ab1234, abc5678, etc., and I want to switch them to abc3412, abc7856, etc. – just swap the last two characters out with the second-to-last two characters. The filenames are all in this format, no surprises. What's the easiest way to do this with a regex?

Use perl rename?
rename 's/(..)(..)$/$2$1/' *

Depending on your platform, you may have a rename utility that can directly do what you want.
For instance, anishsane's answer shows an elegant option using a Perl-based renaming utility.
Here's a POSIX-compliant way to do it:
printf '"%s"\n' * | sed 'p; s/\(..\)\(..\)"$/\2\1"/' | xargs -L 2 mv
printf '"%s"\n' * prints all files in the current folder line by line (if there are subdirs., you'd have to exclude them), enclosed in literal double-quotes.
sed 'p; s/\(..\)\(..\)"$/\2\1"/' produces 2 output lines:
p prints the input line as-is.
s/\(..\)\(..\)"$/\2\1"/' matches the last 2 character pairs (before the closing ") on the input lines and swaps them.
The net effect is that each input line produces a pair of output lines: the original filename, and the target filename, each enclosed in double-quotes.
xargs -L 2 mv then reads pairs of input lines (-L 2) and invokes the mv utility with each line as its own argument, which results in the desired renaming. Having each line enclosed in double-quotes ensures that xargs treats them as a single argument, even if they should contain whitespace.
Tip of the hat to anishsane for the enclose-in-literal-double-quotes approach, which makes the solution robust.
Note: If you're willing to use non-POSIX features, you can simplify the command as follows, to bypass the need for extra quoting:
GNU xargs:
printf '%s\n' * | sed 'p; s/\(..\)\(..\)$/\2\1/' | xargs -d '\n' -L 2 mv
Nonstandard option -d '\n' tells xargs to not perform word splitting on lines and treat each line as a single argument.
BSD xargs (also works with GNU xargs):
printf '%s\n' * | sed 'p; s/\(..\)\(..\)$/\2\1/' | tr '\n' '\0' | xargs -0 -L 2 mv
tr '\n' '\0' replaces newlines with \0 (NUL) chars, which is then specified as the input-line separator char. for xargs with the nonstandard -0 option, again ensuring that each input line is treated as a single argument.

Related

sed regex command to output lines that end with html?

I need a sed regex command that will output every line in a while that ends with 'html', and does NOT start with 'a'.
Would my current code work?
sed 's/\[^a]\*\.\(html)\/p' text.txt
The sed command would be
sed -n '/^[^a].*html$/p'
But the canonical command to print matching lines is grep:
grep '^[^a].*html$'
Sed just over complicates things...you can use grep to handle that easily!
egrep "^[^a].+\.html$" -f sourcefile > text.txt
//loads file from within the program egrep
egrep "^[^a].+\.html$" < sourcefile > text.txt
//Converts stdin file descriptor with the input redirect
//to sourceFile for this stage of the` pipeline
are equivalent functionally.
or
pipe input | xargs -n1 egrep "^[^a].+\.html$" > text.txt
//xargs -n1 means take the stdin from the pipe and read it one line at a time in conjunction with the single command specified after any other xargs arguments
// ^ means from start of line,
//. means any one character
//+ means the previous matched expression(which can be a
//(pattern group)\1 or [r-ange], etc) one or more times
//\. means escape the single character match and match the period character
//$ means end of line(new line character)
//egrep is short for extended regular expression matches which are really
nice
(assuming you aren't using a pipe or cat, etc)
You can convert a newline delimited file into a single input line with this command:
cat file | tr -d '\n' ' '
//It converts all newlines into a space!
Anyway, get creative with simple utilities and you can do a lot:
xargs, grep, tr are a good combo that are easy to learn. Without the sedness of it all.
Don't do this with sed. Do it with two different calls to grep
grep -v ^a file.txt | grep 'html$'
The first grep gets all the lines that do not start with "a", and sends the output from that into the second grep, which pulls out all the lines that end with "html".

Modify sed command to catch substrings

I have the example text:
"are "insulin sensitizers" "
I am trying to use the below command to find and replace the xml quote command with a single quote but it only works for the first one and the second that follows "sensitizers" is left unchanged.
command:
grep -rl """ ./ | xargs sed -i "s/"/'/"
result:
"are 'insulin sensitizers" "
desired result:
"are 'insulin sensitizers' "
You will need to use g flag in sed for global substitution:
grep -Zirl """ . | xargs -0 sed -i "s/"/'/g"
Also note use of -Z option in grep and -0 in xargs to take care of files with special characters and whitespaces.
As per man grep:
-Z, --null
Output a zero byte (the ASCII NUL character) instead of the character that normally follows a file name.

Search and Replace String from text file Ubuntu

I have to replace following String
//#Config(manifest
with below string,
#Config(manifest
So this i created following regex
\/\/#Config\(manifest
And tried
grep -rl \/\/#Config\(manifest . | xargs sed -i "\/\/#Config\(manifest#Config\(manifest/g"
But i am getting following error:
sed: -e expression #1, char 38: Unmatched ( or \(
I have to search recursively and do this operation, though i am stuck with above error.
grep -rl '//#Config(manifest' | xargs sed -i 's|//#Config(manifest|#Config(manifest|g'
Specifying . for current directory is optional for grep -r
sed allows Any character other than backslash or newline to be used as delimiter
Edit
If file name contains spaces, use
grep -rlZ '//#Config(manifest' | xargs -0 sed -i 's|//#Config(manifest|#Config(manifest|g'
Explanation (assumes GNU version of commands)
grep
-r performs recursive search
-l option outputs only filenames instead of matched patterns
-Z outputs a zero byte (ASCII NUL character) after each file name instead of usual newline
'pattern' by default, grep uses BRE (basic regular expression) where characters like ( do not have special meaning and hence need not be escaped
xargs -0 tells xargs to separate arguments by the ASCII NUL character
sed
-i inplace edit, use -i.bkp if you want to create backup of original files
s|pattern|replace|g the g flag tells sed to search and replace all occurrences. sed also defaults to BRE and so no need to escape (. Using \( would mean start of capture groups and hence the error when it doesn't find the closing \)

Grep any whitespace character including newline in single pattern

I'm trying to make 'perfect' command to show any .php file in dir or subdirs that contain eval code.
Since there are many many false positives, I'm after solution that would strip at least most obvious of them - so my target is:
word eval, followed by any whitepace char including newline zero to unlimited times, followed by open bracket char (;
Here are my shots:
find . -type f -exec grep -l "eval\s*(" {} \; | grep ".php"
Works great but somehow \s* here doesn't match newline characters, so
eval
("some nasty obfuscated code");
is bellow the radar.
I've also tried with:
find . -type f -exec grep -l "eval[[:space:]]*(" {} \; | grep ".php"
with same results.
If I did understand you correct, I believe this line here to be what you're looking for:
find . -name '*.php' -exec grep -Ezl 'eval\s*\(' {} +
the -z is what you've been missing, see explanation below.
and of course you could give the find command whatever other root rather than . and just add arguments and conditions according to where you are looking in and what you are looking for.
That was it. From here on, explanations:
The find command
It would probably be faster in most cases to first search for files with .php extension, and then search only within these files for your regular expression. The -name '*.php' part gives us this behavior by searching only for files with a file name ending with '.php'.
-exec allows us to execute a command using the output of the find command (file names). We are using it in order to execute grep for all php files.
This syntax {} + in the end of the line, creates one long list of file names as arguments for the grep command, instead of executing grep separately for every file.
The grep command
-E: Interpret PATTERN as an extended regular expression (copied from the grep man page)
-z: Treat the input as a set of lines, each terminated by a zero byte instead of a newline (grep man page). That means that for a normal textual file, the whole file would be treated as one long line. This behavior allows you to use multi-lined regular expressions.
-l: tells grep to only show the filenames for all the files matching the search, and not to show the matching lines.
The regular expression:
'eval' just matches the word eval.
'\s' matches any whitespace character, and the '*' after it means it could appear zero or more times. This '\(' matches an actual bracket, which in this case needs escaping (and that's what the \ is for).
have fun!
Simple Version:
For simplicity sake, to cater for your need, but using awk instead of grep (if this is possible), then for php files in /tmp/, you could simply;
awk -v RS="^$" '/eval[[:space:]]*\(/ { print FILENAME }' /tmp/*.php
And that will print the files that match.
If you need to use the output of find:
find /tmp/ -iname "*.php" -print | while read file ; do awk -v RS="^$" '/eval[[:space:]]*\(/ { print FILENAME }' "$file" ; done
The above is simple and works even with busybox and basic versions of awk.
Alternate (With matches)
This part of the answer may seem absurd to some, but enough experience with searching for whitespace, and doing serialisation in the shell, the amount of "gotcha's" become evident, and the need for a working solution causes the preference for built-in one liners to take a back seat.
This might also help others stumbling across a similar need, but requiring easy to read line previews, maybe for parsing, or simplicity:
NOTE 1: This solution works in sh/ash/busybox as well as bash (the external binary xxd would still be needed)
NOTE 2: For BSD grep, substitute -P with -E. Using -E on a GNU grep that has support for -P, seems to not yield the same lookahead matches
Example Test File
Take this test file (with special characters notated in place), plus 2 other test files that are located in /tmp/ for this example:
find /tmp/ -iname "*.php" -print \
| while read file ; do hexdump -ve '1/1 " %02X"' "$file" \
| sed -E "s/($)/ 0A/g" \
| grep -P -o "65 76 61 6C( 09| 0A| 0B| 0C| 0D| 20)*? 28 22.+?0A" \
| sed -E -e 's/ //g' \
| sed -E -e 's/(0A)+([^$])/20\2/g' \
| sed -E -e 's/(09|0B|0C|0D|20)+/20/g' \
| xxd -r -p \
| grep -i "eval" && printf "$file matches\n\n" ; done
Will return the matches, from eval, to the end of the line where the (" was matched, substituting line breaks and spaces for a single space for readability :
eval ("some nasty obfuscated code (LF / LINE FEED)");
eval ("some nasty obfuscated code (HT / TAB)");
eval ("some nasty obfuscated code (SP / SPACE)");
eval ("some nasty obfuscated code (FF / FORM FEED)");
eval ("some nasty obfuscated code (CR / CARRIAGE RETURN)");
eval ("some nasty obfuscated code (VT / VERTICAL TAB)");
eval ("some nasty obfuscated code (LF > HT > FF > CR > LF > LF > HT > VT > LF > HT > SP)");
eval ("some nasty obfuscated code (VT / VERTICAL TAB)");
/tmp/eval.php matches
eval ("some nasty obfuscated code (LF / LINE FEED)");
/tmp/eval_no_trailing_line_feed.php matches
eval("\$str = \"$str\";");
/tmp/eval_w3_example.php matches
For just the file matches using this method (maybe to allow for a "-v" option for example), just change grep -i on the last line to grep -iq
Explanation:
find /tmp/ -iname "*.php" -print \ : Find .php files in /tmp/
| while read file ; do hexdump -ve '1/1 " %02X"' "$file" \ : hexdump each resulting file, and output in single space separated bytes (to avoid any matching from the second character of one byte to the first char of another byte)
| sed -E "s/($)/ 0A/g" \ : Put a single 0A (line feed) at the very end of the file that matches - This means it will match a file that does not have a trailing line feed (sometimes can cause some issues with text processing)
| grep -P -o "65 76 61 6C( 09| 0A| 0B| 0C| 0D| 20)*? 28 22.+?0A" \ : Return only match (note that grep adds a line break to each match)
6576616C : eval
09 : horizontal TAB
0A : line feed
0B : vertical TAB
0C : form feed
0D : carriage return
20 : plane SPACE
2822 : ("
| sed -E -e 's/ //g' \ : Remove all spaces between bytes (may not have been needed in the end)
| sed -E -e 's/(0A)+([^$])/20\2/g' \ : Look for any repeated occurrences of 0A (line feed), as long as they are not the line feed at the end of the line, and replace them with a single space (20)
| sed -E -e 's/(09|0B|0C|0D|20)+/20/g' \ : Look for any of the white space characters above, and replace them with a space, for readability
| xxd -r -p \ : Revert back from hex
| grep -i "eval" && printf "$file matches\n\n" ; done : Print the match, and the file name (the && means that printf will only print the file match, if the output of grep was 0 (success), therefore it won't simply print every file in the loop. (as noted before, adding -q into this grep will still evaluate for the purpose of printf, but will not output the matching lines.

replace \n\t pattern in a file

ok I have a recordset that is pipe delimited
I am checking the number of delimiters on each line as they have started including | in the data (and we cannot change the incoming file)
while using a great awk to parse out the bad records into a bad file for processing we discovered that some data has a new line character (\n) (followed by a tab (\t) )
I have tried sed to replace \n\t with just \t but it always either changes the \n\t with \r\n or replaces all the \n (file is \r\n for line end)
yes to answer some quetions below...
files can be large 200+ mb
the line feed is in the data spuriously (not every row.. but enought to be a pain)
I have tried
sed ':a;N;$!ba;s/\n\t/\t/g' Clicks.txt >test2.txt
sed 's/\n\t/\t/g' Clicks.txt >test1.txt
sample record
12345|876|testdata\n
\t\t\t\tsome text|6209\r\n
would like
12345|876|testdata\t\t\t\tsome text|6209\r\n
please help!!!
NOTE must be in KSH (MKS KSH to be specific)
i don't care if it is sed or not.. just need to correct the issue...
several of the solutions below woke on small data or do part of the job...
as an aside i have started playing with removing all linefeeds and then replacing the caraige return with carrige return linefeed.. but can't quite get that to work either
I have tried TR but since it is single char it only does part of the issue
tr -d '\n' test.txt
leave me with a \r ended file....
need to get it to \r\n (and no-no dos2unix or unix2dos exists on this system)
If the input file is small (and you therefore don't mind processing it twice), you can use
cat input.txt | tr -d "\n" | sed 's/\r/\r\n/g'
Edit:
As I should have known by now, you can avoid using cat about everywhere.
I had reviewed my old answers in SO for UUOC, and carefully checked for a possible filename in the tr usage. As Ed pointed out in his comment, cat can be avoided here as well:
The command above can be improved by
tr -d "\n" < input.txt | sed 's/\r/\r\n/g'
It's unclear what you are trying to do but given this input file:
$ cat -v file
12345|876|testdata
some text|6209^M
Is this what you're trying to do:
$ gawk 'BEGIN{RS=ORS="\r\n"} {gsub(/\n/,"")} 1' file | cat -v
12345|876|testdata some text|6209^M
The above uses GNU awk for multi-char RS. Alternatively with any awk:
$ awk '{rec = rec $0} /\r$/{print rec; rec=""}' file | cat -v
12345|876|testdata some text|6209^M
The cat -vs above are just there to show where the \rs (^Ms) are.
Note that the solution below reads the input file as a whole into memory, which won't work for large files.
Generally, Ed Morton's awk solution is better.
Here's a POSIX-compliant sed solution:
tab=$(printf '\t')
sed -e ':a' -e '$!{N;ba' -e '}' -e "s/\n${tab}/${tab}/g" Clicks.txt
Keys to making this POSIX-compliant:
POSIX sed doesn't recognize \t as an escape sequence, so a literal tab - via variable $tab, created with tab=$(printf '\t') - must be used in the script.
POSIX sed - or at least BSD sed - requires label names (such as :a and the a in ba above) - whether implied or explicit - to be terminated with an actual newline, or, alternatively, terminated implicitly by continuing the script in the next -e option, which is the approach chosen here.
-e ':a' -e '$!{N;ba' -e '}' is an established Sed idiom that simply "slurps" the entire input file (uses a loop to read all lines into its buffer first). This is the prerequisite for enabling subsequent string substitution across input lines.
Note how the option-argument for the last -e option is a double-quoted string so that the references to shell variable $tab are expanded to actual tabs before Sed sees them. By contrast, \n is the one escape sequence recognized by POSIX sed itself (in the regex part, not the replacement-string part).
Alternatively, if your shell supports ANSI C-quoted strings ($'...'), you can use them directly to produce the desired control characters:
sed -e ':a' -e '$!{N;ba' -e '}' -e $'s/\\n\t/\\t/g' Clicks.txt
Note how the option-argument for the last -e option is an ANSI C-quoted string, and how literal \n (which is the one escape sequence that is recognized by POSIX Sed) must then be represented as \\n. By contrast, $'...' expands \t to an actual tab before Sed sees it.
Thanks everyone for all your suggestions.. After looking at all the answers.. None quite did the trick... After some thought... I came up with
tr -d '\n' <Clicks.txt | tr '\r' '\n' | sed 's/\n/\r\n/g' >test.txt
Delete all newlines
translate all Carriage return to newline
Sed replace all newline with Carriage return line feed
This works in seconds on a 32mb file.