I'm writing a Bash script for cleaning up in my music.
I wanted it to format all the file names and making them and so with a little internet search I wrote this line:
sed -i -e 's/[-_]/ /g' -e 's/ \+/ /g' -e **'s/\<[a-z]/\U&/g'** -e "s/$artist //g" -e "s/$album //g"
Which I used to add the file names to a text file and then sed it, but then I didn't know how to apply the new names to the files.
So then I started experimenting with rename and managed to get the exact same result
except for the bolded parts, which is supposed to make every first letter in a word uppercase.
rename 's/[-_]/ /g' * && rename 's/\s+/ /g' * && **rename 's/\s\w{1}/*A-Z*/g' *** && rename 's/^\d+[[:punct:]]\s//g' * && rename "s/$artist\s//g" * && rename "s/$album\s//g" * && rename "s/($ext)//g" *
Now, the code in rename is working (satisfactorily at least), finding only one letter after a SPACE character, but it's the replacement that is problematic. I've tried numerous different approaches, all leaving me with the result that the first letter in focus get exchanged to exactly A-Z in this case.
In the rename manual page it says to make lower case uppercase you do 's/a-z/A-Z/g' but it's easy to figure that it only applies when it finds a-z A-Z.
So this is what I need help with.
A bonus would be if someone knows how to do it like in the sed example, where the \< matches the beginning of each word, because at the moment, my rename command won't apply to the very first word and neither will it apply if there are multiple discs looking like "Disc name [Disc 1]" for obvious reasons.
This is sort-of a Perl question, since rename is written in Perl, and instructions for how to perform the renaming are a Perl command.
In a s/// in order for the substitution to know which letter to insert the upper-case version of, it has to ‘capture’ the letter from the input. Parentheses in the pattern do this, storing the captured letter in the variable $1. And \u in a substitution makes the next character upper-case.
So you can do:
$ rename 's/\s(\w)/ \u$1/g' *
Note that the replacement part has to insert a space before the upper-case letter, because the pattern includes a space and so both the space and the original letter are being replaced. You can avoid this by using \b, a zero-width assertion which only matches at a word boundary:
$ rename 's/\b(\w)/\u$1/g' *
Also you don't need the {1} in there, because \w (like other symbols in regexs) matches a single character by default.
Finally, the example in rename(1) is actually y/A-Z/a-z/, using the y/// operator, not s///. y/// is a completely different operator, which replaces all occurrences of one set of letters with another; that isn't of use to you here, where it's only some characters you want making upper-case.
rename -nv 's{ (\A|\s) (\w+) }{$1\u$2}xmsg'
This looks for the beginning of the string \A or for whitespace \s followed by at least one or more word characters (a-z, 0-9, underscore) \w+. It will uppercase the first character of all word sequences.
Related
I want to do a string replacement on any string that is surrounded by a word boundary that is alphanumeric and is 14 characters long. The string must contain at least one capitalized letter and one number. I know (I think I know) that I'll need to use positive look ahead for the capitalized letter and number. I am sure that I have the right regex pattern. What I don't understand is why sed is not matching. I have used online tools to validate the pattern like regexpal etc. Within those tools, I am matching the string like I expect.
Here is the regex and sed command I'm using.
\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b
The sed command I'm testing with is
echo "asdfASDF1234ds" | sed 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g'
I would expect this to match on the echoed string.
sed understands a very limited form of regex. It does not have lookahead.
Using a tool with more powerful regex support is the simple solution.
If you must use sed, you could do something like:
$ sed '
# mark delimiters
s/[^a-zA-Z0-9]\{1,\}/\n&\n/g
s/^[^\n]/\n&/
s/[^\n]$/&\n/
# mark 14-character candidates
s/\n[a-zA-Z0-9]\{14\}\n/\n&\n/g
# mark if candidate contains capital
s/\n\n[^\n]*[A-Z][^\n]*\n\n/\n&\n/g
# check for a digit; if found, replace
s/\n\n\n[^\n]*[0-9][^\n]*\n\n\n/NEW_STRING/g
# remove marks
s/\n//g
' <<'EOD'
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
a23456789A123n
XX,a23456789A123n,YY
xx,a23456789A1234n,yy
EOD
a234567890123n
,a234567890123n,
xx,a234567890123n,yy
NEW_STRING
XX,NEW_STRING,YY
xx,a23456789A1234n,yy
$
This might work for you (GNU sed):
sed -E 's/\<[A-Za-z0-9]{14}\>/\n&\n/
s/\n.*(([A-Z].*[0-9])|([0-9].*[A-Z])).*\n/NEW_STRING/
s/\n//g' file
Isolate a 14 alphanumeric word by delimiting it with newlines.
If the string between the newlines contains at least one uppercase alpha character and at least one numeric character, replace the string and its delimiters by NEW_STRING.
Remove the delimiters.
Or if multiple strings, perhaps:
sed -E 's/\b/\n/g
s#.*#echo "&"|sed -E "/^[a-z0-9]{14}$/I{/[A-Z]/{/[0-9]/s/.*/NEW_STRING/}}"#e
s/\n//g' file
sed doesn't support lookaheads, or many many many other modern regex Perlisms. The simple fix is to use Perl.
perl -pe 's/\b(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{14}\b/NEW_STRING/g' <<< "asdfASDF1234ds"
Using a regular expression, I need to match only the IPv4 subnet mask from the given input string:
ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
For testing this input string is contained in a text file called file.txt, however the actual use case will be to parse /proc/cmdline, and I will need a solution that starts parsing, counting fields, and matching after encountering "ip=" until the next white space character.
I'm using bash 4.2.46 with GNU grep 2.20 on an EL 7.9 workstation, x86_64 to test the expression.
Based on examples I've seen looking at other questions, I've come up with the following grep command and PCRE regular expression which gives output that is very close to what I need.
[user#ws01 ~]$ grep -o -P '(?<!:)(?:\:[0-9])(.*?)(?=:)' file.txt
:255.255.254.0
My understanding of what I've done here is that, I've started with a negative lookbehind with a ":" character to try and exclude the first "::" field, followed by a non capturing group to match on an escaped ":" character, followed by a number, [0-9], then a capturing group with .*?, for the actual match of the string itself, and finally a look ahead for the next ":" character.
The problem is that this gives the desired string, but includes an extra : character at the beginning of the string.
Expected output should look like this:
255.255.254.0
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters. The reason for this is because a field can have an empty value. For example
:<null>:ip:gw:netmask:hostname:<null>:off
Null is shown here to indicate an omitted value not passed by the user, that the user does not need to provide for the intended purpose.
I've tried a few different expressions as suggested in other answers that use negative look behinds and look aheads to not start matching at a : which is neighbored by another :
For example, see this question:
Regular Expression to find a string included between two characters while EXCLUDING the delimiters
If I can start matching at the first single colon, by itself, which is not followed by or preceded by another : character, while excluding the colon character as the delimiter, and continue matching until the next single colon which is also not neighboring another : and without including the colon character, that should match the desired string.
I'm able to match the exact string by including "255" in an expression like this: (Which will work for all of our present use cases)
[user#ws01 ~]$ grep -o -P '(?:)255.*?(?=:)' file.txt
255.255.254.0
The logic problem here is that the subnet mask itself, may not always start with "255", but it should be a number, [0-9] which is why I'm attempting to use that in the expression above. For the sake of simplicity, I don't need to validate that it's not greater than 255.
Using gnu-grep you could write the pattern as:
grep -oP '(?<!:):\K\d{1,3}(?:\.\d{1,3}){3}(?=:(?!:))' file.txt
Output
255.255.254.0
Explanation
(?<!:): Negative lookahead, assert not : to the left and then match :
\K Forget what is matched until now
\d{1,3}(?:\.\d{1,3}){3} Match 4 times 1-3 digits separated by .
(?=:(?!:)) Positive lookahead, assert : that is not followed by :
See a regex demo.
Using grep
$ grep -oP '(?<!:)?:\K([0-9.]+)(?=:[[:alpha:]])' file.txt
View Demo here
or
$ grep -oP '[^:]*:\K[^:[:alpha:]]*' file.txt
Output
255.255.254.0
If these are delimiters, your value should be in a clearly predictable place.
Just treat every colon as a delimiter and select the 4th field.
$: awk -F: '{print $4}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
I'm not sure what you mean by
What's making this tricky for me to figure out is that the delimiters are not consistent. The string includes both double colons, and single colon fields, so I haven't been able to just simply match on the string between the delimiters.
If your delimiters aren't predictable and parse-able, they are useless. If you mean the fields can have or not have quotes, but you need to exclude quotes, we can do that. If double colons are one delimiter and single colons are another that's horrible design, but we can probably handle that, too.
$: awk -F'::' '{ split($2,x,":"); print x[2];}' <<< ip=10.0.20.100::10.0.20.1:255.255.254.0:ws01.example.com::off
255.255.254.0
For quotes, you need to provide an example.
Since the number of fields is always the same, simply separated by ":", you can use cut.
That solution will also work if you have empty fields.
cut -d":" -f4
I have this regex that I would like to use with sed. I would like to use sed, since I want to batch process a few thousand files and my editor does not like that
Find: "some_string":"ab[\s\S\n]+"other_string_
Replace: "some_string":"removed text"other_string_
Find basically matches everything between some_string and other_string, including special chars like , ; - or _ and replaces it with a warning that text was removed.
I was thinking about combining the character classes [[:space:]] and [[:alnum:]], which did not work.
In MacOS FreeBSD sed, you can use
sed -i '' -e '1h;2,$H;$!d;g' -e 's/"some_string":"ab.*"other_string_/"some_string":"removed text"other_string_/g' file
The 1h;2,$H;$!d;g part reads the whole file into memory so that all line breaks are exposed to the regex, and then "some_string":"ab.*"other_string_ matches text from "some_string":"ab till the last occurrence of "other_string_ and replaces with the RHS text.
You need to use -i '' with FreeBSD sed to enforce inline file modification.
By the way, if you decide to use perl, you really can use the -0777 option to enable file slurping with the s modifier (that makes . match any chars including line break chars) and use something like
perl -i -0777 's/"some_string":"\Kab.*(?="other_string_)/removed text/gs' file
Here,
"some_string":" - matches literal text
\K - omits the text matched so far from the current match memory buffer
ab - matches ab
.* - any zero or more chars as many as possible
OR .*? - any zero or more chars as few as possible
(?="other_string_) - a positive lookahead (that matches the text but does not append to the match value) making sure there is "other_string_ immediately on the right.
I'm having difficulty understanding a number-parsing sed command I saw in this article:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
I'm a sed newbie, so this is what I've been able to figure out:
& adds to what's already there rather than substitutes
the :a; ... ;ta calls the substitution recursively on the line until the search finds no more returns
Here's what I am hoping folks can explain
What does -i do? I can't seem to find it on the man pages though I'm sure it's there.
I'm a little fuzzy on what the \B is accomplishing here? Perhaps it helps with the left-right parsing priority, but I don't see how. So lastly...
Most importantly, why does this execute right to left instead of left to right? For example, which part of the command keeps this from doing something like: 1234566778,9 ---> 1234,566,778,9
Bisecting this command:
sed -i ':a;s/\B[0-9]\{3\}\>/,&/;ta' numbers.txt
-i # inline editing to save changes in input file
\B # opposite of \b (word boundary) - to match between words
[0-9] # match any digit
\{3,\} # match exact 3 digits
\> # word boundary
& # use matched pattern in replacement
:a # start label a
ta # go back to label a until \B[0-9]\{3\}\> is matches
Yes indeed this sed command starts match/replacement from right most 3 digits and keeps going left till it finds 3 digits.
Update: However looking at this inefficient sed command in a loop I recommend this much simpler and faster awk instead:
awk '/^[0-9]+$/{printf "%\047.f\n", $1}' file
20,130,607,215,015
607,220,701
992,171
Where input file is:
cat file
20130607215015
607220701
992171
The matching is greedy, i.e. it matches the leftmost three digits NOT preceded by a word boundary and followed by the word boundary, i.e. the rightmost three digits. After inserting the comma, the "goto" makes it match again, but the comma introduced a new word boundary, so the match happens earlier.
I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.