Command below is from this answer: https://stackoverflow.com/a/208220/3470343
The command adds the word new. as a prefix. That's understood.
rename 's/(.*)$/new.$1/' original.filename
However, I'd like to ask why the open and close brackets are required here:
(.*)
And also why is $1 the variable which stores the original file name, why can't I do the the same with following (where i have replaced $1 with $2):
rename 's/(.*)$/new.$2/' original.filename
I'm still relatively new to bash, so help would be greatly appreciated.
First off, (.*)$ is what is known as a regular expression (or regex). Regular expressions are used to match text based on some rules.
For example, .* matches zero or more characters. $ matches the end of the line. Because regular expressions by default are greedy, .*$ matches the whole line (although, precisely because regexes are greedy, $ is superfluous).
However, I'd like to ask why the open and close brackets are required here: (.*)
The round brackets denote a group. Groups are used to "save" the contents of the matched text, so that you can use it later.
And also why is $1 the variable which stores the original file name, why can't I do the the same with following (where I have replaced $1 with $2): ...
In the case of rename(1), the first group is stored in $1, the second group is stored in $2 and so on. For example, the following regular expression:
(a)(b)(c)
stores a single a into $1, a single b into $2 and so on.
There you have only one group, therefore you must use $1. Tokens like $2, $3, ... will be empty.
Last but not least, you could use a shorter, equivalent command:
rename 's/^/new./'
Here ^ denotes the start of the string.
Related
I'd like to know how to take the following string
/text1/text2/text3/wanted_text/text5/text6
and get the wanted text, based solely on its position between the 4th and 5th /?
A substitution command is enough (I've obviously assumed that the interesting part is between the 4th and 5th / as you said):
echo your_text | sed -E 's!(/[^/]+){3}/([^/]+).*!\2!'
where I've used ! as separator for the parts of the substitution command, in order to avoid having to escape every /.
More in detail:
s!…!…! is the seach-and-substitute command, where you put the search pattern in the first … and the replacement in the second …;
the seach pattern is (/[^/]+){3}/([^/]+).* and matches 3 occurrences of a / followed by 1 or more non-/, followed by a / followed by 1 ore more non-/; the (…) are for grouping a part of a regex such that you can apply quatifiers (like {3}) to the whole group (just like in (/[^/]+){3}), and for capturing the matching text to allow you to refer to it in the replacement; in this case, the third of the 3 texts matching (/[^/]+){3} is referred to via \1, whereas the text matched by ([^/]+) is referred to via \2;
the replacement is simply \2 (see previous point).
For more details about how the search pattern works, and to understand all of its parts, you can refer to this demo on regex 101.
(-E is a non-POSIX-compliant option that makes the script more readable. Without it, you have to prepend \ to each of (, ), {, } and +.)
I would like to change the second forward slash, within each line, to a comma.
I have found various posts and managed to derive a way of doing it from them but it's not doing it how I want.
Initial attempt - I thought I needed to replace between 2 delimiters
1st "Replace 2nd occurrence" - Found this post which seemed easier.
2nd "Replace 2nd occurrence"- Used the regex in here as a base for mine.
What I am doing is;
Find:
^(.*?)\/(.*?)\/
Replace:
$&,
Which results in changing my data from;
042146/OVERNIGHT/HSSC825571,started,14/07/2016,00:00:56,V0700LWHSB
042146/OVERNIGHT/HSSC825571,ended,14/07/2016,00:00:56,
042147/OVERNIGHT/HSSC825571,started,14/07/2016,00:00:58,V0700LWHSB
042147/OVERNIGHT/HSSC825571,ended,14/07/2016,00:00:58,
To;
042146/OVERNIGHT/,HSSC825571,started,14/07/2016,00:00:56,V0700LWHSB
042146/OVERNIGHT/,HSSC825571,ended,14/07/2016,00:00:56,
042147/OVERNIGHT/,HSSC825571,started,14/07/2016,00:00:58,V0700LWHSB
042147/OVERNIGHT/,HSSC825571,ended,14/07/2016,00:00:58,
Is there a way of just replacing the second /?
An example set of my data is;
042146/OVERNIGHT/HSSC825571,started,14/07/2016,00:00:56,V0700LWHSB
042146/OVERNIGHT/HSSC825571,ended,14/07/2016,00:00:56,
042147/OVERNIGHT/HSSC825571,started,14/07/2016,00:00:58,V0700LWHSB
042147/OVERNIGHT/HSSC825571,ended,14/07/2016,00:00:58,
042154/TEMP56/QPADEV000M,started,14/07/2016,00:01:02,V0700LRFIN
042154/TEMP56/QPADEV000M,ended,14/07/2016,00:07:12,
042155/JMALICKA/QPADEV000N,started,14/07/2016,00:01:05,V0700LRFIN
042155/JMALICKA/QPADEV000N,ended,14/07/2016,00:06:53,
042156/DG8SVCPRF/DG8SVC,started,14/07/2016,00:01:15,DATAGATE
042156/DG8SVCPRF/DG8SVC,ended,14/07/2016,00:12:01,
042157/OVERNIGHT/RCPTDISCRP,started,14/07/2016,00:01:42,V0700LBATC
042157/OVERNIGHT/RCPTDISCRP,ended,14/07/2016,00:01:44,
042158/QTCP/QTSMTPCLTP,started,14/07/2016,00:01:53,QSYSWRK
042158/QTCP/QTSMTPCLTP,ended,14/07/2016,01:29:08,
042159/QTCP/QTSMTPCLTP,started,14/07/2016,00:01:53,QSYSWRK
042159/QTCP/QTSMTPCLTP,ended,14/07/2016,00:19:05,
Ctrl+H
Find what: ^([^/]+/[^/]+)/
Replace with: $1,
Replace all
This will replace the second slash of each line by a comma.
You were almost there. You only need to change your replace string with the following:
$1/$2,
How it works
Your regex was: ^(.*?)\/(.*?)\/
In Notepad++'s replace string, the dollar sign is used to refer to groups enclosed by parentheses in the regex.
$1 refers to the first group, (.*?) which is at the beginning of the line, as specified with the ^ character.
$2 refers to the second group, also (.*?), but which follows the first /.
Since you don't want to replace the first slash, you need $1/$2 at the beginning of your replace string. But since what follows the second group is another / (the 2nd one on the line), you need to replace it with the ,. That's why the replace string has to be $1/$2,. Notice that all characters that are not enclosed by ()'s need to be re-written in the replace string. Otherwise, they're just omitted (try replace string $1$2 and you'll see what I mean).
In other editors or programming languages, instead of the $ sign, the \ is sometimes used (sometimes doubled) to refer to parenthetic groups. So you could have for instance \\1/\\2, or \1/\2, as a replace string instead of $1/$2,.
I am trying to replace all the words except the first 3 words from the String (using textpad).
Ex value: This is the string for testing.
I want to extract just 3 words: This is the from above string and remove all other words.
I figured out the regex to match the 3 words (\w+\s+){3} but I need to match all other words except the first 3 words and remove other words. Can someone help me with it?
Exactly how depends on the flavor, but to eliminate everything except the first three words, you can use:
^((?:\S+\s+){2}\S+).*
which captures the first three words into capturing group 1, as well as the rest of the string. For your replace string, you use a reference to capturing group 1. In C# it might look like:
resultString = Regex.Replace(subjectString, #"^((?:\S+\s+){2}\S+).*", "${1}", RegexOptions.Multiline);
EDIT: Added the start-of-line anchor to each regex, and added TextPad specific flags.
If you want to eliminate the first three words, and capture the rest,
^(?:\w+\s+){3}([^\n\r]+)$
?: changes the first three words to a non-capturing group, and captures everything after it.
Is this what you're looking for? I'm not totally clear on your question, or your goal.
As suggested, here's the opposite. Capture the first three words only, and discard the rest:
^(\w+\s+){3}(?:[^\n\r]+)$
Just move the ?: from the first to the second grouping.
As far as replacing that captured group, what do you want it replaced with? To replace each word individually, you'd have to capture each word individually:
^(\w+)\s+(\w+)\s+(\w+)\s+(?:[^\n\r]+)$
And then, for instance, you could replace each with its first letter capitalized:
Replace with: \u$1 \u$2 \u$3
Result is This Is The
In TextPad, lowercase \u in the replacement means change only the next letter. Uppercase \U changes everything after it (until the next capitalization flag).
Try it:
http://fiddle.re/f3hgv
(press on [Java] or whatever language is most relevant. Note that \u is not supported by RegexPlanet.)
Coming from a duplicate question, I'll post a solution which works for "traditional" regex implementations which do not support the Perl extensions \s, \W, etc. Newcomers who are not familiar even with the fact that there are different dialects (aka flavors) of regular expressions are advised to read e.g. Why are there so many different regular expression dialects?
If you have POSIX class support, you can use [[:alpha:]] for \w, [^[:alpha:]] for \W, [[:space:]] for \s, etc. But if we suppose that whitespace will always be a space and you want to extract the first three tokens between spaces, you don't really need even that.
[^ ]+[ ]+[^ ]+[ ]+[^ ]+
matches three tokens separated by runs of spaces. (I put the spaces in brackets to make them stand out, and easy to extend if you want to include other characters than just a single regular ASCII space in the token separator set. For example, if your regex dialect accepts \t for tab, or you are able to paste a regular tab in its place, you could extend this to
[^ \t]+[ \t]+[^ \t]+[ \t]+[^ \t]+
In most shells, you can type a literal tab with ctrl+v tab, i.e. prefix it with an escape code, which is often typed by holding down the ctrl key and typing v.)
To actually use this, you might want to do
grep -Eo '[^ ]+[ ]+[^ ]+[ ]+[^ ]+' file
where the single quotes are necessary to protect the regex from the shell (double quotes would work here, too, but are weaker, or backslashing every character in the regex which has a significance to the shell as a metacharacter) or perhaps
sed -r 's/([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/' file
to replace every line with just the captured expression (the parentheses make a capturing group, which you can refer back to with \1 in the replacement part in the s command in sed). The -r option selects a slightly more featureful regex dialect than the bare-bones traditional sed; if your sed doesn't have it, try -E, or put a backslash before each parenthesis and plus sign.
Because of the way regular expressions work, the first three is easy because a regular expression engine will always return the first possible match on a line. If you want three tokens starting from the second, you have to put in a skip expression. Adapting the sed script above, that would be
sed -r 's/[^ ]+[ ]+([^ ]+[ ]+[^ ]+[ ]+[^ ]+).*/\1/'
where you'll notice how I put in a token+non-token group before the capture. (This is not really possible with grep -o unless you have grep -P in which case the full gamut of Perl extensions is available to you anyway.)
If your regex dialect supports {m,n} repetition, you can of course refactor the regex to use that. If you need a large number of repetitions, it's certainly both more readable and more maintainable. Just make sure you don't add parentheses where you break up the backreference order (the first left parenthesis creates the first group \1, the second \2, etc.)
sed -r 's/([^ ]+([ ]+[^ ]+){2}).*/\1/' file
Notice how the second parenthesized group is necessary to specify the scope of the {2} repetition (we want to repeat more than just the single character immediately before the left curly brace). The OP's attempt had an error where the repetition was specified outside of the last parenthesis; then, the back reference \1 (or whatever it's called in your dialect -- TextMate seems to use $1, just like Perl) will refer to the last single match of the capturing parentheses, because the repetition is not part of the capture, being outside the capturing parentheses.
I am attempting to edit a csv file, below is a sample line from this file.
|MIGRATE|;|10000|;|2ACC0003|;|30/09/13|;|Positive Adjmt.|;||;|MIGRATE|;|95004U
The beginning of the line |MIGRATE| needs to be modified without changing the second MIGRATE so the line would read
|MIGRATE|;|MIG_IN|;|10000|;|2ACC0003|;|30/09/13|;|Positive Adjmt.|;||;|MIGRATE|;|95004U
There are 7700 or so lines so if I am forced to do this manually I will probably cry a little.
Thanks in advance!
Just replace all the ones you want not changed with another word temporarily, then replace the rest with what you want. I'm not sure what you're asking here, but from what I can guess this might help.
It seems like you could just search for Just search for:
^\|MIGRATE\|
And replace with:
|MIGRATE|;|MIG_IN|
Make sure you've checked 'Regular expression' in the 'Search Mode' options.
Explanation: The ^ is a begin anchor; it will match the beginning of the line, ensuring that it does not match the second |MIGRATE|. The \ characters are required to escape the | characters since they normally have special meaning in regular expressions, and you want to match a literal |.
You can use beginning of line anchors:
Find:
^(\|MIGRATE\|)
Replace with:
$1;|MIG_IN|
regex101 demo
Just make sure that you are using the regular expression mode of the Search&Replace.
If you want to be a bit fancier, you can use a positive lookbehind:
Find:
(?<=^\|MIGRATE\|)
Replace with:
;|MIG_IN|
^ Will match only at the beginning of a line.
( ... ) is called a capture group, and will save the contents of the match in variable you can use (in the first regex, I accessed the variable using $1 in the replace. The first capture gets stored to $1, the second to $2, etc.)
| is a special character meaning 'or' in regex (to match a character or group of characters or another, e.g. a|b matches a or b. As such, you need to escape it with a backslash to make a regex match a literal |.
In my second regex, I used (?<= ... ) which is called a positive lookbehind. It makes sure that the part to be matched has what's inside before it. For instance, (?<=a)b matches a b only if it has an a before it. So that the b in ab matches but not in bb.
The website I linked also explains the details of the regex and you can try out some regex yourself!
grep "http:\/\/.*\.jpg" index.html -o
Gives me text starting with http:// and ending with .jpg
So does: grep "http:\/\/.*\.\(jpg\)" index.html -o
What is the difference? And is there any condition where this might fail?
I got it to match either jpg,png or gif using this regex:
http:\/\/.*\.\(jpg\|png\|gif\)
Something to do with backreference or regex grouping that I read. Cannot understand this part \(\)
Grouping is used for two purposes in regular expressions.
One uses is to delimit parts of the regexp when using alternatives. That's the case in your third regexp, it allows you to say that the extension can be any of jpg, png, or gif.
The other use is for backreferences. This allows you to refer to the text that matched an earlier part of the regexp later in the regexp. For instance, the following regexp matches any letter that appears twice in a row:
\([a-z]\)\1
The backreference \1 means "match whatever matched the first group in the regexp".
( and ) are metacharacters. i.e. they don't match themselves, but mean something to grep.
From here:
Grouping is performed with backslashes followed by parentheses ‘(’,
‘)’.
so in the above the \( and \) define within them a group of possibilities to match separated by the | character. i.e. your filename extensions.