sed not printing matching group as expected - regex

I am searching for a particular pattern in a csv file. I would like to print the value of the second-to-last column if its value matches [0-9]{5}.
For example, let's say I have file.csv containing only one line of text:
col1,col2,col3,12345,col5
So I'm trying to print 12345. Here is the command I tried:
sed -nr 's/,([0-9]{5}),[^,]*$/\1/p' file.csv
However, this prints col1,col2,col312345.
Then, I tried
sed -nr 's/.*,([0-9]{5}),[^,]*$/\1/p' file.csv
which worked perfectly, printing 12345.
I don't know if I'm misunderstanding sed or just regex in general, but when I test the first regex on www.regex101.com, it behaves as I originally expected it to.
Why did prepending a .* to the pattern make a difference / fix the problem, and also why did the first pattern print what it did?

The command s/pattern/replacement/p takes a line that matches pattern, performs the substitution and then prints the whole line.1 So, you have this line:
col1,col2,col3,12345,col5
Your pattern /,([0-9]{5}),[^,]*$/ matches the line, specifically ,12345,col5. You substitute that with the capture group, 12345, so the line is now
col1,col2,col312345
and the p flag prints the whole line.
In your second command, the pattern /.*,([0-9]{5}),[^,]*$/ matches the line as well, but this time, it matches the whole line, and you substitute the whole line with the capture group.
1 In sed parlance, the line is loaded into the "pattern space", and you're manipulating the pattern space. At the end of each cycle, the pattern space gets printed (or whenever an explicit p command is given). I think you assumed that the p flag in the s command affects only the substituted part, but it's the whole pattern space.

Related

Regex Match Paragraph Pattern

I am trying to match a paragraph pattern and I am having trouble.
The pattern is:
[image.gif]
some words, usually a few lines
name
emailaddress<mailto:theemailaddress#mail.com>
I tried matching everything between the gif image and the <mailto: but this happens multiple times in the file meaning I get a bad result.
I tried it with this
(?<=\[image.gif\].*?(\[image.gif\])).*?(?=<mailto:)
Is there a way to use Regex to match the general layout of a paragraph?
"the general layout of a paragraph" needs a better definition. Given the lack of an input plus expected output, I'm having to guess what you want here. I'm also guessing that you will accept any language. Here's perl, almost certainly not a language you're familiar with.
Assumed input:
do not match this line
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
don't match this line either
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
this line is also not for matching
Expected output:
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>
Solution using perl:
#!/usr/bin/perl -n007
my $sep = "";
while (/(\[image\.gif\].*?<mailto:[^>]*>(\r)?\n)/gms) {
print $sep . $1;
$sep = "---$2\n";
}
perl is the king of regex languages; many would say that's all it is good for. Here, we use the -n007 option to tell it to read the entire contents of each file and run the code on it as the default variable.
$sep starts blank because there's nothing to separate until the second match.
Then we loop over each block of text that matches the regex:
matches a literal [image.gif]
then matches as little content following that as possible
then matches a literal <mailto: and continues until the next >
then captures the line break (including optional support for DOS line endings)
(see full regex explanation and example at regex101)
We then print the match and finally set the separator to three dashes and a line break (DOS line endings added when needed).
Now you can run it:
$ perl answer.pl input.txt
[image.gif]
some words, usually a few lines
Bobert McBobson
emailaddress<mailto:bobertmb#example.com>
---
[image.gif]
another few words
on another few lines
Bobina Robertsdaughter
emailaddress<mailto:bobinard#example.info>

sed is returning more than I need

Every line of the input file will match one of the patterns:
"SCnnnn"
"SC-nnnn"
"SC_nnnn"
( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)
Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.
I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:
sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
Here's a sample input line:
SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz
with required output being:
SC1867.xyz
but what I am getting is:
SCSCSCSCSCSCSCSCSC1867.xyz
Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.
(Everything works fine where my "SCnnnn" match is at the beginning of the line.)
I am open to other tools - e.g. awk - if they offer a more straightforward solution.
EDIT: I think I found a solution - at least it appears to work:
sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'
It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).
To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result
[user#localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz
As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.
Alternately
As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:
[user#localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz
To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).
... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.

sed - match regex in specific position

I'm having some trouble creating a one liner or a simple script to edit some fixed length files using sed.
Supposing my file has lines in this format:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
If the entire lines are considered as a string, I can say I would want to match the substring that starts in position 10 and has length 3 with a regex. If it matches the regex I want to had some other string in the end of that line.
Assuming the matching regex is B.R, and the string to append in the end of the line is NOT, I would want my file to turn into:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT
The lines in the files are bigger than the ones in this sample.
So far I have this:
sed -i '/B.R/ s/$/NOT/' file.name
The problem is that this ignores the position where the regex is matched, making the first line of the example a match as well:
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTH3FOO
I'm open to use awk as well.
Thanks in advance.
You are almost there. You just need to specify the characters which exists before B.R . If B is at 10th position then there must be 9 characters exists before B
sed -i '/^.\{9\}B.R/s/$/NOT/' file.name
Example:
$ sed '/^.\{9\}B.R/s/$/NOT/' file
IPITTYTHEFOOBUTIDONOTPITTYTHEBAR
IPITTYTH BARBUTIDONOTPITTYTHEFOONOT

sed only replacing last occurrence of match - need to match all

I would like to replace all { } on a certain line with [ ], but unfortunately I am only able to match the last occurrence of the regexp.
I have a config file which has structure as follows:
entry {
id 123456789
desc This is a description of {foo} and was added by {bar}
trigger 987654321
}
I have the following sed, of which is able to replace the last match 'bar' but not 'foo':
sed s'/\(desc.*\){\(.*\)}/\1\[\2\]/g' < filename
I anchor this search to the line containing 'desc' as I would hate for it to replace the delimiting braces of each 'entry' block.
For the life of me I am unable to figure out how to replace all of the occurrences.
Any help is appreciated - have been learning all day and unable to read any more tutorials for fear that my corneas might crack.
Thanks!
Try the following:
sed '/desc/ s/{\([^}]*\)}/[\1]/g' filename
The search and replace in the above command will only be done for lines that match the regex /desc/, however I don't think this is actually necessary because sed processes text a line at a time, so even without this you wouldn't be replacing braces on the 'entry' block. This means that you could probably simplify this to the following:
sed 's/{\([^}]*\)}/[\1]/g' filename
Instead of .* inside of the capturing group [^}]* is used which will match everything except closing braces, that way you won't match from the first opening to the last closing.
Also, you can just provide the file name as the final argument to sed instead of using input redirection.

SED whitespace removal within a string

I'm trying to use sed to replace whitespace within a string. For example, given the line:
var test = 'Some test text here.';
I want to get:
var test = 'Sometesttexthere.';
I've tried using (\x27 matches the '):
sed 's|\x27\([^\x27[:space:]]*\)[[:space:]]|\x27\1|g
but that just gives
var test = 'Sometest text here.';
Any ideas?
This is a much more complex sed script, but it works without a loop. You know, just for the sake of variety:
sed 'h;s/[^\x27]*\x27\(.*\)/\n\x27\1/;s/ //g;x;s/\([^\x27]*\).*/\1/;G;s/\n//g'
It makes a copy of the string, splits one (which will become the second half) at the first single quote discarding the first half, replaces all the spaces in the second half, swaps the copies, splits the other one discarding the second half, merges them back together and removes the newlines used for the splitting and the one added by the G command.
Edit:
In order to select particular lines to operate on, you can use some selection criteria. Here I've specified that the line must contain an equal sign and at least two single quotes:
sed '/.*=.*\x27.*\x27.*/ {h;s/[^\x27]*\x27\(.*\)/\n\x27\1/;s/ //g;x;s/\([^\x27]*\).*/\1/;G;s/\n//g}'
You could use whatever regex works best to include and exclude appropriately for your needs.
Your command line has two problems:
First, there's a missing \ after [^.
Second, even though you use the g modifier, only the first space is removed. Why? Because that modifier leads to replacement of successive matches within the same line. It does not re-scan the whole line from the beginning. But this is required here, because your match is anchored at the initial ' of the string literal.
The obvious way to solve this problem is to use a loop, implemented by a conditional jump (jump with tLabel to a :Label; t jumps if at least one s matched since the last test with t).
This is easiest with a sed script (and you don't have to escape the '), like so:
:a
s|'\([^'[:space:]]*\)[[:space:]]|'\1|
ta
But it can be done one the command prompt. The exact syntax may depend on your sed flavour, for mine (super-sed on Windows) it is invoked like so:
sed -e ":a" -e "s|\x27\([^\x27[:space:]]*\)[[:space:]]|\x27\1|;ta"
You need two separate script expressions, because the label :a extends until the end of an expression.