Perl In place edit: Find and replace in X12850 formatted file - regex

I am new to Perl and cannot figure this out. I have a file called Test:
ISA^00^ ^00^ ^01^SupplyScan ^01^NOVA ^180815^0719^U^00204^000000255^0^P^^
GS^PO^SupplyScan^NOVA^20180815^0719^00000255^X^002004
ST^850^00000255
BEG^00^SA^0000000059^^20180815
DTM^097^20180815^0719
N1^BY^^92^
N1^SE^^92^1
N1^ST^^92^
PO1^1^4^BX^40.000^^^^^^^^IN^131470^^^1^
PID^F^^^^CATH 6FR .070 MPA 1 100CM
REF^
PO1^2^4^BX^40.000^^^^^^^^IN^131295^^^1^
PID^F^^^^CATHETER 6FR XB 3.5
REF^
PO1^3^2^EA^48.000^^^^^^^^IN^132288^^^1^
PID^F^^^^CATH 6FR AL-1 SH
REF^
PO1^4^2^BX^48.000^^^^^^^^IN^131297^^^1^
PID^F^^^^CATHETER 6FR .070 JL4SH 100CM
REF^
CTT^4^12
SE^20^00000255
GE^1^00000255
IEA^1^00000255
What I am trying to do is an in place edit, dropping any value in the N1^SE segment after the 92^. I tried this but I cant seem to make it work:
perl -i -pe 's/^N1\^SE\^\^92\^\d+$/N1^SE^^92^/g' Test
The final result should include the N1^SE segment looking like this:
N1^SE^^92^
It worked when I just had the one line in the file: N1^SE^^92^1. But when I try to globally substitute in the entire file, it doesn't work
Thanks.

You may have missed to copy here some hidden character(s) or spaces. Those may well be at the end of the line so try
perl -i -pe 's/^N1\^SE\^\^92\^\K.*//' Test
The \K is a special form of the "positive lookbehind" which drops all previous matches so only .* after it (the rest) are removed by the substitution. †
This takes seriously the requirement "dropping any value ... after", as it matches lines with things other than the sole \d from the question's example.
Or use \Q...\E sequence to escape special characters (see quotemeta)
perl -i -pe 's/^\QN1^SE^^92^\E\K.*//' Test
per Borodin's comment.
Another take is to specifically match \d as in the question
s/^N1\^SE\^\^92\^\K\d+//
per ikegami's comment. This stays true to your patterns and it also doesn't remove whatever may be hiding at the end of the line.
† The term "lookbehind" for \K is from documentation but, while \K clearly "looks behind," it has marked differences from how the normal lookbehind assertions behave.
Here is a striking example from ikegami. Compare
perl -le'print for "abcde" =~ /(?<=\w)\w/g' # prints lines: b c d e
and
perl -le'print for "abcde" =~ /\w\K\w/g' # prints lines: b d

Related

Output Substring to Newline from a Raw Text String using Regex

I have a name delimiter that I want to use to extract the whole line where it is found.
[string]$testString = $null
# broken text string of text & newlines which simulates $testString = Get-Content -Raw
$testString = "initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text"
# test1
# simulate text string before(?<content>.*)text string after - this returns "initial text" only (no newline or anything after)
# $testString -match "(?<BOURKE>.*)"
# test2
# this returns all text, including the newlines, so that $testString outputs exactly as it is defined
$testString -match "(?s)(?<BOURKE>.*)"
#test3
# I want just the line with BOURKE
$result = $matches['BOURKE']
$result
#Test1 finds the match but only prints to the newline. #Test2 finds the match and includes all newlines. I would like to know what is the regex pattern that forces the output to begin 001 BOURKE ...
Any suggestions would be appreciated.
Note:
I'm assuming you're looking for the whole line on which BOURKE appears as a substring.
In your own attempts, (?<BOURKE>...) simply gives the regex capture group a self-chosen name (BOURKE), which is unrelated to what the capture group's subexpression (...) actually matches.
For the use case at hand, there's no strict need to use a (named) capture group at all, so the solutions below make do without one, which, when the -match operator is used, means that the result of a successful match is reported in index [0] of the automatic $Matches variable, as shown below.
If your multiline input string contains only Unix-format LF newlines (\n), use the following:
if ($multiLineStr -match '.*BOURKE.*') { $Matches[0] }
Note:
To match case-sensitively, use -cmatch instead of -match.
If you know that the substring is preceded / followed by at least one char., use .+ instead of .*
If you want to search for the substring verbatim and it happens to or may contain regex metacharacters (e.g. . ), apply [regex]::Escape() to it; e.g, [regex]::Escape('file.txt') yields file\.txt (\-escaped metacharacters).
If necessary, add additional constraints for disambiguation, such as requiring that the substring start or end only at word boundaries (\b)
If there's a chance that Windows-format CLRF newlines (\r\n) are present , use:
if ($multiLineStr -match '.*BOURKE[^\r\n]*') { $Matches[0] }
For an explanation of the regexes and the ability to experiment with them, see this regex101.com page for .*BOURKE.*, and this one for .*BOURKE[^\r\n]*
In short:
By default, . matches any character except \n, which obviates the need for line-specific anchors (^ and $) altogether, but with CRLF newlines requires excluding \r so as not to capture it as part of the match.[1]
Two asides:
PowerShell's -match operator only ever looks for one match; if you need to find all matches, you currently need to use the underlying [regex] API directly; e.g., [regex]::Matches($multiLineStr, '.*BOURKE[^\r\n]*').Value, 'IgnoreCase'GitHub issue #7867 suggests bringing this functionality directly to PowerShell in the form of a -matchall operator.
If you want to anchor the substring to find, i.e. if you want to stipulate that it either occur at the start or at the end of a line, you need to switch to multi-line mode ((?m)), which makes ^ and $ match on each line; e.g., to only match if BOURKE occurs at the very start of a line:
if ($multiLineStr -match '(?m)^BOURKE[^\r\n]*') { $Matches[0] }
If line-by-line processing is an option:
Line-by-line processing has the advantage that you needn't worry about differences in newline formats (assuming the utility handling the splitting into lines can handle both newline formats, which is true of PowerShell in general).
If you're reading the input text from a file, the Select-String cmdlet, whose very purpose is to find the whole lines on which a given regex or literal substring (-SimpleMatch) matches, additionally offers streaming processing, i.e. it reads lines one by one, without the need to read the whole file into memory.
(Select-String -LiteralPath file.txt -Pattern BOURKE).Line
Add -CaseSensitive for case-sensitive matching.
The following example simulates the above (-split '\r?\n' splits the multiline input string into individual lines, recognizing either newline format):
(
#'
initial text
preliminary text
unfinished line bfore the line I want
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
line after the line I want
extra text
extra extra text
'# -split '\r?\n' |
Select-String -Pattern BOURKE
).Line
Output:
001 BOURKE, Bridget Mary ....... ........... 13 Mahina Road, Mahina Bay.Producrs/As 002 BOURKE. David Gerard ...
[1] Strictly speaking, the [^\r\n]* would also stop matching at a \r character in isolation (i.e., even if not directly followed by \n). If ruling out that case is important (which seems unlikely), use a (simplified version of) the regex suggested by Mathias R. Jessen in a comment on the question: .*BOURKE.*?(?=\r?\n)
I find it best to have a match consume up to what is not needed; the \r\n. That can be done with the set nomenclature with the ^ in the set such as [^\r\n]+ which says consume up to either a \r or a \n. Hence everything that is not a \r\n.
To do that use
$testString -match "(?<Bourke>\d\d\d\s[^\r\n]+)"
Also one should try to avoid the * when one knows there will be matchable txt...the * is a greedy type that consumes everything. Usage of the +, one or more, limits the match considerably because the parser doesn't have to try patterns (The zero of the *s zero or more), backtracking as its called which are patently not plausible.

Using the length of the matched group inside regex

Assume this
char=l
string="Hello, World!"
Now, I want to replace all char in string but continuous occurrence (run-length encoding) while reading from STDIN
I tried this:
$c=<>;$_=<>;print s/($c)\1*/length($&)/grse;
When the input is given as
l
Hello, World!
It returns Hello, World!. But when I ran this
$c=<>;$_=<>;print s/(l)\1*/length($&)/grse;
it returned He2o, Wor1d.
So, since the input is given in separate lines, $c contained \n (checked with $c=~/\n/)
So, I tried
$c=<>.chomp;$_=<>;print s/($c)\1*/length($&)/grse;
and
$c=<>;$_=<>;print s/($c.chomp)\1*/length($&)/grse;
Neither worked. Could anyone please say why?
In Perl, . is used to concatenate strings, and not to call methods (unlike in some other languages; Ruby for instance). Have a look at documentation of chomp to see how it should be use. You should be doing
chomp($c=<>)
Rather than
$c=<>.chomp
Your full code should thus simply be:
chomp($c=<>);$_=<>;print s/($c)\1*/length($&)/grse;
If $c is always a single character, then the regex can be simplified to s/$c+/length($&)/grse. Also, if $c can be a regex meta-character (eg, +, *, (, [, etc), then it you should escape it (and it makes sense to escape it just in case). To do so, you can use \Q..\E (or quotemeta, although it is more verbose and thus maybe less adapted to a one-liner):
s/\Q$c\E+/length($&)/grse
If you don't escape $c one way or another, and your one-liner is ran with ( as first input for instance, you'll get the following error:
Quantifier follows nothing in regex; marked by <-- HERE in m/(+ <-- HERE / at -e line 1, <> line 2
Regarding what $c=<>.chomp actually means in Perl (since this is a valid Perl code that can make sense in some contexts):
$c=<>.chomp means <> concatenated to chomp, where chomp without arguments is understood as chomp($_). And chomp returns the total number of characters removed, and since $_ is empty, no characters are removed, which means that this chomp returns 0. So you are basically writing $c=<>.0, which means that if your input is l\n, you end up with l\n0 instead of l.
One way to debug this kind of this yourself is to:
Enable warnings with the -w flag. In that case, it would have printed
Use of uninitialized value $_ in scalar chomp at -e line 1, <> line 1.
This is arguably not the most helpful warning ever, but it would have helped you get an idea of where your mistake was.
Print variables to be sure that they contain what you expect. For instance, you could co perl -wE '$c=<>.chomp;print"|$c|"', which would print:
|l
0|
Which should help giving you an idea of what was wrong.

How to use the literal string "STDIN" in negative lookbehind in perl? [duplicate]

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

"Variable length lookbehind not implemented" but it isn't variable length

I have a very crazy regex that I'm trying to diagnose. It is also very long, but I have cut it down to just the following script. Run using Strawberry Perl v5.26.2.
use strict;
use warnings;
my $text = "M Y H A P P Y T E X T";
my $regex = '(?i)(?<!(Mon|Fri|Sun)day |August )abcd(?-i)';
if ($text =~ m/$regex/){
print "true\n";
}
else {
print "false\n";
}
This gives the error "Variable length lookbehind not implemented in regex."
I am hoping you can help with several issues:
I don't see why this error would occur, because all of the possible lookbehind values are 7 characters: "Monday ", "Friday ", "Sunday ", "August ".
I did not write this regex myself, and I am not sure how to interpret the syntax (?i) and (?-i). When I get rid of the (?i) the error actually goes away. How will perl interpret this part of the regex? I would think the first two characters are evaluated to "optional literal parentheses" except that the parentheses isn't escaped and also in that case I would get a different syntax error because the closing parentheses would then not be matched.
This behavior starts somewhere between Perl 5.16.3_64 and 5.26.1_64, at least in Strawberry Perl. The former version is fine with the code, the latter is not. Why did it start?
I have reduced your problem to this:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!st)A';
print ($text =~ m/$regex/i ? "true\n" : "false\n");
Due to presence of /i (case insensitive) modifier and presence of certain character combinations such as "ss" or "st" that can be replaced by a Typographic_ligature causing it to be a variable length (/August/i matches for instance on both AUGUST (6 characters) and august (5 characters, the last one being U+FB06)).
However if we remove /i (case insensitive) modifier then it works because typographic ligatures are not matched.
Solution: Use aa modifiers i.e.:
/(?<!st)A/iaa
Or in your regex:
my $text = 'M Y H A P P Y T E X T';
my $regex = '(?<!(Mon|Fri|Sun)day |August )abcd';
print ($text =~ m/$regex/iaa ? "true\n" : "false\n");
From perlre:
To forbid ASCII/non-ASCII matches (like "k" with "\N{KELVIN SIGN}"), specify the "a" twice, for example /aai or /aia. (The first occurrence of "a" restricts the \d, etc., and the second occurrence adds the "/i" restrictions.) But, note that code points outside the ASCII range will use Unicode rules for /i matching, so the modifier doesn't really restrict things to just ASCII; it just forbids the intermixing of ASCII and non-ASCII.
See a closely related discussion here
That's because st can be a ligature. The same happens to fi and ff:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
my $fi = 'fi';
print $fi =~ /fi/i;
So imagine something like fi|fi where, indeed, the lengths of alternatives isn't the same.
st could be represented in a 1-character stylistic ligature as st or ſt, so its length could be 2 or 1.
Quickly finding perl's full list of 2→1-character ligatures using a bash command:
$ perl -e 'print $^V'
v5.26.2
$ for lig in {a..z}{a..z}; do \
perl -e 'print if /(?<!'$lig')x/i' 2>/dev/null || echo $lig; done
ff fi fl ss st
These respectively represent the ff, fi, fl, ß, and st/ſt ligatures. (ſt represents ſt, using the obsolete long s character; it matches st and it does not match ft.)
Perl also supports the remaining stylistic ligatures, ffi and ffl for ffi and ffl, though this isn't noteworthy in this context since lookbehinds already have issues with ff and fi/fl separately.
Future releases of perl may include more stylistic ligatures, though all that remain are font-specific (e.g. Linux Libertine has stylistic ligatures for ct and ch) or debatably stylistic (such as the Dutch ij for ij or the obsolete Spanish ꝇ for ll). It doesn't seem appropriate to have this treatment for ligatures that are not entirely interchangeable (nobody would accept dœs for does), though there are other scenarios, such as including ß thanks to its uppercase form being SS.
Perl 5.16.3 (and similarly old versions) only stumble on ss (for ß) and fail to expand the other ligatures in lookbehinds (they have fixed width and will not match). I didn't seek out the bugfix to itemize exactly which versions are affected.
Perl 5.14 introduced ligature support, so earlier versions don't have this problem.
Workarounds
Workarounds for /(?<!August)x/i (only the first will properly avoid August):
/(?<!Augus[t])(?<!Augu(?=st).)x/i (absolutely comprehensive)
/(?<!Augu(?aa:st))x/i (just the st in the lookbehind is "ASCII-safe" ²)
/(?<!(?aa)August)x/i (the whole the lookbehind is "ASCII-safe" ²)
/(?<!August)x/iaa (the whole regex is "ASCII-safe" ²)
/(?<!Augus[t])x/i (breaks ligature seeking ¹)
/(?<!Augus.)x/i (slightly different, matches more)
/(?<!Augu(?-i:st))x/i (case-sensitive st in lookbehind, won't match AugusTx)
These toy with removing the case-insensitive modifier¹ or adding the ASCII-safe modifier² in various places, often requiring the regex writer to specifically know of the variable-width ligature.
The first variation (which is the only comprehensive one) matches the variable widths with two lookbehinds: first for the six character version (no ligatures as noted in the first quote below) and second for any ligatures, employing a forward lookahead (which has zero width!) for st (including the ligatures) and then accounting for its single character width with a .
Two segments of the perlre man page:
¹ Case-insensitive modifier /i & ligatures
There are a number of Unicode characters that match a sequence of
multiple characters under /i. For example, "LATIN SMALL LIGATURE
FI" should match the sequence fi. Perl is not currently able to
do this when the multiple characters are in the pattern and are
split between groupings, or when one or more are quantified. Thus
"\N{LATIN SMALL LIGATURE FI}" =~ /fi/i; # Matches [in perl 5.14+]
"\N{LATIN SMALL LIGATURE FI}" =~ /[fi][fi]/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /fi*/i; # Doesn't match!
"\N{LATIN SMALL LIGATURE FI}" =~ /(f)(i)/i; # Doesn't match!
² ASCII-safe modifier /aa (perl 5.14+)
To forbid ASCII/non-ASCII matches (like k with \N{KELVIN SIGN}),
specify the a twice, for example /aai or /aia. (The first
occurrence of a restricts the \d, etc., and the second occurrence
adds the /i restrictions.) But, note that code points outside the
ASCII range will use Unicode rules for /i matching, so the modifier
doesn't really restrict things to just ASCII; it just forbids the
intermixing of ASCII and non-ASCII.
To summarize, this modifier provides protection for applications that
don't wish to be exposed to all of Unicode. Specifying it twice gives
added protection.
Put (?i) after lookbehind:
(?<!(Mon|Fri|Sun)day |August )(?i)abcd(?-i)
or
(?<!(Mon|Fri|Sun)day |August )(?i:abcd)
To me it seems to be a bug.

How to replace all the blanks within square brackets with an underscore using sed?

I figured out that in order to turn [some name] into [some_name] I need to use the following expression:
s/\(\[[^ ]*\) /\1_/
i.e. create a backreference capture for anything that starts with a literal '[' that contains any number of non space characters, followed by a space, to be replaced with the non space characters followed by an underscore. What I don't know yet though is how to alter this expression so it works for ALL underscores within the braces e.g. [a few words] into [a_few_words].
I sense that I'm close, but am just missing a chunk of knowledge that will unlock the key to making this thing work an infinite number of times within the constraints of the first set of []s contained in a line (of SQL Server DDL in this case).
Any suggestions gratefully received....
There are two parts to the trickery needed:
Stop replacing when you reach a close square bracket (but do it repeatedly on the line):
s/\(\[[^] ]*\) /\1_/g
This matches an open square bracket, followed by zero or more characters that are neither a blank nor a close square bracket. The global suffix means that the pattern is applied to all sequences starting with an open square bracket followed eventually by a blank or close square bracket on the line. Note, too, that this regex does not alter '[single-word] and context' whereas the original would translate that to '[single-word]_and context', which is not the object of the exercise.
Get sed to repeat the search from where this one started. Unfortunately, there isn't a truly good way to do that. Sed always resumes searching after the text that was substituted; and this is one occasion when we don't want that. Sometimes, you can get away with simply repeating the substitute operation. In this case, you have to repeat it every time the substitution succeeds, stopping when there are no more substitutions.
Two of the less well known operations in sed are the ':label' and the 't' commands. They were present in the 7th Edition of Unix (circa 1978), though, so they are not new features. The first simply identifies a position in the script which can be jumped to with 'b' (not wanted here) or 't':
[2addr]t [label]
Branch to the ':' function bearing the label if any substitutions have been made since the most recent reading of an input line or execution of a 't' function. If no label is specified, branch to the end of the script.
Marvellous: we need:
sed -e ':redo; s/\(\[[^] ]*\) /\1_/g; t redo' data.file
Except - it doesn't work all on one line like that (at least, not on MacOS X). This did work admirably, though:
sed -e ':redo
s/\(\[[^] ]*\) /\1_/g
t redo' data.file
Or, as noted in the comments, you could write three separate '-e' options (which works on MacOS X):
sed -e ':redo' -e 's/\(\[[^] ]*\) /\1_/g' -e 't redo' data.file
Given the data file:
a line with [one blank] word inside square brackets.
a line with [two blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple words in a single bracket] inside square brackets.
a line with [multiple words in a single bracket] [several times on one line]
the output from the sed script shown is:
a line with [one_blank] word inside square brackets.
a line with [two_blank] or [three_blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several_times_on_one_line]
And, finally, reading the fine print in the question, if you need this done only in the first square-bracketed field on each line, then we need to ensure that are no open square brackets before the one that starts the match. This variant works:
sed -e ':redo' -e 's/^\([^]]*\[[^] ]*\) /\1_/' -e 't redo' data.file
(The 'g' qualifier is gone - it probably isn't needed in the other variants either given the loop; its presence might make the process marginally more efficient, but it would most likely be essentially impossible to detect that. The pattern is now anchored to the start of the line (the caret) and contains zero or more characters that are not open square bracket before the first open square bracket.)
Sample output:
a line with [two_blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several times on one line]
This is easier in a language like perl which has "executable" substitutions:
perl -wne 's/(\[.*?])/ do { my $x = $1; $x =~ y, ,_,; $x } /ge; print'
Or to split it up more clearly:
sub replace_with_underscores {
my $s = shift;
$s =~ y/ /_/;
$s
}
s/(\[.*?])/ replace_with_underscores($1) /ge;
The .*? is the non-greedy match (to avoid slurring together two adjacent bracketed phrases) and the e flag to the substitution causes it to be evaluated, so you can call a function to do the inner work.