SED whitespace removal within a string - regex

I'm trying to use sed to replace whitespace within a string. For example, given the line:
var test = 'Some test text here.';
I want to get:
var test = 'Sometesttexthere.';
I've tried using (\x27 matches the '):
sed 's|\x27\([^\x27[:space:]]*\)[[:space:]]|\x27\1|g
but that just gives
var test = 'Sometest text here.';
Any ideas?

This is a much more complex sed script, but it works without a loop. You know, just for the sake of variety:
sed 'h;s/[^\x27]*\x27\(.*\)/\n\x27\1/;s/ //g;x;s/\([^\x27]*\).*/\1/;G;s/\n//g'
It makes a copy of the string, splits one (which will become the second half) at the first single quote discarding the first half, replaces all the spaces in the second half, swaps the copies, splits the other one discarding the second half, merges them back together and removes the newlines used for the splitting and the one added by the G command.
Edit:
In order to select particular lines to operate on, you can use some selection criteria. Here I've specified that the line must contain an equal sign and at least two single quotes:
sed '/.*=.*\x27.*\x27.*/ {h;s/[^\x27]*\x27\(.*\)/\n\x27\1/;s/ //g;x;s/\([^\x27]*\).*/\1/;G;s/\n//g}'
You could use whatever regex works best to include and exclude appropriately for your needs.

Your command line has two problems:
First, there's a missing \ after [^.
Second, even though you use the g modifier, only the first space is removed. Why? Because that modifier leads to replacement of successive matches within the same line. It does not re-scan the whole line from the beginning. But this is required here, because your match is anchored at the initial ' of the string literal.
The obvious way to solve this problem is to use a loop, implemented by a conditional jump (jump with tLabel to a :Label; t jumps if at least one s matched since the last test with t).
This is easiest with a sed script (and you don't have to escape the '), like so:
:a
s|'\([^'[:space:]]*\)[[:space:]]|'\1|
ta
But it can be done one the command prompt. The exact syntax may depend on your sed flavour, for mine (super-sed on Windows) it is invoked like so:
sed -e ":a" -e "s|\x27\([^\x27[:space:]]*\)[[:space:]]|\x27\1|;ta"
You need two separate script expressions, because the label :a extends until the end of an expression.

Related

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

sed not printing matching group as expected

I am searching for a particular pattern in a csv file. I would like to print the value of the second-to-last column if its value matches [0-9]{5}.
For example, let's say I have file.csv containing only one line of text:
col1,col2,col3,12345,col5
So I'm trying to print 12345. Here is the command I tried:
sed -nr 's/,([0-9]{5}),[^,]*$/\1/p' file.csv
However, this prints col1,col2,col312345.
Then, I tried
sed -nr 's/.*,([0-9]{5}),[^,]*$/\1/p' file.csv
which worked perfectly, printing 12345.
I don't know if I'm misunderstanding sed or just regex in general, but when I test the first regex on www.regex101.com, it behaves as I originally expected it to.
Why did prepending a .* to the pattern make a difference / fix the problem, and also why did the first pattern print what it did?
The command s/pattern/replacement/p takes a line that matches pattern, performs the substitution and then prints the whole line.1 So, you have this line:
col1,col2,col3,12345,col5
Your pattern /,([0-9]{5}),[^,]*$/ matches the line, specifically ,12345,col5. You substitute that with the capture group, 12345, so the line is now
col1,col2,col312345
and the p flag prints the whole line.
In your second command, the pattern /.*,([0-9]{5}),[^,]*$/ matches the line as well, but this time, it matches the whole line, and you substitute the whole line with the capture group.
1 In sed parlance, the line is loaded into the "pattern space", and you're manipulating the pattern space. At the end of each cycle, the pattern space gets printed (or whenever an explicit p command is given). I think you assumed that the p flag in the s command affects only the substituted part, but it's the whole pattern space.

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

sed is returning more than I need

Every line of the input file will match one of the patterns:
"SCnnnn"
"SC-nnnn"
"SC_nnnn"
( n=[0-9], SC is literal but may be upper or lowercase and will be followed immediately by 1-4 digits delimited at the end by an alphanumeric, space or other non-numeric character)
Somewhere in the line there will also be a file extension (matching ".abc") where abc = upper|lower alphanumeric in any position.
I want to extract the first pattern and print this together with the extracted file extension for each line. This is what I have so far:
sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
Here's a sample input line:
SCSCSCSCSCSCSCSCSC1867SCBrSCSCSCSC&SCBlSCkSCSCBSCrSCbSCckSC.xyz
with required output being:
SC1867.xyz
but what I am getting is:
SCSCSCSCSCSCSCSCSC1867.xyz
Can someone please tell me why this is returning the "SC"s before the part I want? I know it's something to do with greediness, but I can't get my head around it.
(Everything works fine where my "SCnnnn" match is at the beginning of the line.)
I am open to other tools - e.g. awk - if they offer a more straightforward solution.
EDIT: I think I found a solution - at least it appears to work:
sed -E -n 's/.*([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p'
It's actually not necessarily the greediness that is at play here. The reason this is happening is because sed is replacing a part of a line and then printing the whole line (the suffix of p on your s// command does this).
To more clearly see what's happening, make infile contain a more obvious string like 0o0o0o0o0o0o0o0oSC1867lalalalalalfalalala.xyz and run your first command. The following is the result
[user#localhost ~]$ sed -E -n 's/([Ss][Cc][-_]*[0-9][0-9]*).*(\.[a-zA-Z0-9]{3})/\1\2/p' infile
0o0o0o0o0o0o0o0oSC1867.xyz
As a slow-mo: sed finds your [Ss][Cc] characters beginning after the 0o0o0s and dutifully replaces the string you have described with the desired substitution; namely, it maintains the SC_-like part and four digits, then deletes everything after the numbers until the suffix. The problem is seen when the p command prints out the partially-changed line, including all of the unwanted 0oze.
Alternately
As an alternate solution, not involving printing partially changed lines but instead matching an entire line and altering it to your purpose, the following command extracted the correct answer to stdout for a file containing your example string:
[user#localhost ~]$ sed -e 's/^.*\([Ss][Cc][-_]\?[0-9]\{4\}\).*\(\.[a-Z]\{3\}\)$/\1\2/' infile
SC1867.xyz
To break that regex down a bit: the regex begins with a beginning of line (^), consumes all characters (.*) until it sees an SC (upper or lower, [Ss][Cc]), then it checks for an optional hyphen or underscore ([-_]\?), followed by exactly four digits ([0-9]\{4\}). Then, all characters are consumed until a dot (\.) is seen, followed by exactly three alphanumerical characters ([a-Z]\{3\}) and an end of line ($). The two expressions not consumed by a wildcard are saved to registers and concatenated (\1\2).
... sed -E 's/^.*([Ss][Cc][-_]?[0-9]{4}).*(\.[a-Z]{3})$/\1\2/' infile works too, if you don't enjoy backslashes as much as I do.

Regular Expression - Capture and Replace Select Sequences

Take the following file...
ABCD,1234,http://example.com/mpe.exthttp://example/xyz.ext
EFGH,5678,http://example.com/wer.exthttp://example/ljn.ext
Note that "ext" is a constant file extension throughout the file.
I am looking for an expression to turn that file into something like this...
ABCD,1234,http://example.com/mpe.ext
ABCD,1234,http://example/xyz.ext
EFGH,5678,http://example.com/wer.ext
EFGH,5678,http://example/ljn.ext
In a nutshell I need to capture everything up to the urls. Then I need to capture each URL and put them on their own line with the leading capture.
I am working with sed to do this and I cannot figure out how to make it work correctly. Any ideas?
If the number of URLs in each line is guaranteed to be two, you can use:
sed -r "s/([A-Z0-9,]{10})(.+\.ext)(.+\.ext)/\1\2\n\1\3/" < input
This does not require the first two fields to be a particular width or limit the set of (non-comma) characters between the commas. Instead, it keys on the commas themselves.
sed 's/\(\([^,]*,\)\{2\}\)\(.*\.ext\)\(http:.*\)/\1\3\n\1\4/' inputfile.txt
You could change the "2" to match any number of comma-delimited fields.
I have no sed available to me at the moment.
Wouldn't
sed -r 's/(....),(....),(.*\.ext)(http.*\.ext)/\1,\2,\3\n\1,\2,\4/g'
do the trick?
Edit: removed the lazy quantifier