Regex for replacing space with comma-space, except at end of line - regex

I am trying to covert input file content of this:
NP_418770.2: 257-296 344-415 503-543 556-592 642-707
YP_026226.4: 741-779 811-890 896-979 1043-1077
to this:
NP_418770.2: 257-296, 344-415, 503-543, 556-592, 642-707
YP_026226.4: 741-779, 811-890, 896-979, 1043-1077
i.e., replace a space with comma and space (excluding newline)
For that, I have tried:
perl -pi.bak -e "s/[^\S\n]+/, /g" input.txt
but it gives:
NP_418770.2:, 257-296, 344-415, 503-543, 556-592, 642-707
YP_026226.4:, 741-779, 811-890, 896-979, 1043-1077
how can I stop the additional comma which appear after ":" (I want ":" and a single space) without writing another regex?
Thanks

Try using regex negative lookbehind. It is basically look if the character before the space is colon (:) then it don't match that space.
s/(?<!:)[^\S\n]+/, /g

You can play with the word-boundary to discard the space that follows the colon: s/\b\h+/, /g
It can be done with perl:
perl -pe's/\b\h+/, /g' file
but also with sed:
sed -E 's/\b[ \t]+/, /g' file
Other approach that uses the field separator:
perl -F'\b\h+' -ape'BEGIN{$,=", "}' file
or do the same with awk:
awk -F'\b[ \t]+' -vOFS=', ' '1' file

You were close. That should do the trick:
s/(\d+-\d+)[^\S\n]+/$1, /g
The thing is, I try to look at the parts that will get a comma after them which apply to the pattern of "digits, then a dash, more digits, then a whitespace that's not a newline". The funny thing about it is that I said that "whitespace that's not a newline" part as [^\S\n]+ which means "not a non-whitespace or a newline" (because \S is all that's not \s and we want to exclude the newline too). If in any case you have some trailing whitespace, you can trim it with s/\s+$// prior to the regex above, just don't forget to add the newline character back after that.

Related

Regex to get match on entire string

How to match a a word before a specific charachter using sed in bash?
In my scenario I would need to match the metrics names in the entire string which occurs only before {.
The below is the string I am working on.
sum(rate(nginx_ingress_controller_request_duration_seconds_sum{namespace=\"$namespace\",ingress=~\"$ingress\"}[3m]))/sum(rate(nginx_ingress_controller_request_duration_seconds_count{namespace=\"$namespace\",ingress=~\"$ingress\"}[3m]))
What I would need the output is the below.
nginx_ingress_controller_request_duration_seconds_sum
nginx_ingress_controller_request_duration_seconds_count
I am not a Regex expert and I would be very thankful.
With GNU grep:
grep -oP '\(\K[^({]+(?={)'
This will print the results in separate lines. \(\K will check for presence of ( character and reset the start of matching portion (since ( isn't needed in the output). [^({]+ will match except ( and { characters. (?={) makes sure that the matched portion is followed by { character (but not part of the output).
If you know that the required portion can have only word characters, you can also use:
grep -oP '\w+(?={)'
This will look for two occurrences on the line onto a separate line in new_file
(with GNU sed):
sed 's/.*(\(.*\){.*(\(.*\){.*/\1\n\2/' your_file > new_file
Contents of new_file:
nginx_ingress_controller_request_duration_seconds_sum
nginx_ingress_controller_request_duration_seconds_count
The ways it's working is as follows:
/.*(: Match everything after a { up to a (
\(.*\): I remember the stuff in between \( and \) (these are called
capture group)
{.*(: Match everything after a { up to a (
\(.*\): I remember a second group of stuff using a second capture group
{.*: Match the rest of the stuff in the line
/\1\n\2/: Put the two patterns we remembered back into a file a newline
\n between.
Edit
Another approach that would would work for multiple occurrences would be to
create newlines and a unique patter at the points before and after the part of the string that
you're interested in, and then grep away those lines:
sed 's/(/BADLINES\n/g; s/{/\nBADLINES/g' your_file | grep -v BADLINES
The first part (sed 's/(/BADLINES\n/g; s/{/\nBADLINES/g' your_file) produces:
sumBADLINES
rateBADLINES
nginx_ingress_controller_request_duration_seconds_sum
BADLINESnamespace=\"$namespace\",ingress=~\"$ingress\"}[3m]))/sumBADLINES
rateBADLINES
nginx_ingress_controller_request_duration_seconds_count
BADLINESnamespace=\"$namespace\",ingress=~\"$ingress\"}[3m]))
and the | grep -v BADLINES produces:
nginx_ingress_controller_request_duration_seconds_sum
nginx_ingress_controller_request_duration_seconds_count
This might work for you (GNU sed):
sed -E '/^(\w+)\{/{s//\1\n/;P;D};s/^\w*\W/\n/;D' file
If the start of the line is a valid string followed by a {, replace the { by a newline, print/delete the first line in the pattern space and repeat.
Otherwise, reduce the pattern space and repeat until all strings are matched.
N.B. A valid string in this case is a word i.e. alphanumeric or an underscore.

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

How to look for lines which don't end with a certain character

How to look for lines which don't end with a ."
description="This has a full stop."
description="This has a full stop."
description="This line doesn't have a full stop"
You can use a character class to describe the occurrence of any character except .:
[^\n.](\n|$)
This will match any character that is neither a . nor new line character, and that is either followed by a new line character or by the end of the string. If multiline mode is supported, you can also use just $ instead of (\n|$).
Depends on your environent. On Linux/Unix/Cygwin you would do something like this:
grep -n -v '\."$' <file.txt
or
grep -n -v '\."[[:space:]]*$' <file.txt
if trailing whitespace is fine.
I guess the regular expression pattern you are looking for is the following:
\."$
\. means a real dot. (compared to . which means any character except \n)
" is the double quote that ends the line in your example.
$ means end of line.
The way you will use this pattern depends on the environment you are using, so give us more precision for a more precise answer :-)
In general, regular expression matches. It is not easy to do a don't match. The general solution for this kind of thing is to invert the truth value. For example:
grep: grep -v '\.$'
Perl: $line !~ /\.$/
Tcl: ![regexp {\.$} $line]
In this specific case, since it is just a character, you can use the character class syntax, [], since it accepts a ^ modifier to signify anything that is not the specified characters:
[^.]$
so, in Perl it would be something like:
$line =~ /[^.]$/

How can I add characters at the beginning and end of every non-empty line in Perl?

I would like to use this:
perl -pi -e 's/^(.*)$/\"$1\",/g' /path/to/your/file
for adding " at beginning of line and ", at end of each line in text file. The problem is that some lines are just empty lines and I don't want these to be altered. Any ideas how to modify above code or maybe do it completely differently?
Others have already answered the regex syntax issue, let's look at that style.
s/^(.*)$/\"$1\",/g
This regex suffers from "leaning toothpick syndrome" where /// makes your brain bleed.
s{^ (.+) $}{ "$1", }x;
Use of balanced delimiters, the /x modifier to space things out and elimination of unnecessary backwhacks makes the regex far easier to read. Also the /g is unnecessary as this regex is only ever going to match once per line.
perl -pi -e 's/^(.+)$/\"$1\",/g' /your/file
.* matches 0 or more characters; .+ matches 1 or more.
You may also want to replace the .+ with .*\S.* to ensure that only lines containing a non-whitespace character are quoted.
change .* to .+
In other words lines must contain at 1 or more characters. .* represents zero or more characters.
You should be able to just replace the * (0 or more) with a + (1 or more), like so:
perl -pi -e 's/^(.+)$/\"$1\",/g' /path/to/your/file
all you are doing is adding something to the front and back of the line, so there is no need for regex. Just print them out. Regex for such a task is expensive if your file is big.
gawk
$ awk 'NF{print "\042" $0 "\042,"}' file
or Perl
$ perl -ne 'chomp;print "\042$_\042,\n" if ($_ ne "") ' file
sed -r 's/(.+)/"\1"/' /path/to/your/file

Regular Expression for carriage return occuring at begining or end file

I am looking for a way to remove 'stray' carriage returns occurring at the beginning or end of a file. ie:
\r\n <-- remove this guy
some stuff to say \r\n
some more stuff to say \r\n
\r\n <-- remove this guy
How would you match \r\n followed by 'nothing' or preceded by 'nothing'?
Try this regular expression:
^(\r\n)+|\r\n(\r\n)+$
Depending on the language either the following regex in multiline mode:
^\r\n|\r\n$
Or this regex:
\A\r\n|\r\n\z
The first one works in e.g. perl (where ^ and $ match beginning/end of line in single-line mode and beginning/end of string in multiline mode). The latter works in e.g. ruby.
Here's a sed version that should print out the stripped file:
sed -i .bak -e '/./,$!d' -e :a -e '/^\n*$/{$d;N;ba' -e '}' foo.txt
The -i tells it to perform the edit in-place and the .bak tells it to back up the original with a .bak extension first. If memory is a concern, you can use '' instead of .bak and no backup will be made. I don't recommend unless absolutely necessary, though.
The first command ('/./,$!d' should get rid of all leading blank lines), and the rest is to handle all trailing blank lines.
See this list of handy sed 1-liners for other interesting things you can chain together.
^\s+|\s+$
\s is whitespace (space, \r, \n, tab)
+ is saying 1 or more
$ is saying at the end of the input
^ is saying at the start of the input