Avoid repeating regex substitution - regex

I have lines of code (making up a Ruby hash) with the form of:
"some text with spaces" => "some other text",
I wrote the following vim style regex pattern to achieve my goal, which is to replace any spaces in the string to the left of the => with +:
:%s/\(.*\".*\)\ (.*\"\ =>.*\,)/\1+\2
Expected output:
"some+text+with+spaces" => "some other text",
Unfortunately, this only replaces the space nearest to the =>. Is there another pattern that will replace all the spaces in one run?

Rather than write a large complex regex a couple of smaller ones would easier
:%s/".\{-}"/\=substitute(submatch(0), ' ', '+', 'g')
For instance this would capture the everything in quotes (escaped quotes break it) and then replace all spaces inside that matched string with pluses.
If you want it to work with strings escaped quotes in the string you just need to replace ".\{-}" with a slightly more complex regex "\(\\.\|[^\"]\)*"
:%s/"\(\\.\|[^\"]\)*"/\=substitute(submatch(0), ' ', '+', 'g')
If you want to restrict the lines that this substitute runs on use a global command.
:g/=>/s/"\(\\.\|[^\"]\)*"/\=substitute(submatch(0), ' ', '+', 'g')
So this will only run on lines with =>.
Relevant help topic :h sub-replace-expression

It's really far from perfect, but it does nearly the job:
:%s/\s\ze[^"]*"\s*=>\s*".*"/+/g
But it doesn't handle escape quotes, so the following line won't be replaced correctly:
"some \"big text\" with many spaces" => "some other text",

Related

Replacing two characters with two different characters using regex in python

I want to replace two different characters with two other different characters using regex in python in one operation. For example: The word is "a/c stuff" and i want to transform this into "ac_stuff" using regex in one regex.sub() line.
I searched here, but find ways to solve this using replace function, but i am looking to do this using regex in one line.
Thank you for the help!
Technically possible, but not pretty to do this in one line using re.sub
re.sub("[/ ]", (lambda match: '' if match.group(0) == '/' else '_'), "a/c stuff")
Much nicer (and faster) way using str.translate
"a/c stuff".translate(str.maketrans({'/': None, ' ': '_'}))
or
"a/c stuff".translate(str.maketrans(' ', '_', '/'))
Probably the most readable way is through str.replace, though this doesn't scale well to many replacements.
"a/c stuff".replace('/', '').replace(' ', '_')

Matching multiple quoted strings in a single line with regex

I want to match quoted strings of the form 'a string' within a line. My issue comes with the fact that I may have multiple strings like this in a single line. Something like
result = functionCall('Hello', 5, 'World')
I can search for phrases bounded by strings with ['].*['], and that picks up quoted strings just fine if there is a single one in a line. But with the above example it would find 'Hello', ', 5, ' and 'World', when I only actually want 'Hello' and 'World'. Obviously I need some way of knowing how many ' precede the currently found ' and not try to match when there is an odd amount.
Just to note, in my case strings are only defined using ', never ".
you should use [^']+ between quotes:
var myString = "result = functionCall('Hello', 5, 'World')";
var parts = myString.match(/'[^']+'/g);

negative look ahead on whole number but preceded by a character(perl)

I have text like this;
2500.00 $120.00 4500 12.00 $23.00 50.0989
Iv written a regex;
/(?!$)\d+\.\d{2}/g
I want it to only match 2500.00, 12.00 nothing else.
the requirement is that it needs to add the '$' sign onto numeric values that have exactly two digits after the decimal point. with the current regex it ads extra '$' to the ones that already have a '$' sign. its longer but im just saying it briefly. I know i can use regex to remove the '$' then use another regex to add '$' to all the desired numbers.
any help would be appreciated thanks!
To answer your question, you need to look before the pos where the first digit is.
(?<!\$)
But that's not going to work as it will match 23.45 of $123.45 to change it into $1$23.45, and it will match 123.45 of 123.456 to change it into $123.456. You want to make sure there's no digits before or after what you match.
s/(?<![\$\d])(\d+\.\d{2})(?!\d)/\$$1/g;
Or the quicker
s/(?<![\$\d])(?=\d+\.\d{2}(?!\d))/\$/g;
This is tricky only because you are trying to include too many functionalities in your single regex. If you manipulate the string first to isolate each number, this becomes trivial, as this one-liner demonstrates:
$ perl -F"(\s+)" -lane's/^(?=\d+\.\d{2}$)/\$/ for #F; print #F;'
2500.00 $120.00 4500 12.00 $23.00 50.0989
$2500.00 $120.00 4500 $12.00 $23.00 50.0989
The full code for this would be something like:
while (<>) { # or whatever file handle or input you read from
my #line = split /(\s+)/;
s/^(?=\d+\.\d{2}$)/\$/ for #line;
print #line; # or select your desired means of output
# my $out = join "", #line; # as string
}
Note that this split is non-destructive because we use parentheses to capture our delimiters. So for our sample input, the resulting list looks like this when printed with Data::Dumper:
$VAR1 = [
'2500.00',
' ',
'$120.00',
' ',
'4500',
' ',
'12.00',
' ',
'$23.00',
' ',
'50.0989'
];
Our regex here is simply anchored in both ends, and allowed to contain numbers, followed by a period . and two numbers, and nothing else. Because we use a look-ahead assertion, it will insert the dollar sign at the beginning, and keep everything else. Because of the strictness of our regex, we do not need to worry about checking for any other characters, and because we split on whitespace, we do not need to check for any such.
You can use this pattern:
s/(?<!\S)\d+\.\d{2}(?!\S)/\$${^MATCH}/gp
or
s/(?<!\S)(?=\d+\.\d{2}(?!\S))/\$/g
I think it is the shorter way.
(?<!\S) not preceded by a character that is not a white character
(?!\S) not followed by a character that is not a white character
The main interest of these double negations is that you include automaticaly the begining and the end of the string cases.

Attach a newline to every sentences

i was wondering how to turn a paragraph, into bullet sentences.
before:
sentence1. sentence2. sentence3. sentence4. sentence5. sentence6. sentence7.
after:
sentence1.
sentence2.
sentence3
sentence4.
sentence5.
Since all the other answers so far show how to do it various programming languages and you have tagged the question with Vim, here's how to do it in Vim:
:%s/\.\(\s\+\|$\)/.\r\r/g
I've used two carriage returns to match the output format you showed in the question. There are a number of alternative regular expression forms you could use:
" Using a look-behind
:%s/\.\#<=\( \|$\)/\r\r/g
" Using 'very magic' to reduce the number of backslashes
:%s/\v\.( |$)/.\r\r/g
" Slightly different formation: this will also break if there
" are no spaces after the full-stop (period).
:%s/\.\s*$\?/.\r\r/g
and probably many others.
A non-regexp way of doing it would be:
:let s = getline('.')
:let lineparts = split(s, '\.\#<=\s*')
:call append('.', lineparts)
:delete
See:
:help pattern.txt
:help change.txt
:help \#<=
:help :substitute
:help getline()
:help append()
:help split()
:help :d
You can use a regex
/\.( |$)/g
That will match the end of the sentence, then you can add newlines.
Or you can use some split function with . (dot space) and . (dot), then join with newlines.
Just replace all end of sentences /(?<=.) / with a period followed by two newline characters /.\n\n/. The syntax would of course depend on the language you are using.
Using Perl:
perl -e "$_ = <>; s/\.\s*/.\n/g; print"
Longer, somewhat more readable version:
my $input = 'foo. bar. baz.';
$input =~ s/
\. # A literal '.'
\s* # Followed by 0 or more space characters
/.\n/gx; # g for all occurences, x to allow comments and whitespace in regex
print $input;
Using Python:
import re
input = 'foo. bar. baz.'
print re.sub(r'\.\s*', '.\n', input)
An example using Ruby:
ruby-1.9.2 > a = "sentence1. sentence2. sentence3. and array.split(). the end."
=> "sentence1. sentence2. sentence3. and array.split(). the end."
ruby-1.9.2 > puts a.gsub(/\.(\s+|$)/, ".\n\n")
sentence1.
sentence2.
sentence3.
and array.split().
the end.
It goes like, for every . followed by (1 whitespace character or more, or followed by end of line), replace it with just . and two newline characters.
using awk
$ awk '{$1=$1}1' OFS="\n" file
sentence1.
sentence2.
sentence3.
sentence4.
sentence5.
sentence6.
sentence7
In PHP:
<?php
$input = "sentence. sentence. sentence.";
$output = preg_replace("/(.*?)\\.[\\s]+/", "$1\n", $input);
?&gt
Also, regular expressions are a blast, but not necessary for this problem. You can also try:
&lt?php
$input = "sentence. sentence. sentence.";
$arr = explode('.', $input);
foreach ($arr as $k => $v) $arr[$k] = trim($v);
$output = implode("\n", $arr);
?&gt
I figured out how to do this in RegExr
Search String is
(\-=?\s+)
--
Replace String is
\n\n
This is the generated information for the current regex
RegExp: /(\-=?\s+)/g
pattern: (\-=?\s+)
flags: g
capturing groups: 1
group 1: (\-=?\s+)
This will find every - in the sentence below and replace it with two newlines
Sentence 1- Sentence 2- Sentence 3- Sentence 4- Sentence 5-
The end result is
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Sentence 5
I have a really simple naive solution using capturing regexs.
:%s/[.!?]/\1y\r\r/g
The main draw back is this won't handle ellipses or multiple punctuation.

Is there a way, using regular expressions, to match a pattern for text outside of quotes?

As stated in the title, is there a way, using regular expressions, to match a text pattern for text that appears outside of quotes. Ideally, given the following examples, I would want to be able to match the comma that is outside of the quotes, but not the one in the quotes.
This is some text, followed by "text, in quotes!"
or
This is some text, followed by "text, in quotes" with more "text, in quotes!"
Additionally, it would be nice if the expression would respect nested quotes as in the following example. However, if this is technically not feasible with regular expressions then it wold simply be nice to know if that is the case.
The programmer looked up from his desk, "This can't be good," he exclaimed, "the system is saying 'File not found!'"
I have found some expressions for matching something that would be in the quotes, but nothing quite for something outside of the quotes.
Easiest is matching both commas and quoted strings, and then filtering out the quoted strings.
/"[^"]*"|,/g
If you really can't have the quotes matching, you could do something like this:
/,(?=[^"]*(?:"[^"]*"[^"]*)*\Z)/g
This could become slow, because for each comma, it has to look at the remaining characters and count the number of quotes. \Z matches the end of the string. Similar to $, but will never match line ends.
If you don't mind an extra capture group, it could be done like this instead:
/\G((?:[^"]*"[^"]*")*?[^"]*?)(,)/g
This will only scan the string once. It counts the quotes from the beginning of the string instead. \G will match the position where last match ended.
The last pattern could need an example.
Input String: 'This is, some text, followed by "text, in quotes!" and more ,-as'
Matches:
1. ['This is', ',']
2. [' some text', ',']
3. [' and followed by "text, in quotes!" and more ', ',']
It matches the string leading up to the comma, as well as the comma.
This can be done with modern regexes due to the massive number of hacks to regex engines that exist, but let me be the one to post the "Don't Do This With Regular Expressions" answer.
This is not a job for regular expressions. This is a job for a full-blown parser. As an example of something you can't do with (classical) regular expressions, consider this:
()(())(()())
No (classical) regex can determine if those parenthesis are matched properly, but doing so without a regex is trivial:
/* C code */
char string[] = "()(())(()())";
int parens = 0;
for(char *tmp = string; tmp; tmp++)
{
if(*tmp == '(') parens++;
if(*tmp == ')') parens--;
}
if(parens > 0)
{
printf("%s too many open parenthesis.\n", parens);
}
else if(parens < 0)
{
printf("%s too many closing parenthesis.\n", -parens);
}
else
{
printf("Parenthesis match!\n");
}
# Perl code
my $string = "()(())(()())";
my $parens = 0;
for(split(//, $string)) {
$parens++ if $_ eq "(";
$parens-- if $_ eq ")";
}
die "Too many open parenthesis.\n" if $parens > 0;
die "Too many closing parenthesis.\n" if $parens < 0;
print "Parenthesis match!";
See how simple it was to write some non-regex code to do the job for you?
EDIT: Okay, back from seeing Adventureland. :) Try this (written in Perl, commented to help you understand what I'm doing if you don't know Perl):
# split $string into a list, split on the double quote character
my #temp = split(/"/, $string);
# iterate through a list of the number of elements in our list
for(0 .. $#temp) {
# skip odd-numbered elements - only process $list[0], $list[2], etc.
# the reason is that, if we split on "s, every other element is a string
next if $_ & 1;
if($temp[$_] =~ /regex/) {
# do stuff
}
}
Another way to do it:
my $bool = 0;
my $str;
my $match;
# loop through the characters of a string
for(split(//, $string)) {
if($_ eq '"') {
$bool = !$bool;
if($bool) {
# regex time!
$match += $str =~ /regex/;
$str = "";
}
}
if(!$bool) {
# add the current character to our test string
$str .= $_;
}
}
# get trailing string match
$match += $str =~ /regex/;
(I give two because, in another language, one solution may be easier to implement than the other, not just because There's More Than One Way To Do Itâ„¢.)
Of course, as your problems grow in complexity, there will arise certain benefits of constructing a full-blown parser, but that's a different horse. For now, this will suffice.
As mentioned before, regexp cannot match any nested pattern, since it is not a Context-free language.
So if you have any nested quotes, you are not going to solve this with a regex.
(Except with the "balancing group" feature of a .Net regex engine - as mentioned by Daniel L in the comments - , but I am not making any assumption of the regex flavor here)
Except if you add further specification, like a quote within a quote must be escaped.
In that case, the following:
text before string "string with \escape quote \" still
within quote" text outside quote "within quote \" still inside" outside "
inside" final outside text
would be matched successfully with:
(?ms)((?:\\(?=")|[^"])+)(?:"((?:[^"]|(?<=\\)")+)(?<!\\)")?
group1: text preceding a quoted text
group2: text within double quotes, even if \" are present in it.
Here is an expression that gets the match, but it isn't perfect, as the first match it gets is the whole string, removing the final ".
[^"].*(,).*[^"]
I have been using my Free RegEx tester to see what works.
Test Results
Group Match Collection # 1
Match # 1
Value: This is some text, followed by "text, in quotes!
Captures: 1
Match # 2
Value: ,
Captures: 1
You should better build yourself a simple parser (pseudo-code):
quoted := False
FOR char IN string DO
IF char = '"'
quoted := !quoted
ELSE
IF char = "," AND !quoted
// not quoted comma found
ENDIF
ENDIF
ENDFOR
This really depends on if you allow nested quotes or not.
In theory, with nested quotes you cannot do this (regular languages can't count)
In practice, you might manage if you can constrain the depth. It will get increasingly ugly as you add complexity. This is often how people get into grief with regular expressions (trying to match something that isn't actually regular in general).
Note that some "regex" libraries/languages have added non-regular features.
If this sort of thing gets complicated enough, you'll really have to write/generate a parser for it.
You need more in your description. Do you want any set of possible quoted strings and non-quoted strings like this ...
Lorem ipsum "dolor sit" amet, "consectetur adipiscing" elit.
... or simply the pattern you asked for? This is pretty close I think ...
(?<outside>.*?)(?<inside>(?=\"))
It does capture the "'s however.
Maybe you could do it in two steps?
First you replace the quoted text:
("[^"]*")
and then you extract what you want from the remaining string
,(?=(?:[^"]*"[^"]*")*[^"]*\z)
Regexes may not be able to count, but they can determine whether there's an odd or even number of something. After finding a comma, the lookahead asserts that, if there are any quotation marks ahead, there's an even number of them, meaning the comma is not inside a set of quotes.
This can be tweaked to handle escaped quotes if needed, though the original question didn't mention that. Also, if your regex flavor supports them, I would add atomic groups or possessive quantifiers to keep backtracking in check.