I would like to match any character and any whitespace except comma with regex. Only matching any character except comma gives me:
[^,]*
but I also want to match any whitespace characters, tabs, space, newline, etc. anywhere in the string.
EDIT:
This is using sed in vim via :%s/foo/bar/gc.
I want to find starting from func up until the comma, in the following example:
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
I
To work with multiline in SED using RegEx, you should look at here.
EDIT:
In SED command, working with NewLine is a bit different. SED command support three patterns to manage multiline operations N, P and D. To see how it works see this(Working with Multiple Lines) explaination. Here these three operations discussed.
My guess is that N operator is the area of consideration that is missing from here. Addition of N operator will allows to sense \n in string.
An example from here:
Occasionally one wishes to use a new line character in a sed script.
Well, this has some subtle issues here. If one wants to search for a
new line, one has to use "\n." Here is an example where you search for
a phrase, and delete the new line character after that phrase -
joining two lines together.
(echo a;echo x;echo y) | sed '/x$/ { N s:x\n:x: }'
which generates
a xy
However, if you are inserting a new line, don't use "\n" - instead
insert a literal new line character:
(echo a;echo x;echo y) | sed 's:x:X\ :'
generates
a X
y
So basically you're trying to match a pattern over multiple lines.
Here's one way to do it in sed (pretty sure these are not useable within vim though, and I don't know how to replicate this within vim)
sed '
/func/{
:loop
/,/! {N; b loop}
s/[^,]*/func("ok"/
}
' inputfile
Let's say inputfile contains these lines
func("bla bla bla"
"asdfasdfasdfasdfasdfasdf"
"asdfasdfasdf", "more strings")
The output is
func("ok", "more strings")
Details:
If a line contains func, enter the braces.
:loop is a label named loop
If the line does not contain , (that's what /,/! means)
append the next line to pattern space (N)
branch to / go to loop label (b loop)
So it will keep on appending lines and looping until , is found, upon which the s command is run which matches all characters before the first comma against the (multi-line) pattern space, and performs a replacement.
Related
I have a string which starts with spaces. I want to replace the leading spaces with equal number of dashes -. I don't want to replace any other spaces which may occur elsewhere in the string.
If I use /^\s*/-/, it only replaces with a single dash. If I use /^\s/-/, it only replaces the first space with a dash. If I remove the anchor /\s/-/, it replaces every occurences of space in the string which is not acceptable.
My string looks like this in general:
<n-leading-spaces><a-non-space-character><remaining-characters>
Example (pipes added to show the boundary):
| ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
After substitution (pipes added to show the boundary):
|---ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn |
NOTE: I cannot use any code snippet. I just want to know whether this can be done using just regex patterns. (Forgive my formatting as I'm new to markdown. I welcome formatting corrections)
You can use the following solution to replace a sequence of characters with a sequence of different characters of same length using regular expressions:
my $string = ' ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn ';
$string =~ s/^(\s+)/"-" x length($1)/eg;
print $string;
Returns '----ajfn ssfdjn ng jnv sjfj%nv sjfj n s ;sn '
I have large text files, in which sometimes long lines are broken into multiple lines by writing a = and then a newline character. (Enron email data from Kaggle). Since even words are broken this way and I want to do some machine learning with the data, I'd like to remove those breaks. As far as I can see the combination =\n is only used for these breaks, so if I remove those, I have the same information without the breaks and nothing gets lost.
I cannot use tr because it only replaces 1 character, but I have two characters to replace.
The sed command I am using so far to no avail is:
sed --in-place --quiet --regexp-extended 's/=\n//g' email_aa_edit
where email_aa_edit is a part of the enron mail data (used split to split it) and is my input file. However this only produces an empty file and I am not sure why. Afaik = is not a special character on itself and the newline should be \n.
What is the correct way of removing those =\n occurrences?
You can't remove newlines characters since sed works line by line, but it's possible if you append the next line to the pattern space:
sed ':a;/=$/{N;s/=\n//;ta}' file
details:
:a; # defines a label "a"
/=$/ { # if the line ends with =
N; # append the next line to the pattern space
s/=\n//; # replace the =\n
ta # jump to label "a" when something is replaced (that's always the case
# except if the last line ends with =)
}
Note: if your file uses the Windows newline sequence, change \n to \r\n.
I have have several lines from a table that I’m converting from Excel to the Wiki format, and want to add link tags for part of the text on each line, if there is text in that field. I have started the converting job and come to this point:
|10.20.30.9||x|-||
|10.20.30.10||x|s04|Server 4|
|10.20.30.11||x|s05|Server 5|
|10.20.30.12|||||
|10.20.30.13|||||
What I want is to change the fourth column from, e.g., s04 to [[server:s04]]. I do not wish to add the link brackets if the line is empty, or if it contains -. If that - is a big problem, I can remove it.
All my tries on regex to get anything from the line ends in the whole line being replaced.
Consider using awk to do this:
#!/bin/bash
awk -F'|' '
{
OFS = "|";
if ($5 != "" && $5 != "-")
$5 = "server:" $5;
print $0
}'
NOTE: I've edited this script since the first version. This current one, IMO is better.
Then you can process it with:
cat $FILENAME | sh $AWK_SCRIPTNAME
The -F'|' switch tells awk to use | as a field separator. The if/else and printf statements are pretty self explanatory. It prints the fields, with 'server:' prepended to column 5, only if it is not "-" or "".
Why column 5 and not column 4?: Because you use | at the beginning of each record. So awk takes the 'first' field ($1) to be an empty string that it believes should have occured before this first |.
This seems to do the job on the sample you give up there (with Vim):
%s/^|\%([^|]*|\)\{3}\zs[^|]*/\=(empty(submatch(0)) || submatch(0) == '-') ? submatch(0) : '[[server:'.submatch(0).']]'/
It's probably better to use awk as ArjunShankar writes, but this should work if you remove "-" ;) Didn't get it to work with it there.
:%s/^\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]*|\)\([^|]\+|\)/\1\2\3\4[[server:\5]]/
It's just stupid though. The first 4 are identical (match anything up to | 4 times). Didn't get it to work with {4}. The fifth matches the s04/s05-strings (just requires that it's not empty, therefor "-" must be removed).
Adding a bit more readability to the ideas given by others:
:%s/\v^%(\|.{-}){3}\|\zs(\w+)/[[server:\1]]/
Job done.
Note how {3} indicates the number of columns to skip. Also note the use of \v for very magic regex mode. This reduces the complexity of your regex, especially when it uses more 'special' characters than literal text.
Let me recommend the following substitution command.
:%s/^|\%([^|]*|\)\{3}\zs[^|-]\+\ze|/[[server:&]]/
try
:1,$s/|\(s[0-9]\+\)|/|[[server:\1]]|/
assuming that your s04, s05 are always s and a number
A simpler substitution can be achieved with this:
%s/^|.\{-}|.\{-}|.\{-}|\zs\(\w\{1,}\)\ze|/[[server:\1]]/
^^^^^^^^^^^^^^^^^^^^ -> Match the first 3 groups (empty or not);
^^^ -> Marks the "start of match";
^^^^^^^^^^^ -> Match only if the 4th line contains letters numbers and `_` ([0-9A-Za-z_]);
^^^ -> Marks the "end of match";
If the _character is similar to -, can appear but must not be substituted, uset the following regex: %s/^|.\{-}|.\{-}|.\{-}|\zs\([0-9a-zA-Z]\{1,}\)\ze|/[[server:\1]]/
I figured out that in order to turn [some name] into [some_name] I need to use the following expression:
s/\(\[[^ ]*\) /\1_/
i.e. create a backreference capture for anything that starts with a literal '[' that contains any number of non space characters, followed by a space, to be replaced with the non space characters followed by an underscore. What I don't know yet though is how to alter this expression so it works for ALL underscores within the braces e.g. [a few words] into [a_few_words].
I sense that I'm close, but am just missing a chunk of knowledge that will unlock the key to making this thing work an infinite number of times within the constraints of the first set of []s contained in a line (of SQL Server DDL in this case).
Any suggestions gratefully received....
There are two parts to the trickery needed:
Stop replacing when you reach a close square bracket (but do it repeatedly on the line):
s/\(\[[^] ]*\) /\1_/g
This matches an open square bracket, followed by zero or more characters that are neither a blank nor a close square bracket. The global suffix means that the pattern is applied to all sequences starting with an open square bracket followed eventually by a blank or close square bracket on the line. Note, too, that this regex does not alter '[single-word] and context' whereas the original would translate that to '[single-word]_and context', which is not the object of the exercise.
Get sed to repeat the search from where this one started. Unfortunately, there isn't a truly good way to do that. Sed always resumes searching after the text that was substituted; and this is one occasion when we don't want that. Sometimes, you can get away with simply repeating the substitute operation. In this case, you have to repeat it every time the substitution succeeds, stopping when there are no more substitutions.
Two of the less well known operations in sed are the ':label' and the 't' commands. They were present in the 7th Edition of Unix (circa 1978), though, so they are not new features. The first simply identifies a position in the script which can be jumped to with 'b' (not wanted here) or 't':
[2addr]t [label]
Branch to the ':' function bearing the label if any substitutions have been made since the most recent reading of an input line or execution of a 't' function. If no label is specified, branch to the end of the script.
Marvellous: we need:
sed -e ':redo; s/\(\[[^] ]*\) /\1_/g; t redo' data.file
Except - it doesn't work all on one line like that (at least, not on MacOS X). This did work admirably, though:
sed -e ':redo
s/\(\[[^] ]*\) /\1_/g
t redo' data.file
Or, as noted in the comments, you could write three separate '-e' options (which works on MacOS X):
sed -e ':redo' -e 's/\(\[[^] ]*\) /\1_/g' -e 't redo' data.file
Given the data file:
a line with [one blank] word inside square brackets.
a line with [two blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple words in a single bracket] inside square brackets.
a line with [multiple words in a single bracket] [several times on one line]
the output from the sed script shown is:
a line with [one_blank] word inside square brackets.
a line with [two_blank] or [three_blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several_times_on_one_line]
And, finally, reading the fine print in the question, if you need this done only in the first square-bracketed field on each line, then we need to ensure that are no open square brackets before the one that starts the match. This variant works:
sed -e ':redo' -e 's/^\([^]]*\[[^] ]*\) /\1_/' -e 't redo' data.file
(The 'g' qualifier is gone - it probably isn't needed in the other variants either given the loop; its presence might make the process marginally more efficient, but it would most likely be essentially impossible to detect that. The pattern is now anchored to the start of the line (the caret) and contains zero or more characters that are not open square bracket before the first open square bracket.)
Sample output:
a line with [two_blank] or [three blank] words inside square brackets.
a line with [no-blank] word inside square brackets.
a line with [multiple_words_in_a_single_bracket] inside square brackets.
a line with [multiple_words_in_a_single_bracket] [several times on one line]
This is easier in a language like perl which has "executable" substitutions:
perl -wne 's/(\[.*?])/ do { my $x = $1; $x =~ y, ,_,; $x } /ge; print'
Or to split it up more clearly:
sub replace_with_underscores {
my $s = shift;
$s =~ y/ /_/;
$s
}
s/(\[.*?])/ replace_with_underscores($1) /ge;
The .*? is the non-greedy match (to avoid slurring together two adjacent bracketed phrases) and the e flag to the substitution causes it to be evaluated, so you can call a function to do the inner work.
I need to clip out all the occurances of the pattern '--' that are inside single quotes in long string (leaving intact the ones that are outside single quotes).
Is there a RegEx way of doing this?
(using it with an iterator from the language is OK).
For example, starting with
"xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
I should end up with:
"xxxx rt / $ 'dfdffggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g 'ggh' vcbcvb"
So I am looking for a regex that could be run from the following languages as shown:
+-------------+------------------------------------------+
| Language | RegEx |
+-------------+------------------------------------------+
| JavaScript | input.replace(/someregex/g, "") |
| PHP | preg_replace('/someregex/', "", input) |
| Python | re.sub(r'someregex', "", input) |
| Ruby | input.gsub(/someregex/, "") |
+-------------+------------------------------------------+
I found another way to do this from an answer by Greg Hewgill at Qn138522
It is based on using this regex (adapted to contain the pattern I was looking for):
--(?=[^\']*'([^']|'[^']*')*$)
Greg explains:
"What this does is use the non-capturing match (?=...) to check that the character x is within a quoted string. It looks for some nonquote characters up to the next quote, then looks for a sequence of either single characters or quoted groups of characters, until the end of the string. This relies on your assumption that the quotes are always balanced. This is also not very efficient."
The usage examples would be :
JavaScript: input.replace(/--(?=[^']*'([^']|'[^']*')*$)/g, "")
PHP: preg_replace('/--(?=[^\']*'([^']|'[^']*')*$)/', "", input)
Python: re.sub(r'--(?=[^\']*'([^']|'[^']*')*$)', "", input)
Ruby: input.gsub(/--(?=[^\']*'([^']|'[^']*')*$)/, "")
I have tested this for Ruby and it provides the desired result.
This cannot be done with regular expressions, because you need to maintain state on whether you're inside single quotes or outside, and regex is inherently stateless. (Also, as far as I understand, single quotes can be escaped without terminating the "inside" region).
Your best bet is to iterate through the string character by character, keeping a boolean flag on whether or not you're inside a quoted region - and remove the --'s that way.
If bending the rules a little is allowed, this could work:
import re
p = re.compile(r"((?:^[^']*')?[^']*?(?:'[^']*'[^']*?)*?)(-{2,})")
txt = "xxxx rt / $ 'dfdf--fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '--ggh--' vcbcvb"
print re.sub(p, r'\1-', txt)
Output:
xxxx rt / $ 'dfdf-fggh-dfgdfg' ghgh- dddd -- 'dfdf' ghh-g '-ggh-' vcbcvb
The regex:
( # Group 1
(?:^[^']*')? # Start of string, up till the first single quote
[^']*? # Inside the single quotes, as few characters as possible
(?:
'[^']*' # No double dashes inside theses single quotes, jump to the next.
[^']*?
)*? # as few as possible
)
(-{2,}) # The dashes themselves (Group 2)
If there where different delimiters for start and end, you could use something like this:
-{2,}(?=[^'`]*`)
Edit: I realized that if the string does not contain any quotes, it will match all double dashes in the string. One way of fixing it would be to change
(?:^[^']*')?
in the beginning to
(?:^[^']*'|(?!^))
Updated regex:
((?:^[^']*'|(?!^))[^']*?(?:'[^']*'[^']*?)*?)(-{2,})
Hm. There might be a way in Python if there are no quoted apostrophes, given that there is the (?(id/name)yes-pattern|no-pattern) construct in regular expressions, but it goes way over my head currently.
Does this help?
def remove_double_dashes_in_apostrophes(text):
return "'".join(
part.replace("--", "") if (ix&1) else part
for ix, part in enumerate(text.split("'")))
Seems to work for me. What it does, is split the input text to parts on apostrophes, and replace the "--" only when the part is odd-numbered (i.e. there has been an odd number of apostrophes before the part). Note about "odd numbered": part numbering starts from zero!
You can use the following sed script, I believe:
:again
s/'\(.*\)--\(.*\)'/'\1\2'/g
t again
Store that in a file (rmdashdash.sed) and do whatever exec magic in your scripting language allows you to do the following shell equivalent:
sed -f rmdotdot.sed < file containing your input data
What the script does is:
:again <-- just a label
s/'\(.*\)--\(.*\)'/'\1\2'/g
substitute, for the pattern ' followed by anything followed by -- followed by anything followed by ', just the two anythings within quotes.
t again <-- feed the resulting string back into sed again.
Note that this script will convert '----' into '', since it is a sequence of two --'s within quotes. However, '---' will be converted into '-'.
Ain't no school like old school.