Match certain text, but omit it in the output - regex

In Notepad++'s find and replace regex feature, is there any way to match certain text, but not include it in the replacement? For instance: ([ab][cd] )* for matching a strings such as ac ad bc bc ad, and replacing it with $0, except not including the [ab] part, or in the case of the string above, c d c c d. While only answers for Notepad++'s regex dialect will be useful, if anyone knows a solution in some other dialect, I'd be curious to see them, and they might apply to this dialect anyway.
EDIT:
The pattern is easy to match, the part I don't know how to do is get the replacement to do what I want. For the example expression I gave, the pattern (?:[ab]([cd]))* actually works, with $1 in the replace box, but that said, it doesn't work for my actual use case because the [ab][cd] is a sub-expression of the result (note that I didn't think that it would make a difference, else I would have posted this in the original question, my apologies); a better example would be where I want strings like f(ac ad bc bc ad): replaced with f(ac ad bc bc ad): f'(c d c c d) (so, really I want a regular addition). I tried using the regex ([a-z])\((?:[ab]([cd] ?))*\):, with the replacement being $0$1'($2), but that results in the value of $2 being whatever it last matched (i.e., f(ac ad bc bc ad): f'(d)).

Notepad++ find and replace functionality doesn't provide a feature to solve this specific problem. As I see, you need to match a substring and replace parts of it without affecting similar patterns in text which I assume should be general to be able to expand.
if anyone knows a solution in some other dialect...
awk to the rescue
You have to use a programming language or a more powerful text-processing tool. If you have an awk implementation within your environment you are able to achieve what you desire in a second:
awk '{
sepRe = "[ab]"
regex = "(" sepRe "[cd] )+"
while(match($0, regex)) {
str = substr($0, RSTART, RLENGTH);
current = str
gsub(sepRe, "", current)
sub(str, current, $0)
}
print;
}' file
$ cat file:
ac ad bc bc ad
ac same ab ad af
Running awk outputs:
c d c c ad
c same ab d af
Note that there is no space after last ad

Related

Bash Script for Concatenating Broken Dashed Words

I've scraped a large amount (10GB) of PDFs and converted them to text files, but due to the format of the original PDFs, there is an issue:
Many of the words which break across lines have a dash in them that artificially breaks up the word, like this:
You can see that this happened because the original PDFs files have breaks:
What would be the cleanest and fastest way to "join" every word instance that matches this pattern inside of a .txt file?
Perhaps some sort of Regex search, like for a [a-z]\-\s \w of some kind (word character followed by dash followed by space) would work?
Or would some sort of sed replacement work better?
Currently, I'm trying to get a sed regex to work, but I'm not sure how to translate this to use capture groups to replace the selected text:
sed -n '\%\w\- [a-z]%p' Filename.txt
My input text would look like this:
The dog rolled down the st- eep hill and pl- ayed outside.
And the output would be:
The dog rolled down the steep hill and played outside.
Ideally, the expression would also work for words split up by a newline, like this:
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
To this:
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
It's straightforward in sed:
sed -e ':a' -e '/-$/{N;s/-\n//;ba
}' -e 's/- //g' filename
This translates roughly as "if the line ends with a dash, read in the next line as well (so that you have a line with a carriage return in the middle) then excise the dash and carriage return, and loop back the beginning just in case this new line also ends with a dash. Then remove any instances of - ".
You may use this gnu-awk code:
cat file
The dog rolled down the st- eep hill and pl- ayed outside.
The rule which provided for the consid-
eration of the resolution, was agreed to earlier by a
Then use awk like this:
awk 'p != "" {
w = $1
$1 = ""
sub(/^[[:blank:]]+/, ORS)
$0 = p w $0
p = ""
}
{
$0 = gensub(/([_[:alnum:]])-[[:blank:]]+([_[:alnum:]])/, "\\1\\2", "g")
}
/-$/ {
p = $0
sub(/-$/, "", p)
}
p == ""' file
The dog rolled down the steep hill and played outside.
The rule which provided for the consideration
of the resolution, was agreed to earlier by a
If you can consider perl then this may also work for you:
Then use:
perl -0777 -pe 's/(\w)-\h+(\w)/$1$2/g; s/(\w)-\R(\w+)\s+/$1$2\n/g' file
You simply add backslash-parentheses (or use the -r or -E option if available to do away with the requirement to put backslashes before capturing parentheses) and recall the matched text with \1 for the first capturing parenthesis, \2 for the second, etc.
sed 's/\(\w\)\- \([a-z]\)/\1\2/g' Filename.txt
The \w escape is not standard sed but if it works for you, feel free to use it. Otherwise, it is easy to replace with [A-Za-z0-9_#] or whatever else you want to call "word characters".
I'm guessing not all of the matches will be hyphenated words so perhaps run the result through a spelling checker or something to verify whether the result is an English word. (I would probably switch to a more capable scripting language like Python for that, though.)

Substituting multiple occurrences of a character inside a grep match

I am trying to use TextWrangler to take a bunch of text files, match everything within some angle-bracket tags (so far so good), and for every match, substitute all occurrences of a specific character with another.
For instance, I'd like to take something like
xx+xx <f>bar+bar+fo+bar+fe</f> yy+y <f>fee+bar</f> zz
match everything within <f> and </f> and then substitute all +'s with, say, *'s (but ONLY inside the "f" tag).
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
I think I can easily match "f" tags containing +'s with an expression like
<f>[^<]*\+[^<]*</f>
but I have no idea on how to substitute only a subclass of character for each match. I don't know a priori how many +'s there are in each tag.
I think I should run a regular expression for all matches of the first regular expression, but I am not really sure how to do that.
(In other words, I would like to match all +'s but only inside specific angle-bracket tags).
Does anyone have a hint?
Thanks a lot,
Daniele
In case you're OK with an awk solution:
$ awk '{
while ( match($0,/<f>[^<]*\+[^<]*<\/f>/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz
The above will work using any awk in any shell on any UNIX box. It relies on there being no < within each <f>...</f> as indicated by your sample code. If there can be then include that in your example and we can tweak the script to handle it:
$ awk '{
gsub("</f>",RS)
while ( match($0,/<f>[^\n]*\+[^\n]*\n/) ) {
tgt = substr($0,RSTART,RLENGTH)
gsub(/\+/,"*",tgt)
$0 = substr($0,1,RSTART-1) tgt substr($0,RSTART+RLENGTH)
}
gsub(RS,"</f>")
print
}' file
xx+xx <f>bar*bar*fo*bar*fe</f> yy+y <f>fee*bar</f> zz

Gawk regexp to select sequence

sorry for the nth simple question on regexp but I'm not able to get what I need without a what seems to me a too complicated solution. I'm parsing a file containing sequence of only 3 letters A,E,D as in
AADDEEDDA
EEEEEEEE
AEEEDEEA
AEEEDDAAA
and I'd like to identify only those that start with E and ends in D with only one change in the sequence as for example in
EDDDDDDDD
EEEDDDDDD
EEEEEEEED
I'm fighting with the proper regexp to do that. Here my last attempt
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E[(ED){1,1}]*D$/ && $2 !~ /^E[(ED){2,}]*D$/) print $0}'
which does not work. Any help?
Thanks in advance.
If i understand correctly your request a simple
awk '/^E+D+$/' file.input
will do the trick.
UPDATE: if the line format contains pre/post numbers (with post optional) as showed later in the example, this can be a possible pure regex adaptation (alternative to the use of field switch-F,):
awk '/^[0-9]+,E+D+(,[0-9]+)?$/' input.test
First of all, you need the regular expression:
^E+[^ED]*D+$
This matches one or more Es at the beginning, zero or more characters that are neither E nor D in the middle, and one or more Ds at the end.
Then your AWK program will look like
$2 ~ /^E+[^ED]*D+$/
$2 refers to the 2nd field of the current record, ~ is the regex matching operator, and /s delimit a regular expression. Together, these components form what is known in AWK jargon as a "pattern", which amounts to a boolean filter for input records. Note that there is no "action" (a series of statements in {s) specified here. That's because when no action is specified, AWK assumes that the action should be { print $0 }, which prints the entire line.
If I understand you correct you want to match patterns that starts with at least one E and then continues with at least one D until the end.
echo "1,AAEDDEED,1\n2,EEEEDDDD,2\n3,EDEDEDED" | gawk -F, '{if($2 ~ /^E+D+$) print $0}'

How to swap text based on patterns at once with sed?

Suppose I have 'abbc' string and I want to replace:
ab -> bc
bc -> ab
If I try two replaces the result is not what I want:
echo 'abbc' | sed 's/ab/bc/g;s/bc/ab/g'
abab
So what sed command can I use to replace like below?
echo abbc | sed SED_COMMAND
bcab
EDIT:
Actually the text could have more than 2 patterns and I don't know how many replaces I will need. Since there was a answer saying that sed is a stream editor and its replaces are greedily I think that I will need to use some script language for that.
Maybe something like this:
sed 's/ab/~~/g; s/bc/ab/g; s/~~/bc/g'
Replace ~ with a character that you know won't be in the string.
I always use multiple statements with "-e"
$ sed -e 's:AND:\n&:g' -e 's:GROUP BY:\n&:g' -e 's:UNION:\n&:g' -e 's:FROM:\n&:g' file > readable.sql
This will append a '\n' before all AND's, GROUP BY's, UNION's and FROM's, whereas '&' means the matched string and '\n&' means you want to replace the matched string with an '\n' before the 'matched'
sed is a stream editor. It searches and replaces greedily. The only way to do what you asked for is using an intermediate substitution pattern and changing it back in the end.
echo 'abcd' | sed -e 's/ab/xy/;s/cd/ab/;s/xy/cd/'
Here is a variation on ooga's answer that works for multiple search and replace pairs without having to check how values might be reused:
sed -i '
s/\bAB\b/________BC________/g
s/\bBC\b/________CD________/g
s/________//g
' path_to_your_files/*.txt
Here is an example:
before:
some text AB some more text "BC" and more text.
after:
some text BC some more text "CD" and more text.
Note that \b denotes word boundaries, which is what prevents the ________ from interfering with the search (I'm using GNU sed 4.2.2 on Ubuntu). If you are not using a word boundary search, then this technique may not work.
Also note that this gives the same results as removing the s/________//g and appending && sed -i 's/________//g' path_to_your_files/*.txt to the end of the command, but doesn't require specifying the path twice.
A general variation on this would be to use \x0 or _\x0_ in place of ________ if you know that no nulls appear in your files, as jthill suggested.
Here is an excerpt from the SED manual:
-e script
--expression=script
Add the commands in script to the set of commands to be run while processing the input.
Prepend each substitution with -e option and collect them together. The example that works for me follows:
sed < ../.env-turret.dist \
-e "s/{{ name }}/turret$TURRETS_COUNT_INIT/g" \
-e "s/{{ account }}/$CFW_ACCOUNT_ID/g" > ./.env.dist
This example also shows how to use environment variables in your substitutions.
This might work for you (GNU sed):
sed -r '1{x;s/^/:abbc:bcab/;x};G;s/^/\n/;:a;/\n\n/{P;d};s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/;ta;s/\n(.)/\1\n/;ta' file
This uses a lookup table which is prepared and held in the hold space (HS) and then appended to each line. An unique marker (in this case \n) is prepended to the start of the line and used as a method to bump-along the search throughout the length of the line. Once the marker reaches the end of the line the process is finished and is printed out the lookup table and markers being discarded.
N.B. The lookup table is prepped at the very start and a second unique marker (in this case :) chosen so as not to clash with the substitution strings.
With some comments:
sed -r '
# initialize hold with :abbc:bcab
1 {
x
s/^/:abbc:bcab/
x
}
G # append hold to patt (after a \n)
s/^/\n/ # prepend a \n
:a
/\n\n/ {
P # print patt up to first \n
d # delete patt & start next cycle
}
s/\n(ab|bc)(.*\n.*:(\1)([^:]*))/\4\n\2/
ta # goto a if sub occurred
s/\n(.)/\1\n/ # move one char past the first \n
ta # goto a if sub occurred
'
The table works like this:
** ** replacement
:abbc:bcab
** ** pattern
Tcl has a builtin for this
$ tclsh
% string map {ab bc bc ab} abbc
bcab
This works by walking the string a character at a time doing string comparisons starting at the current position.
In perl:
perl -E '
sub string_map {
my ($str, %map) = #_;
my $i = 0;
while ($i < length $str) {
KEYS:
for my $key (keys %map) {
if (substr($str, $i, length $key) eq $key) {
substr($str, $i, length $key) = $map{$key};
$i += length($map{$key}) - 1;
last KEYS;
}
}
$i++;
}
return $str;
}
say string_map("abbc", "ab"=>"bc", "bc"=>"ab");
'
bcab
May be a simpler approach for single pattern occurrence you can try as below:
echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
My output:
~# echo 'abbc' | sed 's/ab/bc/;s/bc/ab/2'
bcab
For multiple occurrences of pattern:
sed 's/\(ab\)\(bc\)/\2\1/g'
Example
~# cat try.txt
abbc abbc abbc
bcab abbc bcab
abbc abbc bcab
~# sed 's/\(ab\)\(bc\)/\2\1/g' try.txt
bcab bcab bcab
bcab bcab bcab
bcab bcab bcab
Hope this helps !!
echo "C:\Users\San.Tan\My Folder\project1" | sed -e 's/C:\\/mnt\/c\//;s/\\/\//g'
replaces
C:\Users\San.Tan\My Folder\project1
to
mnt/c/Users/San.Tan/My Folder/project1
in case someone needs to replace windows paths to Windows Subsystem for Linux(WSL) paths
If replacing the string by Variable, the solution doesn't work.
The sed command need to be in double quotes instead on single quote.
#sed -e "s/#replacevarServiceName#/$varServiceName/g" -e "s/#replacevarImageTag#/$varImageTag/g" deployment.yaml
Here is an awk based on oogas sed
echo 'abbc' | awk '{gsub(/ab/,"xy");gsub(/bc/,"ab");gsub(/xy/,"bc")}1'
bcab
I believe this should solve your problem. I may be missing a few edge cases, please comment if you notice one.
You need a way to exclude previous substitutions from future patterns, which really means making outputs distinguishable, as well as excluding these outputs from your searches, and finally making outputs indistinguishable again. This is very similar to the quoting/escaping process, so I'll draw from it.
s/\\/\\\\/g escapes all existing backslashes
s/ab/\\b\\c/g substitutes raw ab for escaped bc
s/bc/\\a\\b/g substitutes raw bc for escaped ab
s/\\\(.\)/\1/g substitutes all escaped X for raw X
I have not accounted for backslashes in ab or bc, but intuitively, I would escape the search and replace terms the same way - \ now matches \\, and substituted \\ will appear as \.
Until now I have been using backslashes as the escape character, but it's not necessarily the best choice. Almost any character should work, but be careful with the characters that need escaping in your environment, sed, etc. depending on how you intend to use the results.
Every answer posted thus far seems to agree with the statement by kuriouscoder made in his above post:
The only way to do what you asked for is using an intermediate
substitution pattern and changing it back in the end
If you are going to do this, however, and your usage might involve more than some trivial string (maybe you are filtering data, etc.), the best character to use with sed is a newline. This is because since sed is 100% line-based, a newline is the one-and-only character you are guaranteed to never receive when a new line is fetched (forget about GNU multi-line extensions for this discussion).
To start with, here is a very simple approach to solving your problem using newlines as an intermediate delimiter:
echo "abbc" | sed -E $'s/ab|bc/\\\n&/g; s/\\nab/bc/g; s/\\nbc/ab/g'
With simplicity comes some trade-offs... if you had more than a couple variables, like in your original post, you have to type them all twice. Performance might be able to be improved a little bit, too.
It gets pretty nasty to do much beyond this using sed. Even with some of the more advanced features like branching control and the hold buffer (which is really weak IMO), your options are pretty limited.
Just for fun, I came up with this one alternative, but I don't think I would have any particular reason to recommend it over the one from earlier in this post... You have to essentially make your own "convention" for delimiters if you really want to do anything fancy in sed. This is way-overkill for your original post, but it might spark some ideas for people who come across this post and have more complicated situations.
My convention below was: use multiple newlines to "protect" or "unprotect" the part of the line you're working on. One newline denotes a word boundary. Two newlines denote alternatives for a candidate replacement. I don't replace right away, but rather list the candidate replacement on the next line. Three newlines means that a value is "locked-in", like your original post way trying to do with ab and bc. After that point, further replacements will be undone, because they are protected by the newlines. A little complicated if I don't say so myself... ! sed isn't really meant for much more than the basics.
# Newlines
NL=$'\\\n'
NOT_NL=$'[\x01-\x09\x0B-\x7F]'
# Delimiters
PRE="${NL}${NL}&${NL}"
POST="${NL}${NL}"
# Un-doer (if a request was made to modify a locked-in value)
tidy="s/(\\n\\n\\n${NOT_NL}*)\\n\\n(${NOT_NL}*)\\n(${NOT_NL}*)\\n\\n/\\1\\2/g; "
# Locker-inner (three newlines means "do not touch")
tidy+="s/(\\n\\n)${NOT_NL}*\\n(${NOT_NL}*\\n\\n)/\\1${NL}\\2/g;"
# Finalizer (remove newlines)
final="s/\\n//g"
# Input/Commands
input="abbc"
cmd1="s/(ab)/${PRE}bc${POST}/g"
cmd2="s/(bc)/${PRE}ab${POST}/g"
# Execute
echo ${input} | sed -E "${cmd1}; ${tidy}; ${cmd2}; ${tidy}; ${final}"

AWK Regex pattern matching

I have a text file, and I need to identify a certain pattern in one field. I am using AWK, and trying to use the match() function.
The requirement is I need to see if the following pattern exists in a string of digits
??????1?
??????3?
??????5?
??????7?
ie I am only interested in the last but one digit being a 1, 3, 5, or a 7.
I have a solution, which looks like this;
b = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]1[0-9]")
c = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]3[0-9]")
d = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]5[0-9]")
e = match($23, "[0-9][0-9][0-9][0-9][0-9][0-9]7[0-9]")
if (b || c || d || e)
{
print "Found a match" $23
}
I think though I should be able to write the regex more succinctly like this;
b = match($23, "[0-9]{6}1[0-9]")
but this does not work.
Am I missing something, or are my regex skills (which are not great), really all that bad?
Thanks in anticipation
The regex delimiter is /.../, not "...". When you use quotes in an RE context, you're telling awk that there's an RE stored inside a string literal and that string literal gets parsed twice, once when the script is read and then again when it's executed which makes your RE specification that much more complicated to accommodate that double parsing.
So, do not write:
b = match($23, "[0-9]{6}1[0-9]")
write:
b = match($23, /[0-9]{6}1[0-9]/)
instead.
That's not your problem though. The most likely problem you have is that you are calling a version of awk that does not support RE-intervals like {6}. If you are using an older version of GNU awk, then you can enable that functionality by adding the --re-interval flag:
awk --re-interval '...b = match($23, /[0-9]{6}1[0-9]/)...'
but whether it's that or you're using an awk that just doesnt support RE_intervals, the best thing to do is get a newer version of gawk.
Finally, your whole script can be reduced to:
awk --re-interval '$23 ~ /[0-9]{6}[1357][0-9]/{print "Found a match", $23}'
Change [0-9] to [[:digit:]] for locale-independence if you like.
The reason why RE intervals weren't supported by default in gawk until recently is that old awk didn't support them so a script that had an RE of a{2}b when executed in old awk would have been looking for literally those 5 chars and gawk didn't want old scripts to quietly break when executed in gawk instead of old awk. A few release back the gawk guys rightly decided to take the plunge an enable RE intervals by default for our convenience over backward compatibility.
Here is one awk solution:
awk -v FS="" '$7~/(1|3|5|7)/' file
By setting FS to nothing, every character becomes a field. We can then test field #7.
As Tom posted.
awk -v FS="" '$7~/[1357]/' file