Copy Text that matches a regular expression - regex

I have a regular expression that has several matches in a textfile.
I want to copy only the the matches to a second file. I dont want to copy the lines that contain the matches: I only want the matched text.
I dont find a way to do this in notepad++ (only copies complete lines, not only the matches). Also not in Visual Studio search.
Is there a way to copy only matches? Maybe in grepp or sed?

You can do it with both. Lets say I have a following file -
Sample file:
[jaypal:~/Temp] cat myfile
this is some random number 424-555
and my cell is 111-222-3333
and 42555 is my zip code
And I want to capture only numbers from myfile
Using sed:
With sed you can use the combination of -n and p option along with grouped pattern.
sed -n 's/.[^0-9]*\([0-9-]\+\).*/\1/p'
| | | | | ||
--- ---------- -- |
| | | ---------> `p` prints only matched output. Since
V V V we are suppressing everything with -n
Suppress Escaped `(` \1 prints we use p to invoke printing.
output start the group first matched
you can reference group
it with \1. If you
have more grouped
pattern then they can
be called with \2 ...
Test:
[jaypal:~/Temp] sed -n 's/.[^0-9]*\([0-9-]\+\).*/\1/p' myfile
424-555
111-222-3333
42555
You can simply re-direct this to another file.
Using grep:
You can use either -
egrep -o "regex" filename
or
grep -E -o "regex" filename
From the man page:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (see below).
-o, --only-matching
Show only the part of a matching line that matches PATTERN.
Test:
[jaypal:~/Temp] egrep -o "[0-9-]+" myfile
424-555
111-222-3333
42555
You can simply re-direct this to another file.
Note: Obviously these are simple examples but it conveys the point.

This might work for you:
sed -n 's/^.*\(matched text regexp\).*/\1/w matched_text_file' source_file

You can do it with Notepad++ by bookmarking the wanted lines then using menu => Search => Copy styled text => Find style (marked). The following lines show the method in more detail:
Example text:
aposidfupwoebfadbsf-mytext1-ausdfioabq
qoejbgaoudfb -mytext2-asdoufbnqub
foqiuebgf-mytext3- ñqloienbq
alkbnepaofub -mytext4- jafpoebqaf
Want to extract all -mytext?- words
Create a Regex to find text you want to copy: -mytext\d+-
Use the find function (ctlr + f) of Notepad++ and open the "Mark" Tab:
"Bookmark line" option on "Mark" Tab
Enter the regex, activate the option Bookmark line and then click on Mark All button on the Mark Tab.
Mark all text we need
Finally open the Search menu and select the option Copy Styled Text and then choos the correct color, like this:
Select all ocurrences of the regex
Then paste it where you need it.
Here an animation of the entire process:
Entire process Gif animation

Related

Use "sed" to Remove Capture Group 1 From All Lines In a File

I currently have a file with lines like the below:
ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z
My goal is to remove everything from the "#" to the next comma, such that it instead looks like the below:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
I'm not that experienced with utilizing sed and RegEx expressions. In playing around on a testing website, I came up with the below RegEx string, in which capture group 1 is perfectly matching to what I want to remove:
regex101.com Test
How would I go about putting this in a "sed" command against a given input file, and writing the results to a new output file. I had tried the below most recently:
sed 's/(#.+?),//' input.csv > input_Corrected.csv
Just as another note, I'm doing this in a bash script in which I have an API call generating the "input.csv" file, and then want to run this sed command to clean up the data format to match my needs.
You can use
sed 's/#[^,]*,/,/' input.csv > input_Corrected.csv
sed 's/#[^,]*//' input.csv > input_Corrected.csv
The #[^,]*, POSIX BRE pattern matches a # and then any zero or more chars other than , and then a , (in the first example, use it if there MUST be a comma after the match) and replaces with a comma (in the first example, keep the replacement empty if you use the second approach).
See the online demo:
s='ABCD123RTY,steve_tyler#gmail.com,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy#hotmail.com,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2#netnet,10.20.30.l6,2021-08-20T15:30:34.480Z'
sed 's/#[^,]*,/,/' <<< "$s"
Output:
ABCD123RTY,steve_tyler,10.20.30.142,2021-08-20T14:49:51.035Z
ABCD123QWE,thisguy,10.20.30.245,2021-08-20T14:10:22.254Z
ABCD123DFG,calvin_hobbes2,10.20.30.l6,2021-08-20T15:30:34.480Z
You can used the below regular expression in order to remove the content of the valid email address only.
sed "s/#([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})//g" input.csv > input_Corrected.csv
And as per your requirement you can use the below code. As it is going to replace all the email address on the file as you have on your file "calvin_hobbes2#netnet" which is not valid email address.
sed "s/#[^,]*//g" input.csv > input_Corrected.csv

How to use sed to search and replace a pattern who appears multiple times in the same line?

Because the question can be misleading, here is a little example. I have this kind of file:
some text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##
In this example, I want to replace each occurrence of KEY- inside a pair of ## by VALUE-. I started with this sed command:
sed -i 's/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g'
Here is how it works:
\(##[^#]*\): create a first group composed of two # and any characters except # ...
KEY-: ... until the last occurrence of KEY- on that line
\([^#]*##\): and create a second group with all the characters except # until the next pair of #.
The problem is my command can't handle correctly the following line because there are multiple KEY- inside my pair of ##:
again ##some-text-KEY-some-other-text-KEY-text##
Indeed, I get this result:
again ##some-text-KEY-some-other-text-VALUE-text##
If I want to replace all the occurrences of KEY- in that line, I have to run my command multiple times and I prefer to avoid that. I also tried with lazy operators but the problem is the same.
How can I create a regex and a sed command who can handle correctly all my file?
The problem is rather complex: you need to replace all occurrences of some multicharacter text inside blocks of text between identical multicharacter delimiters.
The easiest and safest way to solve the task is using Perl:
perl -i -pe 's/(##)(.*?)(##)/$end_delim=$3; "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim"/ge' file
See the online demo.
The (##)(.*?)(##) pattern will match strings between two adjacent ## substrings capturing the start delimiter into Group 1, end delimiter in Group 3, and all text in between into Group 2. Since the regex substitution re-sets all placeholders, the temporary variable is used to keep the value of the end delimiter ($end_delim=$3), then, "$1" . $2=~s|KEY-|VALUE-|gr . "$end_delim" replaces the match with the value in the Group 1 of the first match (the first ##), then the Group 2 value with all KEY- replaced with VALUE-, and then the end delimiter.
If there are no KEY-s in between matches on the same line you may use a branch with sed by enclosing your command with :A and tA:
sed -i ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' file
Note you missed the first placeholder in \VALUE-\2, it should be \1VALUE-\2.
See the online demo:
s="some KEY- text
some text ##some-text-KEY-some-other-text##
text again ##some-text-KEY-some-other-text## ##some-text-KEY-some-other-text##
again ##some-text-KEY-some-other-text-KEY-text##
some text with KEY ##KEY-some-text##
blabla ##KEY##"
sed ':A; s/\(##[^#]*\)KEY-\([^#]*##\)/\1VALUE-\2/g; tA' <<< "$s"
Output:
some KEY- text
some text ##some-text-VALUE-some-other-text##
text again ##some-text-VALUE-some-other-text## ##some-text-VALUE-some-other-text##
again ##some-text-VALUE-some-other-text-VALUE-text##
some text with KEY ##VALUE-some-text##
blabla ##KEY##
More details:
sed allows the usage of loops and branches. The :A in the code above is a label, a special location marker that can be "jumped at" using the appropriate operator. t is used to create a branch, this "command jumps to the label only if the previous substitute command was successful". So, once the pattern matched and the replacement occurred, sed goes back to where it was and re-tries a match. If it is not successful, sed goes on to search for the matches further in the string. So, tA means go back to the location marked with A if there was a successful search-and-replace operation.
This might work for you (GNU sed):
sed -E 's/##/\n/g;:a;s/^([^\n]*(\n[^\n]*\n[^\n]*)*\n[^\n]*)KEY-/\1VALUE-/;ta;s/\n/##/g' file
Convert ##'s to newlines. Using a loop, replace VAL- between matched newlines to VALUE-. When all done replace newlines by ##'s.

How to replace ABC with ABCD but not ABCD with ABCDD

I have some large text files were I need to do some text replacements with replace all.
However the replaced text contains the original text which creates a infinite loop.
I need some method or a program that do a replace-all and check if the replacement was already done. Something to check if the found match for ABC is already part of ABCD and skip that replacement.
The ABC from my example is just a basic case and can be any sequence of characters so using an option like match whole sting only does not work.
You can use a language or tool that supports regular expressions, and use look-ahead.
As a trivial example in a bash shell:
echo "ABCEE" | grep -P -o "ABC(?=[^D])"
outputs "ABC" (i.e., it finds a match)
where:
echo "ABCDE" | grep -P -o "ABC(?=[^D])"
does not -- it does not match.
The (?=...) means "followed by" and "[^D]" means "by anything other than D".
You can get more sophisticated using capture groups which would capture "ABC" -- look into regular expressions and you should find plenty of options.

sed: replace a match with the same match and add more text

I have a file called config.properties that contains the following text:
cat config.properties
// a lot of data
com.enterprise.project.AERO_CARRIERS = LA|LP|XL|4M|LU|4C
//more data
and my goal is keep the same data but adding more. For this example i want to add to the assignment of this variable |JJ|PZ results in:
cat config.properties
// a lot of data
com.enterprise.project.AERO_CARRIERS = LA|LP|XL|4M|LU|4C|JJ|PZ
//more data
The command that I've been using for this is :
sed 's/\(com\.enterprise\.project\.AERO_CARRIERS\s*\=\s*.+\)/\1\|JJ\|PZ/g' config.properties
But this doesn't works. What am I doing wrong?
\s and + are not POSIX compliant:
you can match spaces and tabs with [[:blank:]] and whitespace characters(including line breaks) with [[:space:]].
.+ can be replaced with .\{1,\} or ..*
And you don't need to use backreference here, use & instead to output lines matching your pattern:
sed 's/^com\.enterprise\.project\.AERO_CARRIERS[[:blank:]]*=[[:blank:]]*.\{1,\}/&|JJ|PZ/'
As an alternative to use stream-editors like sed, just use the native text editor, ed from UNIX-days for in-place search and replacement. The option used (-s) is POSIX compliant, so no issues on portability,
printf '%s\n' ",g/com.enterprise.project.AERO_CARRIERS/ s/$/\|JJ\|PZ/g" w q | ed -s -- inputFile
The part ,g/com.enterprise.project.AERO_CARRIERS/ searches for the line containing the pattern, and the part s/$/\|JJ\|PZ/g appends |JJ|PZ to end of that line and w q writes and saves the file, in-place.
You can match first:
sed '/com\.enterprise\.project\.AERO_CARRIERS\s*\=\s*.\+/ s/$/|JJ|PZ/g' config.properties

Grep/Sed between two tags with multiline

I have many files from which I need to get information.
Example of my files:
first file content:
"test This info i need grep</singleline>"
and
second file content (with two lines):
"test This info=
i need grep too</singleline>"
in results I need grep this text: from first file - "This info i need grep" and from second file - "This info= i need grep too"
In first file I use:
grep -o 'test .*</singleline>' * | sed -e 's/test \(.*\)<\/singleline>/\1/'
and successfully get "This info i need grep" but I can not get the information from the second file by using the same command.
Please help rewrite the command or write what the other.
Or, if you insist to use grep, you can:
grep -Pzo 'test(\n|.)*(?=</singleline>)' test.txt
To understand the meaning of each flag, use grep --help:
-P, --perl-regexp
PATTERN is a Perl regular expression
-o, --only-matching
show only the part of a line matching PATTERN
-z, --null-data
a data line ends in 0 byte, not newline
I'd use pcregrep, which can match multiline regexes:
pcregrep -Mo 'test \K((?s).)*?(?=</singleline>)' filename
The tricks are:
-M allows pcregrep to match on more than one line,
-o makes it print only the match,
\K throws away the part of the match that comes before it,
(?=</singleline>) is a lookahead term that matches an empty string if (and only if) it is followed by </singleline>, and
((?s).)*? to match any characters non-greedily, which is to say that if you have several occurrences of </singleline> in the file, it will match until the closest rather than the furthest. If this is not desired, remove the ?. (?s) enables the s option locally for the term to make . match newlines in it; it wouldn't do that by default.
Thanks to #CasimiretHippolyte for pointing out the ((?s).) alternative to (.|\n).
It looks like you're parsing quoted-printable encoded text, where a "soft" line break (one that is an artifact from fixed-line-width formatting) is indicated with a line-terminating = (directly before the \n).
Since in a later comment you also expressed the desire to print each match as a single line, I suggest the following 2-pass appraoch:
use awk to remove the soft line breaks
then use grep on the result
awk '/=$/ { printf "%s", substr($0, 1, length($0)-2); next } 1' file |
grep -Po 'test .*?(?=</singleline>)'
Tip of the hat to Wintermute's helpful answer for the non-greedy quantifier, *?, and both Wintermute's and Maroun Maroun's helpful answer for the positive look-ahead assertion, (?=...).
Not that the awk command removes the line-ending = (along with the newline); replace the substr call with just $0 to retain it.
Since strings of interest are first converted back their original single-line representations:
The matches are printed in their original form.
You can use regular (GNU) grep with line-by-line matching; contrast this with
needing to read the entire file at once, as in Maroun Maroun's helpful answer.
Note that, as of this writing, * must be replaced with *? in his answer to work correctly work in files with multiple matches.
needing to install another utility, pcregrep, as in Wintermute's helpful answer.
additionally, the matches would have to be cleaned up to be single-line (something you didn't originally state as a requirement).