Search and replace with lookarounds regex - regex

Say I have multiple strings that contain ';'. I want to get rid of ';'only if there is a comma contained in the string. If there are no commas I want to leave the semi colon,';'.
An example of where I would like to get rid of the semi colon,';':
1,JoE,Stuff,more stuff; 05423089; 3029483-;98
Output:
1,JoE,Stuff,more stuff 05423089 3029483-98
I will be dealing with hundreds of thousands of rows of data.
Basically what I am doing here is where a file is delimited by commas, I want to omit all semi-colons,';'. I know you can do this with regex look around but im not sure how to search and replace with them.
In the case that there is not a comma or it is not a comma delimited file, I would like to preserve all semi-colons. I may have certain files that are delimited by a ';', or semi-colon.
Example:
1;JoE;Stuff;more stuff;054230893029483-98
In the above example, I do not want to remove the ';'.
Here is what I have so far:
s/((?=;)^.*?,.*?$)/$1/gi;
(?=;) checking for semi-colon encapsulated by (^.?,.?), which is the string I want to match. I don't believe I am calling to my '$1' and '$2' correctly.
What is the proper way to search and replace with lookarounds..anyone?

So, you basically want to remove all semicolons from a line if it contains a comma.
This is easily accomplished with the following perl one-liner
perl -i -pe 's/;//g if /,/' file.txt
Explanation:
Switches:
-i[extension]: Edit <> files in place (makes backup if extension supplied)
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
s/;//g: Remove all semicolons from a line
if /,/: Only if the line comtains a comma

Related

Using sed to delete everything from a colon on, up until a space character

The first line of a text file contains hundreds of strings of the following form:
143362:2019111515391775
that are separated by spaces. I.e.,
143362:2019111515391775 143760:2019111515391785 143020:2019111515391748
I would like to remove the portion of each string starting with the colon (i.e., delete from the colon, up until the space).
Is there an elegant way to do this with sed?
You could do it like this:
sed 's/:[^[:blank:]]*//g'
This removes each colon and any number of non-blanks following it. Output for your input:
143362 143760 143020

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

Perl: regular expression: capturing group

In a code file, I want to remove any (one or more) consecutive white lines (lines that may include only zero or more spaces/tabs and then a newline) that go between a code text and the concluding } of a block. This concluding } may have spaces for indentation before it, so I want to keep them.
Here is what I try to do:
perl -i -0777 -pe 's/\s+\n([ ]*)\}/\n($1)\}/g' file
For example, if my code file looks like (□ is the space character):
□□□□while (true) {\n
□□□□□□□□print("Yay!");□□□□□□\n
□□□□□□□□□□□□□□□□\n
□□□□}\n
Then I want it to become:
□□□□while (true) {\n
□□□□□□□□print("Yay!");\n
□□□□}\n
However it does not do the change I expected. Any idea what I am doing wrong here?
The only issues I can see with your regex are
you don't need the parenthesis around the matching variable,
and
the use of a character class when extracting the match is
redundant (unless you want to match tabs as well as spaces).
So, you could try
s/\s+\n( *)\}/\n$1\}/g
instead.
This works as expected when run on your test input.
To tidy it up even more, you could try the following.
s/\s+(\n *\})/$1/g
If there might be tabs as well as spaces, you can use a character class. (You do not need to include '|' inside the character class).
s/\s+(\n[ \t]*\})/$1/g
perl -pi -0777 -e's/^\s*\n(?=\s*})//mg' yourfile
(Remove whitespace from the beginning of a line through a newline that precedes a line with } as the first non-whitespace.)
Try using this regex instead, which uses a positive look-ahead assertion. This way you only capture the part that you want to remove, and then replace it with nothing:
s/\s+(?=\n[ ]*\})//g
You can try the following one liner
perl -0777 -pe 's/\s*\n*(\s*\n)/$1/g' test

sed: remove all non-alphanumeric characters inside quotations only

Say I have a string like this:
Output:
I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"
I want to only remove non-alphanumeric characters inside the quotations except commas, periods, or spaces:
Desired Output:
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
I have tried the following sed command matching a string and deleting inside the quotes, but it deletes everything that is inside the quotes including the quotes:
sed '/characters/ s/\("[^"]*\)\([^a-zA-Z0-9\,\. ]\)\([^"]*"\)//g'
Any help is appreciated, preferably using sed, to get the desired output. Thanks in advance!
You need to repeat your substitution multiple times to remove all non-alphanumeric characters. Doing such a loop in sed requires a label and use of the b and t commands:
sed '
# If the line contains /characters/, just to label repremove
/characters/ b repremove
# else, jump to end of script
b
# labels are introduced with colons
:repremove
# This s command says: find a quote mark and some stuff we do not want
# to remove, then some stuff we do want to remove, then the rest until
# a quote mark again. Replace it with the two things we did not want to
# remove
s/\("[a-zA-Z0-9,. ]*\)[^"a-zA-Z0-9,. ][^"a-zA-Z0-9,. ]*\([^"]*"\)/\1\2/
# The t command repeats the loop until we have gotten everything
t repremove
'
(This will work even without the [^"a-zA-Z0-9,. ]*, but it'll be slower on lines that contain many non-alphanumeric characters in a row)
Though the other answer is right in that doing this in perl is much easier.
Sed is not the right tools for this. Here is the one through Perl.
perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g' file
Example:
$ echo 'I have some-non-alphanumeric % characters remain here, I "also, have_+ some & .here"' | perl -pe 's/[^a-zA-Z0-9,.\s"](?!(?:"[^"]*"|[^"])*$)//g'
I have some-non-alphanumeric % characters remain here, I "also, have some .here"
Regex Demo

Extract info from the large file, use newline character in awk regex

I need to extract some information from the very large file.
I want to extract specific lines using regular expressions.
What is the fastest way to do this?
I'm coding in c++ on linux.
I want to use grep, but seems my regex is not working as expected.
For example \s, \w are not working properly.
In man grep is written that \wand [:alnum:] are synonyms, so, \w should work properly but it shouldn't.
I need to use newline characters in my regex, so, I'couldn't use grep, therefore, I decided to use awk.
How should I use newline character in awk regex?
Let's consider we have a file (test.txt) with the content below:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement hi hi some statement ...
And I want to get only these lines:
HELLO worl_d5 ; some statement HELLO world1 ; some
statement
I.e., I want to find lines that start with HELLO word followed by the space character(s), then some alphanumeric( or containing /) word followed by the space character(s) and then, a single ;. But I want to get this kind of lines when they are followed by the some statement line only.
I wrote:
awk '/HELLO[[:space:]]([[:alnum:]]|\/)+[[:space:]];\n[[:space:]]*some[[:space:]] statement [[:space:]];/ { print }' test.txt
But I couldn't get needed results.
Or just provide an example where newline is used in regex.
I solved this by using pcregrep and newline just worked fine!
pcregrep -M '(HELLO[[:space:]]([[:alnum:]]|\/|_)+[[:space:]];)[\r\n]([[:space:]]*some[[:space:]]statement[[:space:]];)' test.txt