Remove 2nd occurence after a given flag - regex

How can I parse every line in a .txt file to remove everything after the second occurrence of a / after a given flag jdk on each line of a file.
For example
/usr/lib/jvm/jdk-1.7.0/2.0/zi/etc/GMT
/usr/lib/jvm/jdk1.7.2/3.0/zi/etc/GMT
/usr/share/servertool-java-openjdk/4.0/jce.jar
becomes
/usr/lib/jvm/jdk-1.7.0/2.0/
/usr/lib/jvm/jdk1.7.2/3.0/
/usr/share/servertool-java-openjdk/4.0/
Note, I can't just split on jdk, because it may be jdk-1.*.*/ etc.
My end goal is to find all the unique paths on a highly restricted SeLinux box that has the output of a locate jdk stored in a output.txt file
Update: my attempt so far, to get closer is
cat output.txt | awk -F '\\jdk' '{print $1"jdk"}' | sort -u
This just chops everything after jdk, and removes dupes.

sed is a very appropriate tool for this job. You'll use the s/// command to remove the part of the line you want to delete.
Note the slashes in the s/// command can be changed to other characters so that any slashes you have in the pattern or replacement parts don't need to be escaped.
Your pattern will be:
in capturing parentheses:
"jdk" followed by zero or more non-slashes
followed by a slash
followed by one or more non-slashes
followed by a slash
followed by any number of characters
The replacement will be the text that was captured.
You'll want to refer to the sed manual
3.3 The s Command
5 Regular Expressions: selecting text
5.2 Basic (BRE) and extended (ERE) regular expression

if you want to replace in the same file, you can use below script
#!/bin/bash
cat output.txt | while read line
do
x=${line#/*jdk*/*/}
replace=${line%${x}}
sed -i "s|$line|$replace|g" output.txt
done

Related

Break line using regex

I have multiple xml files that look like this: <TEST><TEST><TEST><TEST><TEST><TEST><TEST><TEST><TEST><TEST>
I would like to break into a new like for every '<' and get rid of every '>'.
I want to do this via regex since what I'm working on is for *nix.
There is no need for regex to do such a simple search & replace. You want to replace < with \n< and > with an empty string.
Assuming your content is in file input.txt, this simple sed command line can do the job:
sed 's/</\n</g;s/>//g' input.txt
How it works
There are two sed commands separated by ;:
s/</\n</g
s/>//g
Both commands are s (search and replace). The s command requires the search regex (no regex here), the replacement string and some optional flag, separated by /.
The first s searches for < and replaces it with \n<. \n is the usual notation for a newline character in regex and many Unix tools (even when no regex is involved).
The second s searches for > and replaces it with nothing.
Both s commands use the g (global) flag that tells them to do all the replacements they can do on each line. sed runs each command for every line of the input and by default, s stops after the first replacement (on a line).

Sed Regex OSX find Roman numerals and replace with empty string. Error "unterminated substitute pattern"

This is probably a Sed and shell scripting syntax issue as well as Regex.
(Edit: maybe an I/O issue, as the regex worked when reading the file within the bash shell, but the actual .txt file was not altered as desired)
Trying to prepare a .txt file for some natural language processing work. Wanted to delete some Roman numerals in a plain text file containing Shakespeare's sonnets, each sonnet beginning with a Roman numeral such as IX. and XVIII. which represents the title of the individual sonnet, including the decimal character.
Example intput text:
XXV.
Let those who are in favour with their stars
Of public honour and proud titles boast,
Desired output:
Let those who are in favour with their stars
Of public honour and proud titles boast,
Following the example in this question, I tried all the following commands in Terminal bash shell:
$ sed -i 's/[IVXLC]{1,}[.]//g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/^$/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/()/g' sonnets.txt
$ sed -i 's/[IVXLC]{1,}[.]/[]/g' sonnets.txt
The idea was to replace any match with an empty string. Since that didn't work, I tried to replace match with a space character:
$ sed -i 's/[IVXLC]{1,}[.]/^ $/g' sonnets.txt
No luck. All commands above returned the same error:
sed: 1: "sonnets.txt": unterminated substitute pattern
I tested the regex in the "find" field on https://regexr.com/ and it seemed to be correct. The target file was right in the working directory. Any idea what went wrong? What characters should I be using in the "replace" field of the Sed command? Should I modify the regex and/or the Sed command?
The curly brackets need to be escaped.
$ sed 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Let those who are in favour with their stars
Of public honour and proud titles boast,
As #Jonathan Leffler mentioned in the comments, my Mac is using BSD sed and that's why the command didn't work.
So I installed GNU sed through Homebrew:
brew install gnu-sed
Then used the command:
gsed -i 's/[IVXLC]\{1,\}[.]//g' sonnets.txt
Typing in gsed invokes the GNU sed, and it worked as desired. It altered the content of the .txt file in place.
In this configuration, as #Hakan Baba mentioned, the regex did need to escape the curly braces:
\{ \}
The problem seems to be with the range (or limiting ) quantifier {m,n} that is not supported in your BSD sed version. Note that you may rewrite the {1,} quantifier using [IVXLC][IVXLC]* (one Roman "digit" followed with 0+ Roman digits):
sed -i 's/[IVXLC][IVXLC]*[.]//g' sonnets.txt
^^^^^^^^^^^^^^^
Also, if you need to make sure you only match the Roman numbers at the start of the line, append ^ at the start of the pattern (and that means you may also omit g modifier at the end of the regex). To match them as whole words, add [[:<:]] leading word boundary at the start of the pattern.

Grep for lines not beginning with "//"

I'm trying but failing to write a regex to grep for lines that do not begin with "//" (i.e. C++-style comments). I'm aware of the "grep -v" option, but I am trying to learn how to pull this off with regex alone.
I've searched and found various answers on grepping for lines that don't begin with a character, and even one on how to grep for lines that don't begin with a string, but I'm unable to adapt those answers to my case, and I don't understand what my error is.
> cat bar.txt
hello
//world
> cat bar.txt | grep "(?!\/\/)"
-bash: !\/\/: event not found
I'm not sure what this "event not found" is about. One of the answers I found used paren-question mark-exclamation-string-paren, which I've done here, and which still fails.
> cat bar.txt | grep "^[^\/\/].+"
(no output)
Another answer I found used a caret within square brackets and explained that this syntax meant "search for the absence of what's in the square brackets (other than the caret). I think the ".+" means "one or more of anything", but I'm not sure if that's correct and if it is correct, what distinguishes it from ".*"
In a nutshell: how can I construct a regex to pass to grep to search for lines that do not begin with "//" ?
To be even more specific, I'm trying to search for lines that have "#include" that are not preceeded by "//".
Thank you.
The first line tells you that the problem is from bash (your shell). Bash finds the ! and attempts to inject into your command the last you entered that begins with \/\/. To avoid this you need to escape the ! or use single quotes. For an example of !, try !cat, it will execute the last command beginning with cat that you entered.
You don't need to escape /, it has no special meaning in regular expressions. You also don't need to write a complicated regular expression to invert a match. Rather, just supply the -v argument to grep. Most of the time simple is better. And you also don't need to cat the file to grep. Just give grep the file name. eg.
grep -v "^//" bar.txt | grep "#include"
If you're really hungup on using regular expressions then a simple one would look like (match start of string ^, any number of white space [[:space:]]*, exactly two backslashes /{2}, any number of any characters .*, followed by #include):
grep -E "^[[:space:]]*/{2}.*#include" bar.txt
You're using negative lookahead which is PCRE feature and requires -P option
Your negative lookahead won't work without start anchor
This will of course require gnu-grep.
You must use single quotes to use ! in your regex otherwise history expansion is attempted with the text after ! in your regex, the reason of !\/\/: event not found error.
So you can use:
grep -P '^(?!\h*//)' file
hello
\h matches 0 or more horizontal whitespace.
Without -P or non-gnu grep you can use grep -v:
grep -v '^[[:blank:]]*//' file
hello
To find #include lines that are not preceded by // (or /* …), you can use:
grep '^[[:space:]]*#[[:space:]]*include[[:space:]]*["<]'
The regex looks for start of line, optional spaces, #, optional spaces, include, optional spaces and either " or <. It will find all #include lines except lines such as #include MACRO_NAME, which are legitimate but rare, and screwball cases such as:
#/*comment*/include/*comment*/<stdio.h>
#\
include\
<stdio.h>
If you have to deal with software containing such notations, (a) you have my sympathy and (b) fix the code to a more orthodox style before hunting the #include lines. It will pick up false positives such as:
/* Do not include this:
#include <does-not-exist.h>
*/
You could omit the final [[:space:]]*["<] with minimal chance of confusion, which will then pick up the macro name variant.
To find lines that do not start with a double slash, use -v (to invert the match) and '^//' to look for slashes at the start of a line:
grep -v '^//'
You have to use the -P (perl) option:
cat bar.txt | grep -P '(?!//)'
For the lines not beginning with "//", you could use (^[^/]{2}.*$).
If you don't like grep -v for this then you could just use awk:
awk '!/^\/\//' file
Since awk supports compound conditions instead of just regexps, it's often easier to specify what you want to match with awk than grep, e.g. to search for a and b in any order with grep:
grep -E 'a.*b|b.*a`
while with awk:
awk '/a/ && /b/'

Trying to remove version number from a string using sed in OSX

I have what I hope is a simple issue which is stumping me. I need to take an installer file with a name like:
installer_v0.29_linux.run
installer_v10.22_linux_x64.run
installer_v1.1_osx.app
installer_v5.6_windows.exe
and zip it up into a file with the format
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
I already have a bash script running on OSX which does almost everything else I need in the build chain, and was certain I could achieve this with sed using something like:
ZIP_NAME=`echo "$OUTPUT_NAME" | sed -E 's/_(?:\d*\.)?\d+//g'`
That is, replacing the regex _(?:\d*\.)?\d+ with a blank - the regex should match any decimal number preceded by an underscore.
However, I get the error RE error: repetition-operator operand invalid when I try to run this. At this stage I am stumped - I have Googled around this and can't see what I am doing wrong. The regex I wrote works correctly at Regexr, but clearly some element of it is not supported by the sed implementation in OSX. Does anyone know what I am doing wrong?
You can try this sed:
sed 's/_v[^_]*//; s/\.[[:alnum:]]\+$/.zip/' file
installer_linux.zip
installer_linux_x64.zip
installer_osx.zip
installer_windows.zip
You don't need sed, just some parameter expansion magic with an extended pattern.
shopt -s extglob
zip_name=${OUTPUT_NAME/_v+([^_])/}
The pattern _v+([^_]) matches a string starting with _v and all characters up to the next _. The extglob option enables the use of the +(...) pattern to match one or more occurrences of the enclosed pattern (in this case, a non-_ character). The parameter expansion ${var/pattern/} removes the first occurrence of the given pattern from the expansion of $var.
Try this way also
sed 's/_[^_]\+//' FileName
OutPut:
installer_linux.run
installer_linux_x64.run
installer_osx.app
installer_windows.exe
If you want add replace zip instead of run use below method
sed 's/\([^_]\+\).*\(_.*\).*/\1\2.zip/' Filename
Output :
installer_linux.run.zip
installer_x64.run.zip
installer_osx.app.zip
installer_windows.exe.zip

Regex (grep) for multi-line search needed [duplicate]

This question already has answers here:
How can I search for a multiline pattern in a file?
(11 answers)
Closed 1 year ago.
I'm running a grep to find any *.sql file that has the word select followed by the word customerName followed by the word from. This select statement can span many lines and can contain tabs and newlines.
I've tried a few variations on the following:
$ grep -liIr --include="*.sql" --exclude-dir="\.svn*" --regexp="select[a-zA-Z0-
9+\n\r]*customerName[a-zA-Z0-9+\n\r]*from"
This, however, just runs forever. Can anyone help me with the correct syntax please?
Without the need to install the grep variant pcregrep, you can do a multiline search with grep.
$ grep -Pzo "(?s)^(\s*)\N*main.*?{.*?^\1}" *.c
Explanation:
-P activate perl-regexp for grep (a powerful extension of regular expressions)
-z Treat the input as a set of lines, each terminated by a zero byte (the ASCII NUL character) instead of a newline. That is, grep knows where the ends of the lines are, but sees the input as one big line. Beware this also adds a trailing NUL char if used with -o, see comments.
-o print only matching. Because we're using -z, the whole file is like a single big line, so if there is a match, the entire file would be printed; this way it won't do that.
In regexp:
(?s) activate PCRE_DOTALL, which means that . finds any character or newline
\N find anything except newline, even with PCRE_DOTALL activated
.*? find . in non-greedy mode, that is, stops as soon as possible.
^ find start of line
\1 backreference to the first group (\s*). This is a try to find the same indentation of method.
As you can imagine, this search prints the main method in a C (*.c) source file.
I am not very good in grep. But your problem can be solved using AWK command.
Just see
awk '/select/,/from/' *.sql
The above code will result from first occurence of select till first sequence of from. Now you need to verify whether returned statements are having customername or not. For this you can pipe the result. And can use awk or grep again.
Your fundamental problem is that grep works one line at a time - so it cannot find a SELECT statement spread across lines.
Your second problem is that the regex you are using doesn't deal with the complexity of what can appear between SELECT and FROM - in particular, it omits commas, full stops (periods) and blanks, but also quotes and anything that can be inside a quoted string.
I would likely go with a Perl-based solution, having Perl read 'paragraphs' at a time and applying a regex to that. The downside is having to deal with the recursive search - there are modules to do that, of course, including the core module File::Find.
In outline, for a single file:
$/ = "\n\n"; # Paragraphs
while (<>)
{
if ($_ =~ m/SELECT.*customerName.*FROM/mi)
{
printf file name
go to next file
}
}
That needs to be wrapped into a sub that is then invoked by the methods of File::Find.