sed only replacing last occurrence of match - need to match all - regex

I would like to replace all { } on a certain line with [ ], but unfortunately I am only able to match the last occurrence of the regexp.
I have a config file which has structure as follows:
entry {
id 123456789
desc This is a description of {foo} and was added by {bar}
trigger 987654321
}
I have the following sed, of which is able to replace the last match 'bar' but not 'foo':
sed s'/\(desc.*\){\(.*\)}/\1\[\2\]/g' < filename
I anchor this search to the line containing 'desc' as I would hate for it to replace the delimiting braces of each 'entry' block.
For the life of me I am unable to figure out how to replace all of the occurrences.
Any help is appreciated - have been learning all day and unable to read any more tutorials for fear that my corneas might crack.
Thanks!

Try the following:
sed '/desc/ s/{\([^}]*\)}/[\1]/g' filename
The search and replace in the above command will only be done for lines that match the regex /desc/, however I don't think this is actually necessary because sed processes text a line at a time, so even without this you wouldn't be replacing braces on the 'entry' block. This means that you could probably simplify this to the following:
sed 's/{\([^}]*\)}/[\1]/g' filename
Instead of .* inside of the capturing group [^}]* is used which will match everything except closing braces, that way you won't match from the first opening to the last closing.
Also, you can just provide the file name as the final argument to sed instead of using input redirection.

Related

Substitute any other character except for a specific pattern in Perl

I have text files with lines like this:
U_town/u_LN0_pk_LN3_bnb_LN155/DD0 U_DESIGN/u_LNxx_pk_LN99_bnb_LN151_LN11_/DD5
U_master/u_LN999_pk_LN767888_bnb_LN9772/Dnn111 u_LN999_pk_LN767888_bnb_LN9772_LN9999_LN11/DD
...
I am trying to substitute any other character except for / to nothing and keep a word with pattern _LN\d+_ with Perl one-liner.
So the edited version would look like:
/_LN0__LN3__LN155/ /_LN99__LN151_LN11_/
/_LN999__LN767888_/ _LN999__LN767888__LN9772_LN9999_/
I tried below which returned empty lines
perl -pe 's/(?! _LN\d+_)[^\/].+//g' file
Below returned only '/'.
perl -pe 's/(?! _LN\d+_)\w+//g' file
Is it perhaps not possible with a one-liner and I should consider writing a code to parse character by character and see if a matching word _LN\d+_ or a character / is there?
To merely remove everything other than these patterns can simply match the patterns and join the matches back
perl -wnE'say join "", m{/ | _LN[0-9]+_ }gx' file
or perhaps, depending on details of the requirements
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(See explanation in the last bullet below.)
Prints, for the first line (of the two) of the shown sample input
/_LN0__LN3_//_LN99__LN151_
...
or, in the second version
/_LN0_LN3//_LN99_LN151_LN11/
...
The _LN155 is not there because it is not followed by _. See below.
Questions:
Why are there spaces after some / in the "edited version" shown in the question?
The pattern to keep is shown as _LN\d+_ but _LN155 is shown to be kept even though it is not followed by a _ in the input (but by a /) ...?
Are underscores optional by any chance? If so, append ? to them in the pattern
perl -wnE'say join "", m{/ | _?LN[0-9]+_? }gx' file
with output
/_LN0__LN3__LN155//_LN99__LN151_LN11_/
(It's been clarified that the extra space in the shown desired output is a mistake.)
If the underscores "overlap," like in _LN155_LN11_, in the regex they won't be both matched by the _LN\d+_ pattern, since the first one "takes" the underscore.
But if such overlapping instances nned be kept then replace the trailing _ with a lookahead for it, which doesn't consume it so it's there for the leading _ on the next pattern
perl -wnE'say join "", m{/ | _LN[0-9]+(?=_) }gx' file
(if the underscores are optional and you use _?LN\d+_? pattern then this isn't needed)

Regex for string matching ****${****}***

I am trying to write a regex that matches and excludes all strings in a file that contain ${ followed by } with any characters between or around it. In between could be any characters/numbers/underscores/dashes/etc (there won't be another parenthesis inside).
Example matches:
hello ${VAR}
${HELLO_VAR} world
https://${WEB_VAR}
I came up with this: egrep -v '^\${[a-zA-Z?]', though it seems to be working partially and I am not too sure if its right. How can I do this?
The input file has strings separated by a newline, very similar to simple java properties.
You can trying using sed command.
sed 's/\$\{[^}]*\}//g' <input_file> > <output_file>
Sed here excludes all the characters between '{' and '}' and writes the new content in a new output file.
You can give this one a shot:
\$\{[^}]*\}
Match ${ literally, followed by everything except }, followed by }
You say you're trying to exclude all strings in a file, so it sounds like you need something a bit more advanced than just a regex with grep. I'd do this with an awk script:
awk '{while(match($0,/\$\{[^}]*\}/)){$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)}} 1' input.txt
Or, split for easier reading and commenting:
{
while (match($0,/\$\{[^}]*\}/)) {
$0=substr($0,0,RSTART-1) substr($0,RSTART+RLENGTH)
}
}
1
The idea here is that for each line, we'll check to see whether the regex matches anything on the line. If it does, we'll replace the line with the parts around the matched regex. (We could alternate sub(/RE/,""), but that would require applying the regex twice per match rather than once.)
The final 1 is shorthand that says "print the current line". It runs whether or not the loop processed any matches.
Just use the global wilcard .* around the two sequences, as in:
.*\$\{.*\}.*
As you want to match entire lines, you have to use wilcard at both sides, to extend the regexp to both ends (it doesn't matter if you anchor it with ^ and $ as the greedy algorithm will try to extend as much as possible) Note that the $, { and } must be escaped, as they are reserved by the regexp language.
This can be seen in action here.
note
the title of this question doesn't specify that the substring between the two curly braces should not have a }, and as you want only to match the whole line, then it is not necessary to check for something except a }, the only requirement is that } must be after the ${ in the line. Anyway, this has no drawback in efficiency, as the NFA that parses this regexp has the same number of states as the other.

Remove columns from CSV

I don't know anything about Notepad++ Regex.
This is the data I have in my CSV:
6454345|User1-2ds3|62562012032|324|148|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|0|0|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|1534|51564|411b0fdf54fe29745897288c6ad699f7be30f389
How can I use a Regex to remove the 5th and 6th column? The numbers in the 5th and 6th column are variable in length.
Another problem is the User row can also contain a |, to make it even worse.
I can use a macro to fix this, but the file is a few millions lines long.
This is the final result I want to achieve:
6454345|User1-2ds3|62562012032|9c1fe63ccd3ab234892beaf71f022be2e06b6cd1
3305611|User2-42g563dgsdbf|22023001345|c36dedfa12634e33ca8bc0ef4703c92b73d9c433
8749412|User3-9|xgs|f|98906504456|411b0fdf54fe29745897288c6ad699f7be30f389
I am open for suggestions on how to do this with another program, command line utility, either Linux or Windows.
Match \|[^|]+\|[^|]+(\|[^|]+$)
Repalce $1
Basically, Anchor to the end of the line, and remove columns [-1] and [-2] (I assume columns can't be empty. Replace + with * if they can)
If you need finer detail then that, I'd recommend writing a Java or Python script to manual parse and rewrite the file for you.
I've captured three groups and given them names. If you use a replace utility like sed or vimregex, you can replace remove with nothing. Or you can use a programming language to concatenate keep_before and keep_after for the desired result.
^(?<keep_before>(?:[^|]+\|){3})(?<remove>(?:[^|]+\|){2})(?<keep_after>.*)$
You may have to remove the group namings and use \1 etc. instead, depending on what environment you use.
Demo
From Notepad++ hit ctrl + h then enter the following in the dialog:
Find what: \|\d+\|\d+(\|[0-9a-z]+)$
Replace with: $1
Search mode: Regular Expression
Click replace and done.
Regex Explain:
\|\d+ : match 1st string that starts with | followed by number
\|\d+ : match 2nd string that starts with | followed by number
(\|[0-9a-z]+): match and capture the string after the 2nd number.
$ : This is will force regex search to match the end of the string.
Replacement:
$1 : replace the found string with whatever we have between the captured group which is whatever we have between the parentheses (\|[0-9a-z]+)

Regex - match up to first literal

I have some lines of code I am trying to remove some leading text from which appears like so:
Line 1: myApp.name;
Line 2: myApp.version
Line 3: myApp.defaults, myApp.numbers;
I am trying and trying to find a regex that will remove anything up to (but excluding) myApp.
I have tried various regular expressions, but they all seem to fail when it comes to line 3 (because myApp appears twice).
The closest I have come so far is:
.*?myApp
Pretty simple - but that matches both instances of myApp occurrences in Line 3 - whereas I'd like it to match only the first.
There's a few hundred lines - otherwise I'd have deleted them all manually by now.
Can somebody help me? Thanks.
You need to add an anchor ^ which matches the starting point of a line ,
^.*?(myApp)
DEMO
Use the above regex and replace the matched characters with $1 or \1. So that you could get the string myApp in the final result after replacement.
Pattern explanation:
^ Start of a line.
.*?(myApp) Shortest possible match upto the first myApp. The string myApp was captured and stored into a group.(group 1)
All matched characters are replaced with the chars present inside the group 1.
Your regular expression works in Perl if you add the ^ to ensure that you only match the beginnings of lines:
cat /tmp/test.txt | perl -pe 's/^.*?myApp/myApp/g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;
If you wanted to get fancy, you could put the "myApp" into a group that doesn't get captured as part of the expression using (?=) syntax. That way it doesn't have to be replaced back in.
cat /tmp/test.txt | perl -pe 's/^.*?(?=myApp)//g'
myApp.name;
myApp.version
myApp.defaults, myApp.numbers;

Select last character of a substring in regexp

I'm trying to clean a huge geoJson datafile. I need to change the format of "text" field from
"text": "(2:Placename,Placename)"
to
"text": "Placename".
In Sublime text I managed to write a regular expression which enabled me to select and remove the first part leaving something like this:
"text": "Placename)"
With following regexp I can select the text above, but I need to narrow it down to the last character:
text\": \".*?\)
No matter what I can't figure out how to select the ")" character in the end of Placename string in the whole file and remove it. Note that the "Placename" here can be any place name, like New York, London etc.
I tried to build an expression where first part finds the text field, then ignores n-amount of characters until it finds the ")" character.
After experimenting and Googling I couldn't find a solution here.
You can capture the value of the second placemark field with the following regexp:
/"text": "+\(\d+:[^,]+,(.*?)\)/
Which will capture "Placename" in $1
More info on capturing parenthesis: http://www.regular-expressions.info/brackets.html
The trick is to use the inverted character classes and to escape any parentheses you want to match.
HTH
I do not know if you are using a Unix system, but probably sed can do much of the work for you. It can interpret regular expressions, capture groups, and substitute by other groups of characters. I have tried an example with sed and the following sed command worked for me:
echo "\"text\": \"(2:Placename,Placename)\"" | sed -r 's/(\"text\": )\"\([[:digit:]]:[^0-9]+,([^0-9]+)\)\"/\1\"\2\"/g'
-r allows sed to interpret regular expressions. I am using parentheses to capture groups that I will use later in the substitution (e.g., a group for "text", and a group for the second placename). In the substitution part of sed, you can use groups by using \n where n is the group number that you want to used. This expression should help you to achieve your desired result.