Linux search and replace a patterns case within a string - regex

Been struggling to figure out a way to do this. Basically I need to change the case of anything enclosed in {} from lower to upper within a string representing a uri (and also strip out the braces but I can use sed to do that)
E.g
/logs/{server_id}/path/{os_id}
To
/logs/SERVER_ID/path/OS_ID
The case of the rest of the string must be preserved in lower which is what has been beating me. Looked at combos of sed,awk,tr with regex so far. Any help appreciated.

sed "s/{\([^{}]*\)}/\U\1/g"
This works by matching all text enclosed within {} and replacing it with its uppercase version.
echo "/logs/{server_id}/path/{os_id}" | sed "s/{\([^{}]*\)}/\U\1/g"
Gives /logs/SERVER_ID/path/OS_ID as the result.

echo "/logs/{server_id}/path/{os_id}" \
| sed 's#{\([^{}][^{}]*\)}#\U\1#;s#{\([^{}][^{}]*\)}#\U\1#'
output
/logs/SERVER_ID/path/OS_ID
The part of the solution you seem to have missed is the 'capture groups' available in sed, i.e. \(regex\). This is then referenced by \1. You could have anywhere from 1-9 capture groups if you're a real masochist ;-)
Also note that I just repeat the same cmd 2 times, as the first {...} pair as been converted to the UC version (without surrounding {}s), so only remaining {...} targets will match.
There are probably less verbose syntax available for [^{}][^{}* but this will work with just about any sed going back to the 80s. I seem to recall that some seds don't support the \U directive, but for the systems I have access to, this works.
Does that help?

$ awk '{
while(match($0,/{[^}]+}/))
$0=substr($0,1,RSTART-1) toupper(substr($0,RSTART+1,RLENGTH-2)) substr($0,RSTART+RLENGTH)
}1' file
/logs/SERVER_ID/path/OS_ID

This one handles arbitrary number and format of braces:
echo "/logs/{server_id}/path/{os_id}/{foo}" | awk -v RS='{' -v FS='}' -v ORS='\0' -v OFS='\0' '!/}/ { print } /}/ { $1 = toupper($1); print}'
Output:
/logs/SERVER_ID/path/OS_ID/FOO

Related

In bash/sed, how do you match on a lowercase letter followed by the SAME letter in uppercase?

I want to delete all instances of "aA", "bB" ... "zZ" from an input string.
e.g.
echo "foObar" |
sed -Ee 's/([a-z])\U\1//'
should output "fbar"
But the \U syntax works in the latter half (replacement part) of the sed expression - it fails to resolve in the matching clause.
I'm having difficulty converting the matched character to upper case to reuse in the matching clause.
If anyone could suggest a working regex which can be used in sed (or awk) that would be great.
Scripting solutions in pure shell are ok too (I'm trying to think of solving the problem this way).
Working PCRE (Perl-compatible regular expressions) are ok too but I have no idea how they work so it might be nice if you could provide an explanation to go with your answer.
Unfortunately, I don't have perl or python installed on the machine that I am working with.
You may use the following perl solution:
echo "foObar" | perl -pe 's/([a-z])(?!\1)(?i:\1)//g'
See the online demo.
Details
([a-z]) - Group 1: a lowercase ASCII letter
(?!\1) - a negative lookahead that fails the match if the next char is the same as captured with Group 1
(?i:\1) - the same char as captured with Group 1 but in the different case (due to the lookahead before it).
The -e option allows you to define Perl code to be executed by the compiler and the -p option always prints the contents of $_ each time around the loop. See more here.
This might work for you (GNU sed):
sed -r 's/aA|bB|cC|dD|eE|fF|gG|hH|iI|jJ|kK|lL|mM|nN|oO|pP|qQ|rR|sS|tT|uU|vV|wW|xX|yY|zZ//g' file
A programmatic solution:
sed 's/[[:lower:]][[:upper:]]/\n&/g;s/\n\(.\)\1//ig;s/\n//g' file
This marks all pairs of lower-case characters followed by an upper-case character with a preceding newline. Then remove altogether such marker and pairs that match by a back reference irrespective of case. Any other newlines are removed thus leaving pairs untouched that are not the same.
Here is a verbose awk solution as OP doesn't have perl or python available:
echo "foObar" |
awk -v ORS= -v FS='' '{
for (i=2; i<=NF; i++) {
if ($(i-1) == tolower($i) && $i ~ /[A-Z]/ && $(i-1) ~ /[a-z]/) {
i++
continue
}
print $(i-1)
}
print $(i-1)
}'
fbar
There's an easy lex for this,
%option main 8bit
#include <ctype.h>
%%
[[:lower:]][[:upper:]] if ( toupper(yytext[0]) != yytext[1] ) ECHO;
(that's a tab before the #include, markdown loses those). Just put that in e.g. that.l and then make that. Easy-peasy lex's are a nice addition to your toolkit.
Note: This solution is (unsurprisingly) slow, based on OP's feedback:
"Unfortunately, due to the multiple passes - it makes it rather slow. "
If there is a character sequence¹ that you know won't ever appear in the input,you could use a 3-stage replacement to accomplish this with sed:
echo 'foObar foobAr' | sed -E -e 's/([a-z])([A-Z])/KEYWORD\1\l\2/g' -e 's/KEYWORD(.)\1//g' -e 's/KEYWORD(.)(.)/\1\u\2/g'
gives you: fbar foobAr
Replacement stages explained:
Look for lowercase letters followed by ANY uppercase letter and replace them with both letters as lowercase with the KEYWORD in front of them foObar foobAr -> fKEYWORDoobar fooKEYWORDbar
Remove KEYWORD followed by two identical characters (both are lowercase now, so the back-reference works) fKEYWORDoobar fooKEYWORDbar -> fbar fooKEYWORDbar
Strip remaining² KEYWORD from the output and convert the second character after it back to it's original, uppercase version fbar fooKEYWORDbar -> fbar foobAr
¹ In this example I used KEYWORD for demonstration purposes. A single character or at least shorter character sequence would be better/faster. Just make sure to pick something that cannot possibly ever be in the input.
² The remaining occurances are those where the lowercase-versions of the letters were not identical, so we have to revert them back to their original state

sed : match all instances of regex in infile1.txt, and output only these to outfile2.txt

I have a text file infile1 with 1,000's of lines.
I wish to use sed to extract the occuring instances of a regex pattern match to outfile2.
NB
Each instance of the regex pattern match may occur more than once on each line of infile1.
Each instance of the extracted regex pattern should be printed to a new line in outfile2.
Does anyone know the syntax within sed to place the regex into?
ps the regex pattern is
\(Google[ ]{1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”\)
Thank you :)
I think you want
grep -oE 'Google[ ]{1,3}"[a-zA-Z0-9 ]{1,100}[., ]{0,3}"' filename
-o tells grep to print only the matches, each on a line of its own, and -E instructs it to interpret the regex in extended POSIX syntax, which your regex appears to be.
Note that [ ] could be replaced with just a space, and you might want to use [[:alnum:] ] instead of [a-zA-Z0-9 ] to cover umlauts and suchlike if they exist in the current locale.
Addendum: It is also possible to do this with sed. I don't recommend it, but you could write (using GNU sed):
sed -rn 's/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g; s/[^\n]*\n([^\n]*\n)/\1/g; s/\n[^\n]*$//p' filename
To make this work with older versions of BSD sed, use -En instead of -rn. -r and -E enable extended regex syntax. -r was historically used by GNU sed, -E by BSD sed; newer versions of them support both for compatibility. -n disables auto-printing.
The code works as follows:
# mark all occurrences of the regex by circumscribing them with newlines
s/Google[ ]{1,3}"[A-Za-z0-9 ]{1,100}[., ]{0,3}"/\n&\n/g
# Isolate every other line from the pattern space (the matches). This will
# leave the part behind the last match...
s/[^\n]*\n([^\n]*\n)/\1/g
# ...so we remove it afterwards and print the result of the transformation if it
# happened (the s///p flag does that). The transformation will not happen if
# there were no matches in the line (because then no newlines will have been
# inserted), so in those cases nothing will be printed.
s/\n[^\n]*$//p
It can be done with sed too, but it isn't pretty:
sed -n ':start /foo/{ h; s/\(foo\).*/\1/; s/.*\(foo\)/\1/; p; g; s/foo\(.*\)/\1/; b start; }' infile1 >outfile2
-- provided that you replace the four occurences of foo above with your pattern Google {1,3}“[a-zA-Z0-9 ]{1,100}[., ]{0,3}”.
Yeah, I told you it isn't pretty. :)

Regular Expression to parse Common Name from Distinguished Name

I am attempting to parse (with sed) just First Last from the following DN(s) returned by the DSCL command in OSX terminal bash environment...
CN=First Last,OU=PCS,OU=guests,DC=domain,DC=edu
I have tried multiple regexs from this site and others with questions very close to what I wanted... mainly this question... I have tried following the advice to the best of my ability (I don't necessarily consider myself a newbie...but definitely a newbie to regex..)
DSCL returns a list of DNs, and I would like to only have First Last printed to a text file. I have attempted using sed, but I can't seem to get the correct function. I am open to other commands to parse the output. Every line begins with CN= and then there is a comma between Last and OU=.
Thank you very much for your help!
I think all of the regular expression answers provided so far are buggy, insofar as they do not properly handle quoted ',' characters in the common name. For example, consider a distinguishedName like:
CN=Doe\, John,CN=Users,DC=example,DC=local
Better to use a real library able to parse the components of a distinguishedName. If you're looking for something quick on the command line, try piping your DN to a command like this:
echo "CN=Doe\, John,CN=Users,DC=activedir,DC=local" | python -c 'import ldap; import sys; print ldap.dn.explode_dn(sys.stdin.read().strip(), notypes=1)[0]'
(depends on having the python-ldap library installed). You could cook up something similar with PHP's built-in ldap_explode_dn() function.
Two cut commands is probably the simplest (although not necessarily the best):
DSCL | cut -d, -f1 | cut -d= -f2
First, split the output from DSCL on commas and print the first field ("CN=First Last"); then split that on equal signs and print the second field.
Using sed:
sed 's/^CN=\([^,]*\).*/\1/' input_file
^ matches start of line
CN= literal string match
\([^,]*\) everything until a comma
.* rest
http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators
awk -v RS=',' -v FS='=' '$1=="CN"{print $2}' foo.txt
I like awk too, so I print the substring from the fourth char:
DSCL | awk '{FS=","}; {print substr($1,4)}' > filterednames.txt
This regex will parse a distinguished name, giving name and val a capture groups for each match.
When DN strings contain commas, they are meant to be quoted - this regex correctly handles both quoted and unquotes strings, and also handles escaped quotes in quoted strings:
(?:^|,\s?)(?:(?<name>[A-Z]+)=(?<val>"(?:[^"]|"")+"|[^,]+))+
Here is is nicely formatted:
(?:^|,\s?)
(?:
(?<name>[A-Z]+)=
(?<val>"(?:[^"]|"")+"|[^,]+)
)+
Here's a link so you can see it in action:
https://regex101.com/r/zfZX3f/2
If you want a regex to get only the CN, then this adapted version will do it:
(?:^|,\s?)(?:CN=(?<val>"(?:[^"]|"")+"|[^,]+))

my sed is close... but not quite there, can you help please?

I want to print only the lines that meet the criteria : "worde:" and "wordo;"
I got this far:
sed -n '/\([a-z]*\)\1e:\1o;/p;'
But it doesn't quite work.
Can someone please perfect it and tell me exactly how its a fixed version/what was wrong with mine?
(Please note there are no capital letters ever, hence why I didn't bother including that within my initial character range)
Thanks heaps,
This will handle lines where "worde:wordo;" (nothing between the words) appears:
sed -n '/\([a-z]*\)e:\1o;/p;'
If you need to allow for characters BETWEEN the words, you'll need something like this:
sed -n '/\([a-z]*\)e:.*\1o;/p;'
My interpretation of your question is that you want to match lines which contain both worde: and wordo;
sed -n '/worde:/{/wordo;/p}' infile
The -n parameter prevents sed from printing the pattern space (infile), the first regex matches, then control flows into the block, if the regex isn't matched, then the line is ignored. Inside the block, the if the second regex is matched, the line is printed.
One way using alternation:
sed -n '/word\(e:\|o;\)/ p' infile
Is it a requirement to use capture groups? I went without them.
$ sed -n '/[\w]*[oe][:;]/p'
[\w]* - Match any word character. (if you really want only [a-z], swap
that back in)
[oe] - Those word characters must end in an e or
o
[:;] - And then have a : or ;
This might work for you:
sed '/^\(.*\)[eE]:\s*\1[oO];/!d' file

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file