RegEx to Remove Unwanted text - regex

I'm still kind of new to RegEx in general. I'm trying to retrieve the names from a field so I can split them for further use (using Pentaho Data Integration/Kettle for the data extraction). Here's an example of the string I'm given:
CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain
I would like to have the following format returned:
Name One;Name Two;Name Three
Kettle uses Java Regular Expressions.

That sounds like you want substitute&replace based on a regex. How to correctly do that depends on your language. But with sed I would do it like this:
echo "CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain" |\
sed 's/CN=\([^\/]*\)[^;]*/\1/g'
If you intend to split it later anyway, you probably want to just match the names and return them im a loop. Example code in perl:
#!/usr/bin/perl
$line="CN=Name One/OU=Site/O=Domain;CN=Name Two/OU=Site/O=Domain;CN=Name Three/OU=Site/O=Domain";
for $match ($line =~ /CN=([^\/]*)/g ){
print "Name: $match\n";
}

assuming you have it in file.txt:
sed -e 's/\/OU=Site\/O=Domain//g' -e 's/CN=//g' file.txt

Related

Output name of named pattern in sed or grep

I'm looking for a solution to output the name of named pattern in regular expression
Regex - can contain n patterns, each named idn, no duplicates:
(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)
input-file:
aba
cec
fav
gex
hur
output (any of the following):
id1
id4
id1;id4
1
4
1;4
Is there any way to do it with sed or grep on a linux os. The input file is a text file 200-500MB.
I know that PHP outputs pattern names in output array, but I'd prefer not to use it.
Any other solution is also welcome, but it should use basic linux commands.
Here's a simple Perl script which does what you ask.
perl -nle 'if (m/(?P<id1>aba)|(?P<id2>cde)|(?P<id3>esa)|(?P<id4>fav)/) {
for my $pat (keys %+) { print $pat } }' filename

Regular expression: replace one character set with another

I have a string ( e.g. 3122323123123) and want to replace any 1->ax, 2->by and 3->cz.
How do I do that in bash?
I started with the character set [123] and tried with "sed", but didn't know how to write the replacement expression ?
Regex is not the tool for you here. There's nothing in your question that requires any regex.
You didn't specify your language, but if you're working in PHP, you could use the function strtr() which does exactly what you are looking for.
And good old str_replace() can probably also do what you want too, as it can accept arrays for the search/replacement arguments.
Most other languages should have similar capabilities that mean you shouldn't need regex for this.
Look at standard tr utility.
% echo "3122323123123" | tr "123" "abc"
cabbcbcabcabc
If you want to replace a character with multiple characters, you can use sed for every replacement:
% echo "3122323123123" | sed -e "s/1/ax/g" -e "s/2/by/g" -e "s/3/cz/g"
czaxbybyczbyczaxbyczaxbycz
In c#
string input = "3122323123123";
string output = intput.Replace('1','a').Replace('2','b').Replace('3','c');
Using Perl tr/// for example:
$ echo "3122323123123" | perl -pe "tr/123/abc/"
cabbcbcabcabc

Why is using string substitiution to form a regex not working?

I have a regular expression for use with awk to find any of the specified words in a line of a file. It looks like this awk "/word1/||/word2/||/word3/" filename. As an alternative, I have been trying to specify the words like this WORDS="word1 word2 word3" and then use bash string substitution to form the regular expression to pass to awk.
I've tried numerous ways of doing this to no avail. awk simply dumps the contents of the entire file or spits out some complaint about the regex form.
Here's what I have:
#!/bin/bash
FILE="myfile"
WORDS="word1 word2 word3"
# use BASH string substitution to obtain the regex which should look like this:
# "/word1/||/word2/||/word3/"
REGEX=\"/${WORDS// //||/}/\"
awk ${REGEX} $FILE
I'm fairly sure it has to do with quoting and I've tried various methods using echo and back ticks and can get it look right (when echoed) but when actually trying to use it, it fails.
Try to replace:
REGEX=\"/${WORDS// //||/}/\"
with:
REGEX="/${WORDS// //||/}/"
Note that there is no need to escape double quotes since they are not really part of the regular expression.

shell script: search and replace over multiple lines

I'm looking for a way to search and replace over multiple lines through a shell script. This is what I'm trying to do:
source:
[stuff before]
<!--WIERD_SPECIAL_COMMENT_BEGIN-->
[stuff here, possibly multiple lines.
<!--WIERD_SPECIAL_COMMENT_END-->
[stuff after]
target:
[stuff before]
[new content]
[stuff after]
In short, I want to delete the comments and everything between them and replace with some new content. Basically, I want to do a simple sed command over multiple lines, and if possible just using some basic *nix tools, no additional scripting language.
If you only need to match complete lines then you can do this task with
awk. Something like:
awk -v NEWTEXT=foo 'BEGIN{n=0} /COMMENT_BEGIN/ {n=1} {if (n==0) {print $0}} /COMMENT_END/ {print NEWTEXT; n=0}' < myfile.txt
If the file is not so well formatted, with comments on
the same line as text you want to keep or remove, then I
would use perl, read the entire file into a single string,
do a regular expression match and replace on that string, then write the new string to
a new file. This is not so simple and you need to write a perl script to do the work.
Something like:
#!/usr/bin/perl
$newtext = "foo\nbar";
$/ = ''; # no input separator so whole file is read.
$s = <>; # read whole file from stdin
$startPattern = quotemeta('<!--WIERD_SPECIAL_COMMENT_BEGIN-->');
$endPattern = quotemeta('<!--WIERD_SPECIAL_COMMENT_END-->');
$pattern = $startPattern . '.+' . $endPattern;
$s =~ s/$pattern/$newtext/sg;
print $s;
sed does this just fine. The following is as simple as it gets; if you need to extract stuff from the delimiter line before the start delimiter or after the end delimiter, that's going to be a little more complex.
sed '/<!--WIERD_SPECIAL_COMMENT_BEGIN-->/,/<!--WIERD_SPECIAL_COMMENT_END-->/d' input >output
If you have any control over this, fix the spelling of "weird".
another solution... this is possible to be done in a one-liner, but using perl regular expressions, which I find easier to work with than sed or awk (which are cumbersome with multi-line match and replace):
perl -0 -i -pe 's/<!--WIERD_SPECIAL_COMMENT_BEGIN-->[\s\S]*<!--WIERD_SPECIAL_COMMENT_END-->/your new content here/gim' yourfile1.txt
please note that this will replace the file with the new, changed content.

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill