awk finding a column and trimming - regex

I have a text file with irregular structure like following
first_name1 last_name1 designation1 email1 phone_number1
first_name2 last_name2 designation2 email2
first_name3 last_name3 designation3 email3 phone_number3 address3
As you see email could be the last column, second last column or the third last column. This means one simply cannot use $NF to get email. My goal is to get email address wherever it is on the line and then extract the portion before # so for instance email1 = foobar#dept.company.com then I want to extract foobar. How can i write an awk query to extract first portion of the email address. I tried this but it is looking for exact match. How can i make it into Regex to get the job done.
awk '{for(i=1;i<=NF;i++){ if($i=="foobar#dept.company.com"){print $i} } }' users.txt

You are comparing $i to a string "foobar#dept.company.com", so yes of course this will only make an exact comparison. What it seems you are looking for is whether or not $i matches (~) a regular expression (/.../ instead of "..."), then tailor the regex to your needs. Try something like:
awk '{for(i=1;i<=NF;++i){if ($i ~ /.+#.+/){sub(/#.*$/, "", $i); print $i; next}}}'
The regex /.+#.+/ matches a string with a # in it, and some non-empty thing before it and after it. This will not match, for example #foobar or foobar#, or just #. You might want to consider using something more like /.+#.+\..+/ which would match (something)#(something).(something) since domain names usually have a . in them. You can tailor this regex to be more specific, if you wish.
The sub(/#.*$/, "", $i) means to substitute in $i everything after (and including) the first # until the end of the line ($) with an empty string "", thus stripping out the part before the # (i.e. the username). The print $i prints it, and the next moves on to the next line (skipping any remaining fields for the current record).

I don't know awk at all but I looked the regex reference up and this should be supported: \b([^ ]*#.*?)($|[^\w#.]) in which group 1 matches the email. This just search for something after a word boundary that contains #. The match ends at the next non word character, excluding # and ..

Related

Perl split string based on forward slash

I am new to Perl, so this is basic question. I have a string as shown below. I am interested in taking date out of it, so thinking of splitting it using slash
my $path = "/bla/bla/bla/20160306";
my $date = (split(/\//,$path))[3];#ideally 3 is date position in array after split
print $date;
However, I don't see the expected output, but instead I see 5 getting printed.
Since the path starts with the pattern / itself, split returns a list with an empty string first (to the left of the first /); one element more. Thus the posted code miscounts by one and returns the one before last element (subdirectory) in the path, not the date.
If date is always the last thing in the string you can pick the last element
my $date = (split '/', $path)[-1];
where i've used '' for delimiters so to not have to escape /. (This, however, may confuse since the separator pattern is a regex and // convey that, while '' may appear to merely quote a string.)
This can also be done with regex
my #parts = $path =~ m{([^/]+)}g;
With this there can be no inital empty string. Or, the last part can be picked out of the full list as above, with ($path =~ m{...}g)[-1], but if you indeed only need the last bit then extract it directly
my ($last_part) = $path =~ m{.*/(.*)};
Here the "greedy" .* matches everything in the string up to the last instance of the next subpattern (/ here), thus getting us to the last part of the path, which is then captured. The regex match operator returns its matches only when it is in the list context so parens on the left are needed.
What brings us to the fact that you are parsing a path, and there are libraries dedicated to that.
For splitting a path into its components one tool is splitdir from File::Spec
use File::Spec;
my #parts = File::Spec->splitdir($path);
If the path starts with a / we'll again get an empty string for the first element (by design, see docs). That can then be removed, if there
shift #parts if $parts[0] eq '';
Again, the last element alone can be had like in the other examples.
Simply bind it to the end:
(\d+)$
# look for digits at the end of the string
See a demo on regex101.com. The capturing group is only for clarification though not really needed in this case.
In Perl this would be (I am a PHP/Python guy, so bear with me when it is ugly)
my $path = "/bla/bla/bla/20160306";
$path =~ /(\d+)$/;
print $1;
See a demo on ideone.com.
Try this
Use look ahead for to do it. It capture the / by splitting. Then substitute the data using / for remove the slash.
my $path = "/a/b/c/20160306";
my $date = (split(/(?=\/)/,$path))[3];
$date=~s/^\///;
print $date;
Or else use pattern matching with grouping for to do it.
my $path = "/a/b/c/20160306";
my #data = $path =~m/\/(\w+)/g;
print $data[3];

matching two strings which differ in elements and spaces in perl

I want to match two string which differ only in element and newlines
$string1 = "perl is <match>scripting language</match>";
$string2 = "perl<TAG> is<TAG> scr<TAG>ipt<TAG>inglanguage";
Note: spaces and <TAG> and newline can come anywhere in string2. space may or may not present in string2 for e.g. in above instance in $string2 spaces between words scripting language is missing. we have to ignore space,tags,newline while matching string1 against string2. <match> tag in string1 indicates the data to be matched against string2
output required :
whole content of string2 in addition with <match> tag.
perl<TAG> is<TAG> <match>scr<TAG>ipt<TAG>inglanguage</match>
Code i tried :
while($string =~ /<match>(.*?)<\/match>/gs)
{
my $data_to_match = $1;
$data_to_match = add_pat($data_to_match);
$string2 =~ s{($data_to_match)}
{
"<match>$&<\/match>"
}esi;
}
sub add_pat
{
my ($data) = (#_);
my #array = split//,$data;
foreach my $each(#array)
{
$each = quotemeta $each;
$each = '(?:(<TAG>|\s)+)?'.$each.'(?:(<TAG>|\s)+)?';
}
$data = join '',#array;
return $data;
}
Problem : since space is missing in string2 it is not matching.i tried making space optional while appending pattern to each character. but making space optional. $string pattern goes on running.
In reality, i have large string to match. these space is causing problem..Please suggest
Use regular expressions to remove all the characters that you wish to ignore from both of the strings. Then compare the remaining values of the two strings.
So you will end up both strings, for example:
'perlisscriptinglanguage' and 'perlisscriptinglanguage'
If you want you can also upper/lower case them to match too.
If they match then just return the original string 2.
I think its weird that you are expected to "match". but $string2, if you take out the tags, doesnt match the original string.
Anyway, since your code is tolerant of Additional spaces and tags in $string2, then you can wipe all spaces (and tags if applicable) from $string1.
I added $data_to_match =~ s/ +//; before your call to add_pat. That didnt quite work because this line "$each = '(?:(|\s)+)?'.$each.'(?:(|\s)+)?';" adds the (?:(|\s)+)?' even before your first letter of the match from $string1. You actually have a lot of redundant TAG patterns, you add one to the front and back of each letter. I dont know what quotemeta does so im not sure how to fix the code there. I just added
$data_to_match =~ s/\Q(?:(<TAG>|\s)+)?\E//; line after the call to add_pat to strip off the first TAG pattern from the front of the pattern. otherwise it'll match wrong and output this 'perl < TAG> is< match>< TAG> scr< TAG>ipt< TAG>inglanguage< /match>'
Really you should only be putting one "(?:(|\s)+)?" inbetween each letter of the $string1 match, and more importantly; you should not be putting "(?:(|\s)+)?" before the first letter or after the last letter.

Regular Expressions: querystring parameters matching

I'm trying to learn something about regular expressions.
Here is what I'm going to match:
/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
My expression should "grabs" abc123 and def456.
And now just an example about what I'm not going to match ("question mark" is missing):
/parent/child/firstparam=abc123&secondparam=def456
Well, I built the following expression:
^(?:/parent/child){1}(?:^(?:/\?|\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?
But that doesn't work.
Could you help me to understand what I'm doing wrong?
Thanks in advance.
UPDATE 1
Ok, I made other tests.
I'm trying to fix the previous version with something like this:
/parent/child(?:(?:\?|/\?)+(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?)?$
Let me explain my idea:
Must start with /parent/child:
/parent/child
Following group is optional
(?: ... )?
The previous optional group must starts with ? or /?
(?:\?|/\?)+
Optional parameters (I grab values if specified parameters are part of querystring)
(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*)?
End of line
$
Any advice?
UPDATE 2
My solution must be based just on regular expressions.
Just for example, I previously wrote the following one:
/parent/child(?:[?&/]*(?:firstparam=([^&]*)|secondparam=([^&]*)|[^&]*))*$
And that works pretty nice.
But it matches the following input too:
/parent/child/firstparam=abc123&secondparam=def456
How could I modify the expression in order to not match the previous string?
You didn't specify a language so I'll just usre Perl. So basically instead of matching everything, I just matched exactly what I thought you needed. Correct me if I am wrong please.
while ($subject =~ m/(?<==)\w+?(?=&|\W|$)/g) {
# matched text = $&
}
(?<= # Assert that the regex below can be matched, with the match ending at this position (positive lookbehind)
= # Match the character “=” literally
)
\\w # Match a single character that is a “word character” (letters, digits, and underscores)
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?= # Assert that the regex below can be matched, starting at this position (positive lookahead)
# Match either the regular expression below (attempting the next alternative only if this one fails)
& # Match the character “&” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
\\W # Match a single character that is a “non-word character”
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
\$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
Output:
This regex will work as long as you know what your parameter names are going to be and you're sure that they won't change.
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam)\=([\w]+)&?)?
Whilst regex is not the best solution for this (the above code examples will be far more efficient, as string functions are way faster than regexes) this will work if you need a regex solution with up to 3 parameters. Out of interest, why must the solution use only regex?
In any case, this regex will match the following strings:
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789
It will now only match those containing query string parameters, and put them into capture groups for you.
What language are you using to process your matches?
If you are using preg_match with PHP, you can get the whole match as well as capture groups in an array with
preg_match($regex, $string, $matches);
Then you can access the whole match with $matches[0] and the rest with $matches[1], $matches[2], etc.
If you want to add additional parameters you'll also need to add them in the regex too, and add additional parts to get your data. For example, if you had
/parent/child/?secondparam=def456&firstparam=abc123&fourthparam=jkl01112&thirdparam=ghi789
The regex will become
\/parent\/child\/?\?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?(?:(?:firstparam|secondparam|thirdparam|fourthparam)\=([\w]+)&?)?
This will become a bit more tedious to maintain as you add more parameters, though.
You can optionally include ^ $ at the start and end if the multi-line flag is enabled. If you also need to match the whole lines without query strings, wrap this whole regex in a non-capture group (including ^ $) and add
|(?:^\/parent\/child\/?\??$)
to the end.
You're not escaping the /s in your regex for starters and using {1} for a single repetition of something is unnecessary; you only use those when you want more than one repetition or a range of repetitions.
And part of what you're trying to do is simply not a good use of a regex. I'll show you an easier way to deal with that: you want to use something like split and put the information into a hash that you can check the contents of later. Because you didn't specify a language, I'm just going to use Perl for my example, but every language I know with regexes also has easy access to hashes and something like split, so this should be easy enough to port:
# I picked an example to show how this works.
my $route = '/parent/child/?first=123&second=345&third=678';
my %params; # I'm going to put those URL parameters in this hash.
# Perl has a way to let me avoid escaping the /s, but I wanted an example that
# works in other languages too.
if ($route =~ m/\/parent\/child\/\?(.*)/) { # Use the regex for this part
print "Matched route.\n";
# But NOT for this part.
my $query = $1; # $1 is a Perl thing. It contains what (.*) matched above.
my #items = split '&', $query; # Each item is something like param=123
foreach my $item (#items) {
my ($param, $value) = split '=', $item;
$params{$param} = $value; # Put the parameters in a hash for easy access.
print "$param set to $value \n";
}
}
# Now you can check the parameter values and do whatever you need to with them.
# And you can add new parameters whenever you want, etc.
if ($params{'first'} eq '123') {
# Do whatever
}
My solution:
/(?:\w+/)*(?:(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)?|\w+|)
Explain:
/(?:\w+/)* match /parent/child/ or /parent/
(?:\w+)?\?(?:\w+=\w+(?:&\w+=\w+)*)? match child?firstparam=abc123 or ?firstparam=abc123 or ?
\w+ match text like child
..|) match nothing(empty)
If you need only query string, pattern would reduce such as:
/(?:\w+/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)
If you want to get every parameter from query string, this is a Ruby sample:
re = /\/(?:\w+\/)*(?:\w+)?\?(\w+=\w+(?:&\w+=\w+)*)/
s = '/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789'
if m = s.match(re)
query_str = m[1] # now, you can 100% trust this string
query_str.scan(/(\w+)=(\w+)/) do |param,value| #grab parameter
printf("%s, %s\n", param, value)
end
end
output
secondparam, def456
firstparam, abc123
thirdparam, ghi789
This script will help you.
First, i check, is there any symbol like ?.
Then, i kill first part of line (left from ?).
Next, i split line by &, where each value splitted by =.
my $r = q"/parent/child
/parent/child?
/parent/child?firstparam=abc123
/parent/child?secondparam=def456
/parent/child?firstparam=abc123&secondparam=def456
/parent/child?secondparam=def456&firstparam=abc123
/parent/child?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child?thirdparam=ghi789
/parent/child/
/parent/child/?
/parent/child/?firstparam=abc123
/parent/child/?secondparam=def456
/parent/child/?firstparam=abc123&secondparam=def456
/parent/child/?secondparam=def456&firstparam=abc123
/parent/child/?thirdparam=ghi789&secondparam=def456&firstparam=abc123
/parent/child/?secondparam=def456&firstparam=abc123&thirdparam=ghi789
/parent/child/?thirdparam=ghi789";
for my $string(split /\n/, $r){
if (index($string,'?')!=-1){
substr($string, 0, index($string,'?')+1,"");
#say "string = ".$string;
if (index($string,'=')!=-1){
my #params = map{$_ = [split /=/, $_];}split/\&/, $string;
$"="\n";
say "$_->[0] === $_->[1]" for (#params);
say "######next########";
}
else{
#print "there is no params!"
}
}
else{
#say "there is no params!";
}
}

Regular expression help in Perl

I have following text pattern
(2222) First Last (ab-cd/ABC1), <first.last#site.domain.com> 1224: efadsfadsfdsf
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
I want the number 1224 or 1234, 4657 from the above text after the text >.
I have this
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain.com>\s\d+:
which will take the text before : But i want the one after email till :
Is there any easy regular expression to do this? or should I use split and do this
Thanks
Edit: The whole text is returned by a command line tool.
(3333) First Last (abcd/ABC12), <first.last#site.domain.com> 1234, 4657: efadsfadsfdsf
(3333) - Unique ID
First Last - First and last names
<first.last#site.domain.com> - Email address in format FirstName.LastName#sub.domain.com
1234, 4567 - database primary Keys
: xxxx - Headline
What I have to do is process the above and get hte database ID (in ex: 1234, 4567 2 separate ID's) and query the tables
The above is the output (like this I will get many entries) from the tool which I am calling via my Perl script.
My idea was to use a regular expression to get the database id's. Guess I could use regular expression for this
you can fudge the stuff you don't care about to make the expression easier, say just 'glob' the parts between the parentheticals (and the email delimiters) using non-greedy quantifiers:
/(\d+)\).*?\(.*?\),\s*<.*?>\s*(\d+(?:,\s*\d+)*):/ (not tested!)
there's only two captured groups, the (1234), and the (1234, 4657), the second one which I can only assume from your pattern to mean: "a digit string, followed by zero or more comma separated digit strings".
Well, a simple fix is to just allow all the possible characters in a character class. Which is to say change \d to [\d, ] to allow digits, commas and space.
Your regex as it is, though, does not match the first sample line, because it has a dash - in it (ab-cd/ABC1 does not match \w*\/\w+\d*\). Also, it is not a good idea to rely too heavily on the * quantifier, because it does match the empty string (it matches zero or more times), and should only be used for things which are truly optional. Use + otherwise, which matches (1 or more times).
You have a rather strict regex, and with slight variations in your data like this, it will fail. Only you know what your data looks like, and if you actually do need a strict regex. However, if your data is somewhat consistent, you can use a loose regex simply based on the email part:
sub extract_nums {
my $string = shift;
if ($string =~ /<[^>]*> *([\d, ]+):/) {
return $1 =~ /\d+/g; # return the extracted digits in a list
# return $1; # just return the string as-is
} else { return undef }
}
This assumes, of course, that you cannot have <> tags in front of the email part of the line. It will capture any digits, commas and spaces found between a <> tag and a colon, and then return a list of any digits found in the match. You can also just return the string, as shown in the commented line.
There would appear to be something missing from your examples. Is this what they're supposed to look like, with email?
(1234) First Last (ab-cd/ABC1), <foo.bar#domain.com> 1224: efadsfadsfdsf
(1234) First Last (abcd/ABC12), <foo.bar#domain.com> 1234, 4657: efadsfadsfdsf
If so, this should work:
\((\d+)\)\s\w*\s\w*\s\(\w*\/\w+\d*\),\s<\w*\.\w*\#\w*\.domain\.com>\s\d+(?:,\s(\d+))?:
$string =~ /.*>\s*(.+):.+/;
$numbers = $1;
That's it.
Tested.
With number catching:
$string =~ /.*>\s*(?([0-9]|,)+):.+/;
$numbers = $1;
Not tested but you get the idea.

perl regex grouping overload

I am using the following perl regex lines
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
$myalbum =~ m/([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+) +([A-Z0-9\$]+)/i;
to match the following strings
"30_Seconds_To_Mars_-_30_Seconds_To_Mars"
"30_Seconds_To_Mars_-_A_Beautiful_Lie"
"311_-_311"
"311_-_From_Chaos"
"311_-_Grassroots"
"311_-_Sound_System"
What I am experiencing is that for strings with less than 5 matching groups (ex. 311_-_311), attempting to print $1 $2 $3 prints nothing at all. Only strings with more than 5 matches will print.
How do I resolve this?
It looks like you just want the words in separate groups. To me, it seems like you're abusing regexes to do that when you could just run your substitutions and then split. Just do:
$myalbum =~ s/[-_'&’]/ /g;
$myalbum =~ s/[,’.]//g;
my #myalbum_list = split(/\s/, $myalbum);
#Print out whatever it is you want/ test length, etc...
print "$myalbum_list[0] $myalbum_list[1] $myalbum_list[2]";
the + character means at least one match. Which means your regex m/([A-Z0-9\$]+) +([A-Z0-9\$]+) + ... requires all those fields to be there for it to be considered a match. The reason you are not capturing anything is because it's not actually matching.
You are probably looking for the * character which means zero or more not one or more like +.
I suppose your capturing groups are empty for "311 - 311" because this string doesn't match your regex.
How to resolve? Use * instead of + to permit empty sequences.
Edit: From your post I guess you want to extract the album name, i.e. the part before the minus sign.
Why not match against '(.*) - (.*)', being the first group the album and the second the title. The problem is with strings like "Album with minus - sign - First track" or "My Album - Track is one - two - three". But also as a human you wouldn't know there where the album ends and the track starts.