awk : set start and end of match - regex

I have a LaTeX-like table like this (columns are delimited by &) :
foobar99 & 68
foobar4 & 43
foobar2 & 73
I want to get the index of the numbers at column 2 by using match.
In Vim, we can use \zs and \ze to set start and end of matching.
Thus, to match accurately number at colum 2, we can use ^.*&\s*\zs[[:digit:]]\+\ze\s*$.
How about awk? Is there an equivalent?
EDIT:
Matching for the first line:
foobar99 & 68
^^
123456789012345678
Expected output : 18.
EDIT2:
I am writing an awk script to deal with block delimited by line break (Hence, FS="\n" and RS=""). The MWE above is just one of these blocks.
A possible way to get the index of number at column 2 is to do something like that
split(line, cases, "&");
index = match(cases[2], /[[:digit:]]\+/);
but I am looking for a beautiful way to do this.
Apologies for the XY problem. But I'm still interested in start/end matching.

Too little context, so a simple guess: have you tried splitting the table into columns? With something like awk -F '\\s*&\\s*' you have your second column in $2.
In fact, you can use split() to retrieve the exact column of a string:
split(s, a[, fs ])
Split the string s into array elements a[1], a[2], ..., a[n], and
return n. All elements of the array shall be deleted before the split is
performed. The separation shall be done with the ERE fs or with the field
separator FS if fs is not given. Each array element shall have a
string value when created and, if appropriate, the array element
shall be considered a numeric string (see Expressions in awk). The
effect of a null string as the value of fs is unspecified.
So your second column is something like
split(s, a, /\s*&\s*/)
secondColumn = a[2]

By default, awk sees three columns in your data, and column 2 contains & only (and column 3 contains numbers). If you change the field delimiter to &, then you have two columns with trailing spaces in column 1 and leading spaces in column 2 (and some trailing spaces, as it happens; try copying the data from the question).
In awk, you could convert column 2 with leading spaces into a number by adding 0: $2 + 0 would force it to be treated as a number. If you use $2 in a numeric context, it'll be treated as a number. Conversely, you can force awk to treat a field as a string by concatenating with the empty string: $2 "" will be a string.
So there's no need for the complexity of regexes to get at the number — if the data is as simple as shown.
You say you want to use match; it is not clear what you need that for.
awk -F'&' '{ printf "F1 [%s], F2 [%10s] = [%d] = [%-6d] = [%06d]\n", $1, $2, $2, $2, $2 }' data
For your data, which has a single blank at the end of the first two lines and a double blank at the end of the third, the output is:
F1 [foobar99 ], F2 [ 68 ] = [68] = [68 ] = [000068]
F1 [foobar4 ], F2 [ 43 ] = [43] = [43 ] = [000043]
F1 [foobar2 ], F2 [ 73 ] = [73] = [73 ] = [000073]
Note that I didn't need to explicitly convert $2 to a number. The printf formats treated it as a string or a number depending on whether I used %s or %d.
If you need to, you can strip trailing blanks of $1 (or, indeed, $2), but without knowing what else you need to do, it's hard to demonstrate alternatives usefully.
So, I think awk does what you need without needing you to jump through much in the way of hoops. For a better explanation, you'd need to provide a better question, describing or showing what you want to do.

You can try this way
awk '{print index($0,$3)}' infile

Related

regex Match a capture group's items only once

So I'm trying to split a string in several options, but those options are allowed to occur only once. I've figured out how to make it match all options, but when an option occurs twice or more it matches every single option.
Example string: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again
Regex: /-{1,2}(split1|split2|split3) [\w|\s]+/g
Right now it is matching all cases and I want it to match --split1, --split2 and --split3 only once (so --split1 split1 again will not be matched).
I'm probably missing something really straight forward, but anyone care to help out? :)
Edit:
Decided to handle the extra occurances showing up in a script and not through RegEx, easier error handling. Thanks for the help!
EDIT: Somehow I ended up here from the PHP section, hence the PHP code. The same principles apply to any other language, however.
I realise that OP has said they have found a solution, but I am putting this here for future visitors.
function splitter(string $str, int $splits, $split = "--split")
{
$a = array();
for ($i = $splits; $i > 0; $i--) {
if (strpos($str, "$split{$i} ") !== false) {
$a[] = substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} "));
$str = substr($str, 0, strpos($str, "$split{$i} "));
}
}
return array_reverse($a);
}
This function will take the string to be split, as well as how many segments there will be. Use it like so:
$array = splitter($str, 3);
It will successfully explode the array around the $split parameter.
The parameters are used as follows:
$str
The string that you want to split. In your instance it is: --split1 testsplit 1 --split2 test split 2 --split3 t e s t split 3 --split1 split1 again.
$splits
This is how many elements of the array you wish to create. In your instance, there are 3 distinct splits.
If a split is not found, then it will be skipped. For instance, if you were to have --split1 and --split3 but no --split2 then the array will only be split twice.
$split
This is the string that will be the delimiter of the array. Note that it must be as specified in the question. This means that if you want to split using --myNewSplit then it will append that string with a number from 1 to $splits.
All elements end with a space since the function looks for $split and you have a space before each split. If you don't want to have the trailing whitespace then you can change the code to this:
$a[] = trim(substr($str, strpos($str, "$split{$i} ") + strlen("$split{$i} ")));
Also, notice that strpos looks for a space after the delimiter. Again, if you don't want the space then remove it from the string.
The reason I have used a function is that it will make it flexible for you in the future if you decide that you want to have four splits or change the delimiter.
Obviously, if you no longer want a numerically changing delimiter then the explode function exists for this purpose.
-{1,2}((split1)|(split2)|(split3)) [\w|\s]+
Something like this? This will, in this case, create 3 arrays which all will have an array of elements of the same name in them. Hope this helps

Find group of strings starting and ending by a character using regular expression

I have a string, and I want to extract, using regular expressions, groups of characters that are between the character : and the other character /.
typically, here is a string example I'm getting:
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
and so, I want to retrieved, 45.72643,4.91203 and also hereanotherdata
As they are both between characters : and /.
I tried with this syntax in a easier string where there is only 1 time the pattern,
[tt]=regexp(str,':(\w.*)/','match')
tt = ':45.72643,4.91203/'
but it works only if the pattern happens once. If I use it in string containing multiples times the pattern, I get all the string between the first : and the last /.
How can I mention that the pattern will occur multiple time, and how can I retrieve it?
Use lookaround and a lazy quantifier:
regexp(str, '(?<=:).+?(?=/)', 'match')
Example (Matlab R2016b):
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = regexp(str, '(?<=:).+?(?=/)', 'match')
result =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
In most languages this is hard to do with a single regexp. Ultimately you'll only ever get back the one string, and you want to get back multiple strings.
I've never used Matlab, so it may be possible in that language, but based on other languages, this is how I'd approach it...
I can't give you the exact code, but a search indicates that in Matlab there is a function called strsplit, example...
C = strsplit(data,':')
That should will break your original string up into an array of strings, using the ":" as the break point. You can then ignore the first array index (as it contains text before a ":"), loop the rest of the array and regexp to extract everything that comes before a "/".
So for instance...
'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh'
Breaks down into an array with parts...
1 - 'abcd'
2 - '45.72643,4.91203/Rou'
3 - 'hereanotherdata/defgh'
Then Ignore 1, and extract everything before the "/" in 2 and 3.
As John Mawer and Adriaan mentioned, strsplit is a good place to start with. You can use it for both ':' and '/', but then you will not be able to determine where each of them started. If you do it with strsplit twice, you can know where the ':' starts :
A='abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
B=cellfun(#(x) strsplit(x,'/'),strsplit(A,':'),'uniformoutput',0);
Now B has cells that start with ':', and has two cells in each cell that contain '/' also. You can extract it with checking where B has more than one cell, and take the first of each of them:
C=cellfun(#(x) x{1},B(cellfun('length',B)>1),'uniformoutput',0)
C =
1×2 cell array
'45.72643,4.91203' 'hereanotherdata'
Starting in 16b you can use extractBetween:
>> str = 'abcd:45.72643,4.91203/Rou:hereanotherdata/defgh';
>> result = extractBetween(str,':','/')
result =
2×1 cell array
{'45.72643,4.91203'}
{'hereanotherdata' }
If all your text elements have the same number of delimiters this can be vectorized too.

BASH: Split strings without any delimiter and keep only first sub-string

I have a CSV file containing 7 columns and I am interested in modifying only the first column. In fact, in some of the rows a row name appears n times in a concatenated way without any space. I need a script that can identify where the duplication starts and remove all duplications.
Example of a row name among others:
Row name = EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4EXAMPLE1.ABC_DEF.panel4
Replace by: EXAMPLE1.ABC_DEF.panel4
In the different rows:
n can vary
The length of the row name can vary
The structure of the row name can vary (eg. amount of _ and .), but it is always collated without any space
What I have tried:
:%s/(.+)\1+/\1/
Step-by-step:
%s: substitute in the whole file
(.+)\1+: First capturing group. .+ matches any character (except for line terminators), + is the quantifier — matches between one and unlimited times, as many times as possible, giving back as needed.
\1+: matches the same text as most recently matched by the 1st capturing group
Substitute by \1
However, I get the following errors:
E65: Illegal back reference
E476: Invalid command
From what i understand you need only one line contain EXAMPLE1.ABC_DEF.panel4. In that case you can do the following:
First remove duplicates in one line:
sed -i "s/EXAMPLE1.ABC_DEF.panel4.*/EXAMPLE1.ABC_DEF.panel4/g"
Then remove duplicated lines:
awk '!a[$0]++'
If all your rows are of the format you gave in the question (like EXAMPLExyzEXAMPLExyz) then this should work-
awk -F"EXAMPLE" '{print FS $2}' file
This takes "EXAMPLE" as the field delimiter and asks it to print only the first 'column'. It prepends "EXAMPLE" to this first column (by calling the inbuilt awk variable FS). Thanks, #andlrc.
Not an ideal solution but may be good enough for this purpose.
This script, with first arg is the string to test, can retrieve the biggest duplicate substring (i.e. "totototo" done "toto", not "to")
#!/usr/bin/env bash
row_name="$1"
#test duplicate from the longest to the smallest, by how many we need to split the string ?
for (( i=2; i<${#row_name}; i++ ))
do
match="True"
#continue test only if it's mathematically possible
if (( ${#row_name} % i )); then
continue
fi
#length of the potential duplicate substring
len_sub=$(( ${#row_name} / i ))
#test if the first substring is equal to each others
for (( s=1; s<i; s++ ))
do
if ! [ "${row_name:0:${len_sub}}" = "${row_name:$((len_sub * s)):${len_sub}}" ]; then
match="False"
break
fi
done
#each substring are equal, so return string without duplicate
if [ $match = "True" ]; then
row_name="${row_name:0:${len_sub}}"
break
fi
done
echo "$row_name"

regex Pattern Matching over two lines - search and replace

I have a text document that i require help with. In the below example is an extract of a tab delimited text doc whereby the first line of the 3 line pattern will always be a number. The Doc will always be in this format with the same tabbed formula on each of the three lines.
nnnn **variable** V -------
* FROM CLIP NAME - **variable**
* LOC: variable variable **variable**
I want to replace the second field on the first line with the fourth field on the third line. And then replace the field after the colon on the second line with the original second field on the first line. Is this possible with regex? I am used to single line search replace function but not multiline patterns.
000003 A009C001_151210_R6XO V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: 5-1A
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 B008C001_151210_R55E V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: 5-1B
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
The Result would look like :
000003 005_NST_010_E02 V C 11:21:12:17 11:21:57:14 01:00:18:22 01:01:03:19
*FROM CLIP NAME: A009C001_151210_R6XO
*LOC: 01:00:42:15 WHITE 005_NST_010_E02
000004 005_NST_010_E03 V C 11:21:18:09 11:21:53:07 01:01:03:19 01:01:38:17
*FROM CLIP NAME: B008C001_151210_R55E
*LOC: 01:01:20:14 WHITE 005_NST_010_E03
Many Thanks in advance.
A regular expression defines a regular language. Alone, this only expresses a structure of some input. Performing operations on this input requires some kind of processing tool. You didn't specify which tool you were using, so I get to pick.
Multiline sed
You wrote that you are "used to single line search replace function but not multiline patterns." Perhaps you are referring to substitution with sed. See How can I use sed to replace a multi-line string?. It is more complicated than with a single line, but it is possible.
An AWK script
AWK is known for its powerful one-liners, but you can also write scripts. Here is a script that identifies the beginning of a new record/pattern using a regular expression to match the first number. (I hesitate to call it a "record" because this has a specific meaning in AWK.) It stores the fields of the first two lines until it encounters the third line. At the third line, it has all the information needed to make the desired replacements. It then prints the modified first two lines and continues. The third line is printed unchanged (you specified no replacements for the third line). If there are additional lines before the start of the next record/pattern, they will also be printed unchanged.
It's unclear exactly where the tab characters are in your sample input because the submission system has replaced them with spaces. I am assuming there is a tab between FROM CLIP NAME: and the following field and that the "variables" on the first and third line are also tab-separated. If the first number of each record/pattern is hexadecimal instead of decimal, replace the [[:digit:]] with [[:xdigit:]].
fixit.awk
#!/usr/bin/awk -f
BEGIN { FS="\t"; n=0 }
{n++}
/^[[:digit:]]+\t/ { n=1 }
# Split and save first two lines
n==1 { line1_NF = split($0, line1, FS); next }
n==2 { line2_NF = split($0, line2, FS); next }
n==3 {
# At the third line, make replacements
line1_2 = line1[2]
line1[2] = $4
line2[2] = line1_2
# Print modified first two lines
printf "%s", line1[1]
for ( i=2; i<=line1_NF; ++i )
printf "\t%s", line1[i]
print ""
printf "%s", line2[1]
for ( i=2; i<=line2_NF; ++i )
printf "\t%s", line2[i]
print ""
}
1 # Print lines after the second unchanged
You can use it like
$ awk -f fixit.awk infile.txt
or to pipe it in
$ cat infile.txt | awk -f fixit.awk
This is not the most regular expression inspired solution, but it should make the replacements that you want. For a more complex structure of input, an ideal solution would be to write a scanner and parser that correctly interprets the full input language. Using tools like string substitution might work for simple specific cases, but there could be nuances and assumptions you've made that don't apply in general. A parser can also be more powerful and implement grammars that can express languages which can't be recognized with regular expressions.

Regex in Perl messed up by bracket

I am new to perl and having the following problem recently.
I have a string with format " $num1 $num2 $num3 $num4", that $num1, $num2, $num3, $num4 are real numbers can be a scientific number or in regular format.
Now I want to extract the 4 numbers from the string using regular expression.
$real_num = '\s*([+-]?[0-9]+\.?[0-9]*([eE][+-]?[0-9]+)?)'
while (<FP>) {
if (/$real_num$real_num$real_num$real_num/) {
print $1; print $2; print$3; print$4;
}
}
How can I get $num1, $num2, $num3, $num4 from $1, $2, $3, $4? As there is a necessary bracket in the $real_num regular expression so $1, $2, $3, $4 are not what I am expecting now.
Thanks for all warm replies, non-capturing group is the answer I need!
Just use non-capturing groups in your $real_num regex and make the regex itself a captured group:
$real_num = '\s*([+-]?[0-9]+\.?[0-9]*(?:[eE][+-]?[0-9]+)?)'
Now, the problem is: /$real_num$real_num$real_num$real_num/ will easily fail, if there are more than 4 numbers out there. May be this is not the case now. But, you should take care of that also. A split would be a better option.
If you are sure that your lines contains numbers, you can avoid that regexp, using split function:
while (<FP>) {
my #numbers = split /\s+/; #<-- an array with the parsed numbers
}
If you need tho check if the extracted strings are really numbers, use the Scalar::Util looks_like_number. Example:
use strict;
use warnings;
use Scalar::Util qw/looks_like_number/;
while(<DATA>) {
my #numbers = split /\s+/;
#numbers = map { looks_like_number($_) ? $_ : undef } #numbers;
say "#numbers";
}
__DATA__
1 2 NaN 4 -1.23
5 6 f 8 1.32e12
Prints:
1 2 NaN 4 -1.23
5 6 8 1.32e12
The answers to two important questions will affect whether you even need to use a regular expression to match the various number formats, or if you can do something much simpler:
Are you certain that your lines contain numbers only or do they also contain other data (or possibly some lines have no numbers at all and only other data)?
Are you certain that all numbers are separated from each other and/or other data by at least one space? If not, how are they separated? (For example, output from portsnap fetch generates lots of numbers like this 3690....3700.... with decimal points and no spaces at all used to separate them.
If your lines contain only numbers and no other data, and numbers are separated by spaces, then you do not even need to check if the results are numbers, but only split the line apart:
my #numbers = split /\s+/;
If you are not sure that your lines contain numbers, but you are sure that there is at least one space between each number and other numbers or other data, then the next line of code is a quite good way of extracting numbers properly with a clever way of allowing Perl itself to recognize all the many different legal formats of numbers. (This assumes that you do not want to convert other data values to NaN.) The result in #numbers will be proper recognition of all numbers within the current line of input.
my #numbers = grep { 1*$_ eq $_ } m/(\S*\d\S*)/g;
# we could do simply a split, but this is more efficient because when
# non-numeric data is present, it will only perform the number
# validation on data pieces that actually do contain at least one digit
You can determine if at least one number was present by checking the truth value of the expression #numbers > 1 and if exactly four were present by using the condition #numbers == 4, etc.
If your numbers are bumped up against each other, for instance, 5.17e+7-4.0e-1 then you will have a more difficult time. That is the only time you will need complicated regular expressions.
Note: Updated code to be even faster/better.
Note 2: There is a problem with the most up-voted answer due to a subtlety of how map works when storing the value of undef. This can be illustrated by the output from that program when using it to extract numbers from the first line of data such as an HTTP log file. The output looks correct, but the array actually has many empty elements and one would not find the first number stored in $numbers[0] as expected. In fact, this is the full output:
$ head -1 http | perl prog1.pl
Use of uninitialized value $numbers[0] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[1] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[2] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[3] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[4] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[5] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[6] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[7] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[10] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[11] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[12] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[13] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[14] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[15] in join or string at prog1.pl line 8, <> line 1.
Use of uninitialized value $numbers[16] in join or string at prog1.pl line 8, <> line 1.
200 2206
(Note that the indentation of these numbers shows how many empty array elements are present in #numbers and have been joined together by spaces before the actual numbers when the array has been converted to a string.)
However, my solution produces the proper results both visually and in the actual array contents, i.e., $numbers[0], $number[1], etc., are actually the first and second numbers contained in the line of the data file.
while (<>) {
my #numbers = m/(\S*\d\S*)/g;
#numbers = grep { $_ eq 1*$_ } #numbers;
print "#numbers\n";
}
$ head -1 http | perl prog2.pl
200 2206
Also, using the slow library function makes the other solution run 50% slower. Output was otherwise identical when running the programs on 10,000 lines of data.
My previous answer did not address the issue of non-space separated numbers. This requires a separate answer in my opinion, since the output can be drastically different from the same data.
my $number = '([-+]?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';
my $type = shift;
if ($type eq 'all') {
while (<>) {
my #all_numbers = m/$number/g;
# finds legal numbers whether space separated or not
# this can be great, but it also means the string
# 120.120.120.120 (an IP address) will return
# 120.120, .120, and .120
print "#all_numbers\n";
}
} else {
while (<>) {
my #ss_numbers = grep { m/^$number$/ } split /\s+/;
# finds only space separated numbers
print "#ss_numbers\n";
}
}
Usage:
$ prog-jkm2.pl all < input # prints all numbers
$ prog-jkm2.pl < input # prints just space-separated numbers
The only code that the OP probably needs:
my $number = '(-?(?:\d+\.\d+|\.\d+|\d+)(?:[Ee][-+]\d+)?)';
my #numbers = grep { m/^$number$/ } split /\s+/;
At this point, $numbers[0] will be the first number, $numbers[1] is the second number, etc.
Examples of output:
$ head -1 http | perl prog-jkm2.pl
200 2206
$ head -1 http | perl prog-jkm2.pl all
67.195 .114 .38 19 2011 01 20 31 -0400 1 1 1.0 200 2206 5.0