Cut and copy-paste given positions of the text - regex

My dummy text file (one continuous line) looks like this:
AAChvhkfiAFAjjfkqAPPMB
I want to:
Delete part of the text (specific range);
Copy-Paste (specific range of characters) within the file.
How I am doing this:
To cut part of the text at wanted positions (from 5 to 7 characters & from 10 to 14 characters) I use cut
echo 'AAChvhkfiAFAjjfkqAPPMB' | cut --complement -c 5-7,10-14
AAChfifkqAPPMB
But I really don't know how to copy-paste text. For example: to copy text from 15 to 18 characters and paste it after character 1 (also using previous cut command). To get the final result like this:
fkqAAAChfifkqAPPMB
So I do have to questions:
How to read text (from .. to) given range using perl, awk or sed & paste this text at specific position.
How to combine this text pasting with the previous cut command as after cutting text will move to the left side, hence wrong text will be copied.

Maybe something like this:
$ echo AAChvhkfiAFAjjfkqAPPMB | awk '{ print(substr($1, 0, 14) substr($1, 18) substr($1, 15, 3)) }'
AAChvhkfiAFAjjAPPMBfkq

In Perl I think substr would be a good candidate, try eg.
$a = '1234567890';
#from pos 2, replace 3 chars with nothing, return the 3 chars
$b=substr($a,2,3,'');
print "$a\t$b\n"; #1267890 345
#in posistion 0 (first), replace 0 characters (ie pure insert)
#with the content of $b
substr($a,0,0,$b);
print "$a\t$b\n"; #3451267890 345
See http://perldoc.perl.org/functions/substr.html for more details.
splice() may be a candidate as well.

In perl, you can use array slice, by splitting the string in a array :
my $string = "AAChvhkfiAFAjjfkqAPPMB1";
my #arr = split //, $string;
and slicing (print element 5 to 7 and 10 to 14):
print #array[5..7,10..14];
you can use splice() too to re-arrange the array.
perldoc said :
Removes the elements designated by OFFSET and LENGTH from an array, and replaces them with the elements of LIST, if any.
See http://perldoc.perl.org/perldata.html#Slices

quite straightforward with awk:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",t,$i);
$0=t""$0}1' OFS="" FS=""
fkqAAAChfifkqAPPMB
edit
to reverse the part of text, you just need to swap t and $i:
kent$ echo "AAChvhkfiAFAjjfkqAPPMB"|awk '
{for(i=5;i<=7;i++)$i="";
for(i=10;i<=14;i++)$i="";
for(i=15;i<=18;i++)t=sprintf("%s%s",$i,t);
$0=t""$0}1' OFS="" FS=""
AqkfAAChfifkqAPPMB

Related

Matching the last K occurrences of a pattern in a line

Is it possible using sed/awk to match the last k occurrences of a pattern in a line?
For simplicity's sake, say I just want to match the last 3 commas in each line, for example (note that the two lines have a different number of total commas):
10, 5, "Sally went to the store, and then , 299, ABD, F, 10
10, 6, If this is the case, and also this happened, then, 299, A, F, 9
I want to match only the commas starting from 299 until the end of the line in both bases.
Motivation: I'm trying to convert a CSV file with stray commas inside one of the fields to tab-delimited. Since the number of proper columns is fixed, my thinking was to replace the first couple commas with tabs up until the troublesome field (which is straightforward), and then go backwards from the end of the line to replace again. This should convert all proper delimiter commas to tabs, while leaving commas intact in the problematic field.
There's probably a smarter way to do this, but I figured this would be a good sed/awk teaching point anyways.
another sed alternative. Replace last 3 commas with tabs
$ rev file | sed 's/,/\t/;s/,/\t/;s/,/\t/' | rev
10, 5, "Sally went to the store, and then , 299 ABD F 10
with GNU sed, you can simply write
$ sed 's/,/\t/g5' file
10, 5, "Sally went to the store, and then , 299 ABD F 10
replace all starting from 5th.
You can use Perl to add the missing double quote into each line:
perl -aF, -ne '$F[-5] .= q("); print join ",", #F' < input > output
or, to turn the commas into tabs:
perl -aF'/,\s/' -ne 'splice #F, 2, -4, join ", ", #F[ 2 .. $#F - 4 ]; print join "\t", #F' < input > output
-n reads the input line by line.
-a splits the input into the #F array on the pattern specified by -F.
The first solution adds the missing quote to the fifth field from the right; the second one replaces the items from the third to the fifth from right with those elements joined by ", ", and separates the resulting array with tabs.
To fix the CSV, I would do this:
echo '10, 5, "Sally went to the store, and then , 299, ABD, F, 10' |
perl -lne '
#F = split /, /; # field separator is comma and space
#start = splice #F, 0, 2; # first 2 fields
#end = splice #F, -4, 4; # last 4 fields
$string = join ", ", #F; # the stuff in the middle
$string =~ s/"/""/g; # any double quotes get doubled
print join(",", #start, "\"$string\"", #end);
'
outputs
10,5,"""Sally went to the store, and then ",299,ABD,F,10
One regex that matches each of the three last commas separately would require a negative lookahead, which sed does not support.
You can use the following sed-regex to match the last three fields and the commas directly before them all at once:
,[^,]*,[^,]*,[^,]*$
$ matches the end of the line.
[^,] matches anything but ,.
Groups allow you to re-use the field values in sed:
sed -r 's/,([^,]*),([^,]*),([^,]*)$/\t\1\t\2\t\3/'
For awk, have a look at How to print last two columns using awk.
There's probably a smarter way to do this
In case all your wanted commas are followed by a space and the unwanted commas are not, how about
sed 's/,[^ ]/./g'
This transforms a, b, 12,3, c into a, b, 12.3, c.
Hi I guess this is doing the job
echo 'a,b,c,d,e,f' | awk -F',' '{i=3; for (--i;i>=0;i--) {printf "%s\t", $(NF-i) } print ""}'
Returns
d e f
But you need to ensure you have more than 3 arguments
This will do what you're asking for with GNU awk for the 3rd arg to match():
$ cat tst.awk
{
gsub(/\t/," ")
match($0,/^(([^,]+,){2})(.*)((,[^,]+){3})$/,a)
gsub(/,/,"\t",a[1])
gsub(/,/,"\t",a[4])
print a[1] a[3] a[4]
}
$ awk -f tst.awk file
10 5 "Sally went to the store, and then , 299 ABD F 10
10 6 If this is the case, and also this happened, then, 299 A F 9
but I'm not convinced what you're asking for is a good approach so YMMV.
Anyway, note the first gsub() making sure you have no tabs on the input line - that is crucial if you want to convert some commas to tabs to use tabs as output field separators!

Using awk to find a domain name containing the longest repeated word

For example, let's say there is a file called domains.csv with the following:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
I'm trying to use linux awk regex expressions to find the line that contains the longest repeated1 word, so in this case, it will return the line
5,letswelcomewelcomeyou.org
How do I do that?
1 Meaning "immediately repeated", i.e., abcabc, but not abcXabc.
A pure awk implementation would be rather long-winded as awk regexes don't have backreferences, the usage of which simplifies the approach quite a bit.
I'ved added one line to the example input file for the case of multiple longest words:
1,helloguys.ca
2,byegirls.com
3,hellohelloboys.ca
4,hellobyebyedad.com
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
And this gets the lines with the longest repeated sequence:
cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{ print length(), $0 }' | sort -k 1,1 -nr |
awk 'NR==1 {prev=$1;print $2;next} $1==prev {print $2;next} {exit}' | grep -f - infile
Since this is pretty anti-obvious, let's split up what this does and look at the output at each stage:
Remove the first column with the line number to avoid matches for lines numbers with repeating digits:
$ cut -d ',' -f 2 infile
helloguys.ca
byegirls.com
hellohelloboys.ca
hellobyebyedad.com
letswelcomewelcomeyou.org
letscomewelcomewelyou.org
Get all lines with a repeated sequence, extract just that repeated sequence:
... | grep -Eo '(.*)\1'
ll
hellohello
ll
byebye
welcomewelcome
comewelcomewel
Get the length of each of those lines:
... | awk '{ print length(), $0 }'
2 ll
10 hellohello
2 ll
6 byebye
14 welcomewelcome
14 comewelcomewel
Sort by the first column, numerically, descending:
...| sort -k 1,1 -nr
14 welcomewelcome
14 comewelcomewel
10 hellohello
6 byebye
2 ll
2 ll
Print the second of these columns for all lines where the first column (the length) has the same value as on the first line:
... | awk 'NR==1{prev=$1;print $2;next} $1==prev{print $2;next} {exit}'
welcomewelcome
comewelcomewel
Pipe this into grep, using the -f - argument to read stdin as a file:
... | grep -f - infile
5,letswelcomewelcomeyou.org
6,letscomewelcomewelyou.org
Limitations
While this can handle the bbwelcomewelcome case mentioned in comments, it will trip on overlapping patterns such as welwelcomewelcome, where it only finds welwel, but not welcomewelcome.
Alternative solution with more awk, less sort
As pointed out by tripleee in comments, this can be simplified to skip the sort step and combine the two awk steps and the sort step into a single awk step, likely improving performance:
$ cut -d ',' -f 2 infile | grep -Eo '(.*)\1' |
awk '{if (length()>ml) {ml=length(); delete a; i=1} if (length()>=ml){a[i++]=$0}}
END{for (i in a){print a[i]}}' |
grep -f - infile
Let's look at that awk step in more detail, with expanded variable names for clarity:
{
# New longest match: throw away stored longest matches, reset index
if (length() > max_len) {
max_len = length()
delete arr_longest
idx = 1
}
# Add line to longest matches
if (length() >= max_len)
arr_longest[idx++] = $0
}
# Print all the longest matches
END {
for (idx in arr_longest)
print arr_longest[idx]
}
Benchmarking
I've timed the two solutions on the top one million domains file mentioned in the comments:
First solution (with sort and two awk steps):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.742s
user 1m57.873s
sys 0m0.045s
Second solution (just one awk step, no sort):
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 1m55.603s
user 1m56.514s
sys 0m0.045s
And the Perl solution by Casimir et Hippolyte:
964438,abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com
real 0m5.249s
user 0m5.234s
sys 0m0.000s
What we learn from this: ask for a Perl solution next time ;)
Interestingly, if we know that there will be just one longest match and simplify the commands accordingly (just head -1 instead of the second awk command for the first solution, or no keeping track of multiple longest matches with awk in the second solution), the time gained is only in the range of a few seconds.
Portability remark
Apparently, BSD grep can't do grep -f - to read from stdin. In this case, the output of the pipe until there has to be redirected to a temp file, and this temp file then used with grep -f.
A way with perl:
perl -F, -ane 'if (#m=$F[1]=~/(?=(.+)\1)/g) {
#m=sort { length $b <=> length $a} #m;
$cl=length #m[0];
if ($l<$cl) { #res=($_); $l=$cl; } elsif ($l==$cl) { push #res, ($_); }
}
END { print #res; }' file
The idea is to find all longest overlapping repeated strings for each position in the second field, then the match array is sorted and the longest substring becomes the first item in the array (#m[0]).
Once done, the length of the current repeated substring ($cl) is compared with the stored length (of the previous longest substring). When the current repeated substring is longer than the stored length, the result array is overwritten with the current line, when the lengths are the same, the current line is pushed into the result array.
details:
command line option:
-F, set the field separator to ,
-ane (e execute the following code, n read a line at a time and puts its content in $_, a autosplit, using the defined FS, and puts fields in the #F array)
The pattern:
/
(?= # open a lookahead assertion
(.+)\1 # capture group 1 and backreference to the group 1
) # close the lookahead
/g # all occurrences
This is a well-know pattern to find all overlapping results in a string. The idea is to use the fact that a lookahead doesn't consume characters (a lookahead only means "check if this subpattern follows at the current position", but it doesn't match any character). To obtain the characters matched in the lookahead, all that you need is a capture group.
Since a lookahead matches nothing, the pattern is tested at each position (and doesn't care if the characters have been already captured in group 1 before).

Regex for Replacing String with Incrementing Number

Once upon a time I used UltraEdit for making this
text 1
text 2
text 3
text 4
...
from that
text REPLACE
text REPLACE
text REPLACE
text REPLACE
...
it was as easy as replace REPLACE with \i
now, how can I do this with eg. sed?
If you provide a solution, could you please add directions for filling the result with leading zeros?
thanks
You can use awk instead of sed for this:
awk '{ sub(/REPLACE/, ++i) } 1' file
text 1
text 2
text 3
text 4
...
Code Demo
Is this what you want?
$ awk '{$NF=sprintf("%05d",++i)} 1' file
text 00001
text 00002
text 00003
text 00004
If not, edit your question to show some more truly representative sample input and expected output (and get rid of the ...s if they don't literally exist in your input and output as they make your example non-testable).
perl -i -pe 's/REPLACE/++$i/ge' file
For zero-padding to the minimum width (i.e. if there are 10 replacements, use field width 2):
perl -i -pe '
BEGIN {
$patt = "REPLACE";
chomp( $n = qx(grep -co "$patt" file) );
$n = int( log($n)/log(10) ) + 1;
}
s/$patt/ sprintf("%0*d", $n, ++$i) /ge
' file

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

sed join lines together

what would be the sed (or other tool) command to join lines together in a file that do not end w/ the character '0'?
I'll have lines like this
412|n|Leader Building Material||||||||||d|d|20||0
which need to be left alone, and then I'll have lines like this for example (which is 3 lines, but only one ends w/ 0)
107|n|Knot Tying Tools|||||Knot Tying Tools
|||||d|d|0||0
which need to be joined/combined into one line
107|n|Knot Tying Tools|||||Knot Tying Tools|||||d|d|0||0
sed ':a;/0$/{N;s/\n//;ba}'
In a loop (branch ba to label :a), if the current line ends in 0 (/0$/) append next line (N) and remove inner newline (s/\n//).
awk:
awk '{while(/0$/) { getline a; $0=$0 a; sub(/\n/,_) }; print}'
Perl:
perl -pe '$_.=<>,s/\n// while /0$/'
bash:
while read line; do
if [ ${line: -1:1} != "0" ] ; then
echo $line
else echo -n $line
fi
done
awk could be short too:
awk '!/0$/{printf $0}/0$/'
test:
kent$ cat t
#aasdfasdf
#asbbb0
#asf
#asdf0
#xxxxxx
#bar
kent$ awk '!/0$/{printf $0}/0$/' t
#aasdfasdf#asbbb0
#asf#asdf0
#xxxxxx#bar
The rating of this answer is surprising ;s (this surprised wink emoticon pun on sed substitution is intentional) given the OP specifications: sed join lines together.
This submission's last comment
"if that's the case check what #ninjalj submitted"
also suggests checking the same answer.
ie. Check using sed ':a;/0$/{N;s/\n//;ba}' verbatim
sed ':a;/0$/{N;s/\n//;ba}'
does
no one
ie. 0
people,
try
nothing,
ie. 0
things,
any more,
ie. 0
tests?
(^D aka eot 004 ctrl-D ␄ ... bash generate via: echo ^V^D)
which will not give (do the test ;):
does no one ie. 0
people, try nothing, ie. 0
things, any more, ie. 0
tests? (^D aka eot 004 ctrl-D ␄ ... bash generate via: echo ^V^D)
To get this use:
sed 'H;${z;x;s/\n//g;p;};/0$/!d;z;x;s/\n//g;'
or:
sed ':a;/0$/!{N;s/\n//;ba}'
not:
sed ':a;/0$/{N;s/\n//;ba}'
Notes:
sed 'H;${x;s/\n//g;p;};/0$/!d;z;x;s/\n//g;'
does not use branching and
is identical to:
sed '${H;z;x;s/\n//g;p;};/0$/!{H;d;};/0$/{H;z;x;s/\n//g;}'
H commences all sequences
d short circuits further script command execution on the current line and starts the next cycle so address selectors following /0$/! can only be /0$/!! so the address selector of
/0$/{H;z;x;s/\n//g;} is redundant and not needed.
if a line does not end with 0 save it in hold space
/0$/!{H;d;}
if a line does end with 0 save it too and then print flush (double entendre ie. purged and lines aligned)
/0$/{H;z;x;s/\n//g;}
NB ${H;z;x;s/\n//g;p;} uses /0$/ ... commands with an extra p to coerce the final print and with a now unnecessary z (to empty and reset pattern space like s/.*//)
A typically cryptic Perl one-liner:
perl -pe 'BEGIN{$/="0\n"}s/\n//g;$_.=$/'
This uses the sequence "0\n" as the record separator (by your question, I'm assuming that every line should end with a zero). Any record then should not have internal newlines, so those are removed, then print the line, appending the 0 and newline that were removed.
Another take to your question would be to ensure each line has 17 pipe-separated fields. This does not assume that the 17th field value must be zero.
awk -F \| '
NF == 17 {print; next}
prev {print prev $0; prev = ""}
{prev = $0}
'
if ends with 0 store, remove newline..
sed '/0$/!N;s/\n//'