Manipulate data in Awk - regex

I am new to Awk programming.I have a question on manipulating text file,which is required to draw certain Network based images in a visualization software(Circos http://circos.ca)
I have input data for which I want to manipulate values using awk/grep/sed.
There are 9 pairs(18 lines).5 pairs(first 10 lines) are for "from=ABCB11", and 4 pairs(next 8 lines) are for "from =ABCC8". What I want is extract the value from the first line of the first pair and replace it in each alternate line of the rest of the other pairs.
So value for group-2 is 9 10 ,which should replace all the occurence of value in group2.
The next value for group-2 is 28 29,which should be replaced by 9 10.
The stop should be determined by "from=name" which is "from=ABCB11".Its not necessary that the rows that have to captured expression from and replace in its next occurence will belong to group-2 as in this instance.It could be group-3 or group-4 until group-10.So second set ("from =ABCC8")could have been belonged to group-4/5/6 not necessary group-2.Its just a coincidence here.
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Below is the FINAL output,I am looking for:
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Also,this is just a sample data.So many pairs would have group-1,group-4,group-5 upto group 10.Here,only pairs from lesser groups are mentioned.
I want to loop through the lines until the value in "from=name" remains same,so that I can change all occurences in each alternate line.Code:
awk -F, 'NR%2==1 {split($2,a,"="); print a[2]}' file.txt
The above code is able to extract the alternate lines and the "name" in "from=name"

The following is quite verbose (I love verbose variable names). Using your sample-data, I get the data you want to have. This assumes, that every "uneven" line gets the values from the first line with the same "from=xxxx" information.
awk '
BEGIN {
namevar=""
val1var=""
val2var=""
linenum=0
}
{
split($0, linearr)
split(linearr[5], csvarr, ",")
if (namevar != csvarr[2]) {
namevar=csvarr[2]
val1var=linearr[2]
val2var=linearr[3]
linenum=0
}
linenum+=1
if (linenum%2==1) {
print linearr[1], val1var, val2var, linearr[4], linearr[5]
} else {
print linearr[1], linearr[2], linearr[3], linearr[4], linearr[5]
}
}' file.txt

Related

find specific number combination within from and to text

I want only a match on a 3 digit number (under 600, in example below "598") when a specific number in string is visible between start wording and end wording. With below regular expression I get a match of everything, can anyone help?
Regular expression: (?<=Start)(.*)(?=End).
Test string:
Start 440 3 956 4 603 5 - 6 603 7 440 8 - 9 440 10 956 11 440 12 603 13 2005
14 440 15 598 16 1156 17 946 18 761 19 761 20 946 21 598 22 598
23 1156 24 2057 25 946 26 1194 27 946 28 946 - - - Zurich 2019 M T W T F S S - - - - 1 - 2 1058 3 542 4 852 5 - 6 1517 7 1058 8 - 9 1058 10 848 11 542 12 705 13 1306 14 1058 15 1258 16 2159 17 1617 18 700 19 863 20 700 21 1258 22 1911 23 1911 24 1617 25 1258 26 2759 27 1258 28 1258 - - - End
With \b[0-5]\d{2}\b you find all 3 digit number under 600.
Demo: https://regex101.com/r/0ZSbbY/2
Try this pattern:
(?<=^|\D)[1-5]?\d{2}(?!.+Start)(?=\D.+End)
(?<=^|\D)[1-5]?\d{1,2} this will match all 1- or 2-digit numbers, as they are less than 600. It also findes also 1**, 2**, 3**, 4**, 5** numbers.
(?!.+Start)(?=\D.+End) this lookahead assure that we are before End word and not before Start word, i.e. between them. It couldn't be done with positive lookbehind as #TimBiegeleisen stated, as it would have variable length.
Demo
#!/usr/bin/perl
use Modern::Perl;
use Data::Dumper;
my $str = 'Start 440 3 956 4 603 5 - 6 603 7 440 8 - 9 440 10 956 11 440 12 603 13 2005 14 440 15 598 16 1156 17 946 18 761 19 761 20 946 21 598 22 598 23 1156 24 2057 25 946 26 1194 27 946 28 946 - - - Zurich 2019 M T W T F S S - - - - 1 - 2 1058 3 542 4 852 5 - 6 1517 7 1058 8 - 9 1058 10 848 11 542 12 705 13 1306 14 1058 15 1258 16 2159 17 1617 18 700 19 863 20 700 21 1258 22 1911 23 1911 24 1617 25 1258 26 2759 27 1258 28 1258 - - - End';
my $threshold = 600;
my $re = qr/
(?: # start non capture group
Start # literally
| # OR
\G # iterate from last match position
) # end group
(?:(?!End).)*? # make sure we don't have "End" before to number to find
(?<!\d) # negative lookbehind, make sure we don't have a digit before
(\d{3}) # 3 digit number
(?!\d) # negative lookahead, make sure we don't have a digit after
/x;
# Retrieve all 3 digit numbers between Start and End
my #numbers = $str =~ /$re/g;
# Select numbers that are less than $threshold. In this case 600
#numbers = grep { $_ < $threshold } #numbers;
say Dumper \#numbers;
Output:
$VAR1 = [
440,
440,
440,
440,
440,
598,
598,
598,
542,
542
];
If you're searching for a specific number, like one that is close to 600, I would suggest to use regexp to collect all numbers and then use some algorythm to find matching number.
This regexp will help you to check that your string matches pattern and to collect all numbers using group "number".
^Start (([^\d]+ )*((?<number>\d+) )*)*End$
This simplier regexp will help you to collect numbers without checking all String:
\d+
Iterate trough your numbers collection and find needed one.
Sorry I don't noticed what language do you use to write code snippet.

Matching across multiple lines regular expression

I have several lists in a single text file that look like below. It always starts with 0 and it always ends with the word Unique at the start of a newline. I would like to get rid of all of it apart from the line with Unique on it. I looked through stackoverflow and tried the following but it returns the whole text file (there are other strings in the file that I haven't put in this example). Basically the problem is how to account for the newlines in the regex selection
^0(.|\n)*
Input:
0 145
1 139
2 175
3 171
4 259
5 262
6 293
7 401
8 430
9 417
10 614
11 833
12 1423
13 3062
14 10510
15 57587
16 5057575
17 10071
18 375
19 152
20 70
21 55
22 46
23 31
24 25
25 22
26 25
27 14
28 16
29 16
30 8
31 10
32 8
33 21
34 8
35 51
36 65
37 605
38 32
39 2
40 1
41 2
44 1
48 2
51 1
52 1
57 1
63 2
68 1
82 1
94 1
95 1
101 3
102 7
103 1
110 1
111 1
119 1
123 1
129 2
130 3
131 2
132 1
135 1
136 2
137 7
138 4
Unique: 252851
Expected output:
Unique: 252851
You need to use something like
^0[\s\S]*?[\n\r]Unique:
and replace with Unique:.
^ - start of a line
0 - a literal 0
[\s\S]*? - zero or more characters incl. a newline as few as possible
[\n\r] - a linebreak symbol
Unique: - a whole word Unique:
Another possible regex is:
^0[^\r]*(?:\r(?!Unique:)[^\r]*)*
where \r is the line endings in the current file. Replace with an empty string.
Note that you could also use (?m)^0.*?[\r\n]Unique: regex (to replace with Unique:) with the (?m) option:
m: multi-line (dot(.) match newline)
Your method of matching newlines should work, although it's not optimal (alternation is rather slow); the next problem is to make sure the match stops before Unique:
(?s)^0.*(?=Unique:)
should work if there is only one Unique: in your file.
Explanation:
(?s) # Start "dot matches all (including newlines) mode
^0 # Match "0" at the start of the file
.* # Match as many characters as possible
(?=Unique:) # but then backtrack until you're right before "Unique:"

How to find and replace in a text editor?

I am very new to text editing, so I'm sorry if this question is unclear, let me know if there's anything I can specify to make my question more understandable.
My file has 27 tab-separated columns and thousands of rows. I want to replace tabs with an underscore (basically merging the first 3 columns together), but only after my first two columns. How do I do this?
Here's what I currently have for my find:
([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([
^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^
\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\r
and then here's my replace:
\1_\2_\3\t\4\t\5\t\6\t\7\t\8\t\9\t\10\t\11\t\12\t\13\t\14\t\15\t\16\t\17\t\18\t\19\t\20\t\21\t\22\t\23\t\24\t\25\t\26\t\27\r
Also, any references to a good regex guide would be welcomed!
Below are representative data. Each number is separated by a tab in my editor, not by a space.
chr1 28404 29751 25 14 57 42 44 44 56 34 16 24 18 24 24 23 24 163 57 30 28 31 36 23 28 17
chr1 235561 236222 5 13 4 24 4 8 7 6 5 14 20 7 10 3 6 11 9 9 16 8 16 6 11 9
chr1 540455 541272 20 11 6 7 5 7 12 24 7 9 9 6 22 3 10 32 18 22 11 13 10 27 9 10
chr1 713112 715467 96 105 332 159 131 277 225 199 61 164 128 116 156 107 143 687 204 186 97 125 174 193 213 118
chr1 761657 764380 106 153 334 182 161 326 215 343 85 174 160 135 176 151 141 724 308 223 120 141 200 198 247 151
Try this
Find :
(.+?)\t(.+?)\t(.+?)\n
Replace with
\1_\2_\3\n
have a look at Demo
Moreover you'll have to disable ". matches New Line" in your text editor.
So, you have something like
Running this search and replace below, you will get:
Regex:
^(\s*(?:[^\t]+\t){2})([^\t]+)\t([^\t]+)\t
Replacement: $1$2_$3_
If you can have empty columns, replace the + quantifier to *:
^(\s*(?:[^\t]*\t){2})([^\t]*)\t([^\t]*)\t

RegEx: Reject sub-portion of complicated expression

In the sample text below, I want to match groups of text (newlines and all) starting with a line defined by \nI.*' and including the subsequent lines starting with \nA, only if none of the intermediate lines contains "BOM=". I.e. in the example, I would want to match the first "device" and its following attributes, but not the second device, as shown in my comments (after #s).
I 657 device:THAT 2 1290 400 0 1 ' # Start matching here because no lines have "BOM="
A 1335 425 12 0 5 0 some text
A 1335 455 12 0 5 0 some text
A 1300 440 12 0 9 3 some text
A 1370 375 12 0 3 0 some text # Finish matching here
C 655 1 3 0
A 1370 450 12 0 3 3 #=2
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
C 740 2 4 0
A 1305 450 12 0 9 3 #=1
I 318 device:THIS 2 300 1840 0 1 ' # Do not match again here because there's a line with "BOM="
A 320 1880 12 0 7 3 some text
A 320 1880 12 0 9 3 some text
A 380 1880 12 0 1 1 BOM=1,2
A 345 1865 12 0 5 0 some text
A 380 1830 12 0 3 0 some text
C 666 1 3 0
In the sample text, "some text" is various descriptors for electrical devices, e.g. "RATING=63MW", "REFDES=R123". It may contain whitespace but not newlines.
The furthest I've gotten yet is the expression
((\n|^)I((?!misc).)*?'\n)((A.*\n)*(A.*BOM=.*\n)(A.*\n)*)
which matches the opposite of what I want, i.e it finds the text blocks that DO contain BOM=. I thought I could switch this by changing (A.*BOM=.*\n) to (?!(A.*BOM=.*\n)) but this did not work.
I'm hoping to use this in Notepad++ when I'm done.
You can perhaps try this regex:
^I(?:(?!misc).)*'\n(?!(?:A.*\n)*?A.*BOM=)(?:A.*\n)*
regex101 demo
I added a third block where the BOM= is instead on a line starting with C, where the device being matched because BOM= is not on the same line as the consecutive lines beginning with A.
Multiline by default matches on every line on Notepad++, so it's usually not necessary to have (^|\n), but you can revert it if you need it.
I also kept (?:(?!misc).)* in because you had it in your expression, although it doesn't have to do anything with your sample data.
(?!(?:A.*\n)*?A.*BOM=) is what's making the match fail when there's a BOM= in the lines. It's a negative lookahead which will prevent a match only if A.*BOM= matches after any number of lines of (?:A.*\n)*? (i.e. lines beginning with A).

Trying to extract the last number on a line, with sets of numbers delimited by spaces

So I've extracted the digits from a log file and it looks like this:
2011 04 13 23 54 14 601 04 13 23 54 14 10 35 1 14 8080 59 250
What I'm trying to get is the last number (250), and it will loop through each line of the log. Once I get the last number from each line, I will do some calculations...I just can't extract that last number at the end of the line. Thanks!
while (<>) {
my ($last) = /(\d+)$/;
}
If your data is an array, #digits, then the last one is $digits[-1].
If your data is in a string, use the split to get it into an array.