How to find and replace in a text editor? - regex

I am very new to text editing, so I'm sorry if this question is unclear, let me know if there's anything I can specify to make my question more understandable.
My file has 27 tab-separated columns and thousands of rows. I want to replace tabs with an underscore (basically merging the first 3 columns together), but only after my first two columns. How do I do this?
Here's what I currently have for my find:
([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([
^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^
\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\t([^\t]+)\r
and then here's my replace:
\1_\2_\3\t\4\t\5\t\6\t\7\t\8\t\9\t\10\t\11\t\12\t\13\t\14\t\15\t\16\t\17\t\18\t\19\t\20\t\21\t\22\t\23\t\24\t\25\t\26\t\27\r
Also, any references to a good regex guide would be welcomed!
Below are representative data. Each number is separated by a tab in my editor, not by a space.
chr1 28404 29751 25 14 57 42 44 44 56 34 16 24 18 24 24 23 24 163 57 30 28 31 36 23 28 17
chr1 235561 236222 5 13 4 24 4 8 7 6 5 14 20 7 10 3 6 11 9 9 16 8 16 6 11 9
chr1 540455 541272 20 11 6 7 5 7 12 24 7 9 9 6 22 3 10 32 18 22 11 13 10 27 9 10
chr1 713112 715467 96 105 332 159 131 277 225 199 61 164 128 116 156 107 143 687 204 186 97 125 174 193 213 118
chr1 761657 764380 106 153 334 182 161 326 215 343 85 174 160 135 176 151 141 724 308 223 120 141 200 198 247 151

Try this
Find :
(.+?)\t(.+?)\t(.+?)\n
Replace with
\1_\2_\3\n
have a look at Demo
Moreover you'll have to disable ". matches New Line" in your text editor.

So, you have something like
Running this search and replace below, you will get:
Regex:
^(\s*(?:[^\t]+\t){2})([^\t]+)\t([^\t]+)\t
Replacement: $1$2_$3_
If you can have empty columns, replace the + quantifier to *:
^(\s*(?:[^\t]*\t){2})([^\t]*)\t([^\t]*)\t

Related

find specific number combination within from and to text

I want only a match on a 3 digit number (under 600, in example below "598") when a specific number in string is visible between start wording and end wording. With below regular expression I get a match of everything, can anyone help?
Regular expression: (?<=Start)(.*)(?=End).
Test string:
Start 440 3 956 4 603 5 - 6 603 7 440 8 - 9 440 10 956 11 440 12 603 13 2005
14 440 15 598 16 1156 17 946 18 761 19 761 20 946 21 598 22 598
23 1156 24 2057 25 946 26 1194 27 946 28 946 - - - Zurich 2019 M T W T F S S - - - - 1 - 2 1058 3 542 4 852 5 - 6 1517 7 1058 8 - 9 1058 10 848 11 542 12 705 13 1306 14 1058 15 1258 16 2159 17 1617 18 700 19 863 20 700 21 1258 22 1911 23 1911 24 1617 25 1258 26 2759 27 1258 28 1258 - - - End
With \b[0-5]\d{2}\b you find all 3 digit number under 600.
Demo: https://regex101.com/r/0ZSbbY/2
Try this pattern:
(?<=^|\D)[1-5]?\d{2}(?!.+Start)(?=\D.+End)
(?<=^|\D)[1-5]?\d{1,2} this will match all 1- or 2-digit numbers, as they are less than 600. It also findes also 1**, 2**, 3**, 4**, 5** numbers.
(?!.+Start)(?=\D.+End) this lookahead assure that we are before End word and not before Start word, i.e. between them. It couldn't be done with positive lookbehind as #TimBiegeleisen stated, as it would have variable length.
Demo
#!/usr/bin/perl
use Modern::Perl;
use Data::Dumper;
my $str = 'Start 440 3 956 4 603 5 - 6 603 7 440 8 - 9 440 10 956 11 440 12 603 13 2005 14 440 15 598 16 1156 17 946 18 761 19 761 20 946 21 598 22 598 23 1156 24 2057 25 946 26 1194 27 946 28 946 - - - Zurich 2019 M T W T F S S - - - - 1 - 2 1058 3 542 4 852 5 - 6 1517 7 1058 8 - 9 1058 10 848 11 542 12 705 13 1306 14 1058 15 1258 16 2159 17 1617 18 700 19 863 20 700 21 1258 22 1911 23 1911 24 1617 25 1258 26 2759 27 1258 28 1258 - - - End';
my $threshold = 600;
my $re = qr/
(?: # start non capture group
Start # literally
| # OR
\G # iterate from last match position
) # end group
(?:(?!End).)*? # make sure we don't have "End" before to number to find
(?<!\d) # negative lookbehind, make sure we don't have a digit before
(\d{3}) # 3 digit number
(?!\d) # negative lookahead, make sure we don't have a digit after
/x;
# Retrieve all 3 digit numbers between Start and End
my #numbers = $str =~ /$re/g;
# Select numbers that are less than $threshold. In this case 600
#numbers = grep { $_ < $threshold } #numbers;
say Dumper \#numbers;
Output:
$VAR1 = [
440,
440,
440,
440,
440,
598,
598,
598,
542,
542
];
If you're searching for a specific number, like one that is close to 600, I would suggest to use regexp to collect all numbers and then use some algorythm to find matching number.
This regexp will help you to check that your string matches pattern and to collect all numbers using group "number".
^Start (([^\d]+ )*((?<number>\d+) )*)*End$
This simplier regexp will help you to collect numbers without checking all String:
\d+
Iterate trough your numbers collection and find needed one.
Sorry I don't noticed what language do you use to write code snippet.

add number as prefix

I have list of number:
19
20
21
22
23
24
25
26
many more numbers...
I want to add one number to all of then as prefix so thay will all becam etree digit numbers:
219
220
221
222
223
224
225
226
It should go lik this in find section: \S{2,} than what should I put in replace section? 2$1 or what I em not expert.
Find all two digits and capture them (with parentheses).
\b(\d\d)\b
Replace captured groups with an additional 2 in front.
2$1

Matching across multiple lines regular expression

I have several lists in a single text file that look like below. It always starts with 0 and it always ends with the word Unique at the start of a newline. I would like to get rid of all of it apart from the line with Unique on it. I looked through stackoverflow and tried the following but it returns the whole text file (there are other strings in the file that I haven't put in this example). Basically the problem is how to account for the newlines in the regex selection
^0(.|\n)*
Input:
0 145
1 139
2 175
3 171
4 259
5 262
6 293
7 401
8 430
9 417
10 614
11 833
12 1423
13 3062
14 10510
15 57587
16 5057575
17 10071
18 375
19 152
20 70
21 55
22 46
23 31
24 25
25 22
26 25
27 14
28 16
29 16
30 8
31 10
32 8
33 21
34 8
35 51
36 65
37 605
38 32
39 2
40 1
41 2
44 1
48 2
51 1
52 1
57 1
63 2
68 1
82 1
94 1
95 1
101 3
102 7
103 1
110 1
111 1
119 1
123 1
129 2
130 3
131 2
132 1
135 1
136 2
137 7
138 4
Unique: 252851
Expected output:
Unique: 252851
You need to use something like
^0[\s\S]*?[\n\r]Unique:
and replace with Unique:.
^ - start of a line
0 - a literal 0
[\s\S]*? - zero or more characters incl. a newline as few as possible
[\n\r] - a linebreak symbol
Unique: - a whole word Unique:
Another possible regex is:
^0[^\r]*(?:\r(?!Unique:)[^\r]*)*
where \r is the line endings in the current file. Replace with an empty string.
Note that you could also use (?m)^0.*?[\r\n]Unique: regex (to replace with Unique:) with the (?m) option:
m: multi-line (dot(.) match newline)
Your method of matching newlines should work, although it's not optimal (alternation is rather slow); the next problem is to make sure the match stops before Unique:
(?s)^0.*(?=Unique:)
should work if there is only one Unique: in your file.
Explanation:
(?s) # Start "dot matches all (including newlines) mode
^0 # Match "0" at the start of the file
.* # Match as many characters as possible
(?=Unique:) # but then backtrack until you're right before "Unique:"

Manipulate data in Awk

I am new to Awk programming.I have a question on manipulating text file,which is required to draw certain Network based images in a visualization software(Circos http://circos.ca)
I have input data for which I want to manipulate values using awk/grep/sed.
There are 9 pairs(18 lines).5 pairs(first 10 lines) are for "from=ABCB11", and 4 pairs(next 8 lines) are for "from =ABCC8". What I want is extract the value from the first line of the first pair and replace it in each alternate line of the rest of the other pairs.
So value for group-2 is 9 10 ,which should replace all the occurence of value in group2.
The next value for group-2 is 28 29,which should be replaced by 9 10.
The stop should be determined by "from=name" which is "from=ABCB11".Its not necessary that the rows that have to captured expression from and replace in its next occurence will belong to group-2 as in this instance.It could be group-3 or group-4 until group-10.So second set ("from =ABCC8")could have been belonged to group-4/5/6 not necessary group-2.Its just a coincidence here.
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 28 29 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 29 30 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 10 11 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 11 12 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Below is the FINAL output,I am looking for:
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-5 0 1 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM1,toid=114,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-5 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=CHRM2,toid=115,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-3 1 2 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=DRD2,toid=158,use=1,z=1
group-2 9 10 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-3 2 3 text color=black,from=ABCB11,fromid=4,order=2,thickness=3,to=EGF,toid=164,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-3 12 13 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ACE,toid=11,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-1 0 1 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1A,toid=21,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-1 1 2 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1B,toid=22,use=1,z=1
group-2 21 22 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
group-1 2 3 text color=black,from=ABCC8,fromid=5,order=2,thickness=3,to=ADRA1D,toid=23,use=1,z=1
Also,this is just a sample data.So many pairs would have group-1,group-4,group-5 upto group 10.Here,only pairs from lesser groups are mentioned.
I want to loop through the lines until the value in "from=name" remains same,so that I can change all occurences in each alternate line.Code:
awk -F, 'NR%2==1 {split($2,a,"="); print a[2]}' file.txt
The above code is able to extract the alternate lines and the "name" in "from=name"
The following is quite verbose (I love verbose variable names). Using your sample-data, I get the data you want to have. This assumes, that every "uneven" line gets the values from the first line with the same "from=xxxx" information.
awk '
BEGIN {
namevar=""
val1var=""
val2var=""
linenum=0
}
{
split($0, linearr)
split(linearr[5], csvarr, ",")
if (namevar != csvarr[2]) {
namevar=csvarr[2]
val1var=linearr[2]
val2var=linearr[3]
linenum=0
}
linenum+=1
if (linenum%2==1) {
print linearr[1], val1var, val2var, linearr[4], linearr[5]
} else {
print linearr[1], linearr[2], linearr[3], linearr[4], linearr[5]
}
}' file.txt

REGEX: How to split string with space and double quote

I have a input of string with spaces and double quotes as below:
Input :
18 17 16 "Arc 10 12 11 13" "Segment 10 23 33 32 12" 23 76 21
Expected Output:
18
17
16
Arc 10 12 11 13
Segment 10 23 33 32 12
23
76
21
How can I do this using Regex? Thank you in advance
You can use next regexp(see example):
("[^"]+")|\S+
("[^"]+") - quoted sequence.
\S+ - non whitespace sequence.
Probably order of groups is depend from regexp implementation. In the demo engine matching stared from left to right. Also do not forget escape special characters with double slash.
"(.+?)"|(\w+(?=\s|$))
check here