select lines with n identical fields, after a sentinel character - regex

In my file, each line includes five numerical fields delimited by spaces (preceded and followed by more fields). Via a shell script I need to be able to select lines with exactly 3, 4 and 5 identical entries among those five numeric fields (i.e. three separate searches, such that the search for lines with 3 matches in those fields does not also return lines with 4 or 5 matches in those fields).
To find the relevant fields, my search will have to locate the first open and closed parenthesis pair on a line. After the parenthesis is closed, the immediately following five fields are the ones I'm interested in. One potential complication: sometimes one or more of the numeric fields is replaced by a single dash/hyphen instead of a number. One potential means of simplification: the five fields will be in (non-strictly) ascending order and any hyphen entries will always precede the remaining numeric fields.
I would be grateful for some sed/awk suggestions with this. Many thanks!
[EDIT]: I can extract the relevant fields (as detailed in the comment below), thus the strike-through paragraph above is unnecessary. Here is sample data once the relevant fields are extracted:
109 110 111 111 112
110 110 111 111 112
99 99 99 112 112
99 99 99 112 112
100 101 101 112 112
102 102 102 112 112
102 102 103 112 112
102 103 103 112 112
102 104 104 112 112
102 104 104 112 112
103 104 104 112 112
102 105 105 112 112
102 105 105 112 112
103 105 105 112 112
102 106 106 112 112
102 106 107 112 112
103 106 107 112 112
104 106 107 112 112
102 107 107 112 112
104 107 107 112 112
104 107 107 112 112
106 107 108 112 112
107 107 108 112 112
107 107 108 112 112
102 109 109 112 112
102 109 109 112 112
104 109 109 112 112
102 109 110 112 112
103 109 110 112 112
104 109 110 112 112
102 110 110 112 112
104 110 110 112 112
104 110 110 112 112
107 109 111 112 112
107 109 111 112 112
106 110 111 112 112
107 110 111 112 112
107 110 111 112 112
109 110 112 112 112
110 110 112 112 112
107 112 112 112 112
112 112 112 112 112
This should produce hits when n=3 on these lines:
99 99 99 112 112
99 99 99 112 112
102 102 102 112 112
109 110 112 112 112
110 110 112 112 112
A hit when n=4 on this line:
107 112 112 112 112
and a hit when n=5 on this line:
112 112 112 112 112

Here is a Bash script solution using awk. It reads the file line by line and uses an AWK associative array to count how many times a number appeared on the line. Change filename.txt to your file that contains the numbers.
n=3
while read line
do
echo "$line" | awk -v n="$n" '
{
for(i=1; i <= NF; i++) {
a[$i]++
}
}
{
for(o in a) {
if (a[o] == n) {
print
}
}
}
'
done < filename.txt

You can do it using sed as well.
You can create a script:
n=$(($1-1))
sed -n "/\([0-9]*\)\( \1\)\{$n\}/p" filename
And run it like this, just supply n as a script argument:
./script.sh 3
Output:
99 99 99 112 112
99 99 99 112 112
102 102 102 112 112
109 110 112 112 112
110 110 112 112 112

Awk-only solution as a one-liner:
awk -v n=3 '{for(i=1;i<=NF;i++)a[$i]++;for(o in a)if(a[o]==n)p=1} p; {p=0;delete a}' inputfile
Split out for easier reading, this slightly resembles badjr's solution. (I've used his variables for easier comparison.)
{
for (i=1;i<=NF;i++) # populate an array with counts of unique elements
a[$i]++
for (o in a) # check the array for a matching count & set flag
if (a[o]==n)
p=1
}
p; # if we've set our flag, print the current line.
{ # clear our workspace for the next line.
p=0
delete a
}
If you're interested in a bash-only solution, the following implements the same awk logic, only without awk:
#!/usr/bin/env bash
n=5
while read -a a; do
unset b
for i in "${!a[#]}"; do
(( b[${a[$i]}]++ ))
done
for i in "${b[#]}"; do
[ "$i" -eq "$n" ] && echo "${a[#]}"
done
done < inputfile
Note that because the output here is printed using array elements, whitespace in the input file will not be maintained.
This solution is bash-only because of its use of arrays.

Another sed example, didn't want my work to go to waste ;)
#!/bin/bash
while (($1 > 0))
do
n="${n} \1"
set ${1}-1
done
sed -nr "\_\<([0-9]+)${n}\>_ p"
EDIT:
On BSD sed (OS X) you need to replace \< and \> with the fascinating [[:<:]] and [[:>:]] respectively.

Related

SAS Read Two Variable Horizontal Data

I have a dataset That is set up as follows:
Standard
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
New
112 101 109 124 131 97 98 106 115 119 116 125 108 116 111 121 109 124 120 96
102 130 106 112 115 111 122 106 107 109 115 104 125 114 135 127 117 113 98 95
121 116 111 116 118 112 117 114 128 125 104 118 122 123 124 119 110 96 123 124
127 100 121 108 133 118 114 116 125 118 137 115 131 108 100 121 113 116 104 101
126 123 135 116 118 111 101 118 111 125 104 124 132 121 114 132 123 121 121 110
I have only ever read in data to SAS that was in column form. I tried to set it up as two separate raw datasets one Standard, one New and then merge the two.
data blood1;
input Standard;
datalines;
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
;
But this only reads the first number of each row. what would be the best way to read this data in?
You need to add ## to the input statement. That tells SAS not to advance the line pointer after each read.
data blood1;
input Standard ##;
datalines;
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
;

How to scan input from user if graph is given in form of adjancency list?

I have an input in format as shown in below Image.I am using the vector graph[200] to read input. Let's suppose program read first number of first row and column (as shown in Image) . Now I want to add all vertices which is connected to first node like graph[1].push_back(next integer in same row) and so on . But when should I stop reading input for particular Node(graph[i]). Because after reading first row I need to add vertices into another graph[i+1] to node's corresponding list . If you don't understand my questions plz have a look to my code.
#include <bits/stdc++.h>
using namespace std;
vector<int> graph[201];
int main()
{
int n=2,i,node,temp;
for(i=1;i<=n;i++)
{
cin>>node;
while(scanf("%d",&temp)!=EOF/*What is the correct conditon to stop loop*/)
{
graph[node].push_back(temp);
}
}
return 0;
}
Input format :
1 37 79 164 155 32 87 39 113 15 18 78 175 140 200 4 160 97 191 100 91 20 69 198 196
20 123 134 10 141 13 12 43 47 3 177 101 179 77 182 117 116 36 103 51 154 162 128 30
3 48 123 134 109 41 17 159 49 136 16 130 141 29 176 2 190 66 153 157 70 114 65 173 104 194 54
14 91 171 118 125 158 76 107 18 73 140 42 193 127 100 84 121 60 81 99 80 150 55 1 35 23 93
5 193 156 102 118 175 39 124 119 19 99 160 75 20 112 37 23 145 135 146 73 35
60 155 56 52 120 131 160 124 119 14 196 144 25 75 76 166 35 87 26 20
7 156 185 178 79 27 52 144 107 78 22 71 26 31 15 56 76 112 39 8 113 93
8 185 155 171 178 108 64 164 53 140 25 100 133 9 52 191 46 20 150 144 39 62 131 42 119 127 31 7
9 91 155 8 160 107 132 195 26 20 133 39 76 100 78 122 127 38 156 191 196
10 190 184 154 49 2 182 173 170 161 47 189 101 153 50 30 109 177 148 179 16 163 116 13 90 185
111 123 134 163 41 12 28 130 13 101 83 77 109 114 21 82 88 74 24 94 48 33
12 161 109 169 21 24 36 65 50 2 101 159 148 54 192 88 47 11 142 43 70 182 177 179 189 194 33
13 161 141 157 44 83 90 181 41 2 176 10 29 116 134 182 170 165 173 190 159 47 82 111 142 72 154 110 21 103 130 11 33 138 152
and so on...
Here is the screenshot of my input format
Assuming you want to read line-by-line basis, you could do following:
#include <bits/stdc++.h>
using namespace std;
vector<int> graph[201];
int main(){
string line;
int n=2,i,node,temp;
for(i=1;i<=n;i++){
getline(cin, line);
istringstream in( line );
in>>node;
while(in>>temp){
graph[node].push_back(temp);
}
}
return 0;
}

Getline puts program on pause

When "text" is larger than about 280 numbers the program waits ...
With any "text" of 280 numbers, it works fine.
#include <iostream>
#include <string>
using namespace std;
int main()
{
string text;
getline (cin, text);
cout << text;
}
eg:
167 214 280 265 278 292 196 249 242 297 7 125 151 4 25 172 293 157 290 277 240 155 201 90 44 230 94 185 184 65 189 159 74 30 59 279 169 136 142 80 46 124 66 203 138 182 171 241 267 294 32 233 165 39 149 181 156 170 137 96 130 238 239 37 298 48 288 6 100 174 268 144 272 109 275 190 160 154 57 15 83 16 183 236 95 97 147 215 77 34 219 91 68 81 52 207 187 105 229 153 243 20 71 53 3 102 259 13 115 123 98 193 87 208 120 221 113 261 126 178 111 133 255 36 287 93 228 263 47 227 188 191 295 205 28 82 244 152 281 166 58 192 162 60 256 76 50 179 235 247 282 118 88 212 112 21 273 141 222 56 209 134 237 2 121 104 23 150 194 146 24 300 64 92 78 79 116 108 286 223 70 61 67 284 19 33 173 216 42 164 29 199 63 69 140 132 211 101 103 119 106 198 296 168 224 158 232 27 254 246 262 110 250 225 135 86 26 51 180 231 114 257 75 202 217 251 218 18 89 213 85 220 117 266 206 127 234 197 291 248 14 258 129 226 148 260 84 204 73 299 31 264 276 107 11 145 1 54 200 49 72 177 62 45 163 271 274 270 195 186 252 139 99 55 41 38 253 285 5 176 283 22 122 161 17 175 131 43 289 269 9 40 245 12 10 143 35 210 128 8
The only reason I could imagine your code is pausing is if you were overfilling the space allocated for a string. This seems extremely unlikely. There is a limit to the maximum size for a string based on the size of size_t but typically 32 bits are allocated for size_t... This means that the max number of characters for a string is typically 2^32 - 1 characters. Obviously, you are nowhere near reaching that length.
In normal circumstances, you are far more likely to be bound by the amount of space that your system can allocate for the string. This is dependent on the amount of memory available and how your system breaks it up.
Both of these situations seem extremely unlikely.
In this situation, the issue is most likely an issue with your compiler or platform...
What's your reason for taking in such a long string? Have you considered other approaches like reading in the integers from a text file? You could use a vector to store the integers in a more organized way by reading them in individually from the text file, just a thought.

AWK - Printing a specific pattern

I have file that looks like this
gene_id_100100 sp|Q53IZ1|ASDP_PSESP 35.81 148 90 2 13 158 6 150 6e-27 109 158 531
gene_id_100600 sp|Q49W80|Y1834_STAS1 31.31 99 63 2 1 95 279 376 7e-07 50.1 113 402
gene_id_100 sp|A7TSV7|PAN1_VANPO 36.36 44 24 1 41 80 879 922 1.9 32.3 154 1492
gene_id_10100 sp|P37348|YECE_ECOLI 32.77 177 104 6 3 172 2 170 2e-13 71.2 248 272
gene_id_101100 sp|B0U4U5|SURE_XYLFM 29.11 79 41 3 70 148 143 206 0.14 35.8 175 262
gene_id_101600 sp|Q5AWD4|BGLM_EMENI 35.90 39 25 0 21 59 506 544 4.9 30.4 129 772
gene_id_102100 sp|P20374|COX1_APILI 38.89 36 22 0 3 38 353 388 0.54 32.0 92 521
gene_id_102600 sp|Q46127|SYW_CLOLO 79.12 91 19 0 1 91 1 91 5e-44 150 92 341
gene_id_103100 sp|Q9UJX6|ANC2_HUMAN 53.57 28 13 0 11 38 608 635 2.1 28.9 42 822
gene_id_103600 sp|C1DA02|SYL_LARHH 35.59 59 30 2 88 138 382 440 4.6 30.8 140 866
gene_id_104100 sp|B8DHP2|PROB_LISMH 25.88 85 50 2 37 110 27 109 0.81 32.3 127 276
gene_id_105100 sp|A1ALU1|RL3_PELPD 31.88 69 42 2 14 77 42 110 2.2 31.6 166 209
gene_id_105600 sp|P59696|T200_SALTY 64.00 125 45 0 5 129 3 127 9e-58 182 129 152
gene_id_10600 sp|G3XDA3|CTPH_PSEAE 28.38 74 48 1 4 77 364 432 0.56 31.6 81 568
gene_id_106100 sp|P94369|YXLA_BACSU 35.00 100 56 3 25 120 270 364 4e-08 53.9 120 457
gene_id_106600 sp|P34706|SDC3_CAEEL 60.00 20 8 0 18 37 1027 1046 2.3 32.7 191 2150
Now, I need to extract the gene ID, which is the one between || in the second column. In other words, I need an output that looks like this:
Q53IZ1
Q49W80
A7TSV7
P37348
B0U4U5
Q5AWD4
P20374
Q46127
Q9UJX6
C1DA02
B8DHP2
A1ALU1
P59696
G3XDA3
P94369
P34706
I have been trying to do it using the following command:
awk '{for(i=1;i<=NF;++i){ if($i==/[A-Z][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9][A-Z0-9]/){print $i} } }'
but it doesn't seem to work.
Pattern matching is not really necessary. I'd suggest
awk -F\| '{print $2}' filename
This splits the line into |-delimited fields and prints the second of them.
Alternatively,
cut -d\| -f 2 filename
achieves the same.

replace string with multiple-string line using awk

I have two separate files and I was hoping to search and replace a string in file1 for an entire line of multiple strings in file2. I have been working on using awk but I am not sure how to replace a string for a line of strings. Below is an example of what I was looking to do.
The string to be replaced would match the first field of the line to replace it (multiple strings to insert in place of the single string). It's a "find and replace" task.
file1:
001 111 112 113 116 117
002 221 222
003 331
004
005 551 555
file2:
113 114 115
222 223 224 225 226 227
551 552 553 554
Desired output:
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
Try this:
awk 'NR==FNR{a[$1]=$0;next}{for(i=1;i<=NF;i++)$i=($i in a?a[$i]:$i)}1' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
We read file2 first and create an array indexed at column1 containing entire line as value.
For file1 we loop through each element if it is found in our array we substitute it with the value.
Here you go:
awk '
FILENAME == "file2" {
key = $1
map[key] = $0
next
}
{
for (i = 1; i <= NF; i++) {
if (map[$i])
$i = map[$i]
}
print
}
' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
This takes lines from file2 and populates an array called map with the whole line, keyed on the first element (I'm treating awk's associative array system more like a hash). Otherwise, loop through each element and substitute those that have map values, then print the output. Note that this must be run with file2 provided first so that the map array can be populated.