replace string with multiple-string line using awk - replace

I have two separate files and I was hoping to search and replace a string in file1 for an entire line of multiple strings in file2. I have been working on using awk but I am not sure how to replace a string for a line of strings. Below is an example of what I was looking to do.
The string to be replaced would match the first field of the line to replace it (multiple strings to insert in place of the single string). It's a "find and replace" task.
file1:
001 111 112 113 116 117
002 221 222
003 331
004
005 551 555
file2:
113 114 115
222 223 224 225 226 227
551 552 553 554
Desired output:
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555

Try this:
awk 'NR==FNR{a[$1]=$0;next}{for(i=1;i<=NF;i++)$i=($i in a?a[$i]:$i)}1' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
We read file2 first and create an array indexed at column1 containing entire line as value.
For file1 we loop through each element if it is found in our array we substitute it with the value.

Here you go:
awk '
FILENAME == "file2" {
key = $1
map[key] = $0
next
}
{
for (i = 1; i <= NF; i++) {
if (map[$i])
$i = map[$i]
}
print
}
' file2 file1
001 111 112 113 114 115 116 117
002 221 222 223 224 225 226 227
003 331
004
005 551 552 553 554 555
This takes lines from file2 and populates an array called map with the whole line, keyed on the first element (I'm treating awk's associative array system more like a hash). Otherwise, loop through each element and substitute those that have map values, then print the output. Note that this must be run with file2 provided first so that the map array can be populated.

Related

SAS Read Two Variable Horizontal Data

I have a dataset That is set up as follows:
Standard
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
New
112 101 109 124 131 97 98 106 115 119 116 125 108 116 111 121 109 124 120 96
102 130 106 112 115 111 122 106 107 109 115 104 125 114 135 127 117 113 98 95
121 116 111 116 118 112 117 114 128 125 104 118 122 123 124 119 110 96 123 124
127 100 121 108 133 118 114 116 125 118 137 115 131 108 100 121 113 116 104 101
126 123 135 116 118 111 101 118 111 125 104 124 132 121 114 132 123 121 121 110
I have only ever read in data to SAS that was in column form. I tried to set it up as two separate raw datasets one Standard, one New and then merge the two.
data blood1;
input Standard;
datalines;
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
;
But this only reads the first number of each row. what would be the best way to read this data in?
You need to add ## to the input statement. That tells SAS not to advance the line pointer after each read.
data blood1;
input Standard ##;
datalines;
129 106 122 114 121 111 135 106 122 148 102 121 129 101 109 123 109 123 101 119
138 151 137 116 118 118 143 104 119 113 121 98 116 103 132 113 105 127 113 118
109 94 110 119 125 105 106 131 104 126 122 106 118 123 110 134 138 135 131 116
117 123 103 111 120 137 106 112 100 112 128 102 116 118 140 97 122 133 129 127
120 120 127 136 123 112 99 124 129 116 127 123 131 127 109 99 134 128 109 129
;

Getline puts program on pause

When "text" is larger than about 280 numbers the program waits ...
With any "text" of 280 numbers, it works fine.
#include <iostream>
#include <string>
using namespace std;
int main()
{
string text;
getline (cin, text);
cout << text;
}
eg:

The only reason I could imagine your code is pausing is if you were overfilling the space allocated for a string. This seems extremely unlikely. There is a limit to the maximum size for a string based on the size of size_t but typically 32 bits are allocated for size_t... This means that the max number of characters for a string is typically 2^32 - 1 characters. Obviously, you are nowhere near reaching that length.
In normal circumstances, you are far more likely to be bound by the amount of space that your system can allocate for the string. This is dependent on the amount of memory available and how your system breaks it up.
Both of these situations seem extremely unlikely.
In this situation, the issue is most likely an issue with your compiler or platform...
What's your reason for taking in such a long string? Have you considered other approaches like reading in the integers from a text file? You could use a vector to store the integers in a more organized way by reading them in individually from the text file, just a thought.

select lines with n identical fields, after a sentinel character

In my file, each line includes five numerical fields delimited by spaces (preceded and followed by more fields). Via a shell script I need to be able to select lines with exactly 3, 4 and 5 identical entries among those five numeric fields (i.e. three separate searches, such that the search for lines with 3 matches in those fields does not also return lines with 4 or 5 matches in those fields).
To find the relevant fields, my search will have to locate the first open and closed parenthesis pair on a line. After the parenthesis is closed, the immediately following five fields are the ones I'm interested in. One potential complication: sometimes one or more of the numeric fields is replaced by a single dash/hyphen instead of a number. One potential means of simplification: the five fields will be in (non-strictly) ascending order and any hyphen entries will always precede the remaining numeric fields.
I would be grateful for some sed/awk suggestions with this. Many thanks!
[EDIT]: I can extract the relevant fields (as detailed in the comment below), thus the strike-through paragraph above is unnecessary. Here is sample data once the relevant fields are extracted:
109 110 111 111 112
110 110 111 111 112
99 99 99 112 112
99 99 99 112 112
100 101 101 112 112
102 102 102 112 112
102 102 103 112 112
102 103 103 112 112
102 104 104 112 112
102 104 104 112 112
103 104 104 112 112
102 105 105 112 112
102 105 105 112 112
103 105 105 112 112
102 106 106 112 112
102 106 107 112 112
103 106 107 112 112
104 106 107 112 112
102 107 107 112 112
104 107 107 112 112
104 107 107 112 112
106 107 108 112 112
107 107 108 112 112
107 107 108 112 112
102 109 109 112 112
102 109 109 112 112
104 109 109 112 112
102 109 110 112 112
103 109 110 112 112
104 109 110 112 112
102 110 110 112 112
104 110 110 112 112
104 110 110 112 112
107 109 111 112 112
107 109 111 112 112
106 110 111 112 112
107 110 111 112 112
107 110 111 112 112
109 110 112 112 112
110 110 112 112 112
107 112 112 112 112
112 112 112 112 112
This should produce hits when n=3 on these lines:
99 99 99 112 112
99 99 99 112 112
102 102 102 112 112
109 110 112 112 112
110 110 112 112 112
A hit when n=4 on this line:
107 112 112 112 112
and a hit when n=5 on this line:
112 112 112 112 112
Here is a Bash script solution using awk. It reads the file line by line and uses an AWK associative array to count how many times a number appeared on the line. Change filename.txt to your file that contains the numbers.
n=3
while read line
do
echo "$line" | awk -v n="$n" '
{
for(i=1; i <= NF; i++) {
a[$i]++
}
}
{
for(o in a) {
if (a[o] == n) {
print
}
}
}
'
done < filename.txt
You can do it using sed as well.
You can create a script:
n=$(($1-1))
sed -n "/\([0-9]*\)\( \1\)\{$n\}/p" filename
And run it like this, just supply n as a script argument:
./script.sh 3
Output:
99 99 99 112 112
99 99 99 112 112
102 102 102 112 112
109 110 112 112 112
110 110 112 112 112
Awk-only solution as a one-liner:
awk -v n=3 '{for(i=1;i<=NF;i++)a[$i]++;for(o in a)if(a[o]==n)p=1} p; {p=0;delete a}' inputfile
Split out for easier reading, this slightly resembles badjr's solution. (I've used his variables for easier comparison.)
{
for (i=1;i<=NF;i++) # populate an array with counts of unique elements
a[$i]++
for (o in a) # check the array for a matching count & set flag
if (a[o]==n)
p=1
}
p; # if we've set our flag, print the current line.
{ # clear our workspace for the next line.
p=0
delete a
}
If you're interested in a bash-only solution, the following implements the same awk logic, only without awk:
#!/usr/bin/env bash
n=5
while read -a a; do
unset b
for i in "${!a[#]}"; do
(( b[${a[$i]}]++ ))
done
for i in "${b[#]}"; do
[ "$i" -eq "$n" ] && echo "${a[#]}"
done
done < inputfile
Note that because the output here is printed using array elements, whitespace in the input file will not be maintained.
This solution is bash-only because of its use of arrays.
Another sed example, didn't want my work to go to waste ;)
#!/bin/bash
while (($1 > 0))
do
n="${n} \1"
set ${1}-1
done
sed -nr "\_\<([0-9]+)${n}\>_ p"
EDIT:
On BSD sed (OS X) you need to replace \< and \> with the fascinating [[:<:]] and [[:>:]] respectively.

How to generate only so many permutations more efficiently in C++?

I'm trying to solve a problem and I feel like I'm really close to it but it's still a little slow because I'm generating so many permutations.
I need the permutations of "0123456789". I know there are (10)! permutations which is a lot.
I use an std::unordered_set because I don't care about the order they are stored in and it seemed faster than using a regular std::set
Here is some core I wrote: max_perm_size is the size of the string of permutations I care about for a particular case.
void getPermutations(unordered_set<string> &permutations, int &max_perm_size)
{
string digits = "0123456789";
do{
permutations.insert(digits.substr(0, max_perm_size));
} while (next_permutation(digits.begin(), digits.end()));
}
I have two main questions about this code:
Above I'm still generating then entire "0123456789" permutation even for cases where I only care about permutations of size max_perm_size. I just trim them afterwords before storing them into my std::unordered_set. Is there a way to do this in a better way so it's faster?
For the worst case max_pem_size = 10, is there a more efficient way for me to generate and store all of these permutations in general?
As far as a I can tell, your result is numbers (without repeated digits) from 0 through some limit. Since you say you don't care about the order of the digits, it's probably easiest if we just stick to ascending ones. That being the case, we can generate results like this:
#include <iostream>
int main() {
for (int i=0; i<10; i++)
for (int j=i+1; j<10; j++)
for (int k=j+1; k<10; k++)
std::cout << i << j << k << "\t";
}
Result:
012 013 014 015 016 017 018 019 023 024 025 026 027
028 029 034 035 036 037 038 039 045 046 047 048 049
056 057 058 059 067 068 069 078 079 089 123 124 125
126 127 128 129 134 135 136 137 138 139 145 146 147
148 149 156 157 158 159 167 168 169 178 179 189 234
235 236 237 238 239 245 246 247 248 249 256 257 258
259 267 268 269 278 279 289 345 346 347 348 349 356
357 358 359 367 368 369 378 379 389 456 457 458 459
467 468 469 478 479 489 567 568 569 578 579 589 678
679 689 789
If your limit isn't a number of digits, you can put the digits together into an actual int and compare (then break out of the loops).
This is specific to your case, not a general solution, but for the case of digit permutations, you can do:
void getPermutations(unordered_set<string> &permutations, int max_perm_size)
{
if (max_perm_size < 1) return;
uint64_t stopat = 1;
for (int i = 1; i < max_perm_size; ++i) {
stopat *= 10;
}
for (uint64_t dig = 0; dig < stopat; ++dig) {
std::ostringstream ss;
ss << std::setw(max_perm_size) << std::setfill('0') << dig;
permutations.insert(ss.str());
}
}
You can take a substring of your digits string first. Then your loop will only deal with permutations of max_perm_size.
You could create a class that generates permutations on demand instead of pre-generating and storing them beforehand. Depending on your application, you may not even have to store them.

does this code generate random numbers?

a = 100
for b in range(10,a):
c = b%10
if c == 0:
c += 3
c = c*b
print c
I was trying to make a random generator without using random function and I made this, does it generate random numbers?
Short Answer:
No.
Your code will print
30 11 24 39 56 75 96 119 144 171 60 21 44 69 96 125 156 189 224 261 90 31 64 99 136 175 216 259 304 351 120 41 84 129 176 225 276 329 384 441 150 51 104 159 216 275 336 399 464 531 180 61 124 189 256 325 396 469 544 621 210 71 144 219 296 375 456 539 624 711 240 81 164 249 336 425 516 609 704 801 270 91 184 279 376 475 576 679 784 891
every time.
Computers and programs like these are deterministic. If you sat down with a pen and paper you could tell me exactly which of these number would occur, when they would occur.
Random number generation is difficult, what I would recommend is using time to (seem to) randomize the output.
import time
print int(time.time() % 10)
This will give you a "random" number between 0 and 9.
time.time() gives you the number of milliseconds since (I believe) epoch time. It's a floating point number so we have to cast to an int if we want a "whole" integer number.
Caveat: This solution is not truly random, but will act in a much more "random" fashion.