I am trying to use a .txt file with around 5000 patterns (spaced with a line) to search through another file of 18000 lines for any matches. So far I've tried every form of grep and awk I can find on the internet and it's still not working, so I am completely stumped.
Here's some text from each file.
Pattern.txt
rs2622590
rs925489
rs2798334
rs6801957
rs6801957
rs13137008
rs3807989
rs10850409
rs2798269
rs549182
There's no extra spaces or anything.
File.txt
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
rs12562034 1 758311 A G -1.552 0.1207 0.09167
rs4040617 1 769185 A G -0.414 0.6786 0.875
rs4970383 1 828418 A C 0.214 0.8303 .
rs4475691 1 836671 T C -0.604 0.5461 .
rs1806509 1 843817 A C -0.262 0.7933 .
The file.txt was downloaded directly from a med directory.
I'm pretty new to UNIX so any help would be amazing!
Sorry edit: I have definitely tried every single thing you guys are recommending and the result is blank. Am I maybe missing a syntax issue or something in my text files?
P.P.S I know there are matches as doing individual greps works. I'll move this question to unix.stackexchange. Thanks for your answers guys I'll try them all out.
Issue solved: I was obviously using DOS carriages. I didn't know about this before so thank you everyone that answered. For future users who are having this issue, here is the solution that worked:
dos2unix *
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt > Output.txt
You can use grep -Fw here:
grep -Fw -f Pattern.txt File.txt
Options used are:
-F - Fixed string search to tread input as non-regex
-w - Match full words only
-f file - Read pattern from a file
idk if it's what you want or not, but this will print every line from File.txt whose first field equals a string from Patterns.txt:
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt
If that is not what you want, tell us what you do want. If it is what you want but doesn't produce the output you expect then one or both of your files contains control characters courtesy of being created in Windows so run dos2unix or similar on them both first.
Use a shell script to read each line of the file containing your patterns then fgrep it.
#!/bin/bash
FILENAME=$1
awk '{kount++;print $0}' $FILENAME | fgrep -f - PATTERNFILE.txt
Here is the data I want to capitalize:
molly w. bolt 334-78-5443
walter q. bugg 984-49-0032
noah p. way 887-12-0921
kerry t. bricks 431-09-1239
ping h. yu 109-32-9845
Here is the script I have written so far to capitalize the first letter of name including initial
h
s/\(.\).*/\1/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\(.\)\(.*\)/\1\3/
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
/ [a-z]/{
h
s/\([A-Z][a-z]* \)\([a-z]\).*/\2/
y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/
G
s/\(.\)\n\([A-Z][a-z]* \)\(.\)\(.*\)/\2\1\4/
}
It gives me:
MOLLY W. BOLT 334-78-544Molly 3. bolt 334-78-5443
WALTER Q. BUGG 984-49-003Walter 2. bugg 984-49-0032
NOAH P. WAY 887-12-092Noah 1. way 887-12-0921
KERRY T. BRICKS 431-09-123Kerry 9. bricks 431-09-1239
PING H. YU 109-32-984Ping 5. yu 109-32-9845
I want to only have:
Molly W. Bolt 334-78-544
Walter Q. Bugg 984-49-003
Noah P. Way 887-12-092
Kerry T. Bricks 431-09-123
Ping H. Yu 109-32-984
What would I change?
How about this (GNU sed):
$ sed 's/\b[a-z]/\u&/g' myfile
Molly W. Bolt 334-78-5443
Walter Q. Bugg 984-49-0032
Noah P. Way 887-12-0921
Kerry T. Bricks 431-09-1239
Ping H. Yu 109-32-9845
(GNU) Sed what should works with utf8 too:
sed -E 's/[[:alpha:]]+/\u&/g'
#or
sed -E 's/\S+/\u&/g'
Or perl
perl -pe 's/(\w+)/\u$1/g'
search for "word-strings" \w+
replace (substitute) s/// it $1 with uppercase 1st character \u
everywhere in the line g
or the simpler
perl -pe 's/\S+/\u$&/g'
any nonspaced string
capitalize
the
perl -CSDA -pe 's/\S+/\u$&/g'
will work with utf8 encoded files too..., e.g. from the
павел андреевич чехов 234
γεοργε πατσασογλοθ 123
čajka šumivá 345
will print
Павел Андреевич Чехов 234
Γεοργε Πατσασογλοθ 123
Čajka Šumivá 345
for inline file edit use the next:
perl -i.bak -CSDA -pe 's/\S+/\u$&/g' some filenames ....
will create the .bak (backup) file.
If you have bash 4.2+ and need convert only in the variables, you can use:
for name in павел андреевич чехов γεοργε πατσασογλοθ čajka šumivá
do
echo "${name^}" #capitalize the $name
done
prints
Павел
Андреевич
Чехов
Γεοργε
Πατσασογλοθ
Čajka
Šumivá
Also, a solution for sed, what doesn;t knows the \u https://stackoverflow.com/a/11804643/632407
Quite simple with python also:
$ python -c 'with open("myfile") as f:print f.read().title()'
https://docs.python.org/2/library/stdtypes.html
sed 's/^/ /;s/ [aA]/ A/g;s/ [bB]/ B/g;s/ [cC]/ C/g;s/ [dD]/ D/g;s/ [eE]/ E/g;s/ [fF]/ F/g;s/ [gG]/ G/g;s/ [hH]/ H/g;s/ [iI]/ I/g;s/ [jJ]/ J/g;s/ [kK]/ K/g;s/ [lL]/ L/g;s/ [mM]/ M/g;s/ [nN]/ N/g;s/ [oO]/ O/g;s/ [pP]/ P/g;s/ [qQ]/ Q/g;s/ [rR]/ R/g;s/ [sS]/ S/g;s/ [tT]/ T/g;s/ [uU]/ U/g;s/ [vV]/ V/g;s/ [wW]/ W/g;s/ [xX]/ X/g;s/ [yY]/ Y/g;s/ [zZ]/ Z/g;s/^.//' YourFile
Posix (no GNU sed) version
Works on your sample but not if something like {andrea,georges ... assuming word are at the start of line OR after a space char here.
I'm again garbled with sed command, because most probably i have very old version of sed but according to my limitations i couldn't change the version of 'sed' (!)
My question is this i wrote such an easy regex that fits with my string file such as:
/[^,]*$/mg
My string file is this :
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
23:53:22,539
23:53:23,109
23:53:23,110
23:53:23,115
23:53:23,117
23:53:23,118
23:53:23,119
23:53:23,690
23:53:23,721
23:53:23,722
23:53:24,275
23:53:24,276
23:53:24,313
23:53:24,316
23:53:24,317
23:53:24,318
23:53:24,854
23:53:24,888
23:53:24,889
23:53:24,890
23:53:24,891
23:53:50,676
23:53:50,677
23:53:50,711
23:53:50,713
23:53:50,714
23:53:51,257
23:53:51,258
23:53:51,296
23:53:51,297
23:53:51,298
23:53:51,820
23:53:51,822
23:53:51,823
23:53:52,358
23:53:52,364
23:53:52,367
23:53:52,909
23:53:52,910
23:53:52,936
23:53:52,939
23:53:52,941
23:53:52,944
23:53:52,945
23:53:52,946
23:53:52,949
23:53:52,953
23:53:52,956
23:53:52,959
23:53:52,963
23:53:52,966
23:53:52,970
23:53:52,971
23:53:52,974
23:53:52,978
23:53:52,980
23:53:52,983
23:53:52,984
23:53:52,986
23:53:52,987
23:53:52,989
23:53:52,990
23:53:52,991
23:53:52,994
23:53:52,995
23:53:52,999
23:53:53,001
23:53:53,002
23:53:53,004
23:53:53,005
23:53:53,007
23:53:53,010
23:53:53,026
23:53:53,027
23:53:53,081
23:53:53,082
23:53:53,083
23:53:53,085
07:32:54,519
07:32:54,521
07:32:54,537
07:32:54,538
07:32:54,539
07:32:54,540
07:32:54,541
07:32:54,542
07:32:54,543
07:32:54,544
07:32:54,545
07:32:54,546
07:32:54,547
07:32:54,548
07:32:54,549
07:32:54,550
I'm trying to get the values after the comma then assign them into array, when I used the sed command like :
`sed -n '/[^,]*$/mg'` file
It says command garbled, i read about multiline sed but i still couldn't reach to solution, i am new to regexes so the help will be appreciated.
Thank you in advance!
If you are using a "recent" bash, I think you can use cut and assign extracted values to an array:
numbers="$(cut -d',' -f2 filename.txt)"
array_numbers=( $numbers )
If you want to get the values after comma then you could use the below sed command which removes the values from the start upto the first comma.
sed 's/^[^,]*,//' file
OR
sed 's/^.*,//' file
Example:
$ echo '23:53:22,492' | sed 's/^[^,]*,//'
492
$ echo '23:53:22,492' | sed 's/^.*,//'
492
sed s/.*,// file
would match the till the first , are substitute the match wth nothing, which effectively gives the values after comma
for the input file
23:53:20,650
23:53:20,654
23:53:20,655
23:53:20,656
23:53:21,238
23:53:21,240
23:53:21,302
23:53:21,303
23:53:21,304
23:53:21,305
23:53:21,889
23:53:21,890
23:53:21,896
23:53:21,897
23:53:21,898
23:53:21,899
23:53:22,492
23:53:22,538
will produce output as
650
654
655
656
238
240
302
303
304
305
889
890
896
897
898
899
492
538
I have a series of text files that each contain the string "Address" twice in different parts of the file, and later the string "Subscriber Address", making for three total appearances of "Address". Using sed, I'd like to harvest data immediately following the first instance of "Address" in each file while ignoring the rest. Sometimes the full address will appear in two lines as follows...
Address
100 MAIN ST
STRATFORD CT 06614
And sometimes the address line will wrap, moving the City, State and ZIP to a third line as follows...
Address
NO 10 GREEN ACRES
LANE
SHELTON CT 06484
I'd like to store the output in variables: Address1, Address2, City, State and Zip. Using each of the examples above, the desired outcome would be...
Address1=100 MAIN ST
City=STRATFORD
State=CT
Zip=06614
Address1=NO 10 GREEN ACRES
Address2=LANE
City=SHELTON
State=CT
Zip=06484
A suitable alternative in the second example would be to concatenate address lines 1 and 2, resulting in the following...
Address1=NO 10 GREEN ACRES LANE
City=SHELTON
State=CT
Zip=06484
I know that this is a lot to ask. Any help is very much appreciated.
Sed is not intended for this purpose. Sed operates on single lines only, not keeping history and such.
You could switch to e.g. an AWK clone (awk, gawk, nawk).
There is my attempt to do this:
$ cat file
test
First Address
100 MAIN ST STRATFORD CT 06614
test
Second Address
100 MAIN ST
STRATFORD CT 06614
test
Third Address
NO 10 GREEN ACRES
LANE
SHELTON CT 06484
test
$ sed -n '/Address/{:start;N;/[^0-9]$/b start;s/\n/|/g;p}' file |
sed 1d |
sed 's/^Address|//;s| \([0-9]\+\)$|\nZip: \1|' |
sed 's| \([A-Z]\+\)$|\nState: \1|'|
sed 's/|\([^|]\+\)$/\nCity: \1/' |
sed '/^[^:]\+$/s|\(.*\)|Address: \1|;s/|/ /g'
Address: Second Address 100 MAIN ST
City: STRATFORD
State: CT
Zip: 06614
Address: Third Address NO 10 GREEN ACRES LANE
City: SHELTON
State: CT
Zip: 06484
(Let me not to explain how it exactly works :-))
P.S. The idea behind this loong command is to transforming file to lines with adressed only, after that we delete 1st line and continue with others. By using regular expressions we transform each addresses to required format.
sed -ne '/./{H;$!d;}' -e 'x;/Address/,/^$/!d' -e 's/\n/#/g;s/#Address#//' -e 's/\(.*\)#\(.*\)#\(.*\)/Address1=\1\nAddress2=\2\n\3\n/;s/\(.*\)#\(.*\)/Address1=\1\n\2\n/;s/\([a-Z]*\)\s\([a-Z][a-Z]\)\s\([0-9]\{5\}\)/City=\1\nState=\2\nZip=\3/p' addr.txt
this flattens the addresses out and formats them then you just need to uniq them
Again a SED question from me :)
So, same as last time, I'm wrestling with phone numbers. This time the problem is a bit different.
I this kind of organization currently in my text file:
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555
Now, every areacode can have unknown number of numbers, and also the phone numbers are not fixed in length.
What I would like to know, is how could I combine areacode and phone number, to look something like this:
4444-111111, 4444-2222222, 4444-33333333
My first idea was to add again a line break before each phone number and to match these sections with regex, and then just add the first remembered item to second, and first to third:
\1-\2, \1-\3, etc
But of course since sed can only remember 9 arguments, and there can be more than 10 numbers in one line this doesn't work. Moreover, also non-fixed list of phone numbers made this a no go.
I'm again looking primarily the SED option, as I've been trying to get proficient with it - but more efficient solutions with other tools are of course definitely welcome!
$ cat input.txt | sed '1d;s/NUM:/ /g' | awk '{for(i=2;i<=NF;i++)printf("%s-%s%s", $1, $i, i==NF?"\n":",")}'
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
This might work for you:
sed '1d;:a;s/^\(\S*\)\(.*\)NUM:/\1\2,\1-/;ta;s/[^,]*,//;s/ //g' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
or:
awk 'NR>1{gsub(/NUM:/,","$1"-");sub(/[^,]*,/,"");gsub(/ /,"");print}' file
4444-111111,4444-2222222,4444-33333333
5555-1111111,5555-2222,5555-3333333,5555-44444444,5555-5555555
TXR:
#(collect)
#area #(coll :mintimes 1)NUM:#{num /[0-9]+/}#(end)
#(output)
#(rep)#area-#num, #(last)#area-#num#(end)
#(end)
#(end)
Run:
$ txr phone.txr phone.txt
4444-111111, 4444-2222222, 4444-33333333
5555-1111111, 5555-2222, 5555-3333333, 5555-44444444, 5555-5555555
$ cat phone.txt
Areacode: List of phone numbers:
4444 NUM:111111 NUM:2222222 NUM:33333333
5555 NUM:1111111 NUM:2222 NUM:3333333 NUM:44444444 NUM:5555555