Find matches between 2 files

Find matches between 2 files - regex

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$

Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are

Related

Unable to create sed substitution to deduplicate file

I have the file with many duplicates of the form
a
a
b
b
c
c
Which I need to reduce to
a
b
c
So I wrote a sed command: sed -r 's/^(.*)$\n^(.*)$/\1/mg' filename, but the file was still showing duplicates. However I'm sure this regex works because I tested it here.
So what am I doing wrong?
I suspect it may be related to the -r option, as I'm not really sure what that does (but without it I get a invalid reference \1 ons' command's RHS` error).

Either of 2 simpler approaches should work for you.
A simple awk command to print a line only first time by maintaining an array of already printed lines:
awk '!seen[$0]++' file
a
b
c
Since file is already sorted you can use uniq also:
uniq file
a
b
c
Edit: Newer gnu-awk versions support in place editing also using:
awk -i 'inplace' '!seen[$0]++' file

How to replace using sed command in shell scripting to replace a string from a txt file present in one directory by another?

I am very new to shell scripting and trying to learn the "sed" command functionality.
I have a file called configurations.txt with some variables defined in it with some string values initialised to each of them.
I am trying to replace a string in a file (values.txt) which is present in some other directory by the values of the variables defined. The name of the file is values.txt.
Data present in configurations.txt:-
mem="cpu.memory=4G"
proc="cpu.processor=Intel"
Data present in the values.txt (present in /home/cpu/script):-
cpu.memory=1G
cpu.processor=Dell
I am trying to make a shell script called repl.sh and I dont have alot of code in it for now but here is what I got:-
#!/bin/bash
source /home/configurations.txt
sed <need some help here>
Expected output is after an appropriate regex applied, when I run script sh repl.sh, in my values.txt , It must have the following data present:-
cpu.memory=4G
cpu.processor=Intell
Originally which was 1G and Dell.
Would highly appreciate some quick help. Thanks

This question lacks some sort of abstract routine and looks like "help me do something concrete please". Thus it's very unlikely that anyone would provide a full solution for that problem.
What you should do try to split this task into number of small pieces.
1) Iterate over configuration.txt and get values from each line. To do that you need to get X and Y from a value="X=Y" string.
This regex could be helpful here - ([^=]+)=\"([^=]+)=([^=]+)\". It contains 3 matching groups separated by ". For example,
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\1/' configurations.txt
mem
proc
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\2/' configurations.txt
cpu.memory
cpu.processor
>> sed -r 's/([^=]+)=\"([^=]+)=([^=]+)\"/\3/' configurations.txt
4G
Intel
2) For each X and Y find X=Z in values.txt and substitute it with a X=Y.
For example, let's change cpu.memory value in values.txt with 4G:
>> X=cpu.memory; Y=4G; sed -r "s/(${X}=).*/\1${Y}/" values.txt
cpu.memory=4G
cpu.processor=Dell
Use -i flag to do changes in place.

Here is an awk based answer:
$ cat config.txt
cpu.memory=4G
cpu.processor=Intel
$ cat values.txt
cpu.memory=1G
cpu.processor=Dell
cpu.speed=4GHz
$ awk -F= 'FNR==NR{a[$1]=$2; next;}; {if($1 in a){$2=a[$1]}}1' OFS== config.txt values.txt
cpu.memory=4G
cpu.processor=Intel
cpu.speed=4GHz
Explanation: First read config.txt & save in memory. Then read values.txt. If a particular value was defined in config.txt, use the saved value from memory (config.txt).

batch renaming of files with perl expressions

This should be a basic question for a lot of people, but I am a biologist with no programming background, so please excuse my question.
What I am trying to do is rename about 100,000 gzipped data files that have existing name of a code (example: XG453834.fasta.gz). I'd like to name them to something easily readable and parseable by me (example: Xanthomonas_galactus_str_453.fasta.gz).
I've tried to use sed, rename, and mmv, to no avail. If I use any of those commands on a one-off script then they work fine, it's just when I try to incorporate variables into a shell script do I run into problems. I'm not getting any errors, just no names are changed, so I suspect it's an I/O error.
Here's what my files look like:
#! /bin/bash
# change a bunch of file names
file=names.txt
while IFS=' ' read -r r1 r2;
do
mmv ''$r1'.fasta.gz' ''$r2'.fasta.gz'
# or I tried many versions of: sed -i 's/"$r1"/"$r2"/' *.gz
# and I tried many versions of: rename -i 's/$r1/$r2/' *.gz
done < "$file"
...and here's the first lines of my txt file with single space delimiter:
cat names.txt
#find #replace
code1 name1
code2 name2
code3 name3
I know I can do this with python or perl, but since I'm stuck here working on this particular script I want to find a simple solution to fixing this bash script and figure out what I am doing wrong. Thanks so much for any help possible.
Also, I tried to cat the names file (see comment from Ashoka Lella below) and then use awk to move/rename. Some of the files have variable names (but will always start with the code), so I am looking for a find & replace option to just replace the "code" with the "name" and preserve the file name structure.
I suspect I am not escaping the variable within the single tick of the perl expression, but I have poured over a lot of manuals and I can't find the way to do this.

If you're absolutely sure than the filenames doesn't contain spaces of tabs, you can try the next
xargs -n2 < names.txt echo mv
This is for DRY run (will only print what will do) - if you satisfied with the result, remove the echo ...
If you want check the existence ot the target, use
xargs -n2 < names.txt echo mv -i
if you want NEVER allow overwriting of the target use
xargs -n2 < names.txt echo mv -n
again, remove the echo if youre satisfied.

I don't think that you need to be using mmv, a simple mv will do. Also, there's no need to specify the IFS, the default will work for you:
while read -r src dest; do mv "$src" "$dest"; done < names.txt
I have double quoted the variable names as it is generally considered good practice but in this case, a space in either of the filenames will result in read not working as you expect.
You can put an echo before the mv inside the loop to ensure that the correct command will be executed.
Note that in your file names.txt, the .fasta.gz suffix is already included, so you shouldn't be adding it inside the loop aswell. Perhaps that was your problem?

This should rename all files in column1 to column2 of names.txt. Provided they are in the same folder as names.txt
cat names.txt| awk '{print "mv "$1" "$2}'|sh

using awk to match a column in log file and print the entire line

I'm trying to write a script which will analyse a log file,
i want to give the user the option to enter a pattern and then print any line which matches this pattern in a specific column (the fifth one)
the following works from the terminal
awk ' $5=="acpid:" {print$0}' *filename*
ok so above im trying to match "acpid:" this works fine but in the script i want to be able to allow multiple entries and search for them all, the problem is i'm messing up the variable in the script this is what i have:
echo "enter any services you want details on, seperated by spaces"
read -a details
for i in ${details[#]}
do
echo $i
awk '$5 == "${i}" {print $0}' ${FILE}
done
again if i directly put in a matching expression instead of the variable it works so i guess my problem is here any tips would be great
UPDATE
So im using the second option suggested(shown below) by #ghoti as it matches my log file slightly better
however im not having any luck with multiple entries. ive added two lines to illustratre the results im getting these are echo $i and echo "finish loop" as placed they should tell me what input the loop is currently on and that im leaving the loop
'read -a details
re=""
for i in "${details[#]}"; do
re="$re${re:+|}$i"
echo $i
echo"finish loop"
done
awk -v re="$re" '$5 ~ re' "$FILE" `
When i give read an input of either "acpid" or "init" seperately a perfect result is matched, however when the input is "acpid init" the following is the output
acpid init
finish loop
What im seeing from this is that the read is taking the both words as one entry and then the awk is searching but not matching them (as would be expected). so why is the input not being taken as two separate entries i had thought the -a option with read specified that words separated by a space would be placed into separate elements of the array. perhaps i have not declared the array correctly?
Update update
ok cancel the above update like i fool i'd forgotten that id chaged IFS to \n earlier in the script changed it back and bingo !!!
Many thanks again to #ghoti for his help!!

There are a few ways that you could do what you want.
One option might be to run through a for loop for each word, then apply a different call to awk, and show the results sequentially. For example, if you entered foo bar into the $details variable, you might get a list of foo matches, followed by a list of bar matches:
read -a details
for i in "${details[#]}"; do
awk -v s="$i" '$5 == s' "$FILE"
done
The idea here is that we use awk's -v option to get each word into the script, rather than expanding the variable inside the quoted script. You should read about how bash deals with different kinds of quotes. (There are also a few Stackoverflow questions on the topic, here and here and elsewhere.)
Another option might be to construct a regular expression that searches for all the words you're interested in, all at once. This has the benefit of using a single run of awk to search through $FILE:
read -a details
re=""
for i in "${details[#]}"; do
re="$re${re:+|}$i"
done
awk -v re="$re" '$5 ~ re' "$FILE"
The result will contain all the interesting lines from $FILE in the order in which they appear in $FILE, rather than ordered by the words you provided.
Note that this is a fairly rudimentary search, without word boundaries, so if you search for "foo bar babar", you may get results you don't want. You can play with the regex yourself, though. :)
Does that answer your question?

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.

Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'

The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)

Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'

This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js