How to find lines using patterns in a file in UNIX - regex

I am trying to use a .txt file with around 5000 patterns (spaced with a line) to search through another file of 18000 lines for any matches. So far I've tried every form of grep and awk I can find on the internet and it's still not working, so I am completely stumped.
Here's some text from each file.
Pattern.txt
rs2622590
rs925489
rs2798334
rs6801957
rs6801957
rs13137008
rs3807989
rs10850409
rs2798269
rs549182
There's no extra spaces or anything.
File.txt
snpid hg18chr bp a1 a2 zscore pval CEUmaf
rs3131972 1 742584 A G 0.289 0.7726 .
rs3131969 1 744045 A G 0.393 0.6946 .
rs3131967 1 744197 T C 0.443 0.658 .
rs1048488 1 750775 T C -0.289 0.7726 .
rs12562034 1 758311 A G -1.552 0.1207 0.09167
rs4040617 1 769185 A G -0.414 0.6786 0.875
rs4970383 1 828418 A C 0.214 0.8303 .
rs4475691 1 836671 T C -0.604 0.5461 .
rs1806509 1 843817 A C -0.262 0.7933 .
The file.txt was downloaded directly from a med directory.
I'm pretty new to UNIX so any help would be amazing!
Sorry edit: I have definitely tried every single thing you guys are recommending and the result is blank. Am I maybe missing a syntax issue or something in my text files?
P.P.S I know there are matches as doing individual greps works. I'll move this question to unix.stackexchange. Thanks for your answers guys I'll try them all out.
Issue solved: I was obviously using DOS carriages. I didn't know about this before so thank you everyone that answered. For future users who are having this issue, here is the solution that worked:
dos2unix *
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt > Output.txt

You can use grep -Fw here:
grep -Fw -f Pattern.txt File.txt
Options used are:
-F - Fixed string search to tread input as non-regex
-w - Match full words only
-f file - Read pattern from a file

idk if it's what you want or not, but this will print every line from File.txt whose first field equals a string from Patterns.txt:
awk 'NR==FNR{p[$0];next} $1 in p' Patterns.txt File.txt
If that is not what you want, tell us what you do want. If it is what you want but doesn't produce the output you expect then one or both of your files contains control characters courtesy of being created in Windows so run dos2unix or similar on them both first.

Use a shell script to read each line of the file containing your patterns then fgrep it.
#!/bin/bash
FILENAME=$1
awk '{kount++;print $0}' $FILENAME | fgrep -f - PATTERNFILE.txt

Related

rename multiple files splitting filenames by '_' and retaining first and last fields

Say I have the following files:
a_b.txt a_b_c.txt a_b_c_d_e.txt a_b_c_d_e_f_g_h_i.txt
I want to rename them in such a way that I split their filenames by _ and I retain the first and last field, so I end up with:
a_b.txt a_c.txt a_e.txt a_i.txt
Thought it would be easy, but I'm a bit stuck...
I tried rename with the following regexp:
rename 's/^([^_]*).*([^_]*[.]txt)/$1_$2/' *.txt
But what I would really need to do is to actually split the filename, so I thought of awk, but I'm not so proficient with it... This is what I have so far (I know at some point I should specify FS="_" and grab the first and last field somehow...
find . -name "*.txt" | awk -v mvcmd='mv "%s" "%s"\n' '{old=$0; <<split by _ here somehow and retain first and last fields>>; printf mvcmd,old,$0}'
Any help? I don't have a preferred method, but it would be nice to use this to learn awk. Thanks!
Your rename attempt was close; you just need to make sure the final group is greedy.
rename 's/^([^_]*).*_([^_]*[.]txt)$/$1_$2/' *_*_*.txt
I added a _ before the last opening parenthesis (this is the crucial fix), and a $ anchor at the end, and also extended the wildcard so that you don't process any files which don't contain at least two underscores.
The equivalent in Awk might look something like
find . -name "*_*_*.txt" |
awk -F _ '{ system("mv " $0 " " $1 "_" $(NF)) }'
This is somewhat brittle because of the system call; you might need to rethink your approach if your file names could contain whitespace or other shell metacharacters. You could add quoting to partially fix that, but then the command will fail if the file name contains literal quotes. You could fix that, too, but then this will be a little too complex for my taste.
Here's a less brittle approach which should cope with completely arbitrary file names, even ones with newlines in them:
find . -name "*_*_*.txt" -exec sh -c 'for f; do
mv "$f" "${f%%_*}_${f##*_}"
done' _ {} +
find will supply a leading path before each file name, so we don't need mv -- here (there will never be a file name which starts with a dash).
The parameter expansion ${f##pattern} produces the value of the variable f with the longest available match on pattern trimmed off from the beginning; ${f%%pattern} does the same, but trims from the end of the string.
With your shown samples, please try following pure bash code(with great use parameter expansion capability of BASH). This will catch all files with name/format .txt in their name. Then it will NOT pick files like: a_b.txt it will only pick files which have more than 1 underscore in their name as per requirement.
for file in *_*_*.txt
do
firstPart="${file%%_*}"
secondPart="${file##*_}"
newName="${firstPart}_${secondPart}"
mv -- "$file" "$newName"
done
This answer works for your example, but #tripleee's "find" approach is more robust.
for f in a_*.txt; do mv "$f" "${f%%_*}_${f##*_}"; done
Details: https://www.gnu.org/software/bash/manual/html_node/Shell-Parameter-Expansion.html / https://www.gnu.org/software/bash/manual/html_node/Pattern-Matching.html
Here's an alternate regexp for the given samples:
$ rename -n 's/_.*_/_/' *.txt
rename(a_b_c_d_e_f_g_h_i.txt, a_i.txt)
rename(a_b_c_d_e.txt, a_e.txt)
rename(a_b_c.txt, a_c.txt)
A different rename regex
rename 's/(\S_)[a-z_]*(\S\.txt)/$1$2/'
Using the same regex with sed or using awk within a loop.
for a in a_*; do
name=$(echo $a | awk -F_ '{print $1, $NF}'); #Or
#name=$(echo $a | sed -E 's/(\S_)[a-z_]*(\S\.txt)/\1\2/g');
mv "$a" "$name";
done

Combine multiple lines of text documents into one

I have thousands of text documents and they have varied number of lines of texts. I want to combine all the lines into one single line in each document individually. That is for example:
abcd
efgh
ijkl
should become as
abcd efgh ijkl
I tried using sed commands but it is quite not achieving what I want as the number of lines in each documents vary. Please suggest what I can do. I am working on python in ubuntu. One line commands would be of great help. thanks in advance!
If you place your script in the same directory as your files, the following code should work.
import os
count = 0
for doc in os.listdir('C:\Users\B\Desktop\\newdocs'):
if doc.endswith(".txt"):
with open(doc, 'r') as f:
single_line = ''.join([line for line in f])
single_space = ' '.join(single_line.split())
with open("new_doc{}.txt".format(count) , "w") as doc:
doc.write(single_space)
count += 1
else:
continue
#inspectorG4dget's code is more compact than mine -- and thus I think it's better. I tried to make mine as user-friendly as possible. Hope it helps!
Using python wouldn't be necessary. This does the trick:
% echo `cat input.txt` > output.txt
To apply to a bunch of files, you can use a loop. E.g. if you're using bash:
for inputfile in /path/to/directory/with/files/* ; do
echo `cat ${inputfile}` > ${inputfile}2
done
assuming all your files are in one directory,have a .txt extension and you have access to a linux box with bash you can use tr like this:
for i in *.txt ; do tr '\n' ' ' < $i > $i.one; done
for every "file.txt", this will produce a "file.txt.one" with all the text on one line.
If you want a solution that operates on the files directly you can use gnu sed (NOTE THIS WILL CLOBBER YOUR STARTING FILES - MAKE A BACKUP OF THE DIRECTORY BEFORE TRYING THIS):
sed -i -n 'H;${x;s|\n| |g;p};' *.txt
If your files aren't in the same directory, you can used find with -exec:
find . -name "*.txt" -exec YOUR_COMMAND \{\} \;
If this doesn't work, maybe a few more details about what you're trying to do would help.

Extracting Last column using VI

I have a csv file which contains some 1000 fields with values, the headers are something like below:
v1,v2,v3,v4,v5....v1000
I want to extract the last column i.e. v1000 and its values.
I tried %s/,[^,]*$// , but this turns out to be exact opposite of what i expected, Is there any way to invert this expression in VI ?
I know it can be done using awk as awk -F "," '{print $NF}' myfile.csv, but i want to make it happen in VI with regular expression,Please also note that i have VI and don't have VIM and working on UNIX, so i can't do visual mode trick as well.
Many thanks in advance, Any help is much appreciated.
Don't you just want
%s/.*,\s*//
.*, is match everything unto the last comma and the \s* is there to remove whitespace if its there.
You already accepted answer, btw you can still use awk or other nice UNIX tools within VI or VIM. Technique below calls manipulating the contents of a buffer through an external command :!{cmd}
As a demo, let's rearrange the records in CSV file with sort command:
first,last,email
john,smith,john#example.com
jane,doe,jane#example.com
:2,$!sort -t',' -k2
-k2 flag will sort the records by second field.
Extract last column with awk as easy as:
:%!awk -F "," '{print $NF}'
Dont forget cut!
:%!cut -d , -f 6
Where 6 is the number of the last field.
Or if you don't want to count the number of fields:
:%!rev | cut -d , -f 1 | rev

How to optimize a grep regular expression to match a URL

Background:
I have a directory called "stuff" with 26 files (2 .txt and 24 .rtf) on Mac OS 10.7.5.
I'm using grep (GNU v2.5.1) to find all strings within these 26 files that match the structure of a URL, then print them to a new file (output.txt).
The regex below does work on a small scale. I ran it on directory with 3 files (1 .rtf and 2 .txt) with a bunch of dummy text and 30 URLs, and it executed successfully in less than 1 second.
I am using the following regex:
1
grep -iIrPoh 'https?://.+?\s' . --include=*.txt --include=*.rtf > output.txt
Problem
The current size of my directory "stuff" is 180 KB with 26 files. In terminal, I cd to this directory (stuff) then run my regex. I waited about 15 minutes and decided to kill the process as it did NOT finish. When I looked at the output.txt file, it was a whopping 19.75GB (screenshot).
Question
What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?
What more could I add to my regex to streamline the processing time.
Thank you in advance for any guidance you can provide here. I've been working on many different variations of my regex for almost 16 hours, and have read tons of posts online but nothing seems to help. I'm new to writing regex, but with a small bit of hand holding, I think I'll get it.
Additional Comments
I ran the following command to see what was being recorded in the output.txt (19.75GB) file. It looks like the regex is finding the right strings with the exception of what i think are odd characters like: curly braces } { and a string like: {\fldrslt
**TERMINAL**
$ head -n 100 output.txt
http://michacardenas.org/\
http://culturelab.asc.upenn.edu/2013/03/06/calling-all-wearable-electronics-hackers-e-textile-makers-and-fashion-activists/\
http://www.mumia-themovie.com/"}}{\fldrslt
http://www.mumia-themovie.com/}}\
http://www.youtube.com/watch?v=Rvk2dAYkHW8\
http://seniorfitnesssite.com/category/senior-fitness-exercises\
http://www.giac.org/
http://www.youtube.com/watch?v=deOCqGMFFBE"}}{\fldrslt
http://www.youtube.com/watch?v=deOCqGMFFBE}}
https://angel.co/jason-a-hoffman\
https://angel.co/joyent?save_req=mention_slugs"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html"}}{\fldrslt
http://www.cooking-hacks.com/index.php/ehealth-sensors-complete-kit-biometric-medical-arduino-raspberry-pi.html}}
http://www.cooking-hacks.com/index.php/documentation/tutorials/ehealth-biometric-sensor-platform-arduino-raspberry-pi-medical"}}{\fldrslt
http://www.cooking-hacks.com/index.php/documentation
Catalog of regex commands tested so far
2
grep -iIrPoh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_2.txt)
3
grep -iIroh 'https?://\S+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_3.txt)
4
grep -iIrPoh 'https?://\S+\s' . --include=*.txt --include=*.rtf > sixth.txt
FAIL: took 1 second to run / produced blank file (output_4.txt)
5
grep -iIroh 'https?://' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_5.txt)
6
grep -iIroh 'https?://\S' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_6.txt)
7
grep -iIroh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: took 1 second to run / produced blank file (output_7.txt)
8
grep -iIrPoh 'https?://[\w~#%&_+=,.?/-]+' . --include=*.txt --include=*.rtf > output.txt
FAIL: let run for 1O mins and manually killed process / produced 20.63 GB file (output_8.txt) / On the plus side, this regex captured strings that were accurate in the sense that they did NOT include any odd additional characters like curly braces or RTF file format syntax {\fldrslt
9
find . -print | grep -iIPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_9.txt
FAIL: took 1 second to run / produced blank file (output_9.txt)
10
find . -print | grep -iIrPoh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > output_10.txt
FAIL: took 1 second to run / produced blank file (output_10.txt)
11
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
Editor's note: this regex only worked properly when I output strings to the terminal window. It did not work when I output to a file output_11.txt.
NEAR SUCCESS: All URL strings were cleanly cut to remove white space before and after string, and removed all special markup associated with .RTF format. Downside: of the sample URLs tested for accuracy, some were cut short losing their structure at the end. I'm estimating that about 10% of strings were improperly truncated.
Example of truncated string:
URL structure before the regex: http://www.youtube.com/watch?v=deOCqGMFFBE
URL structure after the regex: http://www.youtube.com/watch?v=de
The question now becomes:
1.) Is there a way to ensure we do not eliminate a part of the URL string as in the example above?
2.) Would it help to define an escape command for the regex? (if that is even possible).
12
grep -iIroh 'https?:\/\/[\w~#%&_+=,.?\/-]+' . --include=*.txt --include=*.rtf > output_12.txt
FAIL: took 1 second to run / produced blank file (output_12.txt)
13
grep -iIroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf > tmp/output.txt
FAIL: let run for 2 mins and manually killed process / produced 1 GB file. The intention of this regex was to isolate grep's output file (output.txt) in to a subdirectory to ensure we weren't creating an infinite loop that had grep reading back it's own output. Solid idea, but no cigar (screenshot).
14
grep -iIroh 'https\?://[a-z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
FAIL: Same result as #11. The command resulted in an infinite loop with truncated strings.
15
grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' . --include=*.txt --include=*.rtf
ALMOST WINNER: This captured the entirety of the URL string. It did result in an infinite loop creating millions of strings in terminal, but I can manually identify where the first loop starts and ends so this should be fine. GREAT JOB #acheong87! THANK YOU!
16
find . -print | grep -v output.txt | xargs grep -Iroh 'https\?://[a-zA-Z0-9~#%&_+=,.?/-]\+' --include=*.txt --include=*.rtf > output.txt
NEAR SUCCESS: I was able to grab the ENTIRE URL string, which is good. However, the command turned into an infinite loop. After about 5 seconds of running output to terminal, it produced about 1 million URL strings, which were all duplicates. This would have been a good expression if we could figure out how to escape it after a single loop.
17
ls *.rtf *.txt | grep -v 'output.txt' | xargs -J {} grep -iIF 'http' {} grep -iIFo > output.txt
NEAR SUCCESS: this command resulted in a single loop through all files in the directory, which is good b/c solved the infinite loop problem. However, the structure of the URL strings were truncated and included the filename from where the strings came from.
18
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
NEAR SUCCESS: This expression prevented an infinite loop which is good, it created a new file in the directory it was querying which was small, about 30KB. It captured all the proper characters in the string and a couple ones not needed. As Floris mentioned, in the instances where the URL was NOT terminated with a space - for example http://www.mumia-themovie.com/"}}{\fldrslt it captured the markup syntax.
19
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[a-z./?#=%_-,~&]+'
FAIL: This expression prevented an infinite loop which is good, however it did NOT capture the entire URL string.
The expression I had given on the comments (your test 17) was intended to test for two things:
1) can we make the infinite loop go away
2) can we loop over all files in the directory cleanly
I believe we achieved both. So now I am audacious enough to propose a "solution":
ls *.rtf *.txt | grep -v 'output.txt' | xargs grep -iIohE 'https?://[^[:space:]]+'
Breaking it down:
ls *.rtf *.txt - list all .rtf and .txt files
grep -v 'output.txt' - skip 'output.txt' (in case it was left from a previous attempt)
xargs - "take each line of the input in turn and substitute it
- at the end of the following command
- (or use -J xxx to sub at place of xxx anywhere in command)
grep -i - case insensitive
-I - skip binary (shouldn't have any since we only process .txt and .rtf...)
-o - print only the matched bit (not the entire line), i.e. just the URL
-h - don't include the name of the source file
-E - use extended regular expressions
'http - match starts with http (there are many other URLs possible... but out of scope for this question)
s? - next character may be an s, or is not there
:// - literal characters that must be there
[^[:space:]]+ - one or more "non space" characters (greedy... "as many as possible")
This seemed to work OK on a very simple set of files / URLs. I think that now that the iterating problem is solved, the rest is easy. There are tons of "URL validation" regexes online. Pick any one of them... the above expression really just searches for "everything that follows http until a space". If you end up with odd or missing matches let us know.
I'm guessing a bit but for a line like
http://a.b.com something foo bar
the pattern can match as
http://a.b.com
http://a.b.com something
http://a.b.com something foo
(always with space at the end).
But I don't know if grep tries to match same line multiple times.
Better try
'https?://\S+\s'
as pattern
"What could be causing the output.txt file to be so many orders of maginitude larger than the entire directory?" me thinks you are running a cycle with grep reading back its own output? Try directing the output to > ~/tmp/output.txt.

Shell Script - list files, read files and write data to new file

I have a special question to shell scripting.
Simple scripting is no Problem for me but I am new on this and want to make me a simple database file.
So, what I want to do is:
- Search for filetypes (i.e. .nfo) <-- should be no problem :)
- read inside of each found file and use some strings inside
- these string of each file should be written in a new file. Each found file informations
should be one row in new file
I hope I explained my "project" good.
My problem is now, to understand how I can tell the script it has to search for files and then use each of this files to read in it and use some information in it to write this in a new file.
I will explain a bit better.
I am searching for files and that gives me back:
file1.nfo
file2.nfo
file3.nfo
Ok now in each of that file I need the information between 2 lines. i.e.
file1.nfo:
<user>test1</user>
file2.nfo:
<user>test2</user>
so in the new file there should now be:
file1.nfo:user1
file2.nfo:user2
OK so:
find -name *.nfo > /test/database.txt
is printing out the list of files.
and
sed -n '/<user*/,/<\/user>/p' file1.nfo
gives me back the complete file and not only the information between <user> and </user>
I try to go on step by step and I am reading a lot but it seems to be very difficult.
What am I doing wrong and what should be the best way to list all files, and write the files and the content between two strings to a file?
EDIT-NEW:
Ok here is an update for more informations.
I learned now a lot and searched the web for my problems. I can find a lot of informations but i don´t know how to put them together so that i can use it.
Working now with awk is that i get back filename and the String.
Here now the complete Informations (i thought i can go on by myself with a bit of help but i can´t :( )
Here is an example of: /test/file1.nfo
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Baseball</hobby>
<hobby>Baskeball</hobby>
</personal informations>
Here an example of /test/file2.nof
<string1>STRING 1</string1>
<string2>STRING 2</string2>
<string3>STRING 3</string3>
<string4>STRING 4</string4>
<personal informations>
<hobby>Soccer</hobby>
<hobby>Traveling</hobby>
</personal informations>
The File i want to create has to look like this.
STRING 1:::/test/file1.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
STRING 1:::/test/file2.nfo:::Date of file:::STRING 4:::STRING 3:::Baseball, Basketball:::STRING 2
"Date of file" should be the creation date of the file. So that i can see how old is the file.
So, that´s what i need and it seems not easy.
Thanks a lot.
UPATE ERROR -printf
find: unrecognized: -printf
Usage: find [PATH]... [OPTIONS] [ACTIONS]
Search for files and perform actions on them.
First failed action stops processing of current file.
Defaults: PATH is current directory, action is '-print'
-follow Follow symlinks
-xdev Don't descend directories on other filesystems
-maxdepth N Descend at most N levels. -maxdepth 0 applies
actions to command line arguments only
-mindepth N Don't act on first N levels
-depth Act on directory *after* traversing it
Actions:
( ACTIONS ) Group actions for -o / -a
! ACT Invert ACT's success/failure
ACT1 [-a] ACT2 If ACT1 fails, stop, else do ACT2
ACT1 -o ACT2 If ACT1 succeeds, stop, else do ACT2
Note: -a has higher priority than -o
-name PATTERN Match file name (w/o directory name) to PATTERN
-iname PATTERN Case insensitive -name
-path PATTERN Match path to PATTERN
-ipath PATTERN Case insensitive -path
-regex PATTERN Match path to regex PATTERN
-type X File type is X (one of: f,d,l,b,c,...)
-perm MASK At least one mask bit (+MASK), all bits (-MASK),
or exactly MASK bits are set in file's mode
-mtime DAYS mtime is greater than (+N), less than (-N),
or exactly N days in the past
-mmin MINS mtime is greater than (+N), less than (-N),
or exactly N minutes in the past
-newer FILE mtime is more recent than FILE's
-inum N File has inode number N
-user NAME/ID File is owned by given user
-group NAME/ID File is owned by given group
-size N[bck] File size is N (c:bytes,k:kbytes,b:512 bytes(def.))
+/-N: file size is bigger/smaller than N
-links N Number of links is greater than (+N), less than (-N),
or exactly N
-prune If current file is directory, don't descend into it
If none of the following actions is specified, -print is assumed
-print Print file name
-print0 Print file name, NUL terminated
-exec CMD ARG ; Run CMD with all instances of {} replaced by
file name. Fails if CMD exits with nonzero
-delete Delete current file/directory. Turns on -depth option
The pat1,pat2 notation of sed is line based. Think of it like this, pat1 sets an enable flag for its commands and pat2 disables the flag. If both pat1 and pat2 are on the same line the flag will be set, and thus in your case print everything following and including the <user> line. See grymoire's sed howto for more.
An alternative to sed, in this case, would be to use a grep that supports look-around assertions, e.g. GNU grep:
find . -type f -name '*.nfo' | xargs grep -oP '(?<=<user>).*(?=</user>)'
If grep doesn't support -P, you can use a combination of grep and sed:
find . -type f -name '*.nfo' | xargs grep -o '<user>.*</user>' | sed 's:</\?user>::g'
Output:
./file1.nfo:test1
./file2.nfo:test2
Note, you should be aware of the issues involved with passing files on to xargs and perhaps use -exec ... instead.
It so happens that grep outputs in the format you need and is enough for an one-liner.
By default a grep '' *.nfo will output something like:
file1.nfo:random data
file1.nfo:<user>test1</user>
file1.nfo:some more random data
file2.nfo:not needed
file2.nfo:<user>test2</user>
file2.nfo:etc etc
By adding the -P option (Perl RegEx) you can restrict the output to matches only:
grep -P "<user>\w+<\/user>" *.nfo
output:
file1.nfo:<user>test1</user>
file2.nfo:<user>test2</user>
Now the -o option (only show what matched) saves the day, but we'll need a bit more advanced RegEx since the tags are not needed:
grep -oP "(?<=<user>)\w+(?=<\/user>)" *.nfo > /test/database.txt
output of cat /test/database.txt:
file1.nfo:test1
file2.nfo:test2
Explained RegEx here: http://regex101.com/r/oU2wQ1
And your whole script just became a single command.
Update:
If you don't have the --perl-regexp option try:
grep -oE "<user>\w+<\/user>" *.nfo|sed 's#</?user>##g' > /test/database.txt
All you need is:
find -name '*.nfo' | xargs awk -F'[><]' '{print FILENAME,$3}'
If you have more in your file than just what you show in your sample input then this is probably all you need:
... awk -F'[><]' '/<user>/{print FILENAME,$3}' file
Try this (untested):
> outfile
find -name '*.nfo' -printf "%p %Tc\n" |
while IFS= read -r fname tstamp
do
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
The above will only work if your file names do not contain spaces. If they can, we'd need to tweak the loop.
Alternative if your find doesn't support -printf (suggestion - seriously consider getting a modern "find"!):
> outfile
find -name '*.nfo' -print |
while IFS= read -r fname
do
tstamp=$(stat -c"%x" "$fname")
awk -v tstamp="$tstamp" -F'[><]' -v OFS=":::" '
{ a[$2] = a[$2] sep[$2] $3; sep[$2] = ", " }
END {
print a["string1"], FILENAME, tstamp, a["string4"], a["string3"], a["hobby"], a["string2"]
}
' "$fname" >> outfile
done
If you don't have "stat" then google for alternatives to get a timestamp from a file or consider parsing the output of ls -l - it's unreliable but if it's all you've got...