awk: how to include file names when concatenating files? - regex

Am running GNUwin32 under windows 7.
Have many files in a single directory with file names that look like this:
chem.001.txt
chem.002.b4.txt
chem.003.md6.txt
(more files.txt) ...
In their current form, none of the files includes the file name.
Need to clean these files for further use.
Want to concatenate all files into a single file.
But also need to include the file name at the beginning of concatenated content to later associate the original file with clean data.
For example, the single, concatenated file (new_file.txt) would look like this:
chem.001.txt delimiter (could be a tab or pipe) followed by text from chem.001.txt...
chem.002.b4.txt delimiter followed by text from chem.002.b4.txt ...
chem.003.md6.txt delimiter followed by text from chem.003.md6.txt ...
etc. ...
Will then clean the concatenated file and parse content as needed.
awk - gawk may have a means to associate the file name with ($1), associate the text in the file with ($2) and then, in sequence, print ($1, $2) for each file into 'new_file.txt' but I've not been able to make it work.
How to do this?

Put this in foo.awk:
BEGIN{ RS="^$"; ORS=""; OFS="|" }
{ gsub(/\n[\r]?/," "); print FILENAME, $0 > "new_file.txt" }
and then execute it as
awk -f foo.awk <files>
where <files> is however you provide a list of file names in Windows. It uses GNU awk for multi-char RS to let you read a whole file as a single record.

Related

Unix Sed Command to replace file name entries in a *.txt file

I have a class.txt file which contains multiple .class file entries along with their respective paths, I want to rename .class file names as mentioned
Requirement:
from
modules/abc_1.1.3/abc.domain.ear!/APP-INF/lib/adj.jar!/ba/sr/ApplicationModule.class
to:
modules/abc_1.1.3/abc.domain.ear!/APP-INF/lib/adj.jar!/ba/sr/[ApplicationModule\$.*\.class]
I tried using sed command, but didnt get desired output as shown below
cat class.txt | sed "s/.class/\\\\$.*\\\.class]/g"
modules/abc_1.1.3/abc.domain.ear!/APP-INF/lib/adj.jar!/ba/sr/ApplicationModule\$.*\.class]
Kindly help, Thanks!
You have to capture file name:
sed 's/\([^/]*\).class/[\1\\$.*\\.class]/g'
You need to use capture groups in order to capture the filename and its extension in two separate groups.
$ sed 's~\([^./]*\)\.\([^/.]*\)$~[\1\\\$.*\\.\2]~' file
modules/abc_1.1.3/abc.domain.ear!/APP-INF/lib/adj.jar!/ba/sr/[ApplicationModule\$.*\.class]

Comment out file paths in a file matching lines in another file with sed and bash

I have a file (names.txt) with the following content:
/bin/pgawk
/bin/zsh
/dev/cua0
/dev/initctl
/root/.Xresources
/root/.esd_auth
... and so on. I want to read this file line by line, and use sed to comment out matches in another file. I have the code below, but it does nothing:
#/bin/bash
while read line
do
name=$line
sed -e '/\<$name\>/s/^/#/' config.conf
done < names.txt
Lines in the input file needs to be commented out in config.conf file. Like follows:
config {
#/bin/pgawk
#/bin/zsh
#/dev/cua0
#/dev/initctl
#/root/.Xresources
#/root/.esd_auth
}
I don't want to do this by hand, because the file contains more then 300 file paths. Can someone help me to figure this out?
You need to use double quotes around your sed command, otherwise shell variables will not be expanded. Try this:
sed "/\<$name\>/s/^/#/" config.conf
However, I would recommend that you skip the bash for-loop entirely and do the whole thing in one go, using awk:
awk 'NR==FNR{a[$0];next}{for(i=1;i<=NF;++i)if($i in a)$i="#"$i}1' names.txt config.conf
The awk command stores all of the file names as keys in the array a and then loops through every word in each line of the config file, adding a "#" before the word if it is in the array. The 1 at the end means that every line is printed.
It is better not to use regular expression matching here, as some of the characters in your file names (such as .) will be interpreted by the regular expression engine. This approach does a simple string match, which avoids the problem.

how to parse a text file for a particular compound expressions filtering in shell scripting

I want to extract (parse) a text file which has particular word, for my requirement whatever the rows which have the words "cluster" and "week" and "8.2" it should be written to the output file.
sample text in the file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~monthly~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
2013032308470272~800000102507~Cluster-Mode~yearly~8.1.2~V6240
Desired output into another text file by above mentioned filters
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I have writen a code using the awk command, however the output file contains the rows which are out of the scope of the filters.
code used to extract the text
awk '/Cluster/ && /WEEK/ && /8.2/ { print $NF > "/u/nbsvc/Data/Lookup/derived_asup_2010404_201409_2.txt" }' /u/nbsvc/Data/Lookup/cmode_asup_lookup.txt
obtained output
2013032308470272~800000102507~Cluster-Mode~WEEK~8.1.2~V6240
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
Note: the first line of obtained output is not needed in the desired output. How can I change my script to only get the line that I want?
To remove any ambiguity and false matches on partial fields or the wrong field, THIS is the command you need to run:
$ awk -F'~' '$3~/^Cluster/ && $4=="WEEK" && $5~/^8\.2/' file
2013032308470272~800000102507~Cluster-Mode~WEEK~8.2.2~V6240
I don't think that awk is needed at all here. Just use grep to match the line that you're interested in:
grep 'Cluster.*WEEK.*8\.2' file > output_file
The .* matches zero or more of any character and > is used to redirect the output to a new file. I have escaped the . in between "8.2" so that it is interpreted literally, rather than matching any character (although it would work either way).
there is actually little more in my requirement, it is I need to read this text file, then I need to split the line (where the cursor is) and push the values into a array and then I need to check for the values does it match with my pattern or not, if it matches then I need to write it to a out put text file, else simply ignore it, this one I did like as below..
cat /inputfolder_path/lookup_filename.txt | awk '{IGNORECASE = 1;line=$0;split(line,a, "~") ;if (a[1] ~ /201404/ && a[3]~/Cluster/ && a[4]~/WEEK/ && a[5]~/8.2/){print $0}}' > /outputfolder_path/derived_output_filename.txt
this is working exactly for my requirement..
Just thought to update this to every one, as it may help someone..
Thanks,
Siva

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename

Delete all lines without # textmate regex

I have a huge file that I need to filter out all lines (comma delimited file) that do not contain an email address (determining that by # character).
Right now what I have is this to find all lines containing the # sign:
.*,.*,.*#.*,.*$
basically you have 4 values and the 3rd value has the email address.
the replace with: value would be empty.
You have about 10 different ways to do this in TextMate and even more from the command line. Here are some of the easier ways...
From TextMate:
Command-control-t, start typing some part of the command "Copy Non-Matching Lines into New Document", use # (nothing else) for the pattern.
Same as above, except the command you're looking for is "Distill Document / Selection"
Find and select an # symbol. Then do the same as the above but search for the command "Strip Lines Matching Selection/Clipboard". You may not have it as I may have developed this one myself.
From the command line:
Type one of the following commands, replacing FILE with the filename, including the filepath if it's not in your current working directory. The filtered content can be found in FILE-new.
Using egrep: egrep -v '#' FILE > FILE-new
Using sed: cat FILE | sed -e "/#/D" > FILE-new
For both of the above, use diff to see what you accomplished: diff FILE{,-new}
That should probably do, I'm guessing...
try replace ^[^#]*$ with nothing. Alternatively, grep the file with your regex and redirect the result into a new file.