Add and Sort numbers in files [closed] - regex

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have directories like
./2012/NY/F/
./2012/NJ/M/
....
Under these directories, there are files with names like Zoe etc...
Each file contains a number.
I'd like to sum the numbers in the file with same file name in different directories and find the max of sum, how should I write?

To locate the files, use a glob such as specified in this question.
To do the actual summing, there are quite a few possibilities depending on the number of files and range of the numbers, but a reasonably general-purpose way would be with awk:
awk '{sum += $1} END { print sum }' file1 file2 ...

Suppose that your ./2012/NY/F, /2012/sfs/XXS all under directory, say, /home/yourusername/data/,
You can try this if you are using *nix or if you have cygwin installed on your windows
cd /home/yourusername/data ; find ./ -name yourfile_name_to_lookup.txt | xargs awk 'BEGIN {sum=0} ; {sum+=$1} ; END {print sum} '
I assume the number starting from the first column in that file ($1).

If you know the unique names of the files and the file names don't have space in them, then following may work.
cd 2012/
for i in "Zoe" "file2" "file3"
do
k=$(cat $(find . -type f -name "$i"));
echo $k | awk '{for(i=t=0;i<NF;) t+=$++i; $0=t}1';
done | sort -r
This will sum up files with same names from subdirs under 2012 and sort -r will return the numbers in max to min order.

I assume that the entire contents of the file is a number. I assume that the number is an integer. Requires bash 4 for the associative array
declare -A sum_for_file
for path in ./2012/*/*/*; do
(( sum_for_file["$(basename "$path")"] += $(< "$path") ))
done
max=0
for file in "${!sum_for_file[#]}"; do
if (( ${sum_for_file["$file"]} > max )); then
max=${sum_for_file["$file"]}
maxfile=$file
fi
# you didn't say you needed to print it, but if you do
printf "%d\t%s\n" ${sum_for_file["$file"]} "$file"
done
echo "the maximum sum is $max found in files named $maxfile"

Related

grep nth string from a very large file in constant time(file size independent)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Is there a grep (sed/awk) like tool in linux to find the nth occurrence of a string(regex) from a very large file? Also, I would like to find the number of occurrences of the search string within the file. Remember, the file is really large (> 2 gb).
Grep solution:
grep -on regexp < file.txt
file.txt:
one two one
two
one
two two
two one
Lines with regexp one
grep -on one < test.txt
1:one
1:one
3:one
5:one
How many occurrences:
grep -on one < test.txt | wc -l
4
Line with the Nth occurrence:
grep -m1 one < test.txt | tail -n1
one two one
Update: Now, the solutions don't use cat. Thanks to #tripleee for the hint.
I would like to find the number of occurrences of the search string
within the file
If the search string can't contain spaces, below might suffice:
awk -v RS=" " '/string/{i++}END{print "string count : " i}' file
But how fast it would be depends on the available RAM on the system.

Filter specific lines from directory tree listing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I have the following directory listing:
/home/a/b/c/d/5089/294265
/home/a/b/c/d/5089/79783
/home/a/b/c/d/41630
/home/a/b/c/d/41630/293520
/home/a/b/c/d/41630/293520/293520
...
I want to filter only the lines that go 7 directories deep. In this example I would need only the line: /home/a/b/c/d/41630/293520/293520
Please suggest.
Thanks
You could use grep. Saying:
grep -P '(/[^/]*){8}' inputfile
would return
/home/a/b/c/d/41630/293520/293520
Not sure how you are generating this listing, but if you were using find you could control the depth by specifying -mindepth and -maxdepth options.
You can try:
find /home/x/y/z/ -print | awk -F/ 'NF>8'
or you could try
find /home/x/y/z/ -mindepth 7 -print
YourInput | sed 's|/.|&|7;t
d'
remove line with less than 7 "/" followed by something
echo /home/a/b/c/d/*/*/*
should do the trick.
Using awk:
find /home| awk -F \/ 'NF==9' file

how to retrieve filename or extension within bash [duplicate]

This question already has answers here:
Extract filename and extension in Bash
(38 answers)
Closed 8 years ago.
i have a script that is pushing out some filesystem data to be uploaded to another system.
it would be very handy if i could tell myself what 'kind' of file each file actually is, because it will help with some querying later on down the road.
so, for example, say that my script is spitting out the following:
/home/myuser/mydata/myfile/data.log
/home/myuser/mydata/myfile/myfile.gz
/home/myuser/mydata/myfile/mod.conf
/home/myuser/mydata/myfile/security
/home/myuser/mydata/myfile/last
in the end, i'd like to see:
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
there's gotta be a way to do this with regular expressions and sed, but i can't figure it out.
any suggestions?
EDIT:
i need to get this info via the command line. looking at the answers so far, i obviously have not made this clear. so with the example data i provided, assume that data is all being fed via greps and seds (data is already sterlized). i need to be able to pipe the example data to sed/grep/awk/whatever in order to produce the desired results.
Print last filed that are separated by a none alpha character.
awk -F '[^[:alpha:]]' '{ print $0,$NF }'
/home/myuser/mydata/myfile/data.log log
/home/myuser/mydata/myfile/myfile.gz gz
/home/myuser/mydata/myfile/mod.conf conf
/home/myuser/mydata/myfile/security security
/home/myuser/mydata/myfile/last last
This should work for you:
x='/home/myuser/mydata/myfile/security'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
security
x='/home/myuser/mydata/myfile/data.log'
( IFS=[/.] && arr=( $x ) && echo ${arr[#]:(-1):1} )
log
To extract the last element in a filename path:
filename=$(path##*/}
To extract characters after a dot in a filename:
extension=${filename##*.}
But (my comment) rather than looking at the extension, it might be better to use file. See man file.
As others have already answered, to parse the file names:
extension="${full_file_name##*.}" # BASH and Kornshell/POSIX only
filename=$(basename "$full_file_name")
dirname=$(dirname "$full_file_name")
Quotes are needed if file names could have spaces, tabs, or other strange characters in them.
You can also test whether a file is a directory or file or link with the test command (which is linked to [ so that test -f foo is the same as [ -f foo ].
However, you said: "it would be very handy if i could tell myself what kind of file each file actually is".
In that case, you may want to investigate the file command. This command will return the file type as determined by some sort of magic file (traditionally in /etc/magic), but newer implementations can use the user's own scheme. This can tell file type by extension and by the magic number in the file's header, or by looking at the first few lines in the file (looking for a regular expression ^#! .*/bash$ in the first line.
This extracts the last component after a slash or a dot.
awk -F '[/.]' '{ print $NF }'

Move files starting with number and of type pdf [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I am just a beginner in regex, so please forgive me if this question is too easy.
What I want to ask is that I have a bunch of files in a directory and I move some of the files which start with numbers and of type pdf. How to use regex with mv command and what would be the regex.
If you're using linux command prompt, actually you're not using Regex, but you're using GLOB notation instead, which is different. Read up on that. GLOB cannot take complex pattern such as the one you describe. You need to use real regex.
For your case, you can use grep command on the output of ls to find the files meeting your requirement, then call mv on them. Something like this:
while read fileName; do mv $fileName destination_folder; done < <(ls -1 | grep -E '[0-9].*\.pdf')
Let's break it up:
while read fileName; do
mv $fileName destination_folder;
done < <(ls -1 | grep -E '[0-9].*\.pdf')
So basically you read through the directory listing using while loop, which gets the input from the output of the last line ls -1 | grep -E '[0-9].*\.pdf'. Using while loop (instead of simpler for loop) is necessary to cater filenames containing spaces.
Now the command ls -1 | grep -E '[0-9].*\.pdf' basically just list down the filenames, and grab only those matching specified RegEx pattern.
You could use find too:
find . -maxdepth 1 -name "[0-9]*.pdf" -exec mv {} destination \;

unix find filenames that are lexicographically less that a given filename

I have a list of files in a directory that are automatically generated by a system with the date in the filename. Some examples are: audit_20111020, audit_20111021, audit_20111022, etc.
I want to clean up files older than 18 months therefore I want to put together a unix find command that will find files less than audit_20100501 and delete them.
Does any know how to use lexicographical order as a criteria in the find command?
Another Perl variant:
perl -E'while(<audit_*>) { say if /(\d{8})/ && $1 < 20100501}'
Replace say by unlink if it prints expected filenames.
Note: < performs numerical comparison, use lt if you want string comparison.
With Perl it's easy. Type perl and:
for (glob "*")
{
my($n) = /(\d+)/;
unlink if ($n < 20100501);
}
^D
Test before using. Note that I'm assuming this is a fixed format and the directory only contains these files
It is possible to sort find's result using the sort command:
find . -name "audit*" | sort -n
... then find a way to split this list.
But for what you want to do, i.e. delete directories older than a certain date (18 months is ~547 days), you could use the below instead:
find -ctime -547 -type d | xargs -I{} rm -rf {}