linux pipe with egrep not working as expected - regex

I have this small piece of code
egrep -oh '([A-Z]+)_([A-Z]+)_([A-Z]+)' -R /path | uniq | sort
I use this script to dig for environment variables inside files stored in a common directory when I don't want to display any duplicate, but I just want the the name of any variable if any are being used.
needless to say that the regex works, the matched words are the ones that are composed of 3 subsets of letters in uppercase *_*_*, the problem is that uniq doesn't look like it's work and doing anything, the variables are just printed out as egrep finds them.
Not even uniq -u does the trick.
Is the pipe itself the problem ?

uniq requires its input to be sorted if you want it to work in this manner. From the man page: (emphasis mine)
DESCRIPTION: Filter adjacent matching lines
So you could put a sort before the uniq in the pipeline, but that is not necessary, you can simply use the -u flag to sort to only output unique lines from the sorted output:
egrep -oh '([A-Z]+)_([A-Z]+)_([A-Z]+)' -R /path | sort -u

Related

grep for a list of variables that are referenced by a script?

I have some scripts that use various variables, and I want to grep (from within bash on FreeBSD) each of them for the list of variables that are used by the script. These are not shell scripts, but the syntax of referencing a variable is similar to that used in bash. Specifically, a reference to a variable can be like:
$X, or
${X}
and the name of the variable ("X" in this case) can include alphanumerics and underscores. At this point I want to explicitly note that I imagine that bash itself probably has a more complicated set of possible ways to reference variables, but if so, I do not care about that for the purposes of this question.
I would like to find all variable names that are so referenced in a given file - just the name, not the entire line. So something of the form awesomegrepcmd filename | sort | uniq, or something like that.
Note that there may be multiple different variables referenced on a single line.
I have been fighting regex and escaping rules via largely ignorant stab-in-the-dark attempts (e.g. "OK, maybe I need to put TWO backslashes here, but only ONE there") for a while now, and googling to try to find others who have done this sort of thing, but have thus far been unable to accomplish what I want. How can this be done? Thanks.
EDIT: Example: Let's say the content of a file is this (just pseudocode, not meant as bash or anything):
SPECIFICS={ "be born", "die", "plant", "reap", "kill", "heal", "laugh", "weep" }
CHORUS_PREFIX="To every thing"
CHORUS_INFIX="there is a season"
CHORUS_SUFFIX="and a time to every purpose under heaven."
CHORUS_SEPARATOR=", turn, turn, turn, "
CHORUS = $CHORUS_PREFIX + ${CHORUS_SEPARATOR} + $CHORUS_INFIX + $CHORUS_SEPARATOR + ${CHORUS_SUFFIX}
SPECIFIC_PREFIX="A time to "
UNUSED=${SOMETHING}
echo $CHORUS
foreach SPECIFIC in $SPECIFICS {
echo $SPECIFIC_PREFIX + " " + ${SPECIFIC}
}
echo ${CHORUS}
Then the output I would want would be:
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFIC_PREFIX
SPECIFICS
A couple things to note:
${CHORUS} and $CHORUS (for example) both refer to the same variable, named CHORUS
The UNUSED variable is not in the output, because it is not actually used in the script (despite having been defined in it)
The SOMETHING variable, used only to define the unused UNUSED variable, is in the output, since it was used
The various CHORUS_xxx variables are only used in the one single line defining the CHORUS variable, but they are all present in the output.
Using an Extended regular expression:
me#lappy386:/tmp$ grep -Eoi '\$(\{[a-z0-9_]+\}|[a-z0-9_]+)' \
/tmp/example |tr -d '{}$'|sort|uniq
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFIC_PREFIX
SPECIFICS
Using grep and sed
grep -oE '\$[A-Za-z_]+|\${[A-Za-z_]+}' inputFile| sed -r 's/[${}]//g' | sort | uniq
Example :
$ grep -oE '\$[A-Za-z_]+|\${[A-Za-z_]+}' inputFile | sed -r 's/[${}]//g' | sort | uniq
CHORUS
CHORUS_INFIX
CHORUS_PREFIX
CHORUS_SEPARATOR
CHORUS_SUFFIX
SOMETHING
SPECIFIC
SPECIFICS
SPECIFIC_PREFIX

Using grep for listing files by owner/read perms

The rest of my bash script works, just having trouble using grep. On each file I am using the following command:
ls -l $filepath | grep "^.r..r..r.*${2}$"
How can I properly use the second argument in the regular expression? What I am trying to do is print the file if it can be read by anyone and the owner is who is passed by the second argument.
Using:
ls -l $filepath | grep "^.r..r..r"
Will print the information successfully based on the read permissions. What I am trying to do is print based on... [read permission][any characters in between][ending with the owner's name]
The immediate problem with your attempt is the final $ which anchors the search to the end of the line, which is the end of the file name, not the owner field. A better solution would replace grep with Awk instead, which has built-in support for examining only specific fields. But actually don't use ls for this, or really in scripts at all.
Unfortuntately, the stat command's options are not entirely portable, but for Linux, try
case $(stat -c %a:%u "$filepath") in
[4-7][4-7][4-7]:"$2") ls -l "$filepath";;
esac
or maybe more portably
find "$filepath" -user "$2" -perm /444 -ls
Sadly, the -perm /444 predicate is not entirely portable, either.
Paradoxically, the de facto most portable replacement for stat to get a file's permissions might actually be
perl -le '#s = stat($ARGV[0]); printf "%03o\n", $s[2]' "$filepath"
The stat call returns a list of fields; if you want the owner, too, the numeric UID is in $s[4] and getpwuid($s[4]) gets the user name.

extract filename and path with slash delimiter from text using grep and lookaround

I am trying to write a bash script to automate checkin of changed files after comparing a local folder to another remote folder.
To achieve this I am trying to extract the filename with a portion of the path of the remote folder, to be used in the checkin commands. I am seeking assistance on extracting the filename with it's path.
To achieve the comparison I used diff command as follows
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/
The above command prints output in following format:
Files /home/xxxx/myprojects/company/apps/product/package/test/fileName.java and /productdev/product/product121/java/package/test/filename.java differ
I want to the extract the file name between 'and' & 'differ'. So I used lookarounds regular expression in a grep command :
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP '(?<=and) .*(?=differ)'
which gave me:
/productdev/product/product121/java/package/test/filename.java
I would like to display the path starting from java to the end of the text as in: java/package/test/filename.java ?
You could try the below grep command,
grep -oP 'and.*\/\Kjava.*?(?= differ)'
That is,
diff --brief --suppress-common-lines -x '*.class' -ar ~/myprojects/company/apps/product/package/test/ $env_var/java/package/test/ | grep -oP 'and.*\/\Kjava.*?(?=\s*differ)'
For how I see it you will be getting all the files compared in both folders and will get several lines like the ones you mentioned.
So first step would be grep-ing all the lines "differ" in them. (If that command gives any other kind of lines too)
You can ignore the above step if I am wrong and didn't understand it right.
So the next step is grep-ing both the paths. For that you can use these :
awk '{print $2,$4}'
This will print only 2nd and third fields i.e., both the paths.
awk prints fields irrespective of the spaces.
Another simple way of doing it is :
cut -d" " -f 2,4
This will also do the same.
Here with "-d" flag we are specifying a delimiter to separate strings and with "-f" flag we are specifying the number of field places to pick from(so 2nd and 4th field).
Once you get these paths you can always store them in two variables and cut or awk whatever parts of it you want.

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.
Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'
The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)
Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'
This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'

Merging files in Linux

I am using Cygwin to merge multiple files. However, I wanted to know if my approach is correct or not. This is both a question and a discussion :)
First, a little info about the files I have:
Both the files have ASCII as well as NON ASCII Characters.
File1 has 7899097 lines in it and a size of ~ 70.9 Mb
File2 has 14344391 lines in it and a size of ~ 136.6 Mb
File Encoding Information:
$ file -bi file1.txt
text/x-c++; charset=unknown-8bit
$ file -bi file2.txt
text/x-c++; charset=utf-8
$ file -bi output.txt
text/x-c++; charset=unknown-8bit
This is the method I am following to merge the two files, sort them and then remove all the duplicate entries:
I create a temp folder and place both the text files inside it.
I run the following commands to merge both the files but keep a line break between the two
for file in *.txt; do
cat $file >> output.txt
echo >> output.txt
done
The resulting output.txt file has 22243490 lines in it and a size of 207.5 Mb
Now, if I run the sort command on it as shown below, I get an error since there are Non ASCII characters (maybe unicode, wide characters) present inside it:
sort -u output.txt
string comparison failed: Invalid or incomplete multibyte or wide character
So, I set the environment variable LC_ALL to C and then run the command as follows:
cat output.txt | sort -u | uniq >> result.txt
And, the result.txt has 22243488 lines in it and a size of 207.5 Mb.
So, result.txt is the same as output.txt
Now, I already know that there are many duplicate entries in output.txt, then why the above commands are not able to remove the duplicate entries?
Also, considering the large size of the files, I wanted to know if this is an efficient method to merge multiple files, sort them and then unique them?
Hmm, I'd use
cat file1.txt file2.txt other-files.* | recode enc1..enc2 | sort | uniq > file3.txt
but watch out - this could cause problem with some big file sizes, counted in gigabytes ( or bigger), anyway with hundreds of megabytes should probably go fine. If I'd want real efficiency, e.g. having really huge files, I'd first remove single-file duplicates, then sort it, merge one after one, and then sort again and remove duplicate lines again. Theoretically uniq -c and grep filter could remove duplicates. Try to avoid falling into some unneeded sophistication of the solution :)
http://catb.org/~esr/writings/unix-koans/two_paths.html
edited:
mv file1.txt file1_iso1234.txt
mv file2.txt file2_latin7.txt
ls file*.txt |while read line; do cat $line |recode $(echo $line|cut -d'_' -f2 |cut -d'.' -f1)..utf8 ; done | sort | uniq > finalfile.txt