shell: undo the count from `uniq -c` - uniq

Is there a good way to undo the -c option from uniq?
(Sometimes it makes sense to keep the count in a saved file, but then you'd want to get rid of it for further processing.)
I've tried playing with cut -f, but it doesn't really work, and specifying a space separator, with -d" ", doesn't seem to work, either, because there'd often be a different number of spaces (depending on how big a given count is).

cat /dev/fd/0 | \
sed -E "s#^[ ]*[0-9]+[ ]##g"

Related

Bash: Pass all arguments exactly as they are to a function and prepend a flag on each of them

This seems like a relatively basic question, but I can't find it anywhere after an hour of searching. Many (there are a lot!) of the similar questions do not seem to hit the point.
I am writing a script ("vims") to use vim in a sed-like mode (so I can call normal vim commands on a stream input without actually opening vim), so I need to pass each argument to vim with a "-c" flag prepended to it. There are also many characters that need to be escaped (I need to pass regex expressions), so some of the usual methods on SO do not work.
Basically, when I write:
cat myfile.txt | vims ':%g/foo/exe "norm yyPImyfile: \<esc>\$dF,"' ':3p'
which are two command-line vim arguments to run on stdout,
I need these two single-quoted arguments to be passed exactly the way they are to my function vims(), which then tags each of them with a -c flag, so they are interpreted as commands in vim.
Here's what I've tried so far:
vims() {
vim - -nes -u NONE -c '$1' -c ':q!' | tail -n +2
}
This seems to work perfectly for a single command. No characters get escaped, and the "-c" flag is there.
Then, using the oft-duplicated question-answer, the "$#" trick, I tried:
vims() {
vim - -nes -u NONE $(for arg in "$#"; do echo -n " -c $arg "; done) -c ':q!' | tail -n +2
}
This seems to break the spaces within each string I pass it, so does not work. I also tried a few variations of the printf command, as suggested in other questions, but this has weird interactions with the vim command sequences. I've tried many other different backslash-quote-combinations in a perpetual edit-test loop, but have always found a quirk in my method.
What is the command sequence I am missing?
Add all the arguments to an array one at a time, then pass the entire array to vim with proper quoting to ensure whitespace is correctly preserved.
vims() {
local args=()
while (($# > 0)); do
args+=(-c "$1")
shift
done
vim - -nes -u NONE "${args[#]}" -c ':q!' | tail -n +2
}
As a rule of thumb, if you find yourself trying to escape things, add backslashes, use printf, etc., you are likely going down the wrong path. Careful use of quoting and arrays will cover most scenarios.

Sed replacing only part of a longer match with a shorter replacement:

So I'm measuring the total time elapsed of c program. By doing so I have been running this shell script that uses sed to replace the value of a constant (below: N) defined somewhere in the middle of a line in my c program.
#define N 10 // This constant will be incremented by shell program
Before you tell me that I should be using a variable and timing the function that uses it, I have to time the whole execution of the program externally on a single run (meaning no reassignment of N).
I've been using the following in a shell script to help out:
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
That replaces a 3 digit number with whatever my INCREMENTINGVAR (replacement) is. However, this doesn't seem to work properly for me when the replacement is 2 digits long. Sed replaces only the first two characters and leaves the the previous 3rd digit from the previous run without deleting it.
TESTS=0
while [ $TESTS -lt 3 ]
do
echo "This is test: $TESTS"
INCREMENTINGVAR=10
while [ "$INCREMENTINGVAR" -lt 10 ]
do
tmp=$(sed "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprogram.c); printf "%s" "$tmp" > myprogram.c
rm -f myprog.c.bak
echo "$INCREMENTINGVAR"
gcc myprogram.c -o myprogram.out; ./myprogram.out
INCREMENTINGVAR=$((INCREMENTINGVAR+5))
done
TESTS=$((TESTS+1))
done
Is there something I should do instead?
edit: Added whole shell script; Changed pattern for sed.
Do you simply want to replace whatever digit string is on line 11 with the new value? If so, you'd write:
sed -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/"
That looks for a sequence of one or more digits, and replaces it by the current value in $INCREMENTINGVAR. This will rollover from 9 to 10, and from 99 to 100, and from 999 to 1000, etc. Indeed, there's nothing to stop you jumping from 1 to 987,654 if that's what you want to do.
With the GNU and BSD (Mac OS X) versions of sed, you could overwrite the file automatically. The portable way (meaning, works the same with both GNU and BSD variants of sed), is:
sed -i.bak -e "11s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This creates a backup file (and removes it). The problem is that GNU sed requires just -i and BSD sed requires -i '' (two arguments) to do an in situ change without a backup. You can decide that portability is not relevant.
Note that using line number to identify what must be changed is delicate; trivial changes (a new header, more commentary) could change the line number. It would probably be better to use a context search:
sed -i.bak -e "/^#define N [0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
This assumes spaces between define and N and the number. If you might have blanks or tabs in it, then you might write:
sed -i.bak -e "/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/" myprog.c
rm -f myprog.c.bak
That looks for optional leading white space before the #, optional white space between the # and the define, mandatory white space (at least one, possibly many) between define and N, and mandatory white space again between N and the first digit of the number. But probably your input isn't that sloppy and a simpler search pattern (like the first option) is sufficient to meet your needs. You could also write code to normalize eccentrically formatted #define lines into a canonical representation — but again, you most probably don't need to.
If you have somewhere else in the same file that contains something like this:
#undef N
#define N 100000
you would have to worry about the pattern matching this line too. However, few files do that; it isn't likely to be a problem in practice (and if it is, the code in general probably has more problems than can be dealt with here). One possibility would be to limit the range to the first 30 lines, assuming the first #define N 123 is somewhere in that range and the second is not.
sed -i.bak -e "1,30 { /^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]/ s/[0-9][0-9]*/$INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
There are multiple other tricks that could be pulled to limit the damage, with varying degrees of verbosity. For example:
sed -i.bak -e "1,/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/ \
s/^[[:space:]]*#[[:space:]]*define[[:space:]]\{1,\}N[[:space:]]*\{1,\}[0-9]\{1,\}/#define N $INCREMENTINGVAR/; }" myprog.c
rm -f myprog.c.bak
Working with regexes is generally a judgement call between specificity and verbosity — you can make things incredibly safe but incredibly difficult to read, or you can run a small risk that your more readable code will match something unintended.

batch renaming of files with perl expressions

This should be a basic question for a lot of people, but I am a biologist with no programming background, so please excuse my question.
What I am trying to do is rename about 100,000 gzipped data files that have existing name of a code (example: XG453834.fasta.gz). I'd like to name them to something easily readable and parseable by me (example: Xanthomonas_galactus_str_453.fasta.gz).
I've tried to use sed, rename, and mmv, to no avail. If I use any of those commands on a one-off script then they work fine, it's just when I try to incorporate variables into a shell script do I run into problems. I'm not getting any errors, just no names are changed, so I suspect it's an I/O error.
Here's what my files look like:
#! /bin/bash
# change a bunch of file names
file=names.txt
while IFS=' ' read -r r1 r2;
do
mmv ''$r1'.fasta.gz' ''$r2'.fasta.gz'
# or I tried many versions of: sed -i 's/"$r1"/"$r2"/' *.gz
# and I tried many versions of: rename -i 's/$r1/$r2/' *.gz
done < "$file"
...and here's the first lines of my txt file with single space delimiter:
cat names.txt
#find #replace
code1 name1
code2 name2
code3 name3
I know I can do this with python or perl, but since I'm stuck here working on this particular script I want to find a simple solution to fixing this bash script and figure out what I am doing wrong. Thanks so much for any help possible.
Also, I tried to cat the names file (see comment from Ashoka Lella below) and then use awk to move/rename. Some of the files have variable names (but will always start with the code), so I am looking for a find & replace option to just replace the "code" with the "name" and preserve the file name structure.
I suspect I am not escaping the variable within the single tick of the perl expression, but I have poured over a lot of manuals and I can't find the way to do this.
If you're absolutely sure than the filenames doesn't contain spaces of tabs, you can try the next
xargs -n2 < names.txt echo mv
This is for DRY run (will only print what will do) - if you satisfied with the result, remove the echo ...
If you want check the existence ot the target, use
xargs -n2 < names.txt echo mv -i
if you want NEVER allow overwriting of the target use
xargs -n2 < names.txt echo mv -n
again, remove the echo if youre satisfied.
I don't think that you need to be using mmv, a simple mv will do. Also, there's no need to specify the IFS, the default will work for you:
while read -r src dest; do mv "$src" "$dest"; done < names.txt
I have double quoted the variable names as it is generally considered good practice but in this case, a space in either of the filenames will result in read not working as you expect.
You can put an echo before the mv inside the loop to ensure that the correct command will be executed.
Note that in your file names.txt, the .fasta.gz suffix is already included, so you shouldn't be adding it inside the loop aswell. Perhaps that was your problem?
This should rename all files in column1 to column2 of names.txt. Provided they are in the same folder as names.txt
cat names.txt| awk '{print "mv "$1" "$2}'|sh

Find non-ASCII codepoints in a file

I am currently using this regex to find the non-ASCII code points in a file, no matter what encoding:
$ cat test.txt | hd | grep -P " [8-9a-f][\da-f]"
Is there a better, more concise, or less hacky method? I usually use grep -P "[^\x00-\x7f]" to find the offensive characters but here I am looking for the offensive code points.
Note that the current hacky method does have the nice side effect of showing the surrounding ASCII characters, which is very nice for context.
Using hd, this should be faster:
hd test.txt |grep -w '[89a-f][0-9a-f]'
(grep -P invokes libpcre and is slower. grep -w searches just "words" and will default to standard posix regex, which is nearly as fast as a -F plain text query. Removing the cat from the pipe also saves (trivial) effort.)
If you didn't want the context, you could give grep the -o flag. If you want the context called out more clearly, consider --color (or even --color=always if you're piping the output somewhere and don't mind the coloring control characters). You may also find grep's -n flag useful, which will give you line numbers.
I think you can use grep's -a flag to achieve what you're looking for in a single command (this forces everything to be read as text rather than the useless "Binary file test.txt matches" output), though you may not like what the output does to your terminal. Maybe pipe it into a file and then view that file with vim (which, unlike less, won't render control characters):
grep -aP '[^\x00-\x7f]' test.txt > found-highchars
view found-highchars
This may or may not be faster than piping through hd and grep.

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.
Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'
The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)
Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'
This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'