Unable to create sed substitution to deduplicate file

Unable to create sed substitution to deduplicate file - regex

I have the file with many duplicates of the form
a
a
b
b
c
c
Which I need to reduce to
a
b
c
So I wrote a sed command: sed -r 's/^(.*)$\n^(.*)$/\1/mg' filename, but the file was still showing duplicates. However I'm sure this regex works because I tested it here.
So what am I doing wrong?
I suspect it may be related to the -r option, as I'm not really sure what that does (but without it I get a invalid reference \1 ons' command's RHS` error).

Either of 2 simpler approaches should work for you.
A simple awk command to print a line only first time by maintaining an array of already printed lines:
awk '!seen[$0]++' file
a
b
c
Since file is already sorted you can use uniq also:
uniq file
a
b
c
Edit: Newer gnu-awk versions support in place editing also using:
awk -i 'inplace' '!seen[$0]++' file

Related

Converting LaTeX pmatrix command to amsmath pmatrix environment using sed

I have an old LaTeX document (with a lot of formatting commands) that I want to convert to the more modern LaTeX (I want to do the update for several reasons, not the least of which is to reduce the coupling between content and formatting). At any rate, the document has a lot of calls to the deprecated command \pmatrix{ .... } which I would like to replace with the new amsmath command \begin{pmatrix} ... \end{pmatrix}. I have been trying to use sed to do this conversion but I have never used it before and I am having trouble.
Here is a MWE
LaTeX input string
\pmatrix{0&0\cr \frac{1}{2}&0\cr 0&0\cr}\pmatrix{1&1\cr 1&1\cr 1&1\cr}
with the expected output
\begin{pmatrix}0&0\\ \frac{1}{2}&0\\ 0&0\end{pmatrix}\begin{pmatrix}1&1\\ 1&1\\ 1&1\end{pmatrix}
The commands that I have been trying to use are variants of the following
sed 's/\\pmatrix{\(.*\cr[ ]*\)}/\\begin{pmatrix}\1 \\end{pmatrix}/g' <$WORKING_FILE >$OUTPUT_FILE
but the closest output that I have been able to achieve is
\begin{pmatrix}0 & 0 \\ 0 & 0 \\ 0 & 0 \end{pmatrix}
I am pretty sure that the problem is related to having two calls to pmatrix side by side, but I am not sure how to modify the regex to make this work.
I have searched google, but being so new to regex, I just got confused by all of the variations out there and which to use, and how to properly format such a thing.

The following might work for you:
sed -re 's/(\\pmatrix)\{([^}]*)}/\\begin{pmatrix}\2\\end{pmatrix}/g' -e 's/\\cr/\\\\/g' -e 's/\\\\\\end/\\end/g' inputfile
This works by:
substituting \pmatrix{...} with `\begin{matrix}...\end{matrix}
substituting \cr with \\
handling \\\end to make it \end
EDIT: As per your update, you might be better off splitting the relevant parts using grep before piping to sed:
grep -oP '\\pmatrix.*?\\cr}' inputfile | sed -re 's/\\pmatrix\{(.*)}/\\begin{pmatrix}\1\\end{pmatrix}/g;s/\\cr/\\\\/g;s/\\\\\\end/\\end/g'

This might work for you (GNU sed):
sed -r 's/\\cr/\n/g;s/\\(pmatrix)\{([^\n]*)\n([^\n]*)\n([^\n]*)\n\}/\\begin{\1}\2\\\\ \3\\\\ \4\\end{\1}/g;s/\n/\\cr/g' file
Convert \\cr to newlines. Do a global substitution command. Then convert those newlines left back to \\cr's.

Find matches between 2 files

I'm trying to output matching lines in 2 files using AWK. I made it easier by making 2 files with just one column, they're phone numbers. I found many people asking the same question and getting the answer to use :
awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
The problem I encountered was that it simply doesn't want to work. The first file is small (~5MB) and the second file is considerably larger (~250MB).
I have some general knowledge of AWK and know that the above script should work, yet I'm unable to figure out why it's not.
Is there any other way I can achieve the same result?
GREP is a nice tool, but it clogs up the RAM and dies within seconds due to the file size.
I did run some spot checks to find out whether there are matches, and when I did a grep of random numbers from the smaller file and grep'd them through the big one and I did find matches, so I'm sure that there are.
Any help is appreciated!
[edit as requested by #Jaypal]
Sample code from both files :
File1:
01234567895
01234577896
01234556894
File2:
01234642784
02613467246
01234567895
Output:
01234567895
What I get:
xxx#xxx:~$ awk 'NR==FNR{a[$1];next}$1 in a{print $1}' file1 file2
xxx#xxx:~$

Update
The problem happens to be with the kind of file you were using. Apparently it came from a DOS system and had many \r around. To solve it, do "sanitize" them with:
dos2unix
Former answer
Your awk is pretty fine. However, you can also compare files with grep -f:
grep -f file1 file2
This will look for lines in file1 that are also in file2.
You can add options to make a better matching:
grep -wFf file1 file2
-w matches words
-F matches fixed strings (no regex).
Examples
$ cat a
hello
how are
you
I am fine areare
$ cat b
hel
are
$ grep -f b a
hello
how are
I am fine areare
$ grep -wf b a
how are

Complex changes to a URL with sed

I am trying to parse an RSS feed on the Linux command line which involves formatting the raw output from the feed with sed.
I currently use this command:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" | sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/'
This gives me a number of feed items per line that look like this:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
Notice the long URL at the end. I want to shorten this to better fit on the command line. Therefore, I want to change my sed command to produce the following:
Sat 20:33 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/-2121664
That means cutting everything out of the URL except a dash and that seven digit number preceeding the ".html/blablabla" bit.
Currently my sed command only changes stuff in the date bit. It would have to leave the title and start or the URL alone and then cut stuff out of it until it reaches the seven digit number. It needs to preserve that and then cut everything after it out. Oh yeah, and we need to leave a dash right in front of that number too.
I have no idea how to do that and can't find the answer after hours of googling. Help?
EDIT:
This is the raw output of a line of feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}", in case it helps:
Sat, 22 Feb 2014 20:33:00 GMT> WhatsApp-Ausfall: Server-Probleme blockieren Messaging-Dienst http://www.heise.de/newsticker/meldung/WhatsApp-Ausfall-Server-Probleme-blockieren-Messaging-Dienst-2121664.html/from/atom10?wt_mc=rss.ho.beitrag.atom
EDIT 2:
It seems I can only pipe that output into one command. Piping it through multiple ones seems to break things. I don't understand why ATM.

Unfortunately (for me), I could only think of solving this with extended regexp syntax (either -E or -r flag on different systems):
... | sed -E 's|(://[^/]+/).*(-[0-9]+)\.html/.*|\1\2|'
UPDATE: In basic regexp syntax, the best I can do is
... | sed 's|\(://[^/]*/\).*\(-[0-9][0-9]*\)\.html/.*|\1\2|'

The key to writing this sort of regular expression is to be very careful about what the boundaries of what you expect are, so as to avoid the random gunk that you want to get rid of causing you problems. Also, you should bear in mind that you can use characters other than / as part of a s operation's delimiters.
sed 's!\(http://www\.heise\.de/\)newsticker/meldung/[^./]*\(-[0-9]+\)\.html[^ ]*!\1\2!'
Be aware that getting the RE right can be quite tricky; assume you'll need to test it! (This is a key part of the “now you have two problems” quote; REs very easily become horrendous.)

Something like this maybe?
... | awk -F'[^0-9]*' '{print "http://www.heise.de/-"$2}'

This might work for you (GNU sed):
sed 's|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|' file
You can place the first sed command so:
feedstail -u http://www.heise.de/newsticker/heise-atom.xml -r -i 60 -f "{published}> {title} {link}" |
sed 's/^\(.\{3\}\)\(.\{13\}\)\(.\{6\}\)\(.\{3\}\)\(.*\)/\1\3\5/;s|\(//[^/]*/\).*\(-[0-9]\{7\}\).*|\1\2|'

a simple sed script displaying only changed lines

How could I make a separate sed script (let's call it script.sed) that would display only the changed lines without having to use the -n option while executing it? (Sorry for my English)
I have a file called data2.txt with digits and I need to change the lines ending with ".5" and print those changed lines out in the console.
I know how to do it with a single command (sed -n 's/.5$//gp' data2.txt), however our university professor requires us to do the same using sed -f script.sed data2.txt command.
Any ideas?

The following should work for your sed script:
s/.5$//gp
d
The -n option will suppress automatic printing of the line, the other way to do that is to use the d command. From man page:
d Delete pattern space. Start next cycle.
This works because the automatic printing of the line happens at the end of a cycle, and using the d command means you never reach the end of a cycle so no lines are printed automatically.

This might work for you (GNU sed):
#n
s/.5$//p
Save this to a file and run as:
sed -f file.sed file.txt

Sed command find and replace in even lines of a file

Hi I am new to this forum. I want to use SED to replace an expression on even lines of a file. My problem is that I cannot think f how to save the changes in the original file (i.e, how to overwrite the changes in the file). I have tried with :
sed -n 'n;p;' filename | sed 's/aaa/bbb/'
but this does not save the changes. I appreciate your help on this.

Try :
sed -i '2~2 s/aaa/bbb/' filename
The -i option tells sed to work in place, so not to write the edited version to stout and leave the original file be, but to apply the changes to the file. The 2~2 portion is the address for the lines sed should apply the commands. 2~2 means edit only even lines. 1~2 would edit only odd lines. 5~6 would edit every fifth line, starting at line 5 etc...

#Mithrandir's answer is an excellent, correct and complete one.
I will just add that the m~n addressing method is a GNU sed extension that may not work everywhere. For example, not all Macs have GNU sed, as well as *BSD systems may not have it either.
So, if you have a file like the following one:
$ cat f
1 ab
2 ad
3 ab
4 ac
5 aa
6 da
7 aa
8 ad
9 aa
...here is a more universal solution:
$ sed '2,${s/a/#A#/g;n}' f
1 ab
2 #A#d
3 ab
4 #A#c
5 aa
6 d#A#
7 aa
8 #A#d
9 aa
What does it do? The address of the command is 2,$, which means it will be applied to all lines between the second one (2) and the last one ($). The command in fact are two commands, treated as one because they are grouped by brackets ({ and }). The first command is the replacement s/a/#A#/g. The second one is the n command, which gets, in the current iteration, the next line, appends it to the current pattern space. So the current iteration will print the current line plus the next line, and the next iteration will process the next next line. Since I started it at the 2nd line, I am doing this process at each even line.
Of course, since you want to update the original file, you should call it with the -i flag. I would note that some of those non-GNU seds require you to give a parameter to the -i flag, which will an extension to be append to a file name. This file name is the name of a generated backup file with the old content. (So, if you call, for example, sed -i.bkp s/a/b/ myfile.txt the file myfile.txt will be altered, but another file, called myfile.txt.bkp, will be created with the old content of myfile.txt.) Since a) it is required in some places and b) it is accepted in GNU sed and c) it is a good practice nonetheless (if something go wrong, you can reuse the backup), I recommend to use it:
$ ls
f
$ sed -i.bkp '2,${s/a/#A#/g;n}' f
$ ls
f f.bkp
Anyway, my answer is just a complement for some specific scenarios. I would use #Mithrandir's solution, even because I am a Linux user :)

This might work for you:
sed -i 'n;s/aaa/bbb/' file

Use sed -i to edit the file in place.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js