How to remove nonnumeric junk from a file - regex

Here's an output from less:
487451
487450<A3><BA>1<A3><BA>1
487449<A3><BA>1<A3><BA>1
487448<A3><BA>1<A3><BA>1
487447<A3><BA>1<A3><BA>1
487446<A3><BA>1<A3><BA>1
487445<A3><BA>1<A3><BA>1
484300<A3><BA>1<A3><BA>1
484299<A3><BA>1<A3><BA>1
484297<A3><BA>1<A3><BA>1
484296<A3><BA>1<A3><BA>1
484295<A3><BA>1<A3><BA>1
484294<A3><BA>1<A3><BA>1
484293<A3><BA>1<A3><BA>1
483496
483495
483494
483493
483492
483491
I see a bunch of nonprintable characters here. How do I remove them using sed/tr?
My try was 's/\([0-9][0-9]*\)/\1/g', but it doesn't work.
EDIT: Okay, let's go further down the source. The numbers are extracted from this file:
487451"><img src="Manage/pic/20100901/Adidas running-429.JPG" alt="Adidas running-429" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
487450"><img src="Manage/pic/20100901/Adidas fs 1<A3><BA>1-060.JPG" alt="Adidas fs 1<A3><BA>1-060" height="120" border="0" class="BK01" onload='javascript:if(this.width>160){this.width=160}' /></a></td>
The first line is perfectly normal and what most of the lines are. The second is "corrupted". I'd just like to extract the number at the beginning (using 's/\([0-9][0-9]*\).*/\1/g', but somehow the nonprintables get into the regex, which should stop at ".
EDIT II: Here's a clarification: There are no brackets in the text file. These are character codes of nonprintable characters. The brackets are there because I copied the file from less. Mac's Terminal, on the other hand, uses ?? to represent such characters. I bet xterm on my Ubuntu would print that white oval with a question mark.

Classic job for either sed's or Unix's tr command.
sed 's/[^0-9]//g' $file
(Anything that is not a digit - or newline - is deleted.)
tr -cd '0-9\012' < $file > $file.1
Delete (-d) the complement (-c) of the digits and newline...

You missed the bit where you match the rest of the line.
sed 's/\([0-9][0-9]*\)[^0-9]*/\1/g'
^^^^^^^

Try this sed command:
sed 's/^\([0-9][0-9]*\).*$/\1/' file.txt
OUTPUT (running above command on the input file you provided)
487451
487450
487449
487448
487447
487446
487445
484300
484299
484297
484296
484295
484294
484293
483496
483495
483494
483493
483492
483491

If you know the crap will always be inside brackets, why not delete that crap?
sed 's/<[^>]*>//g'
EDIT: Thanks, Mike that makes sense. In that case, how about:
sed 's/([0-9]+).*/\1/g'

If the data always is like the sample, deleting from the less-than to the end of the line would work fine.
sed -i "s/<.*$//" file

Related

Finding strings across lines and replace with nothing

I have some 'fastq' format DNA sequence files (basically just text files) like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#
+
#
+
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
My ultimate goal is to turn these into 'fasta' format files, but to do that I need to get rid of the two empty sequences in the middle.
EDIT
The desired output would look like this:
#Sample_1
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
#Sample_4
ACTGACTGACTGACTGACTGACTGACTG
ACTGACTGACTGACTGACTGACTGACTG
+
BBBBBBBBBBBBEEEEEEEEEEEEEEEE
EHHHHKKKKKKKKKKKKKKNQQTTTTTT
All of the dedicated software I tried (Biopython, stand alone programs, perl scripts posted by others) crash at the empty sequences. This is really just a problem of searching for the string #\n+ and replacing it with nothing. I googled this and read several posts and tried about a million options with sed and couldn't figure it out. Here are some things that didn't work:
sed s/'#'/,/'+'// test.fastq > test.fasta
sed s/'#,+'// test.fastq > test.fasta
Any insights would be greatly appreciated.
PS. I've got a Mac.
Try:
sed "/^[#+]*$/d" test.fastq > test.fasta
The /d option tells sed to "delete" the matching line (i.e. not print it).
^ and $ mean "start of string" and "end of string" respectively, i.e. the line must be an exact match.
So, the above command basically says:
Print all lines that do not only contain # or +, and write the result to test.fasta.
Edit: I misunderstood the question slightly, sorry. If you want to only remove pairs of consecutive lines like
#
+
then you need to perform a multi-line search and replace.
Although this can be done with sed, it's perhaps easier to use something like a perl script instead:
perl -0pe 's/^#\n\+\n//gm' test.fastq > test.fasta
The -0 option turns Perl into "file slurp" mode, where Perl reads the entire input file in one shot (instead of line by line). This enables multi-line search and replace.
The -pe option allows you to run Perl code (pattern matching and replacement in this case) and display output from the command line.
^#\n\+\n is the pattern to match, which we are replacing with nothing (i.e. deleting).
/gm makes the substitution multiline and global.
You could also instead pass -i as the first parameter to perl, to edit the file inline.
This may not be the most elegant solution in the world, but you can use tr to replace the \n with a null character and back.
cat test.fastq | tr '\n' '\0' | sed 's/#\x0+\x0//g' | tr '\0' '\n' > test.fasta
Try this:
sed '/^#$/{N;/\n+$/d}' file
When # is found, next line is appended to the pattern space with N.
If $ is found in next line, the d command deletes both lines.

Grep invert on string matched, not line matched

I'll keep this explanation of why I need help to a mimimum. One of my file directories got hacked through XSS and placed a long string at the beginning of all php files. I've tried to use sed to replace the string with nothing but it won't work because the pattern to match includes many many characters that would need to be escaped.
I found out that I can use fgrep to match a fixed string saved in a pattern file, but I'd like to replace the matched string (NOT THE LINE) in each file, but grep's -v inverts the result on the line, rather than the end of the matched string.
This is the command I'm using on an example file that contains the hacked
fgrep -v -f ~/hacked-string.txt example.php
I need the output to contain the <?php that's at the end of the line (sometimes it's a <style> tag), but the -v option inverts at the end of that line, so the output doesn't contain the <?php at the beginning.
NOTE
I've tried to use the -o or --only-matching which outputs nothing instead:
fgrep -f ~/hacked-string.txt example.php --only-matching -v
Is there another option in grep that I can use to invert on the end of the matched pattern, rather than the line where the pattern was matched? Or alternatively, is there an easier option to replace the hacked string in all .php files?
Here is a small snippet of what's in hacked-string.txt (line breaks added for readability):
]55Ld]55#*<%x5c%x7825bG9}:}.}-}!#*<%x55c%x7825)
dfyfR%x5c%x7827tfs%x5c%x7c%x785c%x5c%x7825j:^<!
%x5c%x7825w%x5c%x7860%x5c%x785c^>Ew:25tww**WYsb
oepn)%x5c%x7825bss-%x5c%x7825r%x5c%x7878B%x5c%x
7825h>#]y3860msvd},;uqpuft%x5c%x7860msvd}+;!>!}
%x5c%x7827;!%x5c%x7825V%x5c%x7827{ftmfV%x5e56+9
9386c6f+9f5d816:+946:ce44#)zbssb!>!ssbnpe_GMFT%
x5c5c%x782f#00#W~!%x5c%x7825t2w)##Qtjw)#]82#-#!
#-%x5c%x7825tmw)%x5c%x78w6*%x5c%x787f_*#fubfsdX
k5%x5c%xf2!>!bssbz)%x5c%x7824]25%x5c%x7824-8257
-K)fujs%x5c%x7878X6<#o]o]Y%x5c%x78257;utpI#7>-1
-bubE{h%x5c%x7825)sutcvt)!gj!|!*bubEpqsut>j%x5c
%x7825!*72!%x5c%x7827!hmg%x5c%x78225>2q%x5c%x7
Thanks in advance!
I think what you are asking is this:
"Is it possible to use the grep utility to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
In that case, the answer is "No".
What I think you wanted to ask was:
"What is the easiest way to remove all instances of a fixed string (which might contain lots of regex metacharacters) from a file?"
Here's one reasonably simple solution:
delete_string() {
awk -v s="$the_string" '{while(i=index($0,s))$0=substr($0,1,i-1)substr($0,i+length(s))}1'
}
delete_string 'some_hideous_string_with*!"_inside' < original_file > new_file
The shell syntax is slightly fragile; it will break if the string contains an apostrophe ('). However, you can read a raw string from stdin into a variable with:
$ IFS= read -r the_string
absolutely anything here
which will work with any string which doesn't contain a newline or a NUL character. Once you have the string in a variable, you can use the above function:
delete_string "$the_string" < original_file > new_file
Here's another possible one liner, using python:
delete_string() {
python -c 'import sys;[sys.stdout.write(l.replace(r"""'"$1"'""","")) for l in sys.stdin]'
}
This won't handle strings which have three consecutive quotes (""").
Is the hacked string the same in every file?
If the length of hacked string in chars was 1234 then you can use
tail -c +1235 file.php > fixed-file.php
for each infected file.
Note that tail c +1235 tells to start output at 1235th character of the input file.
With perl:
perl -i.hacked -pe "s/\Q$(<hacked-string.txt)\E//g" example.php
Notes:
The $(<file) bit is a bash shortcut to read the contents of a file.
The \Q and \E bits are from perl, they treat the stuff in between as plain characters, ignoring regex metachars.
The -i.hacked option will edit the file in-place, creating a backup "example.php.hacked"

sed match pattern \tTEXT\t not working

I use the following command on a huge text file
sed 's/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
The file contains a [tab]EN-GB[tab] in each row, but what I get is the original text. I cannot figure out why.
NOTE: when I'm using 's/\t//g' it works and the resulting string is [a lot of no-tabs]EN-GB[a lot of no-tabs] in each row, so the tabs vanished.
UPDATE: Here is the incriminated part of the output from cat -vet:
^#2^#0^#0^#7^#0^#1^#0^#4^#~^#1^#6^#3^#2^#4^#3^#^I^#^I^#0^#^I^#E^#N^#-^#G^#B^#^I^#T^#h^#e^# ^#a^#d^#m^#i^#n^#i^#s^#t^#
I'm out of black magic... thanks in advance
It appears that your sed command is correct but you have some null characters in your text file
Run this sed command to remove nulls first:
sed -i.bak 's/\x0//g; s/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
You can use ANSI-C quoting to represent the TAB character:
sed 's/'$'\tEN-GB\t''//g' filename
EDIT: The output of cat -vet suggests that you have NULL characters in your input. Remove those before piping the results to the above command. Say:
tr -d '\x0' < filename | sed 's/'$'\tEN-GB\t''//g'

VIM search and insert newline if newline character found in middle of a line

This has been driving me nuts for a while now. Actually, make that all damn night!!!
I have a file of contact information that is now only slightly mangled that I'm trying to fix in VIM. The file has 13k lines so I really, REALLY don't want to have to manually fix this in the file. I believe that the final issue I am having is that for some dumb reason many of the lines run together with the next line. This is a csv file with newline characters separating the lines. Unfortunately, there are maybe a couple of hundred lines where there is a newline character in the middle of should be two lines.
Here is an example of the file where this is occurring:
Freddy,Bauhof,fabaof#garbage.net,16126 Garbage Drive,Spring,TX,77 379,5555550440,M,1/1/14,14:23:57,256.241.24.29^#
Natasha,Moore,ndivy#garbage.com,3715 Garbage Rd,Louisville,KY,40218,5555553358,F,1/1/14,3:12:09,74.256.182.12^#MaryAnn,Haase,mahase#garbage.net,303 N Garbage Rd,Norfolk,NE,68701,5555559031,M,12/31/13,7:20:21,69.256.211.147^#
Jonathan,Golden,jongolden#garbage.com,11 Garbage Dr,GlenHead,NY,115 45,5555556712,M,1/1/14,17:28:09,256.195.83.118^#
What I am trying to do is simply insert a newline/carraige return/whatever that will actually break the middle line into two after "12^#" and immediately before "MaryAnn". Thanks in advance for any help with this.
Run this in vim :%s/\^#\(\S\)/^#\r\1/g.
It inserts a carriage return between ^# and not whitespace.
Foo^#
Bar^#Baz^#
Qux^#
Becomes
Foo^#
Bar^#
Baz^#
Qux^#
in vim:
this line would work:
%s/\^#\zs\ze\S/\r/g
if you are open to shell, grep could help you:
kent$ cat f
foo^#
foo2^#bar^#baz^#
blah^#
kent$ grep -oP '.*?\^#' f
foo^#
foo2^#
bar^#
baz^#
blah^#
sed too:
kent$ sed -r 's/\^#(.)/^#\n\1/g' f
foo^#
foo2^#
bar^#
baz^#
blah^#

using sed to copy lines and delete characters from the duplicates

I have a file that looks like this:
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
I want it to look like this
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
I thought I could use sed to do this but I can't figure out how to store something in a buffer and then modify it.
Am I even using the right tool?
Thanks
You don't have to get tricky with regular expressions and replacement strings: use sed's p command to print the line intact, then modify the line and let it print implicitly
sed 'p; s/\.png//'
Glenn jackman's response is OK, but it also doubles the rows which do not match the expression.
This one, instead, doubles only the rows which matched the expression:
sed -n 'p; s/\.png//p'
Here, -n stands for "print nothing unless explicitely printed", and the p in s/\.png//p forces the print if substitution was done, but does not force it otherwise
That is pretty easy to do with sed and you not even need to use the hold space (the sed auxiliary buffer). Given the input file below:
$ cat input
#"Afghanistan.png",
#"Albania.png",
#"Algeria.png",
#"American_Samoa.png",
you should use this command:
sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
The result:
$ sed 's/#"\([^.]*\)\.png",/&\
#"\1",/' input
#"Afghanistan.png",
#"Afghanistan",
#"Albania.png",
#"Albania",
#"Algeria.png",
#"Algeria",
#"American_Samoa.png",
#"American_Samoa",
This commands is just a replacement command (s///). It matches anything starting with #" followed by non-period chars ([^.]*) and then by .png",. Also, it matches all non-period chars before .png", using the group brackets \( and \), so we can get what was matched by this group. So, this is the to-be-replaced regular expression:
#"\([^.]*\)\.png",
So follows the replacement part of the command. The & command just inserts everything that was matched by #"\([^.]*\)\.png", in the changed content. If it was the only element of the replacement part, nothing would be changed in the output. However, following the & there is a newline character - represented by the backslash \ followed by an actual newline - and in the new line we add the #" string followed by the content of the first group (\1) and then the string ",.
This is just a brief explanation of the command. Hope this helps. Also, note that you can use the \n string to represent newlines in some versions of sed (such as GNU sed). It would render a more concise and readable command:
sed 's/#"\([^.]*\)\.png",/&\n#"\1",/' input
I prefer this over Carles Sala and Glenn Jackman's:
sed '/.png/p;s/.png//'
Could just say it's personal preference.
or one can combine both versions and apply the duplication only on lines matching the required pattern
sed -e '/^#".*\.png",/{p;s/\.png//;}' input