sed match pattern \tTEXT\t not working - regex

I use the following command on a huge text file
sed 's/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"
The file contains a [tab]EN-GB[tab] in each row, but what I get is the original text. I cannot figure out why.
NOTE: when I'm using 's/\t//g' it works and the resulting string is [a lot of no-tabs]EN-GB[a lot of no-tabs] in each row, so the tabs vanished.
UPDATE: Here is the incriminated part of the output from cat -vet:
^#2^#0^#0^#7^#0^#1^#0^#4^#~^#1^#6^#3^#2^#4^#3^#^I^#^I^#0^#^I^#E^#N^#-^#G^#B^#^I^#T^#h^#e^# ^#a^#d^#m^#i^#n^#i^#s^#t^#
I'm out of black magic... thanks in advance

It appears that your sed command is correct but you have some null characters in your text file
Run this sed command to remove nulls first:
sed -i.bak 's/\x0//g; s/\tEN-GB\t//g' "/home/ubuntu/0214/corpus/C.txt"

You can use ANSI-C quoting to represent the TAB character:
sed 's/'$'\tEN-GB\t''//g' filename
EDIT: The output of cat -vet suggests that you have NULL characters in your input. Remove those before piping the results to the above command. Say:
tr -d '\x0' < filename | sed 's/'$'\tEN-GB\t''//g'

Related

Adding a line using sed

Can't seem to find the right way to do this, despite checking my regex in a reg checker.
Given a text file containing, amongst others, this entry:
zone "example.net" {
type master;
file "/etc/bind/zones/db.example.net";
allow-transfer { x.x.x.x;y.y.y.y; };
also-notify { x.x.x.x;y.y.y.y; };
};
I want to add lines after the also-notify line, for that domain specifically.
So using this sed command string:
sed '/"example\.net".*?also-notify.*?};/a\nxxxxxxx/s' named.conf.local
I thought should work to add 'xxxxxxx' after the line. But nope. What am I doing wrong?
With POSIX sed, you can use the a for append command with an escaped literal new line:
$ sed '/^[[:blank:]]*also-notify/ a\
NEW LINE' file
With GNU sed, a is slightly more natural since the new line is assumed:
$ gsed '/^[[:blank:]]*also-notify/ a NEW LINE' file
The issue with the sed in your example is two fold.
The first is any sed regex cannot be for a multi-line match as in example\.net".*?also-notify.*?. That is more of a perl type match. You would need to use a range operator for the start as in:
$ sed '/"example\.net/,/also-notify/{
/^[[:blank:]]*also-notify/ a\
NEW LINE
}' file
The second issue is the \n in the appended text. With POSIX sed, the \n is not supported in any context. With GNU sed, the new line is assumed and the \n is out of context (if immediately after the a) and interpreted as an escaped literal n. You can use \n with GNU sed after 1 character but not immediately after. In POSIX sed, leading spaces of the appended line will always be stripped.
Following awk may help on this.
awk -v new_lines="new_line here" '/also-notify/{flag=1;print new_lines} /^};/{flag=""} !flag' Input_file
In case you want to edit Input_file itself then append > temp_file && mv temp_file Input_file to above code too. Also print new_lines here new_lines is a variable you could print the new liens directly too in there.
You're pretty close already. Just use a range (/pattern/,/pattern/{ #commands }) to select the text you want to operate on and then use /pattern/a/\ ... to add the line you want.
/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}
sed trims leading space on text to be appended. Adding a backslash \ at the start of the line prevents this.
In Bash, this would look like something like:
sed -e '/"example\.net"/,/also-notify/{
/also-notify/a\
\ this is the text I want to add.
}' named.conf.local
Also note that sed uses an older dialect of regular expressions that doesn't support non-greedy quantifies like *?.

Remove the data before the second repeated specified character in linux

I have a text file which has some below data:
AB-NJCFNJNVNE-802ac94f09314ee
AB-KJNCFVCNNJNWEJJ-e89ae688336716bb
AB-POJKKVCMMMMMJHHGG-9ae6b707a18eb1d03b83c3
AB-QWERTU-55c3375fb1ee8bcd8c491e24b2
I need to remove the data before the second hyphen (-) and produce another text file with the below output:
802ac94f09314ee
e89ae688336716bb
9ae6b707a18eb1d03b83c3
55c3375fb1ee8bcd8c491e24b2
I am pretty new to linux and trying sed command with unsuccessful attempts for the last couple of hours. How can I get the desired output with sed or any other useful command like awk?
You can use a simple cut call:
$ cat myfile.txt | cut -d"-" -f3- > myoutput.txt
Edit:
Some explanation, as requested in the comments:
cut breaks up a string of text to fields according to a given delimiter.
-d defines the delimiter, - in this case.
-f defines which fields to output. In this case, we want to eliminate everything before the second hyphen, or, in other words, return the third field and onwards (3-).
The rest of the command is just piping the output. cating the file into cut, and then saving the result to an output file.
Or, using sed:
cat myfile.txt | sed -e 's/^.\+-//'

How can I remove all characters in each line after the first space in a text file?

I have a large log file from which I need to extract file names.
The file looks like this:
/path/to/loremIpsumDolor.sit /more/text/here/notAlways/theSame/here
/path/to/anotherFile.ext /more/text/here/differentText/here
.... about 10 million times
I need to extract the file names like this:
loremIpsumDolor.sit
anotherFile.ext
I figure my first strategy is to find/replace all /path/to/ with ''. But I'm stuck how to remove all characters after the space.
Can you help?
sed 's/ .*//' file
It doesn't take any more. The transformed output appears on standard output, of course.
In theory, you could also use awk to grab the filename from each line as:
awk '{ print $1 }' input_file.log
That, of course, assumes that there are no spaces in any of the filenames. awk defaults to looking for whitespace as the field delimiters, so the above snippet would take the first "field" from your log file (your filename) for each line, and output it.
Pass it to cut:
cut '-d ' -f1 yourfile
a bash-only solution:
while read path otherstuff; do
echo ${path##*/}
done < filename

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens

Regular expression to find a line containing certain characters and remove that line

I have text file which has lot of character entries one line after another.
I want to find all lines which start with :: and delete all those lines.
What is the regular expression to do this?
-AD
Regular expressions don't "do" anything. They only match text.
What you want is some tools that uses regular expressions to identify a line and then apply some command to those tools.
One such tools is sed (there's also awk and many others). You'd use it like this:
sed -e "/^::/d" < input.txt > output.txt
The part "/^::/" tells sed to apply the following command to all lines that start with "::" and "d" simply means "delete that line".
Or the simplest solution (which my brain didn't produce for some strange reason):
grep -v "^::" input.txt > output.txt
sed -i -e '/^::/d' yourfile.txt
^::.*[\r\n]*
If you're reading the file line-by-line you won't need the [\r\n]* part.
Simple as:
^::
If you don't have sed or grep, find this and replace with empty string:
^::.*[\r\n]
Thanks for the pointers:
Following thing worked for me. After "::" any character was possiblly present in the text file so i gave:
^::[a-zA-Z0-9 I put all punctuation symbols here]*$
-AD
Here's my contribution in C#:
Text stream:
string stream = :: This is a comment line
Syntax:
Regex commentsExp = new Regex("^::.*", RegexOptions.Singleline);
Usage:
Console.WriteLine(commentsExp.Replace(stream, string.Empty));
Alternatively, if I wanted to simply take a text file that included comments and produce an exact duplicate without the comment lines I could use a simple but effective combination of the type and findstr commandline tools:
type commented.txt | findstr /v /R "^::" > uncommented.txt