Getting rid of all words that contain a special character in a textfile - regex

I'm trying to filter out all the words that contain any character other than a letter from a text file. I've looked around stackoverflow, and other websites, but all the answers I found were very specific to a different scenario and I wasn't able to replicate them for my purposes; I've only recently started learning about Unix tools.
Here's an example of what I want to do:
Input:
#derik I was there and it was awesome! !! http://url.picture.whatever #hash_tag
Output:
I was there and it was awesome!
So words with punctuation can stay in the file (in fact I need them to stay) but any substring with special characters (including those of punctuation) needs to be trimmed away. This can probably be done with sed, but I just can't figure out the regex. Help.
Thanks!

Here is how it could be done using Perl:
perl -ane 'for $f (#F) {print "$f " if $f =~ /^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$/} print "\n"' file
I am using this input text as my test case:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
#derik I was there; it was awesome! !! http://url.picture.whatever #hash_tag
output:
Hello,
How are you doing?
I'd like 2.5 cups of piping-hot coffee.
I was there; it was awesome!
Command-line options:
-n loop around every line of the input file, do not automatically print it
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace
-e execute the perl code
The perl code splits each input line into the #F array, then loops over every field $f and decides whether or not to print it.
At the end of each line, print a newline character.
The regular expression ^([a-zA-z-\x27]+[?!;:,.]?|[\d.]+)$ is used on each whitespace-delimited word
^ starts with
[a-zA-Z-\x27]+ one or more lowercase or capital letters or a dash or a single quote (\x27)
[?!;:,.]? zero or one of the following punctuation: ?!;:,.
(|) alternately match
[\d.]+ one or more numbers or .
$ end

Your requirements aren't clear at all but this MAY be what you want:
$ awk '{rec=sep=""; for (i=1;i<=NF;i++) if ($i~/^[[:alpha:]]+[[:punct:]]?$/) { rec = rec sep $i; sep=" "} print rec}' file
I was there and it was awesome!

sed -E 's/[[:space:]][^a-zA-Z0-9[:space:]][^[:space:]]*//g' will get rid of any words starting with punctuation. Which will get you half way there.
[[:space:]] is any whitespace character
[^a-zA-Z0-9[:space:]] is any special character
[^[:space:]]* is any number of non whitespace characters
Do it again without a ^ instead of the first [[:space:]] to get remove those same words at the start of the line.

Related

Using grep to extract very specific strings from binary file

I have a large binary file. I want to extract certain strings from it and copy them to a new text file.
For example, in:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^G
I want to take the number '7' (after the #^#^#E) and every character after it stopping at the Z ('ignoring the M-^G).
I want to copy this 7cacscKLrrok9bwC3Z64NTnZ to a new file.
There will be multiple such strings in one file. The end will always be denoted by the M- (which I don't want copied). The start will always be denoted by a 7 (which I do want copied).
Unfortunately, my knowledge of grep, sed, etc, does not extend to this level. Can someone please suggest a viable way to achieve this?
cat -v filename | grep [7][A-Z,a-z] will show all strings with a '7' followed by a letter but that's not much.
Thank you.
I've noticed that my requirements are rather more complicated.
(I've performed the correct - I hope - formatting this time). Thanks to 'tshiono' for his (?) answer to the earlier submission.
I want to check the ending of a string and, if it ends in M-, grep another string that follows it (with junk in between). If the string does not end in M-, then I don't want it copied (let alone any other strings).
So what I would like is:
grep -a -Po "7[[:alnum:]]+(?=M-)" file_name and if the ending is M- then grep -a -Po "5x[[:alnum:]]+(?=\^)" file_name to copy the string that starts with 5x and ends with a ^.
In this example:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
The outcome would be:
7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
However, if the ending is not M- (more precisely, if the ending is ^S), then do not try the second grep and do not record anything at all.
In this example:
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZ^SGwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
The outcome would be null (nothing copied) as the 7cacs... string ends in ^S.
Is grep the correct tool? Grep a file and if the condition in the grep command is 'yes' then issue a different grep command but if the condition is 'no' then do nothing.
Thanks again.
I have noticed one addition modification.
Can one add an OR command to the second part? Grep if the second string starts with 5x OR 6x?
In the example below, grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" filename | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)" will extract the strings starting with 7 and the strings starting with 5x.
How can one change the 5x to 5x or 6x?
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7cacscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
D-wM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM-FM MM-[o#^B^#^#^#^#^#E7AAAAAscKLrrok9bwC3Z64NTnZM-^GwM-^?^#^#^#^#^#^#^#^Y^#^#^#^#^#^#^#M-lM-FM-MM-[o#^B^#M-lM6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk^89038432nowefe
In this example, the desired outcome would be:
7cacscKLrrok9bwC3Z64NTnZ
5x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
7AAAAAscKLrrok9bwC3Z64NTnZ
6x8w09qewqlkcklwnlkewflewfiewjfoewnflwenfwlkfwelk
UPDATE MARCH 09:
I need to create a series of complex grep (or perl) commands to extract strings from a series of binary files.
I need two strings from the binary file.
The first string will always start with a 1.
The first string will end with a letter or number. The next letter will always be a lower case k. I do not want this k character.
The difficulty is that the ending k will not always be the first k in the string. It might be the first k but it might not.
After the k, there is a second string. The second string will always start with an A or a B.
The ending of the second string will be in one of two forms:
a) it will end with a space then display the first three characters from the first string in lower case followed by a )
b) it will end with a ^K then display the first three characters from the first string in lower case.
For example:
1pppsx9YPar8Rvs75tJYWZq3eo8PgwbckB4m4zT7Yg042KIDYUE82e893hY ppp)
Should be:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc and B4m4zT7Yg042KIDYUE82e893hY - delete the k and the space then ppp.
For example:
1zzzsx9YPkr8Rvs75tJYWZq3eo8PgwbckA2m4zT7Yg042KIDYUE82e893hY^Kzzz
Should be:
1zzzsx9YPkar8Rvs75tJYWZq3eo8Pgwbc and A4m4zT7Yg042KIDYUE82e893hY - delete the second k and the ^Kzzz.
In the second example, we see that the first k is part of the first string. It is the k before the A that breaks up the first and second strings.
I hope there is a super grep expert who can help! Many thanks!
If your grep supports -P option, would you please try:
grep -a -Po "7[[:alnum:]]+(?=M-)" file
The -a option forces grep to read the input as a text file.
The -P option enables the perl-compatible regex.
The -o option tells grep to print only the matched substring(s).
The pattern (?=M-) is a zero-width lookahead assertion (introduced in
Perl) without including it in the result.
Alternatively you can also say with sed:
sed 's/M-/\n/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'
The first sed command splits the input file into miltiple lines by
replacing the substring M- with a newline.
It has two benefits: it breaks the lines to allow multiple matches with
sed and excludes the unnecessary portion M- from the input.
The next sed command extracts the desired pattern from the input.
It assumes your sed accepts \n in the replacement, which is
a GNU extension (not POSIX compliant). Otherwise please try (in case you are working on bash):
sed 's/M-/\'$'\n''/g' file | sed -n 's/.*\(7[[:alnum:]]\+\).*/\1/p'
[UPDATE]
(The requirement has been updated by the OP and the followings are solutions according to it.)
Let me assume the string which starts with 7 and ends with M- is always followed
by another (no more and no less than one) string which starts with 5x and ends
with ^ (ascii caret character) with junks in between.
Then would you please try the following:
grep -aPo "7[[:alnum:]]+M-.*?5x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|5x[[:alnum:]]+(?=\^)"
It executes the task in two steps (two cascaded greps).
The 1st grep narrows down the input data into the candidate substring
which will include the desired two sequences and junks in between.
The regex .*? in between matches any (ascii or binary) characters
except for a newline character.
The trailing ? enables the shortest match
which avoids the overrun due to the greedy nature of regex. The regex is intended to match junks in between.
The 2nd grep includes two regex's merged with a pipe | meaning logical OR.
Then it extracts two desired sequences.
A potential problem of grep solution is that grep is a line oriented command
and cannot include the newline character in the matched string.
If a newline character is included in the junks in between (I'm not sure about the possibility), the above solution will fail.
As a workaround, perl will provide flexible manipulations with binary data.
perl -0777 -ne '
while (/(7[[:alnum:]]+)M-.*?(5x[[:alnum:]]+)\^/sg) {
printf("%s\n%s\n", $1, $2);
}
' file
The regex is mostly same as that of grep because the -P option of grep means
perl-compatible.
It can capture multiple patterns at once in variables $1 and $2 hence just one regex is enough.
The -0777 option to the perl command tells perl to slurp all data
at once.
The s option at the end the regex makes a dot match a newline character.
The g option enables the global (multiple) match.
[UPDATE2]
In order to make the regex match either 5x or 6x, replace 5x with (5|6)x.
Namely:
grep -aPo "7[[:alnum:]]+M-.*?(5|6)x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|(5|6)x[[:alnum:]]+(?=\^)"
As mentioned before, the pipe | means OR. The OR operator has the lowest priority in the evaluation, hence you need to enclose them with parens in this case.
If there is a possibility any other number than 5 or 6 may appear, it will be safer to put [[:digit:]] instead, which matches any one digit betweeen 0 and 9:
grep -aPo "7[[:alnum:]]+M-.*?[[:digit:]]x[[:alnum:]]+\^" file | grep -aPo "7[[:alnum:]]+(?=M-)|[[:digit:]]x[[:alnum:]]+(?=\^)"
[UPDATE3]
(Answering the OP's requirement on March 9th)
Let me start with a perl code which regex will be relatively easier
to explain.
perl -0777 -ne 'while (/(1(.{3}).+)k([AB].*)[\013 ]\2/g){print "$1 $3\n"}' file
Output:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc B4m4zT7Yg042KIDYUE82e893hY
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc A2m4zT7Yg042KIDYUE82e893hY
[Explanation of regex]
(1(.{3}).+)k([AB].*)[\013 ]\2
( start of the 1st capture group referred by $1 later
1 literal "1"
( start of the 2nd capture group referred by \2 later
.{3} a sequence of the identical three characters such as ppp or zzz
) end of the 2nd capture group
.+ followed by any characters with "greedy" match which may include the 1st "k"
) end of the 1st capture group
k literal "k"
( start of the 3rd capture group referred by $3 later
[AB].* the character "A" or "B" followed by any characters
) end of the 3rd capture group
[\013 ] followed by ^K or a whitespace
\2 followed by the capture group 2 previously assigned
When implementing it with grep, we will encounter a limitation of grep.
Although we want to extract multiple patterns from the input file,
the -e option (which can specify multiple search patterns) does not
work with -P option. Then we need to split the regex into two patterns
such as:
grep -Po "(1(.{3}).+)(?=k([AB].*)[\013 ]\2)" file
grep -Po "(1(.{3}).+)k\K([AB].*)(?=[\013 ]\2)" file
And the result will be:
1pppsx9YPar8Rvs75tJYWZq3eo8Pgwbc
1zzzsx9YPkr8Rvs75tJYWZq3eo8Pgwbc
B4m4zT7Yg042KIDYUE82e893hY
A2m4zT7Yg042KIDYUE82e893hY
Please be noted the order of output is not same as the order of appearance in the original file.
Another option will be to introduce ripgrep or rg which is a fast
and versatile version of grep. You may need to install ripgrep with
sudo apt install ripgrep or using other package handling tool.
An advantage of ripgrep is it supports -r (replace) option in which
you can make use of the backreferences:
rg -N -Po "(1(.{3}).+)k([AB].*)[\013 ]\2" -r '$1 $3' file
The -r '$1 $3' option prints the 1st and the 3rd capture groups and the result will be the same as perl.
In the general case, you can use the strings utility to pluck out ASCII from binary files; then of course you can try to grep that output for patterns that you find interesting.
Many traditional Unix utilities like grep have internal special markers which might get messed up by binary input. For example, the character \xFF was used for internal purposes by some versions of GNU grep so you can't grep for that character even if you can figure out a way to represent it in the shell (Bash supports $'\xff' for example).
A traditional approach would be to run hexdump or a similar utility, and then grep that for patterns. However, more modern scripting languages like Perl and Python make it easy to manipulate arbitrary binary data.
perl -ne 'print if m/\xff\xff/' </dev/urandom
This might work for you (GNU sed):
sed -En '/\n/!{s/M-\^G/\n/;s/7[^\n]*\n/\n&/};/^7[^\n]*/P;D' file
Split each line into zero or more lines that begin with 7 and end just before M-^G and only print such lines.

Print commands in history consisting in just one word

I want to print lines that contains single word only.
For example:
this is a line
another line
one
more
line
last one
I want to get the ones with single word only
one
more
line
EDIT: Guys, thank you for answers. Almost all of the answers work for my test file. However I wanted to list single lines in bash history. When I try your answers like
history | your posted commands
all of them below fails. Some only prints some numbers (might line numbers?)
You want to get all those commands in history that contain just one word. Considering that history prints the number of the command as a first column, you need to match those lines consisting in two words.
For this, you can say:
history | awk 'NF==2'
If you just want to print the command itself, say:
history | awk 'NF==2 {print $2}'
To rehash your problem, any line containing a space or nothing should be removed.
grep -Ev '^$| ' file
Your problem statement is unspecific on whether lines containing only punctuation might also occur. Maybe try
grep -Ex '[A-Za-z]+' file
to only match lines containing only one or more alphabetics. (The -x option implicitly anchors the pattern -- it requires the entire line to match.)
In Bash, the output from history is decorated with line numbers; maybe try
history | grep -E '^ *[0-9]+ [A-Za-z]+$'
to match lines where the line number is followed by a single alphanumeric token. Notice that there will be two spaces between the line number and the command.
In all cases above, the -E selects extended regular expression matching, aka egrep (basic RE aka traditional grep does not support e.g. the + operator, though it's available as \+).
Try this:
grep -E '^\s*\S+\s*$' file
With the above input, it will output:
one
more
line
If your test strings are in a file called in.txt, you can try the following:
grep -E "^\w+$" in.txt
What it means is:
^ starting the line with
\w any word character [a-zA-Z0-9]
+ there should be at least 1 of those characters or more
$ line end
And output would be
one
more
line
Assuming your file as texts.txt and if grep is not the only criteria; then
awk '{ if ( NF == 1 ) print }' texts.txt
If your single worded lines don't have a space at the end you can also search for lines without an empty space :
grep -v " "
I think that what you're looking for could be best described as a newline followed by a word with a negative lookahead for a space,
/\n\w+\b(?! )/g
example

How to split a string or file that may be delimited by a combination of comments and spaces, tabs, newlines, commas, or other characters

If file: list.txt contains really ugly data like so:
aaaa
#bbbb
cccc, dddd; eeee
ffff;
#gggg hhhh
iiii
jjjj,kkkk ;llll;mmmm
nnnn
How do we parse/split that file, excluding the commented lines, delimiting it by all commas, semicolons, and all white-space (including tabs, spaces, and newline and carrage-return characters) with a bash script?
Using shell commands:
grep -v "^[ |\t]*#" file|tr ";," "\n"|awk '$1=$1'
It can be done with the following code:
#!/bin/bash
### read file:
file="list.txt"
IFSO=$IFS
IFS=$'\r\n'
while read line; do
### skip lines that begin with a "#" or "<whitespace>#"
match_pattern="^\s*#"
if [[ "$line" =~ $match_pattern ]];
then
continue
fi
### replace semicolons and commas with a space everywhere...
temp_line=(${line//[;|,]/ })
### splitting the line at whitespaces requires IFS to be set back to default
### and then back before we get to the next line.
IFS=$IFSO
split_line_arr=($temp_line)
IFS=$'\r\n'
### push each word in the split_line_arr onto the final array
for word in ${split_line_arr[*]}; do
array+=(${word})
done
done < $file
echo "Array items:"
for item in ${array[*]} ; do
printf " %s\n" $item
done
This was not posed as a question, but rather a better solution to what others have touched upon when answering other related questions. The bit that is unique here is that those other questions/solutions did not really address how to split a string when it is delimited with a combination of spaces and characters and comments; this is one solution that address all three simultaneously...
Related questions:
How to split one string into multiple strings separated by at least one space in bash shell?
How do I split a string on a delimiter in Bash?
Additional notes:
Why do this with bash when other scripting languages are better suited for splitting? A bash script is more likely to have all the libraries it needs when running from a basic upstart or cron (sh) shell, compared with a perl program for example. An argument list is often needed in these situations and we should expect the worst from people who maintain those lists...
Hopefully this post will save bash newbies a lot of time in the future (including me)... Good luck!
sed 's/[# \t,]/REPLACEMENT/g' input.txt
above command replaces comment characters ('#'), spaces (' '), tabs ('\t'), and commas (',') with an arbitrary string ('REPLACEMENT')
to replace newlines, you can try:
sed 's/[# \t,]/replacement/g' input.txt | tr '\n' 'REPLACEMENT'
if you have Ruby on your system
File.open("file").each_line do |line|
next if line[/^\s*#/]
puts line.split(/\s+|[;,]/).reject{|c|c.empty?}
end
output
# ruby test.rb
aaaa
cccc
dddd
eeee
ffff
iiii
jjjj
kkkk
llll
mmmm
nnnn

Substitute words not in double quotes

$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
basic
I want unix sed command such that only basic that is not in quotes should be changed.[change basic to ring]
Expected output:
$cat file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If we disallow escaping quotes, then any basic that is not within " is preceded by an even number of ". So this should do the trick:
sed -r 's/^([^"]*("[^"]*){2}*)basic/\1ring/' file
And as ДМИТРИЙ МАЛИКОВ mentioned, adding the --in-place option will immediately edit the file, instead of returning the new contents.
How does this work?
We anchor the regular expression to the beginning of each line with ". Then we allow an arbitrary number of non-" characters (with [^"]*). Then we start a new subpattern "[^"]* that consists of one " and arbitrarily many non-" characters. We repeat that an even number of times (with {2}*). And then we match basic. Because we matched all of that stuff in the line before basic we would replace that as well. That's why this part is wrapped in another pair of parentheses, thus capturing the line and writing it back in the replacement with \1 followed by ring.
One caveat: if you have multiple basic occurrences in one line, this will only replace the last one that is not enclosed in double quotes, because regex matches cannot overlap. A solution would be a lookbehind, but since this would be a variable-length lookbehind, which is only supported by the .NET regex engine. So if that is the case in your actual input, run the command multiple times until all occurrences are replaced.
$> sed -r 's/^([^\"]*)(basic)([^\"]*)$/\1ring\3/' file0
"basic/strong/bold"
" /""?basic""/strong/bold"
"^/))basic"
ring
If you wanna edit file in place use --in-place option.
This might work for you (GNU sed):
sed -r 's/^/\n/;ta;:a;s/\n$//;t;s/\n("[^"]*")/\1\n/;ta;s/\nbasic/ring\n/;ta;s/\n([^"]*)/\1\n/;ta' file
Not a sed solution, but it substitutes words not in quotes
Assuming that there is no escaped quotes in strings, i.e. "This is a trap \" hehe", awk might be able to solve this problem
awk -F\" 'BEGIN {OFS=FS}
{
for(i=1; i<=NF; i++){
if(i%2)
gsub(/basic/,"ring",$i)
}
print
}' inputFile
Basically the words that are not in quotes are in odd-numbered fields, and the word "basic" is replaced by "ring" in these fields.
This can be written as a one-liner, but for clarity's sake I've written it in multiple lines.
If basic is at the beginning of line:
sed -e 's/^basic/ring/' file0

Convert spaces to tabs in RegEx

How do do you say the following in regex:
foreach line
look at the beginning of the string and convert every group of 3 spaces to a tab
Stop once a character other than a space is found
This is what i have so far:
/^ +/\t/g
However, this converts every space to 1 tab
Any help would be appreciated.
With Perl:
perl -pe '1 while s/\G {3}/\t/gc' input.txt >output.txt
For example, with the following input
nada
three spaces
four spaces
three in the middle
six space
the output (TABs replaced by \t) is
$ perl -pe '1 while s/\G {3}/\t/gc' input | perl -pe 's/\t/\\t/g'
nada
\tthree spaces
\t four spaces
\tthree in the middle
\t\tsix spaces
I know this is an old question but I thought I'd give a full regex answer that works (well it worked for me).
s/\t* {3}/\t/g
I usually use this to convert a whole document in vim do this in vim it looks like this:
:%s/\t* \{3\}/\t/g
Hope it still helps someone.
You probably want /^(?: {3})*/\t/g
edit: fixed