Removing specific character from anywhere between two specific strings? - regex

I have a large text file that contains content as per the below example:
number="+123 123 123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456 789" text="Numbers here should keep their spaces"
number="+9 8 7 6 5" text="example 123 123 123"
What I would like is to remove any whitespace character between two identifying strings, in this case number= and " text= without touching the rest of the line. So that the desired output would be:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"
A regex like (?<=[0-9])(\s)(?=[0-9]) will interfere with with text field, which is undesirable.
I have tested a few variations of using something along the lines of (?<=address)(\s)(?=date) but this doesn't work. I think the problem lies in not being able to deal with the extra possible numbers in between the whitespace and the markers?
Adding wildcard matches into the lookbehinds/lookaheads such as (?<=address.*)(\s)(?=.*date) doesn't seem valid or else I've done it wrong? Also making the whitespace lazy with (/s+?) doesn't seem to help me, but this is about where my knowledge of regex really falls to pieces :)
Ideally I would also like to restrict between the extra equals and quotes characters for safety. I.e number=" at the beginning marker and text=" as the end marker.
Any sed/awk or similar solutions are also welcomed if easier.

Using awk:
awk 'BEGIN{FS=OFS="\""}{gsub(/ /,"",$2)}1' file
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Using a substitution and a loop:
sed ':l s/\(number="[^" \t]*\)\s\s*/\1/g;tl' input
this one gives:
number="+123123123" text="This is some text"
number="+123456" text="This may contain numbers"
number="+123456789" text="Numbers here should keep their spaces"
number="+98765" text="example 123 123 123"

Search: [ ](?=[^"]*" text=) (the [brackets] around the space are optional, they are there for clarity)
Replace: empty string.
In the regex demo, see the substitutions at the bottom.
Command-Line Syntax
I don't know the sed syntax to search and replace. With Perl (courtesy of #jaypal and #AvinashRaj):
perl -pe 's/ (?=[^"]*" text=)//g' file
From perl --help,
-p assume loop like -n but print line also, like sed
-e program one line of program (several -e's allowed, omit programfile)

Another awk solution:
awk -F ' text="' '{ gsub(/ /, "", $1); print $1 FS $2 }' file
-F text="' splits each input line into the part before text=" ($1), and the part after ($2) - the -F option sets the special FS (*f*ield *s*eparator) awk variable to a regex that awk uses to split each input line into fields.
gsub(/ /, "", $1) (*g*lobal *sub*stitution) removes all spaces from $1 (the part before text="; replaces spaces with the empty string).
print $1 FS $2 prints the output: the modified $1 (spaces removed), joined with FS (i.e., text="), joined with $2 (the unmodified part after text=").

Note: This is a complement to the existing answers to compare their performance.
Test environments:
OS X 10.9.4.
FreeBSD awk 20070501
FreeBSD sed (cannot tell version number)
Perl v5.16.2
Ubuntu 14.04
GNU Awk 4.0.1
sed (GNU sed) 4.2.2
Perl v5.18.2
The short of it:
The awk solutions are fastest.
On OS X, #jaypal's solution is faster, on Ubuntu it's #mklement0's (mine).
Followed by the perl solution.
The sed solution (accepted answer) is slowest.
Note that removing the unnecessary g option does improve things measurably, but doesn't change the big picture.
On OS X, the differences aren't dramatic.
On Ubuntu, the differences between the awk and the perl solutions are small, but the sed solution is dramatically slower.
Sample numbers, running against a 100,000-line input file 10 times.
Don't compare them directly (Ubuntu is running in a VM on the OS X machine), just look at their ratios. (Curiously, though, awk and perl ran faster in the Ubuntu VM):
OS X:
# awk (#japyal)
real 0m3.848s
user 0m3.773s
sys 0m0.049s
# awk (#mklement0)
real 0m4.011s
user 0m3.959s
sys 0m0.045s
# perl
real 0m4.382s
user 0m4.291s
sys 0m0.063s
# sed
real 0m4.867s
user 0m4.816s
sys 0m0.044s
# sed (no `g`)
real 0m4.510s
user 0m4.460s
sys 0m0.044s
Ubuntu:
# awk (#mklement0)
real 0m1.850s
user 0m1.788s
sys 0m0.020s
# awk (#jaypal)
real 0m2.055s
user 0m1.996s
sys 0m0.012s
# perl
real 0m2.349s
user 0m2.276s
sys 0m0.024s
# sed
real 0m8.278s
user 0m8.196s
sys 0m0.016s
# sed (no `g`)
real 0m7.580s
user 0m7.488s
sys 0m0.028s

Related

Match strings with certain number of unique characters in bash

I need to delete all the strings in a file that have less than 4 unique characters in them
Input:
hello
cabby
pabba
lokka
lappa
coool
apple
Expected Output:
hello
cabby
lokka
apple
I tried to think up a regular expression to do this but I can't think how it would even be possible.
I did find a sed command that seems promising, it deletes all duplicate characters. However, I am not sure how to program sed to test if the program returns 4 characters, and then if it does, match the original string.
sed ':1;s/\(\(.\).*\)\2/\1/g;t'
Using gnu awk:
awk 'BEGIN{FS=""} {
unq=0; delete seen; for (i=1; i<=NF; i++) if (!seen[$i]++) unq++} unq > 3' file
hello
cabby
lokka
apple
FS="" breaks each character into a separate field in awk.
You tried sed ':1;s/\(\(.\).*\)\2/\1/g;t', please replace t by t1.
Before your command, copy the current line in the Hold space.
After your command, replace lines with at least 4 characters left with the original line.
Now make sure you only print lines with at least four characters.
echo 'hello
cabby
pabba
lokka
lappa
coool
apple' | sed -nE 'h;:1;s/(.)(.*)\1/\1\2/g;t1;/.{4}/x;/.{4}/p'

Extracting string from html file or curl output

I have a html file where some of them are "minified", this means that a whole website can be in just one line.
I want to filter the value of ?idsite= which contains numbers. So a html contains something like this: img src="//stats.domains.com/piwik.php?idsite=44.
So the plain output should be "44".
I tried grep but it echos the whole line and just highlights the value.
With perl it could be something like:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| perl -nE 'say /.*idsite=(..)\"/ '
(assumes that idsite is always two characters ! :-). Your regex will need to be more sophisticated than this most likely).
Putting the snippet from the page you reference above in an HTML file (non-minified) and subsituting 44 for the parameter variable, this bit of perl will extract the "44":
perl -nE 'say /.*idsite=(..)/ if /idsite/ ' idsite.html
Translating the one liner to a sed command line would be similar:
echo "Whole bunch of stuff \
img src=\"stats.domains.com/piwik.php?idsite=44\" " \
| sed -En "s/^.*idsite=(..)\"/\1/p"
This is POSIXsed from FreeBSD (should work on OSX) the -E switch is to add "modern" regexes.
Doing it in awk is left as an exercise for another community member :-)
Here is a perl way to extract only the trailing digits of strings like src="//stats.domains.com/piwik.php?idsite=44" and run on a bash command line:
echo $src|perl -ne '$_ =~m /(\d+$)/; print $1'
Here is a python way to do the same thing:
import re
print ', '.join( re.findall(r'\d+$', src))
If there will be a lot of src strings to process, it would be best to compile the regex when using Python as follows:
import re
p = re.compile('\d+$')
print ', '.join(p.findall(src))
The import and the compilation only have to be done once.
Here is a Ruby way to do it:
puts src.scan( /\d+$/ ).first
In all cases the regexes end with "$" which matches the end of the string. That is why they match and extract only digits (\d+) at the end of the string.
If you don't need to check whether the idsite is in the value of a src attribute, then all you need is
perl -nE'say $1 if /\bidsite=(\d+)' myfile.html
$ cat site.html
lorem ipsum idsite='4934' fasdf a
other line
$ sed -n '/idsite/ { s/.*idsite=\([0-9]\+\).*$/\1/; p }' < site.html
4934
Let me know in case you need an explanation of what is going on.

Awk 3 Spaces + 1 space or hyphen

I have a rather large chart to parse. Each column is separated by either 4 spaces or by 3 spaces and a hyphen (since the numbers in the chart can be negative).
cat DATA.txt | awk "{ print match($0,/\s\s/) }"
does nothing but print a slew of 0's. I'm trying to understand AWK and when to escape, etc, but I'm not getting the hang of it. Help is appreciated.
One line:
1979 1 -0.176 -0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
1979 1 -0.176 0.185 -0.412 0.069 -0.129 0.297 -2.132 -0.334 -0.019
I would like to get just, say, the second column. I copied the line, but I'd like to see -0.185 and 0.185.
You need to start by thinking about bash quoting, since it is bash which interprets the argument to awk which will be the awk program. Inside double-quoted strings, bash expands $0 to the name of the bash executable (or current script); that's almost certainly not what you want, since it will not be a quoted string. In fact, you almost never want to use double quotes around the awk program argument, so you should get into the habit of writing awk '...'.
Also, awk regular expressions don't understand \s (although Gnu awk will handle that as an extension). And match returns the position of the match, which I don't think you care about either.
Since by default, awk considers any sequence of whitespace a field separator, you don't really need to play any games to get the fourth column. Just use awk '{print $4}'
Why not just use this simple awk
awk '$0=$4' Data.txt
-0.185
0.185
It sets $0 to value in $4 and does the default action, print.
PS do not use cat with program that can read data itself, like awk
In case of filed 4 containing 0, you can make it more robust like:
awk '{$0=$4}1' Data.txt
If you're trying to split the input according to 3 or 4 spaces then you will get the expected output only from column 3.
$ awk -v FS=" {3,4}" '{print $3}' file
-0.185
0.185
FS=" {3,4}" here we pass a regex as FS value. This regex get parsed and set the Field Separator value to three or four spaces. In regex {min,max} called range quantifier which repeats the previous token from min to max times.

how to rejoin words that are split accross lines with a hyphen in a text file

OCR texts often have words that flow from one line to another with a hyphen at the end of the first line. (ie: the word has '-\n' inserted in it).
I would like rejoin all such split words in a text file (in a linux environment).
I believe this should be possible with sed or awk, but the syntax for these is dark magic to me! I knew a text editor in windows that did regex search/replace with newlines in the search expression, but am unaware of such in linux.
Make sure to back up ocr_file before running as this command will modify the contents of ocr_file:
perl -i~ -e 'BEGIN{$/=undef} ($f=<>) =~ s#-\s*\n\s*(\S+)#$1\n#mg; print $f' ocr_file
This answer is relevant, because I want the words joined together... not just a removal of the dash character.
cat file| perl -CS -pe's/-\n//'|fmt -w52
is the short answer, but uses fmt to reform paragraphs after the paragraphs were mangled by perl.
without fmt, you can do
#!/usr/bin/perl
use open qw(:std :utf8);
undef $/; $_=<>;
s/-\n(\w+\W+)\s*/$1\n/sg;
print;
also, if you're doing OCR, you can use this perl one-liner to convert unicode utf-8 dashes to ascii dash characters. note the -CS option to tell perl about utf-8.
# 0x2009 - 0x2015 em-dashes to ascii dash
perl -CS -pe 'tr/\x{2009}\x{2010}\x{2011}\x{2012\x{2013}\x{2014}\x{2015}/-/'
cat file | perl -p -e 's/-\n//'
If the file has windows line endings, you'll need to catch the cr-lf with something like:
cat file | perl -p -e 's/-\s\n//'
Hey this is my first answer post, here goes:
'-\n' I suspect are the line-feed characters. You can use sed to remove these. You could try the following as a test:
1) create a test file:
echo "hello this is a test -\n" > testfile
2) check the file has the expected contents:
cat testfile
3) test the sed command, this sends the edited text stream to standard out (ie your active console window) without overwriting anything:
sed 's/-\\n//g' testfile
(you should just see 'hello this is a test file' printed to the console without the '-\n')
If I build up the command:
a) First off you have the sed command itself:
sed
b) Secondly the expression and sed specific controls need to be in quotations:
sed 'sedcontrols+regex' (the text in quotations isn't what you'll actually enter, we'll fill this in as we go along)
c) Specify the file you are reading from:
sed 'sedcontrols+regex' testfile
d) To delete the string in question, sed needs to be told to substitute the unwanted characters with nothing (null,zero), so you use 's' to substitute, forward-slash, then the unwanted string (more on that in a sec), then forward-slash again, then nothing (what it's being substituted with), then forward-slash, and then the scale (as in do you want to apply the edit to a single line or more). In this case I will select 'g' which represents global, as in the whole text file. So now we have:
sed 's/regex//g' testfile
e) We need to add in the unwanted string but it gets confusing because if there is a slash in your string, it needs to be escaped out using a back-slash. So, the unwanted string
-\n ends up looking like -\\n
We can output the edited text stream to stdout as follows:
sed 's/-\\n//g' testfile
To save the results without overwriting anything (assuming testfile2 doesn't exist) we can redirect the output to a file:
sed 's/-\\n//g' testfile >testfile2
sed -z 's/-\n//' file_with_hyphens

how to use sed, awk, or gawk to print only what is matched?

I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file