I need from this file extract line that starts with a number in the range 10-20 and I have tried use grep "[10-20]" tmp_file.txt, but from a file that has this format
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
21.
it returned everything and marked every number that contains either 1, 0, 10, 2, 0, 20 or 21 :/
With an extended regular expression (-E):
grep -E '^(1[0-9]|20)\.' file
Output:
10.aa
12.bbb
13.cccc
14.ddddd
15.eeeeee
16.fffffff
17.gggggggg
18.hhhhhhhhh
19.iiiiiiiiii
20.jjjjjjjjjjj
See: The Stack Overflow Regular Expressions FAQ
An other one with awk
awk '/^10\./,/^20\./' tmp_file.txt
awk '/^10\./,/^13\./' tmp_file.txt
10.aa
12.bbb
13.cccc
Try
grep -w -e '^1[[:digit:]]' -e '^20' tmp_file.txt
-w forces matches of whole words. That prevents matching lines like 100.... It's not POSIX, but it's supported by every grep that most people will encounter these days. Use grep -e '^1[[:digit:]]\.' -e '^20\.' ... if you are concerned about portability.
The -e option can be used multiple times to specify multiple patterns.
[[:digit:]] may be more reliable than [0-9]. See In grep command, can I change [:digit:] to [0-9]?.
Assuming the file might not be sorted and using numeric comparison
awk -F. '$1 >= 10 && $1 <= 20' < file.txt
grep is not the tool for this, because grep finds text patterns, but does not understand numeric values. Making patterns that match the 11 values from 10-20 is like stirring a can of paint with a screwdriver. You can do it, but it's not the right tool for the job.
A much clearer way to do this is with Perl:
$ perl -n -e'print if /^(\d+)/ && $1 >= 10 && $1 <= 20' foo.txt
This says to print a line of the file if the beginning of the line ^ matches one or more digits \d+ and if the numeric value of what was matched $1 is between the values of 10 and 20.
Given
variable=1234567890
or maybe
variable2=12345678901234567890
I need to modify the value so there is always a dot the 9th character from the end. The value could be any number of digits long but the dot must always be right before the 8 last characters.
I am trying to parse a response from https://blockchain.info/rawaddr/ for Bitcoin address balance that always gives a value like "final_balance":12345678901, instead of 123.45678901 (Bitcoin is divisible to 8 decimal places).
I'm guessing this might be possible with sed, but I don't really know. Any help would be appreciated.
For this you can use this sed:
sed -E "s/[0-9]{8}$/.&/"
It catches the last 8 digits and prints them back with a leading dot.
Samples
$ sed -E "s/[0-9]{8}$/.&/" <<< "$v"
123456789012.34567890
$ v="1234567890"
$ sed -E "s/[0-9]{8}$/.&/" <<< "$v"
12.34567890
How about just using parameter substring expansion?
Bash 4
$ echo "${variable2:0:-9}.${variable2: -8}"
12345678901.34567890
Bash 3
$ echo "${variable2:0:(${#variable2}-9)}.${variable2: -8}"
12345678901.34567890
$ echo "${variable:0:(${#variable}-9)}.${variable: -8}"
1.34567890
Parameter expansion saves you spawning another process, so it's very fast.
(Side note: The space between the colon and the first index is necessary to distinguish the substring expression with a negative start-index from another type of parameter expansion:)
${parameter:-word}
Use Default Values. If parameter is unset or null, the expansion of word is substituted. Otherwise, the value of parameter is substituted.
Here is the sed command without -E
echo 12345678901234567890 |sed 's/[0-9]\{8\}$/.&/'
123456789012.34567890
I'm trying to extract the last number before a file extension in a bash script. So the format varies but it'll be some combination of numbers and letters, and the last character will always be a digit. I need to pull those digits and store them in a variable.
The format is generally:
sdflkej10_sdlkei450_sdlekr_1.txt
I want to store just the final digit 1 into a variable.
I'll be using this to loop through a large number of files, and the last number will get into double and triple digits.
So for this file:
kej10_sdlkei450_sdlekr_310.txt
I'd need to return 310.
The number of alphanumeric characters and underscores varies with each file, but the number I want always is immediately before the .txt extension and immediately after an underscore.
I tried:
bname=${f%%.*}
number=$(echo $bname | tr -cd '[[:digit:]]')
but this returns all digits.
If I try
number = $(echo $(bname -2) it changes the number it returns.
The problem i'm having is mostly related to the variability, and the fact that I've been asked to do it in bash. Any help would really be appreciated.
regex='([0-9]+)\.[^.]*$'
[[ $file =~ $regex ]] && number=${BASH_REMATCH[1]}
This uses bash's underappreciated =~ regex operator which stores matches in an array named BASH_REMATCH.
You could do this using parameter substitution
var=kej10_sdlkei450_sdlekr_310.txt
var=${var%.*}
var=${var##*_}
echo $var
310
Use a Series of Bash Shell Expansions
While not the most elegant solution, this one uses a sequence of shell parameter expansions to achieve the desired result without having to define a specific extension. For example, this function uses the length and offset expansions to find the digit after removing filename extensions:
extract_digit() {
local basename=${1%%.*}
echo "${basename:$(( ${#basename} - 1 ))}"
}
Capturing Function Output
You can capture the output in a variable with something like:
$ foo=$(extract_digit sdflkej10_sdlkei450_sdlekr_1.txt)
$ echo $foo
1
Sample Output from Function
$ extract_digit sdflkej10_sdlkei450_sdlekr_1.txt
1
$ extract_digit sdflkej10_sdlkei450_sdlekr_9.txt
9
$ extract_digit sdflkej10_sdlkei450_sdlekr_10.txt
0
This should take care of your situation:
INPUT="some6random7numbers_12345_moreletters_789.txt"
SUBSTRING=`expr match "$INPUT" '.*_\([[:digit:]]*\)'`
echo $SUBSTRING
This will output 789
No need of regex here, you can utilize IFS
var="kej10_sdlkei450_sdlekr_310.txt"
v=$(IFS=[_.] read -ra arr <<< "$var" && echo "${arr[#]:(-2):1}")
echo "$v"
310
I want to invert all the color values in a bunch of files. The colors are all in the hex format #ff3300 so the inversion could be done characterwise with the sed command
y/0123456789abcdef/fedcba9876543210/
How can I loop through all the color matches and do the char translation in sed or awk?
EDIT:
sample input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
asdfghj
desired output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
asdfghj
EDIT: I changed my response as per your edit.
OK, sed may result in a difficult processing. awk could do the trick more or less easily, but I find perl much more easy for this task:
$ perl -pe 's/#[0-9a-f]+/$&=~tr%0123456789abcdef%fedcba9876543210%r/ge' <infile >outfile
Basically you find the pattern, then execute the right-hand side, which executes the tr on the match, and substitutes the value there.
The inversion is really a subtraction. To invert a hex, you just subtract it from ffffff.
With this in mind, you can build a simple script to process each line, extract hexes, invert them, and inject them back to the line.
This is using Bash (see arrays, printf -v, += etc) only (no external tools there):
#!/usr/bin/env bash
[[ -f $1 ]] || { printf "error: cannot find file: %s\n" "$1" >&2; exit 1; }
while read -r; do
# split line with '#' as separator
IFS='#' toks=( $REPLY )
for tok in "${toks[#]}"; do
# extract hex
read -n6 hex <<< "$tok"
# is it really a hex ?
if [[ $hex =~ [0-9a-fA-F]{6} ]]; then
# compute inversion
inv="$((16#ffffff - 16#$hex))"
# zero pad the result
printf -v inv "%06x" "$inv"
# replace hex with inv
tok="${tok/$hex/$inv}"
fi
# build the modified line
line+="#$tok"
done
# print the modified line and clean it for reuse
printf "%s\n" "${line#\#}"
unset line
done < "$1"
use it like:
$ ./invhex infile > outfile
test case input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
bdf#cvb_foo
asdfghj
#bdfg
processed output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
bdf#cvb_foo
asdfghj
#bdfg
This might work for you (GNU sed):
sed '/#[a-f0-9]\{6\}\>/!b
s//\n&/g
h
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g
y/0123456789abcdef/fedcba9876543210/
H
g
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta
s/\n//' file
Explanation:
/#[a-f0-9]\{6\}\>/!b bail out on lines not containing the required pattern
s//\n&/g prepend every pattern with a newline
h copy this to the hold space
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g delete everything but the required pattern(s)
y/0123456789abcdef/fedcba9876543210/ transform the pattern(s)
H append the new pattern(s) to the hold space
g overwrite the pattern space with the contents of the hold space
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta replace the old pattern(s) with the new.
s/\n// remove the newline artifact from the H command.
This works...
cat test.txt |sed -e 's/\#\([0123456789abcdef]\{6\}\)/\n\#\1\n/g' |sed -e ' /^#.*/ y/0123456789abcdef/fedcba9876543210/' | awk '{lastType=type;type= substr($0,1,1)=="#";} type==lastType && length(line)>0 {print line;line=$0} type!=lastType {line=line$0} length(line)==0 {line=$0} END {print line}'
The first sed command inserts line breaks around the hex codes, then it is possible to make the substitution on all lines starting with a hash. There are probably an elegant solution to merge the lines back again, but the awk command does the job. The only assumption there is that there won't be two hex-codes following directly after each other. If so, this step has to be revised.
I see lots of examples and man pages on how to do things like search-and-replace using sed, awk, or gawk.
But in my case, I have a regular expression that I want to run against a text file to extract a specific value. I don't want to do search-and-replace. This is being called from bash. Let's use an example:
Example regular expression:
.*abc([0-9]+)xyz.*
Example input file:
a
b
c
abc12345xyz
a
b
c
As simple as this sounds, I cannot figure out how to call sed/awk/gawk correctly. What I was hoping to do, is from within my bash script have:
myvalue=$( sed <...something...> input.txt )
Things I've tried include:
sed -e 's/.*([0-9]).*/\\1/g' example.txt # extracts the entire input file
sed -n 's/.*([0-9]).*/\\1/g' example.txt # extracts nothing
My sed (Mac OS X) didn't work with +. I tried * instead and I added p tag for printing match:
sed -n 's/^.*abc\([0-9]*\)xyz.*$/\1/p' example.txt
For matching at least one numeric character without +, I would use:
sed -n 's/^.*abc\([0-9][0-9]*\)xyz.*$/\1/p' example.txt
You can use sed to do this
sed -rn 's/.*abc([0-9]+)xyz.*/\1/gp'
-n don't print the resulting line
-r this makes it so you don't have the escape the capture group parens().
\1 the capture group match
/g global match
/p print the result
I wrote a tool for myself that makes this easier
rip 'abc(\d+)xyz' '$1'
I use perl to make this easier for myself. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/'
This runs Perl, the -n option instructs Perl to read in one line at a time from STDIN and execute the code. The -e option specifies the instruction to run.
The instruction runs a regexp on the line read, and if it matches prints out the contents of the first set of bracks ($1).
You can do this will multiple file names on the end also. e.g.
perl -ne 'print $1 if /.*abc([0-9]+)xyz.*/' example1.txt example2.txt
If your version of grep supports it you could use the -o option to print only the portion of any line that matches your regexp.
If not then here's the best sed I could come up with:
sed -e '/[0-9]/!d' -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
... which deletes/skips with no digits and, for the remaining lines, removes all leading and trailing non-digit characters. (I'm only guessing that your intention is to extract the number from each line that contains one).
The problem with something like:
sed -e 's/.*\([0-9]*\).*/&/'
.... or
sed -e 's/.*\([0-9]*\).*/\1/'
... is that sed only supports "greedy" match ... so the first .* will match the rest of the line. Unless we can use a negated character class to achieve a non-greedy match ... or a version of sed with Perl-compatible or other extensions to its regexes, we can't extract a precise pattern match from with the pattern space (a line).
You can use awk with match() to access the captured group:
$ awk 'match($0, /abc([0-9]+)xyz/, matches) {print matches[1]}' file
12345
This tries to match the pattern abc[0-9]+xyz. If it does so, it stores its slices in the array matches, whose first item is the block [0-9]+. Since match() returns the character position, or index, of where that substring begins (1, if it starts at the beginning of string), it triggers the print action.
With grep you can use a look-behind and look-ahead:
$ grep -oP '(?<=abc)[0-9]+(?=xyz)' file
12345
$ grep -oP 'abc\K[0-9]+(?=xyz)' file
12345
This checks the pattern [0-9]+ when it occurs within abc and xyz and just prints the digits.
perl is the cleanest syntax, but if you don't have perl (not always there, I understand), then the only way to use gawk and components of a regex is to use the gensub feature.
gawk '/abc[0-9]+xyz/ { print gensub(/.*([0-9]+).*/,"\\1","g"); }' < file
output of the sample input file will be
12345
Note: gensub replaces the entire regex (between the //), so you need to put the .* before and after the ([0-9]+) to get rid of text before and after the number in the substitution.
If you want to select lines then strip out the bits you don't want:
egrep 'abc[0-9]+xyz' inputFile | sed -e 's/^.*abc//' -e 's/xyz.*$//'
It basically selects the lines you want with egrep and then uses sed to strip off the bits before and after the number.
You can see this in action here:
pax> echo 'a
b
c
abc12345xyz
a
b
c' | egrep 'abc[0-9]+xyz' | sed -e 's/^.*abc//' -e 's/xyz.*$//'
12345
pax>
Update: obviously if you actual situation is more complex, the REs will need to me modified. For example if you always had a single number buried within zero or more non-numerics at the start and end:
egrep '[^0-9]*[0-9]+[^0-9]*$' inputFile | sed -e 's/^[^0-9]*//' -e 's/[^0-9]*$//'
The OP's case doesn't specify that there can be multiple matches on a single line, but for the Google traffic, I'll add an example for that too.
Since the OP's need is to extract a group from a pattern, using grep -o will require 2 passes. But, I still find this the most intuitive way to get the job done.
$ cat > example.txt <<TXT
a
b
c
abc12345xyz
a
abc23451xyz asdf abc34512xyz
c
TXT
$ cat example.txt | grep -oE 'abc([0-9]+)xyz'
abc12345xyz
abc23451xyz
abc34512xyz
$ cat example.txt | grep -oE 'abc([0-9]+)xyz' | grep -oE '[0-9]+'
12345
23451
34512
Since processor time is basically free but human readability is priceless, I tend to refactor my code based on the question, "a year from now, what am I going to think this does?" In fact, for code that I intend to share publicly or with my team, I'll even open man grep to figure out what the long options are and substitute those. Like so: grep --only-matching --extended-regexp
why even need match group
gawk/mawk/mawk2 'BEGIN{ FS="(^.*abc|xyz.*$)" } ($2 ~ /^[0-9]+$/) {print $2}'
Let FS collect away both ends of the line.
If $2, the leftover not swallowed by FS, doesn't contain non-numeric characters, that's your answer to print out.
If you're extra cautious, confirm length of $1 and $3 both being zero.
** edited answer after realizing zero length $2 will trip up my previous solution
there's a standard piece of code from awk channel called "FindAllMatches" but it's still very manual, literally, just long loops of while(), match(), substr(), more substr(), then rinse and repeat.
If you're looking for ideas on how to obtain just the matched pieces, but upon a complex regex that matches multiple times each line, or none at all, try this :
mawk/mawk2/gawk 'BEGIN { srand(); for(x = 0; x < 128; x++ ) {
alnumstr = sprintf("%s%c", alnumstr , x)
};
gsub(/[^[:alnum:]_=]+|[AEIOUaeiou]+/, "", alnumstr)
# resulting str should be 44-chars long :
# all digits, non-vowels, equal sign =, and underscore _
x = 10; do { nonceFS = nonceFS substr(alnumstr, 1 + int(44*rand()), 1)
} while ( --x ); # you can pick any level of precision you need.
# 10 chars randomly among the set is approx. 54-bits
#
# i prefer this set over all ASCII being these
# just about never require escaping
# feel free to skip the _ or = or r/t/b/v/f/0 if you're concerned.
#
# now you've made a random nonce that can be
# inserted right in the middle of just about ANYTHING
# -- ASCII, Unicode, binary data -- (1) which will always fully
# print out, (2) has extremely low chance of actually
# appearing inside any real word data, and (3) even lower chance
# it accidentally alters the meaning of the underlying data.
# (so intentionally leaving them in there and
# passing it along unix pipes remains quite harmless)
#
# this is essentially the lazy man's approach to making nonces
# that kinda-sorta have some resemblance to base64
# encoded, without having to write such a module (unless u have
# one for awk handy)
regex1 = (..); # build whatever regex you want here
FS = OFS = nonceFS;
} $0 ~ regex1 {
gsub(regex1, nonceFS "&" nonceFS); $0 = $0;
# now you've essentially replicated what gawk patsplit( ) does,
# or gawk's split(..., seps) tracking 2 arrays one for the data
# in between, and one for the seps.
#
# via this method, that can all be done upon the entire $0,
# without any of the hassle (and slow downs) of
# reading from associatively-hashed arrays,
#
# simply print out all your even numbered columns
# those will be the parts of "just the match"
if you also run another OFS = ""; $1 = $1; , now instead of needing 4-argument split() or patsplit(), both of which being gawk specific to see what the regex seps were, now the entire $0's fields are in data1-sep1-data2-sep2-.... pattern, ..... all while $0 will look EXACTLY the same as when you first read in the line. a straight up print will be byte-for-byte identical to immediately printing upon reading.
Once i tested it to the extreme using a regex that represents valid UTF8 characters on this. Took maybe 30 seconds or so for mawk2 to process a 167MB text file with plenty of CJK unicode all over, all read in at once into $0, and crank this split logic, resulting in NF of around 175,000,000, and each field being 1-single character of either ASCII or multi-byte UTF8 Unicode.
you can do it with the shell
while read -r line
do
case "$line" in
*abc*[0-9]*xyz* )
t="${line##abc}"
echo "num is ${t%%xyz}";;
esac
done <"file"
For awk. I would use the following script:
/.*abc([0-9]+)xyz.*/ {
print $0;
next;
}
{
/* default, do nothing */
}
gawk '/.*abc([0-9]+)xyz.*/' file