Bash: Replace array value with curl result - regex

I have a text file named raw.txt with something like the following:
T DOTTY CRONO 52/50 53/40 54/30 55/20 RESNO NETKI
U CYMON DENDU 51/50 52/40 53/30 54/20 DOGAL BEXET
V YQX KOBEV 50/50 51/40 52/30 53/20 MALOT GISTI
W VIXUN LOGSU 49/50 50/40 51/30 52/20 LIMRI XETBO
X YYT NOVEP 48/50 49/40 50/30 51/20 DINIM ELSOX
Y DOVEY 42/60 44/50 47/40 49/30 50/20 SOMAX ATSUR
Z SOORY 43/50 46/40 48/30 49/20 BEDRA NERTU
A DINIM 51/20 52/30 50/40 47/50 RONPO COLOR
B SOMAX 50/20 51/30 49/40 46/50 URTAK BANCS
C BEDRA 49/20 50/30 48/40 45/50 VODOR RAFIN
D ETIKI 48/15 48/20 49/30 47/40 44/50 BOBTU JAROM
E 46/40 43/50 42/60 DOVEY
F 45/40 42/50 41/60 JOBOC
G 43/40 41/50 40/60 SLATN
I'm reading it into an array:
while read line; do
set $line
IFS=' ' read -a array <<< "$line"
done < raw.txt
I'm trying to replace all occurrences of [A-Z]{5} with an curl result where the match of [A-Z]{5} is fed as a variable into the curl call.
First match to be replaced would be DOTTY. The call looks similar to curl -s http://example.com/api_call/DOTTY and the result is something like -55.5833 50.6333 which should replace DOTTY in the array.
I was so far unable to correctly match the desired string and feed the match into curl.
Your help is greatly appreciated.
All the best,
Chris
EDIT:
Solution
Working solution based on #Kevin extensive answer and #Floris hint about a possible carriage return in the curl result. This was indeed the case. Thank you! Combined with some tinkering on my side I now got it to work.
#!/bin/bash
while read line; do
set $line
IFS=' ' read -a array <<< "$line"
i=0
for str in ${array[#]}; do
if [[ "$str" =~ [A-Z]{5} ]]; then
curl_tmp=$(curl -s http://example.com/api_call/$str)
# cut off line break
curl=${curl_tmp/$'\r'}
# insert at given index
declare array[$i]="$curl"
fi
let i++
done
# write to file
for index in "${array[#]}"; do
echo $index
done >> $WORK_DIR/nats.txt
done < raw.txt

I didn't change anything about your script except add the matching part, since it seems that's what you're needing help on:
#!/bin/bash
while read line; do
set $line
IFS=' ' read -a array <<< "$line"
for str in ${array[#]}; do
if [[ "$str" =~ [A-Z]{5} ]]; then
echo curl "http://example.com/api_call/$str"
fi
done
done < raw.txt
EDIT: added in the url example you provided with the variable in the URI. You can do whatever you need with the fetched output by changing it to do_something "$(curl ...)"
EDIT2: Since you're wanting to maintain the bash array you create from each line, how about this:
I'm not great at bash when it comes to arrays, so I expect someone to call me out on it, but this should work.
I've left some echos there so you can see what it's doing. The shift commands are to push the array index from the current location when the regex matches. The tmp variable to hold your curl output could probably be improved, but this should get you started, I hope.
removed temporarily to avoid confusion
EDIT3: Oops the above didn't actually work. My mistake. Let me try again here.
EDIT4:
#!/bin/bash
while read line; do
set $line
IFS=' ' read -a array <<< "$line"
i=0
# echo ${array[#]} below is just so you can see it before processing. You can remove this
echo "Array before processing: ${array[#]}"
for str in ${array[#]}; do
if [[ "$str" =~ [A-Z]{5} ]]; then
# replace the echo command below with your curl command
# ie - curl="$(curl http://example.com/api_call/$str)"
curl="$(echo 1234 -1234)"
if [[ "$flag" = "1" ]]; then
array=( ${adjustedArray[#]} )
push=$(( $push + 2 ));
let i++
else
push=1
fi
adjustedArray=( ${array[#]:0:$i} ${curl[#]} ${array[#]:$(( $i + $push)):${#array[#]}} )
#echo "DEBUG adjustedArray in loop: ${adjustedArray[#]}"
flag=1;
fi
let i++
done
unset flag
echo "final: ${adjustedArray[#]}"
# do further processing here
done < raw.txt
I know there's a smarter way to do this than the above, but we're getting into areas in bash where I'm not really suited to give advice. The above should work, but I'm hoping someone can do better.
Hope it helps, anyway
ps - You should probably not use a shell script for this unless you really need to. Perl, php, or python would make the code simple and readable

Since I misread the first time:
How about just using sed?
sed "s/\([A-Z]\{5\}\)/$(echo curl http:\\/\\/example.com\\/api_call\\/\\1)/g" /tmp/raw.txt
Try that, then try removing the echo. I'm not 100% on this since I can't run it on the real domain
EDIT: And just so I'm clear, the echo is just there so you can see what it will do with the echo removed

create a file cmatch:
#!/bin/bash
while read line
do
echo $line
a=`echo $line | egrep -o '\b[A-Z]{5}\b'`
for v in $a
do
echo "doing curl to replace $v in $line"
r=`curl -s http://example.com/api_call/$v`
r1=`echo $r | xargs echo`
line=`echo $line | sed 's/'$v'/'$r1'/'`
done
done
then call it with
chmod 755 cmatch
./cmatch < inputfile.txt > outputfile.txt
It will do what you asked
Notes:
the \b before and after the [A-Z]{5} ensures that ABCDEFG (which is not a five letter word) will not match.
using egrep -o produces an array of matches
I loop over this array to allow the replacement of multiple matches in a line
I update the line for each match found using the result of the curl call
to keep code clean, I assign the result of the curl to an intermediate variable
edit Just saw the comments about arrays. I suggest to take the output of this script and convert it to an array if you want to do further manipulation...
more edits If your curl command returns a multi-line string (which would explain the error you see), you can use the new line I introduced in the script to remove the newlines (essentially stringing all the arguments together):
echo $r | xargs echo
calls echo with one line at a time as argument, and without the carriage returns. It's a fun way of getting rid of carriage returns.

#!/bin/bash
while read line;do
set -- $line
echo "second parm is $2"
echo "do your curl here"
done < afile.txt

Related

Using regex in Bash with mapfile

Edit 2:
Minimal input file: input/input.txt
#-----------
snapshot=83
#-----------
time=30142088
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240480
heap_tree=empty
#-----------
snapshot=84
#-----------
time=30408368
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240552
heap_tree=empty
#-----------
snapshot=85
#-----------
time=30674648
mem_heap_B=20224
mem_heap_extra_B=8
mem_stacks_B=240464
heap_tree=empty
#-----------
snapshot=86
#-----------
Actual output:
input.txt/*
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
input/input.txt
time, heap, stack
30674648, 20224, 240464
Expected output:
input.txt
time, heap, stack,
30142088, 20224, 240480
30408368, 20224, 240552
30674648, 20224, 240464
Edit: Originally, the problem may have been due to Bash's regex's lack of multiline capability. However, after stripping newlines from the text, the problem remains, with the exception that the output now has between one to five lines instead of zero.
I'm trying to write a Bash script to parse a text file into a desirable CSV file with the needed information.
As part of the script, I iterate through n files. Each of the files contains m matches for a given regex, and each match contains three capture groups.
I want to format the three capture groups into a CSV row, then concatenate all the rows of all the matches of all the files and write them to a *.csv file.
I'm quite comfortable using Regex in high level languages such as Kotlin or C#, however I have no experience with Regex in Bash. I used this answer as a starting point, however it doesn't seem to be working for me (mapfile -t matches < <( format_row "$text" "$regex" ) doesn't do anything.
Here's the full code with the relevant portion noted:
#!/bin/bash
# RELEVANT CODE BELOW
regex="time=([0-9]+)\nmem_heap_B=([0-9]+)\n.*\nmem_stacks_B=([0-9]+)"
format_row() {
local s=$1 regex=$2
while [[ $s =~ $regex ]]
do
time="${BASH_REMATCH[1]}"
heap="${BASH_REMATCH[2]}"
stack="${BASH_REMATCH[3]}"
echo "${time}, ${heap}, ${stack}"
echo ""
s=${s#*"${BASH_REMATCH[3]}"}
done
}
for file in $1/*
do
echo "Parsing ${file}..."
echo $file >> $2
echo "time, heap, stack" >> $2
text=$(<${file})
mapfile -t matches < <( format_row "$text" "$regex" )
printf "%s\n" "${matches[#]}" >> $2
echo "" >> $2
done
echo ""
echo "Done"
Thanks!
There are two problems here:
Although bash's =~ operator can match newlines, it does not understand the \n escape sequence. You have to use actual newlines in your regex. This can also be achieved by C-style strings $'\n'.
The regex quantifier * is greedy. When matching ...
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.).*b=(.) ]]
... you end up with BATCH_REMATCH=(1 3) instead of (1 1).
In other regex dialects like PCRE you could use the non-greedy quantifier *?. However, in bash we have to use a workaround and have to replace .* with something that cannot match more than wanted, for instance
[[ "a=1,b=1 a=2,b=2 a=3,b=3" =~ a=(.)[^=]*b=(.) ]]
In your case we have to make sure that the next mem_stacks is not matched
As you didn't post any example input and expected output, I can only guess. However, I assume the following regex could work for you:
regex=$'time=([0-9]+)
mem_heap_B=([0-9]+)
([^\n]*\n){TODO set number of lines allowed here}
mem_stacks_B=([0-9]+)'
Note that now you have to use BASH_REMATCH[4] instead of [3].
At the marked location you have to insert the numbers of lines allowed between mem_heap and mem_stacks. The number can be constant (e.g. {5}) or a range (e.g. {1,10}). In case of ranges you have to make sure that the maximum bound is not so high that you could accidentally skip the next mem_stacks and match another mem_stacks instead. Thus, in case of ranges it might be more appropriate to use two matches. Something like
regex1='time=([0-9]+)
mem_heap_B=([0-9]+)'
regex2='mem_stacks_B=([0-9]+)'
while
[[ "$s" =~ $regex1 ]] &&
time="${BASH_REMATCH[1]}" &&
heap="${BASH_REMATCH[2]}" &&
[[ "$s" =~ $regex2 ]] &&
stack="${BASH_REMATCH[1]}"
do
echo "$time, $heap, $stack"
s="${s#*$stack}"
done >> "$2"
By the way:
https://www.shellcheck.net/ helps you to make your script more robust.
First and foremost: quote your variables.
You can use do cmd1; cmd2 done >> file instead of do cmd1 >> file; cmd2 >> file; done.
mapfile -t matches < <(format_row "$text" "$regex")
printf "%s\n" "${matches[#]}" >> "$2"
could be written as just
format_row "$text" "$regex" >> "$2"

Extracting CGI query parameter values in bash [duplicate]

This question already has answers here:
How to parse $QUERY_STRING from a bash CGI script?
(16 answers)
Closed 3 years ago.
All right, folks, you may have seen this infamous quirk to get hold of those values:
query=`echo $QUERY_STRING | sed "s/=/='/g; s/&/';/g; s/$/'/"`
eval $query
If the query string is host=example.com&port=80 it works just fine and you get the values in bash variables host and port.
However, you may know that a cleverly crafted query string will cause an arbitrary command to be executed on the server side.
I'm looking for a secure replacement or an alternative not using eval. After some research I dug up these alternatives:
read host port <<< $(echo "$QUERY_STRING" | tr '=&' ' ' | cut -d ' ' -f 2,4)
echo $host
echo $port
and
if [[ $QUERY_STRING =~ ^host=([^&]*)\&port=(.*)$ ]]
then
echo ${BASH_REMATCH[1]}
echo ${BASH_REMATCH[2]}
else
echo no match, sorry
fi
Unfortunately these two alternatives only work if the pars come in the order host,port. But they could come in the opposite order.
There could also be more than 2 pars, and any order is possible and allowed. So how do you propose to get the values into the
appropriate bash vars? Can the above methods be amended? Remember that with n pars there are n! possible orders. With 2 pars
there are only 2, but with 3 pars there are already 3! = 6.
I returned to the first method. Can it be made safe to run eval? Can you transform $QUERY_STRING with sed in a way that
makes it safe to do eval $query ?
EDIT: Note that this question differs from the other one referred to and is not a duplicate. The emphasis here is on using eval in a safe way. That is not answered in the other thread.
This method is safe. It does not eval or execute the QUERY_STRING. It uses string manipulation to break up the string into pieces:
QUERY_STRING='host=example.com&port=80'
declare -a pairs
IFS='&' read -ra pairs <<<"$QUERY_STRING"
declare -A values
for pair in "${pairs[#]}"; do
IFS='=' read -r key value <<<"$pair"
values["$key"]="$value"
done
echo do something with "${values[host]}" and "${values[port]}"
URL "percent decoding" left as an exercise.
You must avoid executing strings at all time when they come from untrusted sources. Therefore I would strongly suggest never to use eval in Bash do something with a string.
To be really save, I think I would echo the string into a file, use grep to retrieve parts of the string and remove the file afterwards. Always use a directory out of the web root.
#! /bin/bash
MYFILE=$(mktemp)
QUERY_STRING='host=example.com&port=80&host=lepmaxe.moc&port=80'
echo "${QUERY_STRING}" > ${MYFILE}
TMP_ARR=($(grep -Eo '(host|port)[^&]*' ${MYFILE}))
[ ${#TMP_ARR} -gt 0 ] || exit 1
[ $((${#TMP_ARR} % 2)) -eq 0 ] || exit 1
declare -A ARRAY;
for ((i = 0; i < ${#TMP_ARR[#]}; i+=2)); do
tmp=$(echo ${TMP_ARR[#]:$((i)):2})
port=$(echo $tmp | sed -r 's/.*port=([^ ]*).*/\1/')
host=$(echo $tmp | sed -r 's/.*host=([^ ]*).*/\1/')
ARRAY[$host]=$port
done
for i in ${!ARRAY[#]}; do
echo "$i = ${ARRAY[$i]}"
done
rm ${MYFILE}
exit 0
This produces:
lepmaxe.moc = 80
example.com = 80

Substring removal in bash

I'm currently trying to get into bash regular expressions to change multiple filenames at the same time. Here are the file names:
a_001_D_xy_S37_L003_R1_001.txt
a_001_D_xy_S37_L003_R2_001.txt
a_002_D_xy_S37_L006_R1_001.txt
a_002_D_xy_S37_L006_R2_001.txt
a_003_D_xy_S23_L003_R1_001.txt
a_003_D_xy_S23_L003_R2_001.txt
I want this as my result:
a_002_D_xy_R1.txt
a_002_D_xy_R2.txt
...
I only want to change those with *001.txt at the end. First I want to remove the _S.._L00. in the filenames and the 001 in the end. I split this procedure in two parts:
for file in *001.txt;
do
echo ${file#_S.._L..6}
done
This loop already does not work. As a second alternative I tried:
for file in *001.fastq.gz;
do
echo ${file/_S.._L00./}
done
but the filenames are again unchanged. (I just use echo here to see the results. If it works I will replace it with mv ${file} ${regularexpression})
Thanks for help!
Considering that you need lots of different fields it is possibly better to just split the filename and then reconstruct it as you wish.
I suggest using an array built by splitting the original filename with _. Then you just reconstruct the new name by using the fields that you wish.
for file in *001.txt; do
echo "FILE: $file"
IFS='_' read -r -a fileFields <<< "$file"
echo "FILE FIELDS: "
for index in "${!fileFields[#]}"; do
echo "- $index ${fileFields[index]}"
done
fileName="${fileFields[0]}_${fileFields[1]}_${fileFields[2]}_${fileFields[3]}_${fileFields[-2]}.txt"
echo "NEW FILE NAME: $fileName"
# mv $file $fileName
done
The echo commands are just for debuging, you can remove them all once you understand the code.
However, if you really need to split the string using BASH expressions you can check this post:
Extracting part of a string to a variable in bash or take a look at this BASH cheat sheet.
Try to make a function, you'll first have to decide the number (n) of files.
n=$(ls *_001.txt | wc -l)
functionRename(){
for(( i=1; i <=n; i++))
do
file=$(ls *_001.txt | head -n $i | tail -n 1)
mv "${file}" "${file%_S??_*}${file#???????????????????}"
file2=$(ls *_001.txt | head -n $i | tail -n 1)
mv "${file2}" "${file2%_001*}.txt"
done
}
functionRename

sed/awk replace in all matches

I want to invert all the color values in a bunch of files. The colors are all in the hex format #ff3300 so the inversion could be done characterwise with the sed command
y/0123456789abcdef/fedcba9876543210/
How can I loop through all the color matches and do the char translation in sed or awk?
EDIT:
sample input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
asdfghj
desired output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
asdfghj
EDIT: I changed my response as per your edit.
OK, sed may result in a difficult processing. awk could do the trick more or less easily, but I find perl much more easy for this task:
$ perl -pe 's/#[0-9a-f]+/$&=~tr%0123456789abcdef%fedcba9876543210%r/ge' <infile >outfile
Basically you find the pattern, then execute the right-hand side, which executes the tr on the match, and substitutes the value there.
The inversion is really a subtraction. To invert a hex, you just subtract it from ffffff.
With this in mind, you can build a simple script to process each line, extract hexes, invert them, and inject them back to the line.
This is using Bash (see arrays, printf -v, += etc) only (no external tools there):
#!/usr/bin/env bash
[[ -f $1 ]] || { printf "error: cannot find file: %s\n" "$1" >&2; exit 1; }
while read -r; do
# split line with '#' as separator
IFS='#' toks=( $REPLY )
for tok in "${toks[#]}"; do
# extract hex
read -n6 hex <<< "$tok"
# is it really a hex ?
if [[ $hex =~ [0-9a-fA-F]{6} ]]; then
# compute inversion
inv="$((16#ffffff - 16#$hex))"
# zero pad the result
printf -v inv "%06x" "$inv"
# replace hex with inv
tok="${tok/$hex/$inv}"
fi
# build the modified line
line+="#$tok"
done
# print the modified line and clean it for reuse
printf "%s\n" "${line#\#}"
unset line
done < "$1"
use it like:
$ ./invhex infile > outfile
test case input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
bdf#cvb_foo
asdfghj
#bdfg
processed output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
bdf#cvb_foo
asdfghj
#bdfg
This might work for you (GNU sed):
sed '/#[a-f0-9]\{6\}\>/!b
s//\n&/g
h
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g
y/0123456789abcdef/fedcba9876543210/
H
g
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta
s/\n//' file
Explanation:
/#[a-f0-9]\{6\}\>/!b bail out on lines not containing the required pattern
s//\n&/g prepend every pattern with a newline
h copy this to the hold space
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g delete everything but the required pattern(s)
y/0123456789abcdef/fedcba9876543210/ transform the pattern(s)
H append the new pattern(s) to the hold space
g overwrite the pattern space with the contents of the hold space
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta replace the old pattern(s) with the new.
s/\n// remove the newline artifact from the H command.
This works...
cat test.txt |sed -e 's/\#\([0123456789abcdef]\{6\}\)/\n\#\1\n/g' |sed -e ' /^#.*/ y/0123456789abcdef/fedcba9876543210/' | awk '{lastType=type;type= substr($0,1,1)=="#";} type==lastType && length(line)>0 {print line;line=$0} type!=lastType {line=line$0} length(line)==0 {line=$0} END {print line}'
The first sed command inserts line breaks around the hex codes, then it is possible to make the substitution on all lines starting with a hash. There are probably an elegant solution to merge the lines back again, but the awk command does the job. The only assumption there is that there won't be two hex-codes following directly after each other. If so, this step has to be revised.

Getting the index of the substring on solaris

How can I find the index of a substring which matches a regular expression on solaris10?
Assuming that what you want is to find the location of the first match of a wildcard in a string using bash, the following bash function returns just that, or empty if the wildcard doesn't match:
function match_index()
{
local pattern=$1
local string=$2
local result=${string/${pattern}*/}
[ ${#result} = ${#string} ] || echo ${#result}
}
For example:
$ echo $(match_index "a[0-9][0-9]" "This is a a123 test")
10
If you want to allow full-blown regular expressions instead of just wildcards, replace the "local result=" line with
local result=$(echo "$string" | sed 's/'"$pattern"'.*$//')
but then you're exposed to the usual shell quoting issues.
The goto options for me are bash, awk and perl. I'm not sure what you're trying to do, but any of the three would likely work well. For example:
f=somestring
string=$(expr match "$f" '.*\(expression\).*')
echo $string
You tagged the question as bash, so I'm going to assume you're asking how to do this in a bash script. Unfortunately, the built-in regular expression matching doesn't save string indices. However, if you're asking this in order to extract the match substring, you're in luck:
if [[ "$var" =~ "$regex" ]]; then
n=${#BASH_REMATCH[*]}
while [[ $i -lt $n ]]
do
echo "capture[$i]: ${BASH_REMATCH[$i]}"
let i++
done
fi
This snippet will output in turn all of the submatches. The first one (index 0) will be the entire match.
You might like your awk options better, though. There's a function match which gives you the index you want. Documentation can be found here. It'll also store the length of the match in RLENGTH, if you need that. To implement this in a bash script, you could do something like:
match_index=$(echo "$var_to_search" | \
awk '{
where = match($0, '"$regex_to_find"')
if (where)
print where
else
print -1
}')
There are a lot of ways to deal with passing the variables in to awk. This combination of piping output and directly embedding one into the awk one-liner is fairly common. You can also give awk variable values with the -v option (see man awk).
Obviously you can modify this to get the length, the match string, whatever it is you need. You can capture multiple things into an array variable if necessary:
match_data=($( ... awk '{ ... print where,RLENGTH,match_string ... }'))
If you use bash 4.x you can source the oobash. A string lib written in bash with oo-style:
http://sourceforge.net/projects/oobash/
String is the constructor function:
String a abcda
a.indexOf a
0
a.lastIndexOf a
4
a.indexOf da
3
There are many "methods" more to work with strings in your scripts:
-base64Decode -base64Encode -capitalize -center
-charAt -concat -contains -count
-endsWith -equals -equalsIgnoreCase -reverse
-hashCode -indexOf -isAlnum -isAlpha
-isAscii -isDigit -isEmpty -isHexDigit
-isLowerCase -isSpace -isPrintable -isUpperCase
-isVisible -lastIndexOf -length -matches
-replaceAll -replaceFirst -startsWith -substring
-swapCase -toLowerCase -toString -toUpperCase
-trim -zfill