Get multiple values in an xml file - regex

<!-- someotherline -->
<add name="core" connectionString="user id=value1;password=value2;Data Source=datasource1.comapany.com;Database=databasename_compny" />
I need to grab the values in userid , password, source, database. Not all lines are in the same format.My desired result would be (username=value1,password=value2, DataSource=datasource1.comapany.com,Database=databasename_compny)
This regex seems little bit more complicated as it is more complicated. Please, explain your answer if possible.
I realised its better to loop through each line. Code I wrote so far
while read p || [[ -n $p ]]; do
#echo $p
if [[ $p =~ .*connectionString.* ]]; then
echo $p
fi
done <a.config
Now inside the if I have to grab the values.

For this solution I am considering:
Some lines can contain no data
No semi-colon ; is inside the data itself (nor field names)
No equal sign = is inside the data itself (nor field names)
A possible solution for you problem would be:
#!/bin/bash
while read p || [[ -n $p ]]; do
# 1. Only keep what is between the quotes after connectionString=
filteredLine=`echo $p | sed -n -e 's/^.*connectionString="\(.\+\)".*$/\1/p'`;
# 2. Ignore empty lines (that do not contain the expected data)
if [ -z "$filteredLine" ]; then
continue;
fi;
# 3. split each field on a line
oneFieldByLine=`echo $filteredLine | sed -e 's/;/\r\n/g'`;
# 4. For each field
while IFS= read -r field; do
# extract field name + field value
fieldName=`echo $field | sed 's/=.*$//'`;
fieldValue=`echo $field | sed 's/^[^=]*=//' | sed 's/[\r\n]//'`;
# do stuff with it
echo "'$fieldName' => '$fieldValue'";
done < <(printf '%s\n' "$oneFieldByLine")
done <a.xml
Explanations
General sed replacement syntax :
sed 's/a/b/' will replace what matches the regex a by the content of b
Step 1
-n argument tells sed not to output if no match is found. In this case this is useful to ignore useless lines.
^.* - anything at the beginning of the line
connectionString=" - literally connectionString="
\(.\+\)" - capturing group to store anything in before the closing quote "
.*$" - anything until the end of the line
\1 tells sed to replace the whole match with only the capturing group (which contains only the data between the quotes)
p tells sed to print out the replacement
Step 3
Replace ; by \r\n ; it is equivalent to splitting by semi-colon because bash can loop over line breaks
Step 4 - field name
Replaces literal = and the rest of the line with nothing (it removes it)
Step 4 - field value
Replaces all the characters at the beginning that are not = ([^=] matches all but what is after the '^' symbol) until the equal symbol by nothing.
Another sed command removes the line breaks by replacing it with nothing.

Related

Bash regex overwrite line if multiple match

I have a bash script where I have 3 regular expressions. I would like to, through conditional if, to find the match of the first pattern in the file.
If there is a match, then look for a match in the second pattern but only with the lines that have matched the first pattern.
Finally, to check the third pattern only with the lines that have matched the second pattern (which are also the ones that had already matched the first pattern).
I have the following code but I don't know how to tell that if there is a match to overwrite the "line" value to decrease the number of total lines to only the ones matching.
#!/bin/bash
pattern1= egrep '^([^,]*,){31}[1-9][0-9].*'
pattern2= egrep '^([^,]*,){16}[0-1].[3-9].*'
pattern3= egrep '^([^,]*,){32}[2-9][0-9].*'
while read line
do
if [[$line == $pattern1]];then
newline == $pattern1
if [[$newline == $pattern2 ]];then
newline2 == $pattern2
if [[$newline2 == $pattern3 ]]; then
echo $pattern3
fi
done < mj1.csv #this is the input file
I will call this script like ./b1.sh <filename>.
Some input data:
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,1,1,10/26/1984,21,252,21.6899384,CHI,1,WSB,1,16,1,40,5,16,0.313,0,0,,6,7,0.857,1,5,6,7,2,4,5,2,16,12.5
1985,2,2,10/27/1984,21,253,21.69267625,CHI,0,MIL,0,-2,1,34,8,13,0.615,0,0,,5,5,1,3,2,5,5,2,1,3,4,21,19.4
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,4,4,10/30/1984,21,256,21.7008898,CHI,0,KCK,1,5,1,36,8,21,0.381,0,0,,9,9,1,2,2,4,5,3,1,6,5,25,14.7
1985,5,5,11/1/1984,21,258,21.7063655,CHI,0,DEN,0,-16,1,33,7,15,0.467,0,0,,3,4,0.75,3,2,5,5,1,1,2,4,17,13.2
1985,6,6,11/7/1984,21,264,21.72279261,CHI,0,DET,1,4,1,27,9,19,0.474,0,0,,7,9,0.778,1,3,4,3,3,1,5,5,25,14.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,10,10,11/15/1984,21,272,21.74469541,CHI,1,BOS,0,-20,1,33,12,24,0.5,0,1,0,3,3,1,0,2,2,2,2,1,1,4,27,17.1
1985,11,11,11/17/1984,21,274,21.75017112,CHI,1,PHI,0,-9,1,44,4,17,0.235,0,0,,8,8,1,0,5,5,7,5,2,4,5,16,12.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,14,14,11/23/1984,21,280,21.76659822,CHI,0,SEA,1,19,1,30,9,13,0.692,0,0,,5,6,0.833,0,4,4,3,4,1,4,4,23,19.5
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,16,16,11/27/1984,21,284,21.77754962,CHI,0,GSW,0,-6,1,24,6,10,0.6,0,0,,1,1,1,0,2,2,3,3,2,4,1,13,11.1
1985,17,17,11/29/1984,21,286,21.78302533,CHI,0,PHO,0,-5,1,30,9,17,0.529,1,1,1,3,4,0.75,1,2,3,2,2,0,2,5,22,14
1985,18,18,11/30/1984,21,287,21.78576318,CHI,0,LAC,1,4,1,37,9,15,0.6,0,0,,2,4,0.5,2,3,5,5,3,0,4,4,20,15.5
1985,19,19,12/2/1984,21,289,21.79123888,CHI,0,LAL,1,1,1,42,7,13,0.538,0,0,,6,8,0.75,2,0,2,3,1,1,4,3,20,12.9
1985,20,20,12/4/1984,21,291,21.79671458,CHI,1,NJN,1,15,1,35,7,13,0.538,0,0,,6,6,1,1,2,3,6,1,0,3,3,20,16
1985,21,21,12/7/1984,21,294,21.80492813,CHI,1,NYK,1,2,1,43,8,16,0.5,0,1,0,5,7,0.714,1,1,2,3,2,0,6,5,21,9.3
1985,22,22,12/8/1984,21,295,21.80766598,CHI,1,DAL,1,2,1,35,10,23,0.435,0,0,,0,0,,4,3,7,2,0,2,2,3,20,11.2
1985,23,23,12/11/1984,21,298,21.81587953,CHI,1,DET,0,-7,1,37,13,28,0.464,0,1,0,1,3,0.333,1,7,8,6,2,0,3,4,27,16.2
1985,24,24,12/12/1984,21,299,21.81861739,CHI,0,DET,0,-7,1,30,6,17,0.353,0,2,0,9,10,0.9,0,1,1,2,2,1,1,5,21,12.5
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,26,26,12/15/1984,21,302,21.82683094,CHI,1,PHI,0,-12,1,27,7,16,0.438,0,0,,0,0,,1,1,2,2,1,0,1,2,14,7.2
1985,27,27,12/18/1984,21,305,21.83504449,CHI,1,HOU,0,-8,1,45,8,20,0.4,0,1,0,2,4,0.5,1,2,3,8,3,0,1,2,18,14.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
To make things easier, pattern1 matches all rows where column PTS is higher than 10, pattern 2 matches the rows where column FG_PCT is higher than 0.3, and pattern 3 matches all rows where column GmSc is higher than 19.
While an awk solution is going to be a bit faster ... we'll focus on a bash solution per OP's request.
First issue is regex matching uses the =~ operator and not the == operator.
Second issue is that to keep a row if only all 3 regexes match means we want to and (&&) the results of all 3 regex matches.
Third issue addresses some basic syntax issues with OP's current code (eg, space after [[ and before ]]; improper assignments of regex patterns to the pattern* variables).
One bash idea:
pattern1='^([^,]*,){31}[1-9][0-9].*'
pattern2='^([^,]*,){16}[0-1].[3-9].*'
pattern3='^([^,]*,){32}[2-9][0-9].*'
head -1 mj1.csv > mj1.new.csv
while read -r line
do
if [[ "${line}" =~ $pattern1 && "${line}" =~ $pattern2 && "${line}" =~ $pattern3 ]]
then
# do whatever with $line, eg:
echo "${line}"
fi
done < mj1.csv >> mj1.new.csv
This generates:
$ cat mj1.new.csv
EndYear,Rk,G,Date,Years,Days,Age,Tm,Home,Opp,Win,Diff,GS,MP,FG,FGA,FG_PCT,3P,3PA,3P_PCT,FT,FTA,FT_PCT,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,GmSc
1985,3,3,10/29/1984,21,255,21.69815195,CHI,1,MIL,1,6,1,34,13,24,0.542,0,0,,11,13,0.846,2,2,4,5,6,2,3,4,37,32.9
1985,7,7,11/8/1984,21,265,21.72553046,CHI,0,NYK,1,15,1,33,15,22,0.682,0,0,,3,4,0.75,4,4,8,5,3,2,5,2,33,29.3
1985,8,8,11/10/1984,21,267,21.73100616,CHI,0,IND,1,2,1,42,9,22,0.409,0,0,,9,12,0.75,2,7,9,4,2,5,3,4,27,21.2
1985,9,9,11/13/1984,21,270,21.73921971,CHI,1,SAS,1,3,1,43,18,27,0.667,1,1,1,8,11,0.727,2,8,10,4,3,2,4,4,45,37.5
1985,12,12,11/19/1984,21,276,21.75564682,CHI,1,IND,0,-17,1,39,11,26,0.423,0,3,0,12,16,0.75,2,3,5,2,2,1,3,3,34,20.8
1985,13,13,11/21/1984,21,278,21.76112252,CHI,0,MIL,0,-10,1,42,11,22,0.5,0,0,,13,14,0.929,4,9,13,2,2,2,6,3,35,26.7
1985,15,15,11/24/1984,21,281,21.76933607,CHI,0,POR,0,-10,1,41,10,24,0.417,0,1,0,10,10,1,3,3,6,8,3,1,4,4,30,23.9
1985,25,25,12/14/1984,21,301,21.82409309,CHI,0,NJN,0,-2,1,44,12,25,0.48,0,0,,10,10,1,2,6,8,8,1,0,0,4,34,29.5
1985,28,28,12/20/1984,21,307,21.84052019,CHI,0,ATL,1,3,1,41,12,22,0.545,0,0,,10,16,0.625,4,4,8,7,5,1,7,5,34,26.6
NOTE: OP hasn't (yet) provided the expected output so at this point I have to assume OP's regexes are correct

Bash / Regex: Replacing the second field in a CSV file when some of the first fields start with quotes and commas within those

This question is for a code written in bash, but is really more a regex question. I have a file (ARyy.txt) with CSV values in them. I want to replace the second field with NaN. This is no problem at all for the simple cases (rows 1 and 2 in the example), but it's much more difficult for a few cases where there are quotes in the first field and they have commas in them. These quotes are literally only there to indicate there are commas within them (so if quotes are only there if commas are there and vice versa). Quotes are always the first and last characters if there are commas in the first field.
Here is what I have thus far. NOTE: please try to answer using sed and the general format. There is a way to do this using awk for FPAT from what I know but I need one using sed ideally (or simple use case of awk).
#!/bin/bash
LN=1 #Line Number
while read -r LIN #LIN is a variable containing the line
do
echo "$LN: $LIN"
((LN++))
if [ $LN -eq 1 ]; then
continue #header line
elif [[ {$LIN:0:1} == "\"" ]]; then #if the first character in the line is quote
sed -i '${LN}s/\",/",NaN/' ARyy.txt #replace quote followed by comma with quote followed by comma followed by NaN
else #if first character doesn't start with a quote
sed -i '${LN}s/,[^,]*/,0/' ARyy.txt; fi
done < ARyy.txt
Other pertinent info:
There are never double or nested quotes or anything peculiar like this
There can be more than one comma inside the quotations
I am always replacing the second field
The second field is always just a number for the input (Never words or quotes)
Input Example:
Fruit, Weight, Intensity, Key
Apple, 10, 12, 343
Banana, 5, 10, 323
"Banana, green, 10 MG", 3, 14, 444 #Notice this line has commas in it but it has quotes to indicate this)
Desired Output:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444 #second field changed to NaN and first field remains in tact
Try this:
sed -E -i '2,$ s/^("[^"]*"|[^",]*)(, *)[0-9]*,/\1\2NaN,/' ARyy.txt
Explanation: sed -E invokes "extended" regular expression syntax, so it's easier to use parenthesized groups.
2,$ = On lines 2 through the end of file...
s/ = Replace...
^ = the beginning of a line
("[^"]*"|[^",]*) = either a double-quoted string or a string that doesn't contain any double-quotes or commas
(, *) = a comma, maybe followed by some spaces
[0-9]* = a number
, = and finally a comma
/ = ...with...
\1 = the first () group (i.e. the original first field)
\2 = the second () group (i.e. comma and spaces)
NaN, = Not a number, and the following comma
/ = end of replacement
Note that if the first field could contain escaped double-quotes and/or escaped commas (not in double-quotes), the first pattern would have to be significantly more complex to deal with them.
BTW, the original has an antipattern I see disturbingly often: reading through a file line-by-line to decide what to do with that line, then running something that processes the entire file in order to change that one line. So if you have a thousand-line file, it winds up processing the entire file a thousand times (for a total of a million lines processed). This is what's known as "quadratic scaling", because it takes time proportional to the square of the problem size. As Bruce Dawson put it,
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Given your specific format, in particular that the first field won't ever have any escaped double quotes in it:
sed -E '2,$ s/^("[^"]*"|[^,]*),[^,]*/\1,NaN/' < input.csv > output.csv
This does require the common but non-standard -E option to use POSIX Extended Regular Expression syntax instead of the default Basic (which doesn't support alternation).
One (somewhat verbose) awk idea that replaces the entire set of code posted in the question:
awk -F'"' ' # input field separator = double quotes
function print_line() { # print array
pfx=""
for (i=1; i<=4; i++) {
printf "%s%s", pfx, arr[i]
pfx=OFS
}
printf "\n"
}
FNR==1 { print ; next } # header record
NF==1 { split($0,arr,",") # no double quotes => split line on comma
arr[2]=" NaN" # override arr[2] with " NaN"
}
NF>=2 { split($3,arr,",") # first column in from file contains double quotes
# so split awk field #3 on comma; arr[2] will
# be empty
arr[1]="\"" $2 "\"" # override arr[1] with awk field #1 (the double
# quoted first column from the file
arr[2]=" NaN" # override arr[2] " NaN"
}
{ print_line() } # print our array
' ARyy.txt
For the sample input file this generates:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444
while read -r LIN; do
if [ $LN -eq 1 ]; then
((LN++))
continue
elif [[ $LIN == $(echo "$LIN" | grep '"') ]]; then
word1=$(echo "$LIN" | awk -F ',' '{print $4}')
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
elif [[ $LIN == $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') ]]; then
word2=$(echo "$LIN" | cut -f2 -d ',')
echo "$LIN" | sed -i "$LN"s/"$word2"/\ NaN/ ARyy2.txt
fi
echo "$LN: $LIN"
((LN++))
done <ARyy.txt
make a copy of input ARyy.txt to ARyy2.txt and use this text files as the output.
(read from ARyy.txt and write to ARyy2.txt)
the first elif $(echo "$LIN" | grep '"') checks if the LINE starts with quotes " returns:
once selected, want to grab the number 3 with awk -F ',' '{print $4}and saved to variable word1. -F tells awk to separate columns each time encounters a , so 6 columns in total and number 3 is in column 4 that's why {print $4}
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
then use sed to select line number with $LN. The number 3 inside variable /$word1/. for replacement with /NaN/ BUT want to add a space to NaN so need to escape \ the space with /\ NaN/
always using echo $LIN to grab the correct LINE
the second elif $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') returns:
$LIN only returns one line a time, like this:
The important is to check if the LINE has this pattern Word + space + ONE Digit
once selected, want to grab the number 10[second column] this time with cut -f2 -d ',' and save it to variable word2. -f2 selects the second column, and -d is telling cut to use , to separate each column.

Extract all but last field from a variable in bash

I have a file with lines similar to this:
01/01 THIS IS A DESCRIPTION 123.45
12/23 SHORTER DESC 9.00
11/16 DESC 1,234.00
Three fields: date, desc, amount. The first field will always be followed by a space. The last field will always be preceded by a space. But the middle field will usually contain spaces.
I know bash/regex well enough to get the first and last fields (for example, echo ${LINE##* } or cut -f1 -d\). But how do I get the middle field? Essentially everything except the first and last fields.
You can use sed for that:
$ sed -E 's/^[^[:space:]]*[[:space:]](.*)[[:space:]][^[:space:]]*$/\1/' file
THIS IS A DESCRIPTION
SHORTER DESC
DESC
Or with awk:
$ awk '{$1=$NF=""; sub(/^[ \t]*/,"")}1' file
# same output
You can also use cut and rev to delete the first and last fields:
$ cut -d ' ' -f2- file | rev | cut -d ' ' -f2- | rev
# same output
Or GNU grep:
$ grep -oP '^\H+\h\K(.*)(?=\h+\H+$)' file
# same output
Or, with a Bash loop and parameter expansion:
$ while read -r line; do line="${line#* }"; echo "${line% *}"; done <file
# same output
Or, if you want to capture the fields as variables in Bash:
while IFS= read -r line; do
date="${line%% *}"
amt="${line##* }"
line="${line#* }"
desc="${line% *}"
printf "%5s %10s \"%s\"\n" "$date" "$amt" "$desc"
done <file
Prints:
01/01 123.45 "THIS IS A DESCRIPTION"
12/23 9.00 "SHORTER DESC"
11/16 1,234.00 "DESC"
If you want to remove the first and last fields, you can just extend the parameter expansion technique you referenced:
var=${var#* } var=${var% *}
A single # or % removes the shortest substring that matches the glob.
bash: read the line into an array of words, and pick out the wanted elements from the array
while read -ra words; do
date=${words[0]}
amount=${words[-1]}
description=${words[*]:1:${#words[#]}-2}
printf "%s=%s\n" date "$date" desc "$description" amt "$amount"
done < file
outputs
date=01/01
desc=THIS IS A DESCRIPTION
amt=123.45
date=12/23
desc=SHORTER DESC
amt=9.00
date=11/16
desc=DESC
amt=1,234.00
This is the fun bit: ${words[*]:1:${#words[#]}-2}
take a slice of the words array, from index 1 (the 2nd element) for a length of "number of elements minus 2"
the words will be joined into a single string with a space separator.
See Shell Parameter Expansion and scroll down a bit for the ${parameter:offset:length} discussion.
If you want to use a regex in bash, then you can use capturing parentheses and the BASH_REMATCH array
while IFS= read -r line; do
if [[ $line =~ ([^[:blank:]]+)" "(.+)" "([^[:blank:]]+) ]]; then
echo "date=${BASH_REMATCH[1]}"
echo "desc=${BASH_REMATCH[2]}"
echo "amt=${BASH_REMATCH[3]}"
fi
done < file
Same output as above.
Notice in the pattern that the spaces need to be quoted (or backslash-escaped)
You could try below one with awk:
awk '{$1="";$NF="";sub(/^[ \t]*/,"")}1' file_name

Swap Strings within a line in Bash

I'm parsing a document with a bash script and output different parts of it. At one point i need find and reformat text in the form of:
(foo)[X]
[Y]
(bar)[Z]
to something like:
X->foo
Y
Z->bar
Now, I'm able to grep the parts I want with RegEx, but I'm having trouble swapping the two elements in one line and handling the fact that the text in parentheses is optional. Is this even possible with a combination of sed and grep?
Thank You for your time.
You can use sed:
sed -e 's/(\([^)]*\))\[\([^]]*\)]/\2->\1/' -e 's/\[\([^]]*\)]/\1/' file
This works for your given input example:
X->foo
Y
Z->bar
You might need to make the patterns more strict if you have more kinds of input to handle.
You can use awk:
awk -F '[][()]+' '{print (NF>3 ? $3 "->" $2 : $2)}' file
X->foo
Y
Z->bar
You can even do it in bash itself, although it's not pretty.
# Three capture groups:
# 1. The optional paranthesized text
# 2. The contents of the parentheses
# 3. The contents of the square brackets
regex="(\((.*)\))?\[(.*)\]"
while IFS= read -r str; do
[[ "$str" =~ $regex ]]
# If the 2nd array element is not empty, print -> followed by the
# non-empty value.
echo "${BASH_REMATCH[3]}${BASH_REMATCH[2]:+->${BASH_REMATCH[2]}}"
done < file.txt

In GNU Grep or another standard bash command, is it possible to get a resultset from regex?

Consider the following:
var="text more text and yet more text"
echo $var | egrep "yet more (text)"
It should be possible to get the result of the regex as the string: text
However, I don't see any way to do this in bash with grep or its siblings at the moment.
In perl, php or similar regex engines:
$output = preg_match('/yet more (text)/', 'text more text yet more text');
$output[1] == "text";
Edit: To elaborate why I can't just multiple-regex, in the end I will have a regex with multiple of these (Pictured below) so I need to be able to get all of them. This also eliminates the option of using lookahead/lookbehind (As they are all variable length)
egrep -i "([0-9]+) +$USER +([0-9]+).+?(/tmp/Flash[0-9a-z]+) "
Example input as requested, straight from lsof (Replace $USER with "j" for this input data):
npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXu8pvMg (deleted)
npviewer. 17875 j 17u REG 8,8 16037387 524273 /tmp/FlashXXIBH29F (deleted)
The end goal is to cp /proc/$var1/fd/$var2 ~/$var3 for every line, which ends up "Downloading" flash files (Flash used to store in /tmp but they drm'd it up)
So far I've got:
#!/bin/bash
regex="([0-9]+) +j +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+)"
echo "npviewer. 17875 j 11u REG 8,8 59737848 524264 /tmp/FlashXXYOvS8S (deleted)" |
sed -r -n -e " s%^.*?$regex.*?\$%\1 \2 \3%p " |
while read -a array
do
echo /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
It cuts off the first digits of the first value to return, and I'm not familiar enough with sed to see what's wrong.
End result for downloading flash 10.2+ videos (Including, perhaps, encrypted ones):
#!/bin/bash
lsof | grep "/tmp/Flash" | sed -r -n -e " s%^.+? ([0-9]+) +$USER +([0-9]+).+?/tmp/(Flash[0-9a-zA-Z]+).*?\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Edit: look at my other answer for a simpler bash-only solution.
So, here the solution using sed to fetch the right groups and split them up. You later still have to use bash to read them. (And in this way it only works if the groups themselves do not contain any spaces - otherwise we had to use another divider character and patch read by setting $IFS to this value.)
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
sed -r -n -e " s%^.*$regex.*\$%\1 \2 \3%p " |
while read -a array
do
cp /proc/${array[0]}/fd/${array[1]} ~/${array[2]}
done
Note that I had to adapt your last regex group to allow uppercase letters, and added a space at the beginning to be sure to capture the whole block of numbers. Alternatively here a \b (word limit) would have worked, too.
Ah, I forget mentioning that you should pipe the text to this script, like this:
./grep-result.sh < grep-result-test.txt
(provided your files are named like this). Instead you can add a < grep-result-test after the sed call (before the |), or prepend the line with cat grep-result-test.txt |.
How does it work?
sed -r -n calls sed in extended-regexp-mode, and without printing anything automatically.
-e " s%^.*$regex.*\$%\1 \2 \3%p " gives the sed program, which consists of a single s command.
I'm using % instead of the normal / as parameter separator, since / appears inside the regex and I don't want to escape it.
The regex to search is prefixed by ^.* and suffixed by .*$ to grab the whole line (and avoid printing parts of the rest of the line).
Note that this .* grabs greedy, so we have to insert a space into our regexp to avoid it grabbing the start of the first digit group too.
The replacement text contains of the three parenthesed groups, separated by spaces.
the p flag at the end of the command says to print out the pattern space after replacement. Since we grabbed the whole line, the pattern space consists of only the replacement text.
So, the output of sed for your example input is this:
5 11 /tmp/FlashXXu8pvMg
5 17 /tmp/FlashXXIBH29F
This is much more friendly for reuse, obviously.
Now we pipe this output as input to the while loop.
read -a array reads a line from standard input (which is the output from sed, due to our pipe), splits it into words (at spaces, tabs and newlines), and puts the words into an array variable.
We could also have written read var1 var2 var3 instead (preferably using better variable names), then the first two words would be put to $var1 and $var2, with $var3 getting the rest.
If read succeeded reading a line (i.e. not end-of-file), the body of the loop is executed:
${array[0]} is expanded to the first element of the array and similarly.
When the input ends, the loop ends, too.
This isn't possible using grep or another tool called from a shell prompt/script because a child process can't modify the environment of its parent process. If you're using bash 3.0 or better, then you can use in-process regular expressions. The syntax is perl-ish (=~) and the match groups are available via $BASH_REMATCH[x], where x is the match group.
After creating my sed-solution, I also wanted to try the pure-bash approach suggested by Mark. It works quite fine, for me.
#!/bin/bash
USER=j
regex=" ([0-9]+) +$USER +([0-9]+).+(/tmp/Flash[0-9a-zA-Z]+) "
while read
do
if [[ $REPLY =~ $regex ]]
then
echo cp /proc/${BASH_REMATCH[1]}/fd/${BASH_REMATCH[2]} ~/${BASH_REMATCH[3]}
fi
done
(If you upvote this, you should think about also upvoting Marks answer, since it is essentially his idea.)
The same as before: pipe the text to be filtered to this script.
How does it work?
As said by Mark, the [[ ... ]] special conditional construct supports the binary operator =~, which interprets his right operand (after parameter expansion) as a extended regular expression (just as we want), and matches the left operand against this. (We have again added a space at front to avoid matching only the last digit.)
When the regex matches, the [[ ... ]] returns 0 (= true), and also puts the parts matched by the individual groups (and the whole expression) into the array variable BASH_REMATCH.
Thus, when the regex matches, we enter the then block, and execute the commands there.
Here again ${BASH_REMATCH[1]} is an array-access to an element of the array, which corresponds to the first matched group. ([0] would be the whole string.)
Another note: Both my scripts accept multi-line input and work on every line which matches. Non-matching lines are simply ignored. If you are inputting only one line, you don't need the loop, a simple if read ; then ... or even read && [[ $REPLY =~ $regex ]] && ... would be enough.
echo "$var" | pcregrep -o "(?<=yet more )text"
Well, for your simple example, you can do this:
var="text more text and yet more text"
echo $var | grep -e "yet more text" | grep -o "text"