regex - Edit Bash arrays in text file - regex

I would like to change the following piece:
# Source
source=('10-nvidia-drm-outputclass.conf'
'20-nvidia.conf'
'linux-4.11.patch')
source_i686=("http://us.download.nvidia.com/XFree86/Linux-x86/$pkgver/NVIDIA-Linux-x86-$pkgver.run")
source_x86_64=("http://us.download.nvidia.com/XFree86/Linux-x86_64/$pkgver/$_pkg.run")
md5sums=('4f5562ee8f3171769e4638b35396c55d'
'2640eac092c220073f0668a7aaff61f7'
'897d9775dc484ab37934e7b102c5b325')
md5sums_i686=('8825cec1640739521689bd80121d1425')
md5sums_x86_64=('0e9590d48703c8baa034b6f0f8bbf1e5')
[[ $_pkg = NVIDIA-Linux-x86_64-$pkgver ]] && md5sums_x86_64=('1b74150e84fd99cc1207a51b9327112c')
into:
# Source
source=('10-nvidia-drm-outputclass.conf'
'20-nvidia.conf')
# 'linux-4.11.patch')
source_i686=("http://us.download.nvidia.com/XFree86/Linux-x86/$pkgver/NVIDIA-Linux-x86-$pkgver.run")
source_x86_64=("http://us.download.nvidia.com/XFree86/Linux-x86_64/$pkgver/$_pkg.run")
md5sums=('4f5562ee8f3171769e4638b35396c55d'
'2640eac092c220073f0668a7aaff61f7')
# '897d9775dc484ab37934e7b102c5b325')
md5sums_i686=('8825cec1640739521689bd80121d1425')
md5sums_x86_64=('0e9590d48703c8baa034b6f0f8bbf1e5')
[[ $_pkg = NVIDIA-Linux-x86_64-$pkgver ]] && md5sums_x86_64=('1b74150e84fd99cc1207a51b9327112c')
..to comment out the last item in source and md5sums and close the arrays ()).
I only know how to do 1/4th and comment out the 'linux-4.11.patch') with:
sed "/'linux-.*patch'/s/^/#/"
Sed version:
$ sed --version | head -1
sed (GNU sed) 4.4

Assuming no () characters inside the array elements and no NUL characters in file
$ sed -zE 's/((source|md5sums)=\([^)]*)\n([^)\n]*\))/\1)\n#\3/g' input_file
# Source
source=('10-nvidia-drm-outputclass.conf'
'20-nvidia.conf')
# 'linux-4.11.patch')
source_i686=("http://us.download.nvidia.com/XFree86/Linux-x86/$pkgver/NVIDIA-Linux-x86-$pkgver.run")
source_x86_64=("http://us.download.nvidia.com/XFree86/Linux-x86_64/$pkgver/$_pkg.run")
md5sums=('4f5562ee8f3171769e4638b35396c55d'
'2640eac092c220073f0668a7aaff61f7')
# '897d9775dc484ab37934e7b102c5b325')
md5sums_i686=('8825cec1640739521689bd80121d1425')
md5sums_x86_64=('0e9590d48703c8baa034b6f0f8bbf1e5')
[[ $_pkg = NVIDIA-Linux-x86_64-$pkgver ]] && md5sums_x86_64=('1b74150e84fd99cc1207a51b9327112c')
-z will cause whole file to be read at once
-E extended regular expression
((source|md5sums)=\([^)]*)\n([^)\n]*\)) will cause source=(...) or md5sums=(...) match in two halves, with second half containing last line
\1)\n#\3 replace as per requirement
If number of lines is known to be fixed number,
sed '/^source=\|^md5sums=/ {N;N; s/\n/)\n#/2}' input_file
where N;N and 2 will be number of lines minus one

Related

Script to delete old files and leave the newest one in a directory in Linux

I have a backup tool that takes database backup daily and stores them with the following format:
*_DATE_*.*.sql.gz
with DATE being in YYYY-MM-DD format.
How could I delete old files (by comparing YYYY-MM-DD in the filenames) matching the pattern above, while leaving only the newest one.
Example:
wordpress_2020-01-27_06h25m.Monday.sql.gz
wordpress_2020-01-28_06h25m.Tuesday.sql.gz
wordpress_2020-01-29_06h25m.Wednesday.sql.gz
Ath the end only the last file, meaning wordpress_2020-01-29_06h25m.Wednesday.sql.gz should remain.
Assuming:
The preceding substring left to _DATE_ portion does not contain underscores.
The filenames do not contain newline characters.
Then would you try the following:
for f in *.sql.gz; do
echo "$f"
done | sort -t "_" -k 2 | head -n -1 | xargs rm --
If your head and cut commands support -z option, following code will be more robust against special characters in the filenames:
for f in *.sql.gz; do
[[ $f =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]] && \
printf "%s\t%s\0" "${BASH_REMATCH[1]}" "$f"
done | sort -z | head -z -n -1 | cut -z -f 2- | xargs -0 rm --
It makes use of the NUL character as a line delimiter and allows any special characters in the filenames.
It first extracts the DATE portion from the filename, then prepend it to the filename as a first field separated by a tab character.
Then it sorts the files with the DATE string, exclude the last (newest) one, then retrieve the filename cutting the first field off, then remove those files.
I found this in another question. Although it serves the purpose, but it does not handle the files based on their filenames.
ls -tp | grep -v '/$' | tail -n +2 | xargs -I {} rm -- {}
Since the pattern (glob) you present us is very generic, we have to make an assumption here.
assumption: the date pattern, is the first sequence that matches the regex [0-9]{4}-[0-9]{2}-[0-9]{2}
Files are of the form: constant_string_<DATE>_*.sql.gz
a=( *.sql.gz )
unset a[${#a[#]}-1]
rm "${a[#]}"
Files are of the form: *_<DATE>_*.sql.gz
Using this, it is easily done in the following way:
a=( *.sql.gz );
cnt=0; ref="0000-00-00"; for f in "${a[#]}"; do
[[ "$f" =~ [0-9]{4}(-[0-9]{2}){2} ]] \
&& [[ "$BASH_REMATCH" > "$ref" ]] \
&& ref="${BASH_REMATCH}" && refi=$cnt
((++cnt))
done
unset a[cnt]
rm "${a[#]}"
[[ expression ]] <snip> An additional binary operator, =~, is available, with the same precedence as == and !=. When it is used, the string to the right of the operator is considered an extended regular expression and matched accordingly (as in regex(3)). The return value is 0 if the string matches the pattern, and 1 otherwise. If the regular expression is syntactically incorrect, the conditional expression's return value is 2. If the shell option nocasematch is enabled, the match is performed without regard to the case of alphabetic characters. Any part of the pattern may be quoted to force it to be matched as a string. Substrings matched by parenthesized subexpressions within the regular expression are saved in the array variable BASH_REMATCH. The element of BASH_REMATCH with index 0 is the portion of the string matching the entire regular expression. The element of BASH_REMATCH with index n is the portion of the string matching the nth parenthesized subexpression
source: man bash
Goto the folder where you have *_DATE_*.*.sql.gz files and try below command
ls -ltr *.sql.gz|awk '{print $9}'|awk '/2020/{print $0}' |xargs rm
or
use
`ls -ltr |grep '2019-05-20'|awk '{print $9}'|xargs rm`
replace/2020/ with the pattern you want to delete. example 2020-05-01 replace as /2020-05-01/
Using two for loop
#!/bin/bash
shopt -s nullglob ##: This might not be needed but just in case
##: If there are no files the glob will not expand
latest=
allfiles=()
unwantedfiles=()
for file in *_????-??-??_*.sql.gz; do
if [[ $file =~ _([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2})_ ]]; then
allfiles+=("$file")
[[ $file > $latest ]] && latest=$file ##: The > is magical inside [[
fi
done
n=${#allfiles[#]}
if ((n <= 1)); then ##: No files or only one file don't remove it!!
printf '%s\n' "Found ${n:-0} ${allfiles[#]:-*sql.gz} file, bye!"
exit 0 ##: Exit gracefully instead
fi
for f in "${allfiles[#]}"; do
[[ $latest == $f ]] && continue ##: Skip the latest file in the loop.
unwantedfiles+=("$f") ##: Save all files in an array without the latest.
done
printf 'Deleting the following files: %s\n' "${unwantedfiles[*]}"
echo rm -rf "${unwantedfiles[#]}"
Relies heavily on the > test operator inside [[
You can create a new file with lower dates and should still be good.
The echo is there just to see what's going to happen. Remove it if you're satisfied with the output.
I'm actually using this script via cron now, except for the *.sql.gz part since I only have directories to match but the same date formant so I have, ????-??-??/ and only ([[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}) as the regex pattern.
You can use my Python script "rotate-archives" for smart delete backups. (https://gitlab.com/k11a/rotate-archives).
An example of starting archives deletion:
rotate-archives.py test_mode=off age_from-period-amount_for_last_timeslot=7-5,31-14,365-180-5 archives_dir=/mnt/archives
As a result, there will remain archives from 7 to 30 days old with a time interval between archives of 5 days, from 31 to 364 days old with time interval between archives 14 days, from 365 days old with time interval between archives 180 days and the number of 5.
But require move _date_ to beginning file name or script add current date for new files.

Get multiple values in an xml file

<!-- someotherline -->
<add name="core" connectionString="user id=value1;password=value2;Data Source=datasource1.comapany.com;Database=databasename_compny" />
I need to grab the values in userid , password, source, database. Not all lines are in the same format.My desired result would be (username=value1,password=value2, DataSource=datasource1.comapany.com,Database=databasename_compny)
This regex seems little bit more complicated as it is more complicated. Please, explain your answer if possible.
I realised its better to loop through each line. Code I wrote so far
while read p || [[ -n $p ]]; do
#echo $p
if [[ $p =~ .*connectionString.* ]]; then
echo $p
fi
done <a.config
Now inside the if I have to grab the values.
For this solution I am considering:
Some lines can contain no data
No semi-colon ; is inside the data itself (nor field names)
No equal sign = is inside the data itself (nor field names)
A possible solution for you problem would be:
#!/bin/bash
while read p || [[ -n $p ]]; do
# 1. Only keep what is between the quotes after connectionString=
filteredLine=`echo $p | sed -n -e 's/^.*connectionString="\(.\+\)".*$/\1/p'`;
# 2. Ignore empty lines (that do not contain the expected data)
if [ -z "$filteredLine" ]; then
continue;
fi;
# 3. split each field on a line
oneFieldByLine=`echo $filteredLine | sed -e 's/;/\r\n/g'`;
# 4. For each field
while IFS= read -r field; do
# extract field name + field value
fieldName=`echo $field | sed 's/=.*$//'`;
fieldValue=`echo $field | sed 's/^[^=]*=//' | sed 's/[\r\n]//'`;
# do stuff with it
echo "'$fieldName' => '$fieldValue'";
done < <(printf '%s\n' "$oneFieldByLine")
done <a.xml
Explanations
General sed replacement syntax :
sed 's/a/b/' will replace what matches the regex a by the content of b
Step 1
-n argument tells sed not to output if no match is found. In this case this is useful to ignore useless lines.
^.* - anything at the beginning of the line
connectionString=" - literally connectionString="
\(.\+\)" - capturing group to store anything in before the closing quote "
.*$" - anything until the end of the line
\1 tells sed to replace the whole match with only the capturing group (which contains only the data between the quotes)
p tells sed to print out the replacement
Step 3
Replace ; by \r\n ; it is equivalent to splitting by semi-colon because bash can loop over line breaks
Step 4 - field name
Replaces literal = and the rest of the line with nothing (it removes it)
Step 4 - field value
Replaces all the characters at the beginning that are not = ([^=] matches all but what is after the '^' symbol) until the equal symbol by nothing.
Another sed command removes the line breaks by replacing it with nothing.

Can't seem to get correct regex for sed command

I have a CSV file where I need to replace the occurrence of a double quote followed by a line feed with a string i.e. "XXXX"
I've tried the following:
LC_CTYPE=C && LANG=C && sed 's/\"\n/XXXX/g' < input_file.csv > output_file.csv
and
LC_CTYPE=C && LANG=C && sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv
also tried
sed 's/\"\n\r/XXXX/g' < input_file.csv > output_file.csv
In each case, the command does not seem to recognize the specific combination of "\n in the file
It works if I look for just the double quote:
sed 's/\"/XXXX/g' < input_file.csv > output_file.csv
and if I look for just the line feed:
sed 's/\n\r/XXXX/g' < input_file.csv > output_file.csv
But no luck with the find-replace for the combined regex string
Any guidance would be most appreciated.
Adding simplified sample data
Sample input data (header row and two example records):
column1,column2
data,data<cr>
data,data"<cr>
Sample output:
column1,column2
data,data<cr>
data,dataXXXX
Update: Having some luck using perl commands in bash (MacOS) to get this done:
perl -pe 's/\"/XXXX/' input.csv > output1.csv
then
perl -pe 's/\n/YYYY/' output1.csv > output2.csv
this results in XXXXYYYY at the end of each record
I'm sure there is an easier way, but this seems to be doing the trick on a test file I've been using. Trying it out there before I use on the original 200K-line csv file.
sed is for simple substitutions on individual lines, that is all, so this is not a job for sed.
It sounds like this is what you want (uses GNU awk for multi-char RS):
$ awk -v RS='"\n' -v ORS='XXXX' '1' file
column1,column2
data,data
data,dataXXXX$
That final $ above is my prompt, demonstrating that both the " and the subsequent newline have been replaced.
You can try something like this:
sed ':a;/"\r\?$/{N;s/"\r\?\n\|"\r\?$/XXXX/;ba;}'
details:
:a # define the label "a"
/"\r\?$/ # condition: if the line ends with " then:
{
N # add the next line to the pattern space
s/ # replace:
"\r\?\n # the " and the LF (or CRLF)
\|
"\r\?$ # or a " at the end of the added line
# (this second alternative is only tested at the end
# of the file)
/XXXX/ # with XXXX
ba # go to label a
}

Remove newlines (\n) but exclude lines with specific regex?

After a lot of searching, I've come across a few ways to remove newlines using sed or tr
sed ':a;N;$!ba;s/\n//g'
tr -d '\n'
However, I can't find a way to exclude the action from specific lines. I've learned that one can use the "!" in sed as a means to exclude an address from a subsequent action, but I can't figure out how to incorporate it into the sed command above. Here's an example of what I'm trying to resolve.
I have a file formatted as such:
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
I want the file formatted in this fashion:
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
I've been focusing on trying to exclude lines containing the ">" character, as this is the only constant regex that would exist on lines that have the ">" character (note: the sequence_ID_n is unique to each entry preceded by the ">" and, thus, cannot be relied upon for regex matching).
I've attempted this:
sed ':a;N;$!ba;/^>/!s/\n//g' file.txt > file2.txt
It runs without generating an error, but the output file is the same as the original.
Maybe I can't do this with sed? Maybe I'm approaching this problem incorrectly? Should I be trying to define a range of lines to operate on (i.e. only lines between lines beginning with ">")?
I'm brand new to basic text manipulation, so any suggestions are greatly, greatly appreciated!
This awk should work:
$ awk '/^>/{print (NR==1)?$0:"\n"$0;next}{printf "%s", $0}END{print ""}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
This might work for you (GNU sed):
sed ':a;N;/^>/M!s/\n//;ta;P;D' file
Remove newlines from lines that don't begin with a >.
Using GNU sed:
sed -r ':a;/^[^>]/{$!N;s/\n([^>])/\1/;ta}' inputfile
For your input, it'd produce:
>sequence_ID_1
atcgatcgggatcatgacttcattgagaccgaga
>sequence_ID_2
gatccatggacgttaacgcgatgactactaggatcagt
As #1_CR already said #jaypal's solution is a good way to do it. But I really could not resist to try it in pure Bash. See the comments for details:
The input data:
$ cat input.txt
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
>sequence_ID_20
gattaca
The script:
$ cat script
#!/usr/bin/env bash
# Bash 4 - read the data line by line into an array
readarray -t data < "$1"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$1"
# A search pattern
pattern="^>sequence_ID_[0-9]"
# An array to insert the revised data
merged=()
# A counter
counter=0
# Iterate over each item in our data array
for item in "${data[#]}"; do
# If an item matches the pattern
if [[ "$item" =~ $pattern ]]; then
# Add the item straight into our new array
merged+=("$item")
# Raise the counter in order to write the next
# possible non-matching item to a new index
(( counter++ ))
# Continue the loop from the beginning - skip the
# rest of the code inside the loop for now since it
# is not relevant after we have found a match.
continue
fi
# If we have a match in our merged array then
# raise the counter one more time in order to
# get a new index position
[[ "${merged[$counter]}" =~ $pattern ]] && (( counter++ ))
# Add a non matching value to the already existing index
# currently having the highest index value based on the counter
merged[$counter]+="$item"
done
# Test: Echo each item of our merged array
printf "%s\n" "${merged[#]}"
The result:
$ ./script input.txt
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
>sequence_ID_20
gattaca
Jaypal's solution is the way to go, here's a GNU awk variant
awk -v RS='>sequence[^\\n]+\\n'
'{gsub("\n", "");printf "%s%s%s", $0, NR==1?"":"\n", RT}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
Here is one way to do it with awk
awk '{printf (/^>/&&NR>1?RS:"")"%s"(/^>/?RS:""),$0}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat

sed/awk replace in all matches

I want to invert all the color values in a bunch of files. The colors are all in the hex format #ff3300 so the inversion could be done characterwise with the sed command
y/0123456789abcdef/fedcba9876543210/
How can I loop through all the color matches and do the char translation in sed or awk?
EDIT:
sample input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
asdfghj
desired output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
asdfghj
EDIT: I changed my response as per your edit.
OK, sed may result in a difficult processing. awk could do the trick more or less easily, but I find perl much more easy for this task:
$ perl -pe 's/#[0-9a-f]+/$&=~tr%0123456789abcdef%fedcba9876543210%r/ge' <infile >outfile
Basically you find the pattern, then execute the right-hand side, which executes the tr on the match, and substitutes the value there.
The inversion is really a subtraction. To invert a hex, you just subtract it from ffffff.
With this in mind, you can build a simple script to process each line, extract hexes, invert them, and inject them back to the line.
This is using Bash (see arrays, printf -v, += etc) only (no external tools there):
#!/usr/bin/env bash
[[ -f $1 ]] || { printf "error: cannot find file: %s\n" "$1" >&2; exit 1; }
while read -r; do
# split line with '#' as separator
IFS='#' toks=( $REPLY )
for tok in "${toks[#]}"; do
# extract hex
read -n6 hex <<< "$tok"
# is it really a hex ?
if [[ $hex =~ [0-9a-fA-F]{6} ]]; then
# compute inversion
inv="$((16#ffffff - 16#$hex))"
# zero pad the result
printf -v inv "%06x" "$inv"
# replace hex with inv
tok="${tok/$hex/$inv}"
fi
# build the modified line
line+="#$tok"
done
# print the modified line and clean it for reuse
printf "%s\n" "${line#\#}"
unset line
done < "$1"
use it like:
$ ./invhex infile > outfile
test case input:
random text... #ffffff_random_text_#000000__
asdf#00ff00
bdf#cvb_foo
asdfghj
#bdfg
processed output:
random text... #000000_random_text_#ffffff__
asdf#ff00ff
bdf#cvb_foo
asdfghj
#bdfg
This might work for you (GNU sed):
sed '/#[a-f0-9]\{6\}\>/!b
s//\n&/g
h
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g
y/0123456789abcdef/fedcba9876543210/
H
g
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta
s/\n//' file
Explanation:
/#[a-f0-9]\{6\}\>/!b bail out on lines not containing the required pattern
s//\n&/g prepend every pattern with a newline
h copy this to the hold space
s/[^\n]*\(\n.\{7\}\)[^\n]*/\1/g delete everything but the required pattern(s)
y/0123456789abcdef/fedcba9876543210/ transform the pattern(s)
H append the new pattern(s) to the hold space
g overwrite the pattern space with the contents of the hold space
:a;s/\n.\{7\}\(.*\n\)\n\(.\{7\}\)/\2\1/;ta replace the old pattern(s) with the new.
s/\n// remove the newline artifact from the H command.
This works...
cat test.txt |sed -e 's/\#\([0123456789abcdef]\{6\}\)/\n\#\1\n/g' |sed -e ' /^#.*/ y/0123456789abcdef/fedcba9876543210/' | awk '{lastType=type;type= substr($0,1,1)=="#";} type==lastType && length(line)>0 {print line;line=$0} type!=lastType {line=line$0} length(line)==0 {line=$0} END {print line}'
The first sed command inserts line breaks around the hex codes, then it is possible to make the substitution on all lines starting with a hash. There are probably an elegant solution to merge the lines back again, but the awk command does the job. The only assumption there is that there won't be two hex-codes following directly after each other. If so, this step has to be revised.