Extract all but last field from a variable in bash - regex

I have a file with lines similar to this:
01/01 THIS IS A DESCRIPTION 123.45
12/23 SHORTER DESC 9.00
11/16 DESC 1,234.00
Three fields: date, desc, amount. The first field will always be followed by a space. The last field will always be preceded by a space. But the middle field will usually contain spaces.
I know bash/regex well enough to get the first and last fields (for example, echo ${LINE##* } or cut -f1 -d\). But how do I get the middle field? Essentially everything except the first and last fields.

You can use sed for that:
$ sed -E 's/^[^[:space:]]*[[:space:]](.*)[[:space:]][^[:space:]]*$/\1/' file
THIS IS A DESCRIPTION
SHORTER DESC
DESC
Or with awk:
$ awk '{$1=$NF=""; sub(/^[ \t]*/,"")}1' file
# same output
You can also use cut and rev to delete the first and last fields:
$ cut -d ' ' -f2- file | rev | cut -d ' ' -f2- | rev
# same output
Or GNU grep:
$ grep -oP '^\H+\h\K(.*)(?=\h+\H+$)' file
# same output
Or, with a Bash loop and parameter expansion:
$ while read -r line; do line="${line#* }"; echo "${line% *}"; done <file
# same output
Or, if you want to capture the fields as variables in Bash:
while IFS= read -r line; do
date="${line%% *}"
amt="${line##* }"
line="${line#* }"
desc="${line% *}"
printf "%5s %10s \"%s\"\n" "$date" "$amt" "$desc"
done <file
Prints:
01/01 123.45 "THIS IS A DESCRIPTION"
12/23 9.00 "SHORTER DESC"
11/16 1,234.00 "DESC"

If you want to remove the first and last fields, you can just extend the parameter expansion technique you referenced:
var=${var#* } var=${var% *}
A single # or % removes the shortest substring that matches the glob.

bash: read the line into an array of words, and pick out the wanted elements from the array
while read -ra words; do
date=${words[0]}
amount=${words[-1]}
description=${words[*]:1:${#words[#]}-2}
printf "%s=%s\n" date "$date" desc "$description" amt "$amount"
done < file
outputs
date=01/01
desc=THIS IS A DESCRIPTION
amt=123.45
date=12/23
desc=SHORTER DESC
amt=9.00
date=11/16
desc=DESC
amt=1,234.00
This is the fun bit: ${words[*]:1:${#words[#]}-2}
take a slice of the words array, from index 1 (the 2nd element) for a length of "number of elements minus 2"
the words will be joined into a single string with a space separator.
See Shell Parameter Expansion and scroll down a bit for the ${parameter:offset:length} discussion.
If you want to use a regex in bash, then you can use capturing parentheses and the BASH_REMATCH array
while IFS= read -r line; do
if [[ $line =~ ([^[:blank:]]+)" "(.+)" "([^[:blank:]]+) ]]; then
echo "date=${BASH_REMATCH[1]}"
echo "desc=${BASH_REMATCH[2]}"
echo "amt=${BASH_REMATCH[3]}"
fi
done < file
Same output as above.
Notice in the pattern that the spaces need to be quoted (or backslash-escaped)

You could try below one with awk:
awk '{$1="";$NF="";sub(/^[ \t]*/,"")}1' file_name

Related

Bash / Regex: Replacing the second field in a CSV file when some of the first fields start with quotes and commas within those

This question is for a code written in bash, but is really more a regex question. I have a file (ARyy.txt) with CSV values in them. I want to replace the second field with NaN. This is no problem at all for the simple cases (rows 1 and 2 in the example), but it's much more difficult for a few cases where there are quotes in the first field and they have commas in them. These quotes are literally only there to indicate there are commas within them (so if quotes are only there if commas are there and vice versa). Quotes are always the first and last characters if there are commas in the first field.
Here is what I have thus far. NOTE: please try to answer using sed and the general format. There is a way to do this using awk for FPAT from what I know but I need one using sed ideally (or simple use case of awk).
#!/bin/bash
LN=1 #Line Number
while read -r LIN #LIN is a variable containing the line
do
echo "$LN: $LIN"
((LN++))
if [ $LN -eq 1 ]; then
continue #header line
elif [[ {$LIN:0:1} == "\"" ]]; then #if the first character in the line is quote
sed -i '${LN}s/\",/",NaN/' ARyy.txt #replace quote followed by comma with quote followed by comma followed by NaN
else #if first character doesn't start with a quote
sed -i '${LN}s/,[^,]*/,0/' ARyy.txt; fi
done < ARyy.txt
Other pertinent info:
There are never double or nested quotes or anything peculiar like this
There can be more than one comma inside the quotations
I am always replacing the second field
The second field is always just a number for the input (Never words or quotes)
Input Example:
Fruit, Weight, Intensity, Key
Apple, 10, 12, 343
Banana, 5, 10, 323
"Banana, green, 10 MG", 3, 14, 444 #Notice this line has commas in it but it has quotes to indicate this)
Desired Output:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444 #second field changed to NaN and first field remains in tact
Try this:
sed -E -i '2,$ s/^("[^"]*"|[^",]*)(, *)[0-9]*,/\1\2NaN,/' ARyy.txt
Explanation: sed -E invokes "extended" regular expression syntax, so it's easier to use parenthesized groups.
2,$ = On lines 2 through the end of file...
s/ = Replace...
^ = the beginning of a line
("[^"]*"|[^",]*) = either a double-quoted string or a string that doesn't contain any double-quotes or commas
(, *) = a comma, maybe followed by some spaces
[0-9]* = a number
, = and finally a comma
/ = ...with...
\1 = the first () group (i.e. the original first field)
\2 = the second () group (i.e. comma and spaces)
NaN, = Not a number, and the following comma
/ = end of replacement
Note that if the first field could contain escaped double-quotes and/or escaped commas (not in double-quotes), the first pattern would have to be significantly more complex to deal with them.
BTW, the original has an antipattern I see disturbingly often: reading through a file line-by-line to decide what to do with that line, then running something that processes the entire file in order to change that one line. So if you have a thousand-line file, it winds up processing the entire file a thousand times (for a total of a million lines processed). This is what's known as "quadratic scaling", because it takes time proportional to the square of the problem size. As Bruce Dawson put it,
O(n^2) is the sweet spot of badly scaling algorithms: fast enough to make it into production, but slow enough to make things fall down once it gets there.
Given your specific format, in particular that the first field won't ever have any escaped double quotes in it:
sed -E '2,$ s/^("[^"]*"|[^,]*),[^,]*/\1,NaN/' < input.csv > output.csv
This does require the common but non-standard -E option to use POSIX Extended Regular Expression syntax instead of the default Basic (which doesn't support alternation).
One (somewhat verbose) awk idea that replaces the entire set of code posted in the question:
awk -F'"' ' # input field separator = double quotes
function print_line() { # print array
pfx=""
for (i=1; i<=4; i++) {
printf "%s%s", pfx, arr[i]
pfx=OFS
}
printf "\n"
}
FNR==1 { print ; next } # header record
NF==1 { split($0,arr,",") # no double quotes => split line on comma
arr[2]=" NaN" # override arr[2] with " NaN"
}
NF>=2 { split($3,arr,",") # first column in from file contains double quotes
# so split awk field #3 on comma; arr[2] will
# be empty
arr[1]="\"" $2 "\"" # override arr[1] with awk field #1 (the double
# quoted first column from the file
arr[2]=" NaN" # override arr[2] " NaN"
}
{ print_line() } # print our array
' ARyy.txt
For the sample input file this generates:
Fruit, Weight, Intensity, Key
Apple, NaN, 12, 343
Banana, NaN, 10, 323
"Banana, green, 10 MG", NaN, 14, 444
while read -r LIN; do
if [ $LN -eq 1 ]; then
((LN++))
continue
elif [[ $LIN == $(echo "$LIN" | grep '"') ]]; then
word1=$(echo "$LIN" | awk -F ',' '{print $4}')
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
elif [[ $LIN == $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') ]]; then
word2=$(echo "$LIN" | cut -f2 -d ',')
echo "$LIN" | sed -i "$LN"s/"$word2"/\ NaN/ ARyy2.txt
fi
echo "$LN: $LIN"
((LN++))
done <ARyy.txt
make a copy of input ARyy.txt to ARyy2.txt and use this text files as the output.
(read from ARyy.txt and write to ARyy2.txt)
the first elif $(echo "$LIN" | grep '"') checks if the LINE starts with quotes " returns:
once selected, want to grab the number 3 with awk -F ',' '{print $4}and saved to variable word1. -F tells awk to separate columns each time encounters a , so 6 columns in total and number 3 is in column 4 that's why {print $4}
echo "$LIN" | sed -i "$LN"s/"$word1"/\ NaN/ ARyy2.txt
then use sed to select line number with $LN. The number 3 inside variable /$word1/. for replacement with /NaN/ BUT want to add a space to NaN so need to escape \ the space with /\ NaN/
always using echo $LIN to grab the correct LINE
the second elif $(echo "$LIN" | grep -E '[A-Z][a-z]*[,]\ [0-9]') returns:
$LIN only returns one line a time, like this:
The important is to check if the LINE has this pattern Word + space + ONE Digit
once selected, want to grab the number 10[second column] this time with cut -f2 -d ',' and save it to variable word2. -f2 selects the second column, and -d is telling cut to use , to separate each column.

sed/grep - display 1st and 5th column, remove comma in 5th column and swap Last-First, and sort in alphabetical order by last name in /etc/passwd

I need some sed/grep help with my /etc/passwd file. An example of an output in my /etc/passwd file is:
username1:x:5687:3794:Smith, Mike:/home/username1:/bin/bash
My first column are usernames and fifth column are Last Name, First and it isn't in alphabetical order in any way. I need to display the first column then display the fifth column in First Name Last Name sorted in alphabetical order by last name.
I have the sed command to display the fifth column in First Name Last Name sorted in alphabetical order by last name:
grep "$userid" /etc/passwd | cut -d: -f5 | sort | sed 's/^\(.*\), \(.*\)$/\2 \1/'
but simply adding f1,5 in my cut command doesn't make the expected output, it removes the alphabetical order and just places the first column after their first name which is not what I am looking for.
I can only use the sed command. What I have and need help with is
grep "$userid" /etc/passwd | cut -d: -f1,5 | sort | sed
I'm stuck here since I don't know much about the sed command, userid is just a variable I am reading from user input inside of a script for the first column.
So an example output would be:
username7 Abe Adams
username2 Jack Adams
username4 Ben Fab
username5 Jon Heat
Assuming you want to match multiple users with the "userid" you search for and that you want it to be a string rather than regexp comparison, here's how to do it robustly:
$ awk -F':' -v userid='username' 'index($1,userid){n=split($5,names,/[ ,]+/); print $1, names[n], names[1]}' file |
sort -k3
username1 Mike Smith
I'm using the split() that way to ensure the command works even if you have more than 2 parts to your name (e.g. Billy Bob Thornton).
You get a colon separating the two fields if you use cut -d: -f1,5, so you write your sed script to expect that:
grep "$userid" /etc/passwd |
cut -d: -f1,5 |
sed 's/^\([^:]*\):\([^,]*\),[[:space:]]*\(.*\)$/\1 \3 \2/' |
sort -k3,3 -k2,2
The negated character classes are a useful way of separating fields. Note that if the password entry doesn't contain a comma (and none of the accounts in the password file on my machine does), then the data is passed through sed unmodified. You could work around that if necessary (with more options and a second s/// expression) but it isn't clear that it's necessary.
It also isn't clear why you need to sort unless the value in $userid is some sort of regex that matches multiple rows in the password file.
Well, here's an awk and sort combination that might be helpful for you:
sort -t: -k5,5 /etc/passwd | awk -F'[: ,]*' '{print $1,$6,$5 }'
This took me most of the day, and it looks like I'm too late now. :(
Anyway, here's my solution:-
File+.txt:-
username1:x:5687:3794:Smith, Mike:/home/username1:/bin/bash
username2:x:5687:3794:Smith, Bill:/home/username1:/bin/bash
username3:x:5687:3794:Smith, Sam:/home/username1:/bin/bash
username4:x:5687:3794:Smith, Bob:/home/username1:/bin/bash
username5:x:5687:3794:Smith, Steve:/home/username1:/bin/bash
Script+.txt:-
#!/bin/sh -x
init() {
IFS=":"
cat > edpop+.txt << EOF
1d
wq
EOF
cat > edcommands+.txt << EOF
1p
W tmp2+.txt
q
EOF
next
}
end () {
sort -r tmp+.txt > report+.txt
rm -v ./edcommands+.txt
rm -v ./edpop+.txt
IFS=" "
exit 0
}
next() {
[[ -s file+.txt ]] && main
end
}
main() {
ed -s file+.txt < edcommands+.txt
read var1 var2 var3 var4 var5 var6 var7 < tmp2+.txt
lastname=$(echo "${var5}" | cut -d',' -f1)
firstname=$(echo "${var5}" | cut -d',' -f2)
echo -e "${var1}\t${firstname} ${lastname}" >> tmp+.txt
ed -s file+.txt < edpop+.txt
rm -v ./tmp2+.txt
next
}
init
a} Each line is initially read in with ed.
b} Read and IFS are used to cut up each line into variables, with the colon as delimiter.
c} Cut is used to cut up the firstname/lastname fields.
d} Echo sends the newly constructed lines into a temp file.
e} A reverse sort is performed on the temp file, and the result is sent into report+.txt.
f} Ed is used to delete the first line from the input file, thus reducing the count by one.
g} The loop resets. When all of the lines have been popped off the input file-based stack, we exit.

Get multiple values in an xml file

<!-- someotherline -->
<add name="core" connectionString="user id=value1;password=value2;Data Source=datasource1.comapany.com;Database=databasename_compny" />
I need to grab the values in userid , password, source, database. Not all lines are in the same format.My desired result would be (username=value1,password=value2, DataSource=datasource1.comapany.com,Database=databasename_compny)
This regex seems little bit more complicated as it is more complicated. Please, explain your answer if possible.
I realised its better to loop through each line. Code I wrote so far
while read p || [[ -n $p ]]; do
#echo $p
if [[ $p =~ .*connectionString.* ]]; then
echo $p
fi
done <a.config
Now inside the if I have to grab the values.
For this solution I am considering:
Some lines can contain no data
No semi-colon ; is inside the data itself (nor field names)
No equal sign = is inside the data itself (nor field names)
A possible solution for you problem would be:
#!/bin/bash
while read p || [[ -n $p ]]; do
# 1. Only keep what is between the quotes after connectionString=
filteredLine=`echo $p | sed -n -e 's/^.*connectionString="\(.\+\)".*$/\1/p'`;
# 2. Ignore empty lines (that do not contain the expected data)
if [ -z "$filteredLine" ]; then
continue;
fi;
# 3. split each field on a line
oneFieldByLine=`echo $filteredLine | sed -e 's/;/\r\n/g'`;
# 4. For each field
while IFS= read -r field; do
# extract field name + field value
fieldName=`echo $field | sed 's/=.*$//'`;
fieldValue=`echo $field | sed 's/^[^=]*=//' | sed 's/[\r\n]//'`;
# do stuff with it
echo "'$fieldName' => '$fieldValue'";
done < <(printf '%s\n' "$oneFieldByLine")
done <a.xml
Explanations
General sed replacement syntax :
sed 's/a/b/' will replace what matches the regex a by the content of b
Step 1
-n argument tells sed not to output if no match is found. In this case this is useful to ignore useless lines.
^.* - anything at the beginning of the line
connectionString=" - literally connectionString="
\(.\+\)" - capturing group to store anything in before the closing quote "
.*$" - anything until the end of the line
\1 tells sed to replace the whole match with only the capturing group (which contains only the data between the quotes)
p tells sed to print out the replacement
Step 3
Replace ; by \r\n ; it is equivalent to splitting by semi-colon because bash can loop over line breaks
Step 4 - field name
Replaces literal = and the rest of the line with nothing (it removes it)
Step 4 - field value
Replaces all the characters at the beginning that are not = ([^=] matches all but what is after the '^' symbol) until the equal symbol by nothing.
Another sed command removes the line breaks by replacing it with nothing.

How to replace a text sequence that includes "\n" in a text file

This may sound duplicated, but I can't make this works.
Consider:
_ = space
- = minus sign
particle_little.csv is a file of this form:
waste line to be deleted
__data__data__data
_-data__data_-data
__data_-data__data
I need to get a standard csv format in particle_std.csv, like this:
data,data,data
-data,data,-data
data,-data,data
I am trying to use tail and tr to do that conversion, here I split the command:
tail -n +2 particle_little.csv to delete the first line
| tr -s ' ' to remove duplicated spaces
| tr '/\b\n \b/' '\n' to delete the very beginning space
| tr ' ' ',' to change spaces for commas
> particle_std.csv to put it in a output file
But I get this (without the 4th step):
data
data
data
-data
...
Finally, the file is huge, so it is almost impossible to open in editors (I know there are super editors that maybe can)
I would suggest that you used awk:
$ cat file
waste line to be deleted
data data data
-data data -data
data -data data
$ awk -v OFS=, '{ $1 = $1 } NR > 1' file
data,data,data
-data,data,-data
data,-data,data
The script sets the output field separator OFS to , and reassigns the first field to itself $1 = $1, causing awk to touch each line (and replace the spaces with commas). Lines after the first, where NR > 1, are printed (the default action is to print the line).
So if I'm reading you right - ignore lines that don't start with whitespace. Comma separate everything else.
I'd suggest perl:
perl -lane 'next unless /^\s/; print join ",", #F';
This, when given:
waste line to be deleted
data data data
-data data -data
data -data data
On STDIN (Or specified in a filename) outputs:
data,data,data
-data,data,-data
data,-data,data
This is because:
-l strips linefeeds (and replaces them after each print);
-a autosplits on any whitespace
-n wraps it in a while ( <> ) { loop which iterates line by line - functionally it means it works just like sed/grep/tr and reads STDIN or files specified as args.
-e allows specifying a perl snippet.
In this case:
skip any lines that don't start with \s or any whitespace.
any other lines, join the fields (#F generated by -a) with , as delimiter. (This auto-inserts a linefeed because -l)
Then you can either redirect the output to a file (>output.csv) or use -i.bak to edit inplace.
You should probably use sed or awk for this:
sed -e 1d -e 's/^ *//' -e 's/ */,/g'
One way to do it in Awk is:
awk 'NR == 1 { next }
{ pad=""; for (i = 1; i <= NF; i++) { printf "%s%s", pad, $i; pad="," } print "" }'
but there's a better way to do it in Awk:
awk 'BEGIN { OFS=","} NR == 1 { next } { $1 = $1; print }' data
The BEGIN block sets the output field separator; the assignment $1 = $1; forces Awk to rework the output line; the print prints it.
I've left the first Awk version around because it shows there's more than one way to do it, and in some circumstances, such methods can be useful. But for this task, the second Awk version is better — simpler, more compact (and isomorphic with Tom Fenech's answer).

Remove newlines (\n) but exclude lines with specific regex?

After a lot of searching, I've come across a few ways to remove newlines using sed or tr
sed ':a;N;$!ba;s/\n//g'
tr -d '\n'
However, I can't find a way to exclude the action from specific lines. I've learned that one can use the "!" in sed as a means to exclude an address from a subsequent action, but I can't figure out how to incorporate it into the sed command above. Here's an example of what I'm trying to resolve.
I have a file formatted as such:
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
I want the file formatted in this fashion:
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
I've been focusing on trying to exclude lines containing the ">" character, as this is the only constant regex that would exist on lines that have the ">" character (note: the sequence_ID_n is unique to each entry preceded by the ">" and, thus, cannot be relied upon for regex matching).
I've attempted this:
sed ':a;N;$!ba;/^>/!s/\n//g' file.txt > file2.txt
It runs without generating an error, but the output file is the same as the original.
Maybe I can't do this with sed? Maybe I'm approaching this problem incorrectly? Should I be trying to define a range of lines to operate on (i.e. only lines between lines beginning with ">")?
I'm brand new to basic text manipulation, so any suggestions are greatly, greatly appreciated!
This awk should work:
$ awk '/^>/{print (NR==1)?$0:"\n"$0;next}{printf "%s", $0}END{print ""}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
This might work for you (GNU sed):
sed ':a;N;/^>/M!s/\n//;ta;P;D' file
Remove newlines from lines that don't begin with a >.
Using GNU sed:
sed -r ':a;/^[^>]/{$!N;s/\n([^>])/\1/;ta}' inputfile
For your input, it'd produce:
>sequence_ID_1
atcgatcgggatcatgacttcattgagaccgaga
>sequence_ID_2
gatccatggacgttaacgcgatgactactaggatcagt
As #1_CR already said #jaypal's solution is a good way to do it. But I really could not resist to try it in pure Bash. See the comments for details:
The input data:
$ cat input.txt
>sequence_ID_1
atcgatcgggatc
aatgacttcattg
gagaccgaga
>sequence_ID_2
gatccatggacgt
ttaacgcgatgac
atactaggatcag
at
>sequence_ID_20
gattaca
The script:
$ cat script
#!/usr/bin/env bash
# Bash 4 - read the data line by line into an array
readarray -t data < "$1"
# Bash 3 - read the data line by line into an array
#while read line; do
# data+=("$line")
#done < "$1"
# A search pattern
pattern="^>sequence_ID_[0-9]"
# An array to insert the revised data
merged=()
# A counter
counter=0
# Iterate over each item in our data array
for item in "${data[#]}"; do
# If an item matches the pattern
if [[ "$item" =~ $pattern ]]; then
# Add the item straight into our new array
merged+=("$item")
# Raise the counter in order to write the next
# possible non-matching item to a new index
(( counter++ ))
# Continue the loop from the beginning - skip the
# rest of the code inside the loop for now since it
# is not relevant after we have found a match.
continue
fi
# If we have a match in our merged array then
# raise the counter one more time in order to
# get a new index position
[[ "${merged[$counter]}" =~ $pattern ]] && (( counter++ ))
# Add a non matching value to the already existing index
# currently having the highest index value based on the counter
merged[$counter]+="$item"
done
# Test: Echo each item of our merged array
printf "%s\n" "${merged[#]}"
The result:
$ ./script input.txt
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
>sequence_ID_20
gattaca
Jaypal's solution is the way to go, here's a GNU awk variant
awk -v RS='>sequence[^\\n]+\\n'
'{gsub("\n", "");printf "%s%s%s", $0, NR==1?"":"\n", RT}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat
Here is one way to do it with awk
awk '{printf (/^>/&&NR>1?RS:"")"%s"(/^>/?RS:""),$0}' file
>sequence_ID_1
atcgatcgggatcaatgacttcattggagaccgaga
>sequence_ID_2
gatccatggacgtttaacgcgatgacatactaggatcagat