Use sed or awk to fix date format - regex

I'm trying to convert a HTML containing a table to a .csv file using a bash script.
So far I've acomplished the following steps:
Convert to Unix format (with dos2unix)
Remove all spaces and tabs (with sed 's/[ \t]//g')
Remove all the blank lines (with sed ':a;N;$!ba;s/\n//g') (this is necesary, because the HTML file has a blank line for each cell of the table... that's not my fault)
Remove the unnecesary <td> and <tr> tags (with sed 's/<t.>//g')
Replace </td> with ',' (with sed 's/<\/td/,/g')
Replace </tr> with end-of-line (\n) characters (with sed 's/<\/tr/\n/g')
Of course, I'm putting all this in a pipeline. So far, it's working great. There's one final step I'm stuck with: The table has a column with dates, which has the format dd/mm/yyyy, and I'd like to convert them to yyyy-mm-dd.
Is there a (simple) way to do it (with sed or awk)?
Data sample (after the whole sed pipe):
500,2,13/09/2007,30000.00,12,B-1
501,2,15/09/2007,14000.00,8,B-2
Expected result:
500,2,2007-09-13,30000.00,12,B-1
501,2,2007-09-15,14000.00,8,B-2
The reason I need to do this is because I need to import this data to MySQL. I could open the file in Excel and change the format by hand, but I would like to skip that.

sed -E 's,([0-9]{2})/([0-9]{2})/([0-9]{4}),\3-\2-\1,g'

Awk can do this task pretty easily:
awk '
BEGIN { FS = OFS = "," }
{ split($3, date, /\//)
$3 = date[3] "-" date[2] "-" date[1]
print $0
}
' infile
It yields:
500,2,2007-09-13,30000.00,12,B-1
501,2,2007-09-15,14000.00,8,B-2

sed "s:,\([0-9]\+\)/\([0-9]\+\)/\([0-9]\+\),:,\3-\2-\1,:"

awk would work for this:
echo 08/26/2013 | awk -F/ '{printf "%s-%s-%s\n",$3,$2,$1}'
as would one of these bash-only options:
IFS=/ read m d y < <(echo 08/26/2013); echo "${y}-${m}-${d}"
IFS=/ read m d y <<< "08/26/2013"; echo "${y}-${m}-${d}"
If you happen to use ksh, where a subshell is not used for the last component of a pipeline, this should work as well:
echo 08/26/2013 | IFS=/ read m d y; echo "${y}-${m}-${d}"
In recent bash, you can also use shopt -s lastpipe in a script to allow the above invocation to work as well, but it won't work on the command line (thanks to #mklement0 in the comments below).
I'll leave it up to you to figure out how to integrate it with the rest...

So far all answers are very case specific to OP's issue. Here is a more general approach, running (GNU, for -d option) date through awk :
awk 'BEGIN{FS=","}
{
"date -d\"" $3 "\" +%Y-%m-%d" | getline mydate;
print $1 "," $2 "," mydate "," $4 "," $5 "," $6
}'
Of course this approach will work as is only if the input date format is handled by date. AFAICS this is not the case for dd/mm/yyyy, unfortunately. One may try other commands than date (not tested).
Edit : Implemented mklement0's comment.
Edit2 : Actually this doesn't work with mawk, which is Debian's default awk implementation. Obvious solution is to install gawk when possible.

correction to awk assume you seek yyyy-mm-dd (not yyyy-dd-mm)
echo 08/26/2013 | awk -F/ '{printf "%s-%s-%s\n",$3,$1,$2}'

Related

Remove hostnames from a single line that follow a pattern in bash script

I need to cat a file and edit a single line with multiple domains names. Removing any domain name that has a set certain pattern of 4 letters ex: ozar.
This will be used in a bash script so the number of domain names can range, I will save this to a csv later on but right now returning a string is fine.
I tried multiple commands, loops, and if statements but sending the output to variable I can use further in the script proved to be another difficult task.
Example file
$ echo file.txt
ozarkzshared.com win.ad.win.edu win_fl.ozarkzsp.com ap.allk.org allk.org >ozarkz.com website.com
What I attempted (that was close)
domains_1=$(cat /tmp/file.txt | sed 's/ozar*//g')
domains_2=$( cat /tmp/file.txt | printf '%s' "${string##*ozar}")
Goal
echo domain_x
win.ad.win.edu ap.allk.org allk.org website.com
If all the domains are on a single line separated by spaces, this might work:
awk '/ozar/ {next} 1' RS=" " file.txt
This sets RS, your record separator, then skips any record that matches the keyword. If you wanted to be able to skip a substring provided in a shell variable, you could do something like this:
$ s=ozar
$ awk -v re="$s" '$0 ~ re {next} 1' RS=" " file.txt
Note that the ~ operator is comparing a regular expression, not precisely a substring. You could leverage the index() function if you really want to check a substring:
$ awk -v s="$s" 'index($0,s) {next} 1' RS=" " file.txt
Note that all of the above is awk, which isn't what you asked for. If you'd like to do this with bash alone, the following might be for you:
while read -r -a a; do
for i in "${a[#]}"; do
[[ "$i" = *"$s"* ]] || echo "$i"
done
done < file.txt
This assigns each line of input to the array $a[], then steps through that array testing for a substring match and printing if there is none. Text processing in bash is MUCH less efficient than in a more specialized tool like awk or sed. YMMV.
you want to delete the words until a space delimiter
$ sed 's/ozar[^ ]*//g' file
win.ad.win.edu win_fl. ap.allk.org allk.org website.com

unix sed not backtracking to finish the job

I'm trying to make a script to convert postgres CSV dumps into Oracle csv dumps. Aka, I'm trying to replace "true" with "Y" and "false" with "N".
So I want a script called to_oracle like this:
echo "false,false,false,true" | to_oracle
N,N,N,Y
So here is my attempt:
sed -E -e 's:(,|^)true(,|$):\1Y\2:g' -e 's:(,|^)false(,|$):\1N\2:g' "$#"
The logic is that a field in a CSV file either starts with beginning of line or a comma "," and it ends with either the end of line or a comma ","
The problem with this script is that it greedily absorbs the comma and thus every second field doesn't work:
echo "false,false,false,true" | to_oracle
N,false,N,Y
Now I suppose I could pipe it to the script twice, and that would do the job, but I'm wondering is there a more elegant solution?
An awk version:
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=$i=="true"?"Y":"N"}1'
N,N,N,Y
It test one by one field, if its true use Y, else use N
If you like to test for false as well
echo "false,false,false,true" | awk -F, -v OFS=, '{for(i=1;i<=NF;i++) $i=($i=="true"?"Y":($i=="false"?"N":"other"))}1'
N,N,N,Y
With GNU sed, you may use
sed -E ':a;s/(,|^)false(,|$)/\1N\2/;ta; :b;s/(,|^)true(,|$)/\1Y\2/;tb'
See the online demo
Details
-E will enable POSIX ERE syntax
':a;s/(,|^)false(,|$)/\1N\2/;ta; will recursively replace false in between commas or start/end of string with N
:b;s/(,|^)true(,|$)/\1Y\2/;tb' will recursively replace true in between commas or start/end of string with Y.

Finding and replacing the last space at or before nth character works with sed but not awk, what am I doing wrong?

I have a string in a test.csv file like this:
here is my string
when I use sed it works just as I expect:
cat test.csv | sed -r 's/^(.{1,9}) /\1,/g'
here is,my string
Then when I use awk it doesn't work and I'm not sure why:
cat test.csv | awk '{gsub(/^(.{1,9}) /,","); print}'
,my string
I need to use awk because once I get this figured out I will be selecting only one column to split into two columns with the added comma. I'm using extended regex with sed, "-r" and was wondering how or if it's supported with awk, but I don't know if that really is the problem or not.
awk does not support back references in gsub. If you are on GNU awk, then gensub can be used to do what you need.
echo "here is my string" | awk '{print gensub(/^(.{1,9}) /,"\\1,","G")}'
here is,my string
Note the use of double \ inside the quoted replacement part. You can read more about gensub here.

Remove everything after 2nd occurrence in a string in unix

I would like to remove everything after the 2nd occurrence of a particular
pattern in a string. What is the best way to do it in Unix? What is most elegant and simple method to achieve this; sed, awk or just unix commands like cut?
My input would be
After-u-math-how-however
Output should be
After-u
Everything after the 2nd - should be stripped out. The regex should also match
zero occurrences of the pattern, so zero or one occurrence should be ignored and
from the 2nd occurrence everything should be removed.
So if the input is as follows
After
Output should be
After
Something like this would do it.
echo "After-u-math-how-however" | cut -f1,2 -d'-'
This will split up (cut) the string into fields, using a dash (-) as the delimiter. Once the string has been split into fields, cut will print the 1st and 2nd fields.
This might work for you (GNU sed):
sed 's/-[^-]*//2g' file
You could use the following regex to select what you want:
^[^-]*-\?[^-]*
For example:
echo "After-u-math-how-however" | grep -o "^[^-]*-\?[^-]*"
Results:
After-u
#EvanPurkisher's cut -f1,2 -d'-' solution is IMHO the best one but since you asked about sed and awk:
With GNU sed for -r
$ echo "After-u-math-how-however" | sed -r 's/([^-]+-[^-]*).*/\1/'
After-u
With GNU awk for gensub():
$ echo "After-u-math-how-however" | awk '{$0=gensub(/([^-]+-[^-]*).*/,"\\1","")}1'
After-u
Can be done with non-GNU sed using \( and *, and with non-GNU awk using match() and substr() if necessary.
awk -F - '{print $1 (NF>1? FS $2 : "")}' <<<'After-u-math-how-however'
Split the line into fields based on field separator - (option spec. -F -) - accessible as special variable FS inside the awk program.
Always print the 1st field (print $1), followed by:
If there's more than 1 field (NF>1), append FS (i.e., -) and the 2nd field ($2)
Otherwise: append "", i.e.: effectively only print the 1st field (which in itself may be empty, if the input is empty).
This can be done in pure bash (which means no fork, no external process). Read into an array split on '-', then slice the array:
$ IFS=-
$ read -ra val <<< After-u-math-how-however
$ echo "${val[*]}"
After-u-math-how-however
$ echo "${val[*]:0:2}"
After-u
awk '$0 = $2 ? $1 FS $2 : $1' FS=-
Result
After-u
After
This will do it in awk:
echo "After" | awk -F "-" '{printf "%s",$1; for (i=2; i<=2; i++) printf"-%s",$i}'

SED replace expression "within" a regular expression

I have to change a CSV file column (the date) which is written in the following format:
YYYY-MM-DD
and I would like it to be
YYYY.MM.DD
I can write a succession of 2 sed rules piped one to the other like :
sed 's/-/./' file.csv | sed 's/-/./'
but this is not clean. my question is: is there a way of assigning variables in sed and tell it that YYYY-MM-DD should be parsed as year=YYYY ; month=MM ; day=DD and then tell it
write $year.$month.$day
or something similar? Maybe with awk?
You could use groups and access the year, month, and day directly via backreferences:
sed 's#\([0-9][0-9][0-9][0-9]\)-\([0-9][0-9]\)-\([0-9][0-9]\)#\1.\2.\3#g'
Here's an alternative solution with awk:
awk 'BEGIN { FS=OFS="," } { gsub("-", ".", $1); print }' file.csv
BEGIN { FS=OFS="," } tells awk to break the input lines into fields by , (variable FS, the [input] Field Separator), as well as to also use , when outputting modified input lines (variable OFS, the Output Field Separator).
gsub("-", ".", $1) replaces all - instances with . in field 1
The assumption is that the data is in the 1st field, $1; if the field index is a different one, replace the 1 in $1 accordingly.
print simply outputs the modified input line, terminated with a newline.
What you are doing is equivalent to supplying the "global" replacement flag:
sed 's/-/./g' file.csv
sed has no variables, but it does have numbered groups:
sed -r 's/([0-9]{4})-([0-9]{2})-([0-9]{2})/\1.\2.\3/g' file.csv
or, if your sed has no -r:
sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1.\2.\3/g' file.csv
You may try this sed command also,
sed 's/\([0-9]\{4\}\)\-\([0-9]\{2\}\)\-\([0-9]\{2\}\)/\1.\2.\3/g' file
Example:
$ (echo '2056-05-15'; echo '2086-12-15'; echo 'foo-bar-go') | sed 's/\([0-9]\{4\}\)-\([0-9]\{2\}\)-\([0-9]\{2\}\)/\1.\2.\3/g'
2056.05.15
2086.12.15
foo-bar-go