Transform mysql 'INSERT' statement into a CSV line - regex

I need to convert mysql dump file to CSV format before importing to a data warehouse server.
INSERT INTO `temp` VALUES (30686631,1346959848246,1346959850865,1346959998054,'18663196147','18663196147','18668839208','17326812123',3372579,'1866319614700','A',1,'','',0,147,30686632,'KeyAd','1101','38.325.Monitor2.1101#10.40.10.170','10.40.10.40',5060,'10.40.10.46',5060,'100038455383251101_Monitor2#10.40.10.170','<sip:+18668839208#10.40.10.46:5060>;tag=sansay507370834rdb810','\"O\'HALLORAE,AEAN\" <sip:+17326812123#10.40.10.40;isup-oli=00>;tag=sansay507370829rdb1779','200',0,'',0,NULL,'','',3398812,NULL,NULL);
I'm using this command to remove mysql insert statement
sed -e 's/^INSERT INTO `temp` VALUES (//' -e 's/);$//' -e 's/(//;s/);//;s/,/|/g;s|["'\'']||g'
there seems to be an issue with names when they come between two slashes \ \ ,I can't figure out how to fix it.
From MySQL insert
'\"O\'HALLORAE,AEAN\"
can't figure out how to form the output to
"O'HALLORAN,SEAN"
Desierd output:
30686631|1346959848246|1346959850865|1346959998054|18663196147|18663196147|18668839208|17326812123|3372579|1866319614700|A|1|||0|147|30686632|KeyAd|1101|38.325.Monitor2.1101#10.40.10.170|10.40.10.40|5060|10.40.10.46|5060|100038455383251101_Monitor2#10.40.10.170|<sip:+18668839208#10.40.10.46:5060>;tag=sansay507370834rdb810| "O'HALLORAN,SEAN" <sip:+17326812123#10.40.10.40;isup-oli=00>;tag=sansay507370829rdb1779|200|0||0|NULL|||3398812|NULL|NULL

Try this:
$ sed -e 's/INSERT INTO `temp` VALUES (//' -e 's/);$//' -re 's/("[^"]*),([^"]*")/\1\x1\2/g;s/,/|/g;s/\x1/,/g;s/\\([^\])/\1/g' file | sed "s/'|/|/g;s/|'/|/g"
Output:
30686631|1346959848246|1346959850865|1346959998054|18663196147|18663196147|18668839208|17326812123|3372579|1866319614700|A|1|||0|147|30686632|KeyAd|1101|38.325.Monitor2.1101#10.40.10.170|10.40.10.40|5060|10.40.10.46|5060|100038455383251101_Monitor2#10.40.10.170|<sip:+18668839208#10.40.10.46:5060>;tag=sansay507370834rdb810|"O'HALLORAN,SEAN" <sip:+17326812123#10.40.10.40;isup-oli=00>;tag=sansay507370829rdb1779|200|0||0|NULL|||3398812|NULL|NULL

If ruby is an acceptable dependency for you, you can leverage its parser if you can transform the statement into a valid ruby array:
script.sh:
#!/bin/bash
# -r to preserve backslashes
read -r statement
ruby=$(echo -n $statement | sed -e 's/^.*VALUES //' -e 's/;$//' -e 's/^(/[/' -e 's/)$/]/' -e 's/NULL/"NULL"/g' -e 's/\\"/"/g')
echo $ruby | ruby -rcsv -e 'puts CSV.generate_line(eval($stdin.read), "|")'
Usage:
chmod +x script.sh
echo <your statement> | ./script.sh
30686631|1346959848246|1346959850865|1346959998054|18663196147|18663196147|18668839208|17326812123|3372579|1866319614700|A|1|""|""|0|147|30686632|KeyAd|1101|38.325.Monitor2.1101#10.40.10.170|10.40.10.40|5060|10.40.10.46|5060|100038455383251101_Monitor2#10.40.10.170|<sip:+18668839208#10.40.10.46:5060>;tag=sansay507370834rdb810|"""O'HALLORAE,AEAN"" <sip:+17326812123#10.40.10.40;isup-oli=00>;tag=sansay507370829rdb1779"|200|0|""|0|NULL|""|""|3398812|NULL|NULL
This loads as expected on openoffice (after setting the delimiter to "|")

Related

sed & regex expression

I'm trying to add a 'chr' string in the lines where is not there. This operation is necessary only in the lines that have not '##'.
At first I use grep + sed commands, as following, but I want to run the command overwriting the original file.
grep -v "^#" 5b110660bf55f80059c0ef52.vcf | grep -v 'chr' | sed 's/^/chr/g'
So, to run the command in file I write this:
sed -i -E '/^#.*$|^chr.*$/ s/^/chr/' 5b110660bf55f80059c0ef52.vcf
This is the content of the vcf file.
##FORMAT=<ID=DP4,Number=4,Type=Integer,Description="#ref plus strand,#ref minus strand, #alt plus strand, #alt minus strand">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 24430-0009S21_GM17-12140
1 955597 95692 G T 1382 PASS VARTYPE=1;BGN=0.00134309;ARL=150;DER=53;DEA=55;QR=40;QA=39;PBP=1091;PBM=300;TYPE=SNP;DBXREF=dbSNP:rs115173026,g1000:0.2825,esp5400:0.2755,ExAC:0.2290,clinvar:rs115173026,CLNSIG:2,CLNREVSTAT:mult,CLNSIGLAB:Benign;SGVEP=AGRN|+|NM_198576|1|c.45G>T|p.:(p.Pro15Pro)|synonymous GT:DP:AD:DP4 0/1:125:64,61:50,14,48,13
chr1 957898 82729935 G T 1214 off_target VARTYPE=1;BGN=0.00113362;ARL=149;DER=50;DEA=55;QR=38;QA=40;PBP=245;PBM=978;NVF=0.53;TYPE=SNP;DBXREF=dbSNP:rs2799064,g1000:0.3285;SGVEP=AGRN|+|NM_198576|2|c.463+56G>T|.|intronic GT:DP:AD:DP4 0/1:98:47,51:9,38,10,41
If I understand what is your expected result, try:
sed -ri '/^(#|chr)/! s/^/chr/' file
Your question isn't clear and you didn't provide the expected output so we can't test a potential solution but if all you want is to add chr to the start of lines where it's not already present and which don't start with # then that's just:
awk '!/^(#|chr)/{$0="chr" $0} 1' file
To overwrite the original file using GNU awk would be:
awk -i inplace '!/^(#|chr)/{$0="chr" $0} 1' file
and with any awk:
awk '!/^(#|chr)/{$0="chr" $0} 1' file > tmp && mv tmp file
This can be done with a single sed invocation. The script itself is something like the following.
If you have an input of format
$ echo -e '#\n#\n123chr456\n789chr123\nabc'
#
#
123chr456
789chr123
abc
then to prepend chr to non-commented chrless lines is done as
$ echo -e '#\n#\n123chr456\n789chr123\nabc' | sed '/^#/ {p
d
}
/chr/ {p
d
}
s/^/chr/'
which prints
#
#
123chr456
789chr123
chrabc
(Note the multiline sed script.)
Now you only need to run this script on a file in-place (-i in modern sed versions.)

Replace string with another string based on backreference with sed

I'm trying to convert a predefined string %c# where # can be some number with another string. The catch is that the length of the other string must be truncated to # number of characters.
Ideally these set of commands would work:
FORMAT="%c10"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
echo $FORMAT | sed "s/%c\([0-9]\+\)/${LAST_COMMIT:0:\1}/g"
but clearly there is a syntax error on the \1. You can replace it with a number to see what I'm trying to get as output.
I'm open to using some other program other than sed to achieve this but ideally it should be programs that are pretty much native to most linux installations.
Thanks!
This is my idea.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c//')
Get number with sed and get first some character with head.
EDIT1
This might be better.
echo ${LAST_COMMIT} | head -c $(echo ${FORMAT} | sed -e 's/%c\([0-9]\+\)/\1/')
EDIT2
I make the script because it is too tough to understand. Please try this.
$ cat sample.sh
#!/bin/bash
FORMAT="%b-%t-%c10-%c5"
LAST_COMMIT="5189e42b14797b1e36ffb7fc5657c7eea08f1c0f"
## List numbers
lengths=$(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g")
## Substitute %cXX to first XX characters of LAST_COMMIT
for n in ${lengths}
do
to_str=$(echo ${LAST_COMMIT:0:${n}})
FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/")
done
## Print result
echo ${FORMAT}
This is the result.
$ ./sample.sh
%b-%t-5189e42b1410-5189e5
Also this is one line commands (Same contents but too long and too tough)
for n in $(echo ${FORMAT} | sed -e "s/%[^c]//g" -e "s/-//g" -e "s/%c/ /g"); do to_str=$(echo ${LAST_COMMIT:0:${n}}); FORMAT=$(echo ${FORMAT} | sed "s/%c${length}/${to_str}/"); done; echo ${FORMAT}
The value of $LAST_COMMIT gets interpolated before sed runs, so there is no backreference to refer back to yet. There is an /e extension in GNU sed which would support something like this, but I would simply use a slightly more capable tool.
perl -e '$fmt = shift; $fmt=~ s/%c(\d+)/%.$1s/g; printf("$fmt\n", #ARGV)' '%c10' "$LAST_COMMIT"
Of course, if you can let go of your own ad-hoc format string specifier, and switch to a printf-compatible format string altogether, just use the printf shell command straight off.
length=$(echo $FORMAT | sed "s/%c\([0-9]\+\)/\1/g")
echo "${LAST_COMMIT:0:$length}"

Bash print word after match [duplicate]

This question already has answers here:
Get string after character [duplicate]
(5 answers)
Closed 7 years ago.
I have a variable that stores the output of a file. Within that output, I would like to print the first word after Database:. I'm fairly new to regex, but this is what I've tried so far:
sed -n -e 's/^.*Database: //p' "$output"
When I try this, I am getting a sed: can't read prints_output: File name too long error.
Does sed only take in a filename? I am running a hive query to desc formatted table and storing the results in output like so:
output=`hive -S -e "desc formatted table"`
output is then set to the result of that:
...
# Detailed Table Information
Database: sample_db
Owner: sample_owner
CreateTime: Thu Feb 26 23:36:43 PDT 2015
LastAccessTime: UNKNOWN
Protect Mode: None
Retention: 0
Location: maprfs:/some/location
Table Type: EXTERNAL_TABLE
Table Parameters:
...
Superficially, you should be using:
hive -S -e "desc formatted table" |
sed -n -e 's/^.*Database: //p'
This will show the complete line containing Database:. When you've got that working, you can eliminate the unwanted material on the line too.
Alternatively, you could use:
echo "$output" |
sed -n -e 's/^.*Database: //p'
Or, again, given that you're using Bash, you could use:
sed -n -e 's/^.*Database: //p' <<< "$output"
I'd use the first unless you need the whole output preserved for rescanning. Then I'd probably capture the output in a file (with tee):
hive -S -e "desc formatted table" |
tee output.log |
sed -n -e 's/^.*Database: //p'
Try using egrep:
egrep -oh 'Database:[[:blank:]][[:alnum:]]*[[:blank:]]' <output_file> | awk '{print $2;}'

Log Extract: SED Command

I am trying to extract logs from my application within specific time-stamps. So i wrote the following script
a= echo $1 | sed 's/\//\\\//g';
b= echo $2 | sed 's/\//\\\//g';
sed -n "/$a/,/$b/p" SystemOut.log;
Here a and b are the timestamps which i pass as parameters. When i run the script SED does not expand the variables.
But if i run the following script in terminal it works fine
sed -n '/6\/30\/14 9:03/,/6\/30\/14 9:04/p' SystemOut.log
Anyone can help?
I am running the script as following-
sh extract.sh '6/30/14 9:01' '6/30/14 9:03'
Try this way:
a=$(echo $1 | sed 's/\//\\\//g');
b=$(echo $2 | sed 's/\//\\\//g');
sed -n "/$a/,/$b/p" SystemOut.log;
In order to store the output of a command in a variable you can use $()
Use double quote "" to expand variable. like
sed -n "/\"$a\"/,/\"$b\"/p" SystemOut.log;

help with grep [[:alpha:]]* -o

file.txt contains:
##w##
##wew##
using mac 10.6, bash shell, the command:
cat file.txt | grep [[:alpha:]]* -o
outputs nothing. I'm trying to extract the text inside the hash signs. What am i doing wrong?
(Note that it is better practice in this instance to pass the filename as an argument to grep instead of piping the output of cat to grep: grep PATTERN file instead of cat file | grep PATTERN.)
What shell are you using to execute this command? I suspect that your problem is that the shell is interpreting the asterisk as a wildcard and trying to glob files.
Try quoting your pattern, e.g. grep '[[:alpha:]]*' -o file.txt.
I've noticed that this works fine with the version of grep that's on my Linux machine, but the grep on my Mac requires the command grep -E '[[:alpha:]]+' -o file.txt.
sed 's/#//g' file.txt
/SCRIPTS [31]> cat file.txt
##w##
##wew##
/SCRIPTS [32]> sed 's/#//g' file.txt
w
wew
if you have bash >3.1
while read -r line
do
case "$line" in
*"#"* )
if [[ $line =~ "^#+(.*)##+$" ]];then
echo ${BASH_REMATCH[1]}
fi
esac
done <"file"