If I have a txt with a certain number of rows and column (number of columns unknown at the beginning, columns are separated by tab), how can I export the data into the database? I have managed to iterate through the first row to count the number of columns and create a table accordingly but now I need to go through each row and insert the data into the respective column. How can I do that?
Example of the txt file:
Name Size Population GDP
aa 2344 1234 12
bb 2121 3232 15
... ... .. ..
.. .. .. ..
The table has been created:
CREATE TABLE random id INT, Name char(20), Size INT, Population INT, GDP INT
The difficult part is reading in the text fields. According to your definition, the field titles are separated by spaces. Is this true for the text fields?
A generic process is:
Create an SQL CREATE statement from the header text.
Execute the SQL statement.
While reading a line of text doesn't fail do
Parse the text into variables.
Create an SQL INSERT statement using field names and values from the variables.
Execute the SQL statement.
End-While
Another solution is to convert the TXT file into tab or comma separated fields. Check your database documentation to see if there is a function for loading files and also discover the characters used for separating columns.
If you need specific help, please ask a more specific or detailed question.
Using PostgreSQL's COPY, command, something like:
COPY random FROM 'filename' WITH DELIMITER '\t'
something like this might work.
basic idea is to use print statements to transform the line into SQL commannds.
then you can execute these commands using a sql command interpreter.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random (Name,size,population,gdp) VALUES (" $0 ");" }' > sqlcommands.txt
for the unknown number of columns, this might work.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random VALUES (ID," $0 ");" }' > sqlcommands.txt
replace ID with the id value needed. but you will need to execute it separately for each ID value.
I work with Sybase where "bcp" utility does this. Quick google on "postgres bcp" brings up this:
http://lists.plug.phoenix.az.us/pipermail/plug-devel/2000-October/000103.html
I realize its not the best answer, but good enough to get you going, I hope.
Oh, and you may need to change your text format, make it comma or tab-delimited. Use sed for that.
Related
I have some huge files containing mixed binary and xml data. I want to extract all values between 2 XML tags that have multiple occurrences in the file. Pattern would be as following: <C99><F1>050</F1><F2>random value</F2></C99> . Portions of XML data are not formatted, everything is in a single line.
I need all values between <F1> and </F1> from <C99> where value is between range 050 and 999(<F1> exists under other fields as well but I need only values of F1 from C99). I need to count them, to see how many C99 have F1 with values between 050 and 999.
I want a hint how I could easily reach and extract that values (using cat and grep? or sed?). Sorting and counting is easy to do it once values are exported in a file.
My temporary solution:
After removing all binary data from the file, I can run the following command:
cat filename | grep -o "<C99><F1>......." > file.txt
This will export first 12 characters from all strings starting with <C99><F1>.
<C99><F1>001
<C99><F1>056
<C99><F1>123
<C99><F1>445
.....
Once exported in a text file, I replace <C99><F1> with nothing and then I sort and count remaining values.
Thank you!
Using XMLStarlet:
$ xml sel -t -v '//C99/F1[. >= 50 and . <= 999]' -nl data.xml | wc -l
Not much of a hint there, sorry.
I have a list like this
ABC|Hello1
ABC|Hello2
ABC|Hello3
DEF|Test
GHJ|Blabla1
GHJ|Blabla2
And i want it to be this:
ABC|Hello1
DEF|Test
GHJ|Blabla1
so i want to remove the duplicates in each line before the: |
and only let the first one there.
A simple way using awk
$ awk -F"|" '!seen[$1]++ {print $0}' file
ABC|Hello1
DEF|Test
GHJ|Blabla1
The trick here is to set the appropriate field separator "|" in this case after which the individual columns can be accessed column-wise starting with $1. In this answer, am maintaining a unique-value array seen and printing the line only if the value from $1 is not seen previously.
I would like to use sed to parse a file and prints only the last i labels within a field. Each label is separate by a ..
If I select i=3, with a file that contains the following lines:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
I would like, if there are at least 3 labels, the output lines to be:
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Currently :
sed 's;\(^[^|]\+\)|.*\.\([^\.]\+\.[^\.]\+\.[^\.]\+\)|\([^|]\+$\);\1|\2|\3;' test.txt
produces:
begin_text|label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Begin_Text|Label2.Label1|End_Text
and I do not get why a matching occurs for line 3. I also suppose ther eis a better way to do reverse order label reading.
Any comment/suggestion is appreciated.
Using awk will make the job easier.
awk 'split($2,a,".")>=i' FS="|" i=3 file
begin_text|label_n.label_n-1.other_labels.label3.label2.label1|end_text
BEGIN_TEXT|LABEL3.LABEL2.LABEL1|END_TEXT
Explanation
split(string, array, fieldsep)
split returns the number of elements created.
I have a file whose entries are like:
Time;Instance;Database;Status;sheapthres;bp_heap;fcm_heap;other_heap;sessions;sessions_in_exec;locks_held;lock_escal;log_reads;log_writes;deadlocks;l_reads;p_reads;hit_ratio;pct_async_reads;d_writes;a_writes;lock_waiting;sortheap;sort_overflows;pct_sort_overflows;AvgPoolReadTime;AvgDirReadTime;AvgPoolWriteTime;AvgDirWriteTim
02:07:49;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:09;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:29;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
and want to convert this in a readable format like:
Time Instance Database
02:07:49 SAN33 SAMPLE
02:08:09 SAN33 SAMPLE
02:08:29 SAN33 SAMPLE
and so on..
I have tried tr -s ";" "\t" but did not get any good result.. Can anyone help me in this.
You might want to use column as follows:
column -s\; -t your_file
where -s\; says that your column delimiter is a semicolon (protected with a backslash to avoid interpretation by the shell). See also Command line CSV viewer?.
How about more unix aware variant:
cat <your file> | sed 's/;/\t/g'
Solaris and HP-UX users note: instead of \t character typing, use Ctrl+V and then TAB key sequence.
I have a CSV containing list of 500 members with their phone numbers. I tried diff tools but none can seem to find duplicates.
Can I use regex to find duplicate rows by members' phone numbers?
I'm using Textmate on Mac.
Many thanks
What duplicates are you searching for? The whole lines or just the same phone number?
If it is the whole line, then try this:
sort phonelist.txt | uniq -c | sort -n
and you will see at the bottom all lines, that occur more than once.
If it is just the phone number in some column, then use this:
awk -F ';' '{print $4}' phonelist.txt | uniq -c | sort -n
replace the '4' with the number of the column with the phone number and the ';' with the real separator you are using in your file.
Or give us a few example lines from this file.
EDIT:
If the data format is: name,mobile,phone,uniqueid,group, then use the following:
awk -F ',' '{print $3}' phonelist.txt | uniq -c | sort -n
in the command line.
Yes. For one way to do it, look here. But you would probably not want to do it this way.
You can normally parse this file, and check what rows are duplicated. I think RAGEX is a worst solution for this problem.
What language are you using? In .NET, with little effort you could load the CSV file in to a DataTable and find/remove the duplicate rows. Afterwards, write your DataTable back to another CSV file.
Heck, you can load this file in to Excel and sort by a field and find the duplicates manually. 500 isn't THAT many.
use PERL.
Load the CSV file into an array, and match the column you want to check (phone numbers) for duplicates, then store the values into another array, then check for duplicates in that array, using:
my %seen;
my #unique = grep !$seen{$_}++, #array2;
After that, all you need to do is load the unique array(phone numbers) into a for loop, and inside it load array#1(lines) into a for loop. Compare the phone number in the unique array, and if it matches, output that line into another csv file.