Convert columns and rows in readable format - regex

I have a file whose entries are like:
Time;Instance;Database;Status;sheapthres;bp_heap;fcm_heap;other_heap;sessions;sessions_in_exec;locks_held;lock_escal;log_reads;log_writes;deadlocks;l_reads;p_reads;hit_ratio;pct_async_reads;d_writes;a_writes;lock_waiting;sortheap;sort_overflows;pct_sort_overflows;AvgPoolReadTime;AvgDirReadTime;AvgPoolWriteTime;AvgDirWriteTim
02:07:49;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:09;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
02:08:29;SAN33;SAMPLE;Active;0;10688;832;72064;8;0;0%;0;0;0;0;0;0;0%;0%;0;0;0;0;0;0%;0;0;0;0
and want to convert this in a readable format like:
Time Instance Database
02:07:49 SAN33 SAMPLE
02:08:09 SAN33 SAMPLE
02:08:29 SAN33 SAMPLE
and so on..
I have tried tr -s ";" "\t" but did not get any good result.. Can anyone help me in this.

You might want to use column as follows:
column -s\; -t your_file
where -s\; says that your column delimiter is a semicolon (protected with a backslash to avoid interpretation by the shell). See also Command line CSV viewer?.

How about more unix aware variant:
cat <your file> | sed 's/;/\t/g'
Solaris and HP-UX users note: instead of \t character typing, use Ctrl+V and then TAB key sequence.

Related

add characters each two places within sed

I am working with csv files, they seismic catalogs from a database, I need to arrange them like USGS format in order to start another steps.
My input data format is:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909,7,23,170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913,12,14,024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
The USGS input format is
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T17:00:00,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T02:45:00,-17.780,-63.170,5.6,0,PRE-GEM-ISC
To "convert" my input to USGS format I did the following steps:
archi='catalog.txt'
sed 's/,/-/1' $archi > temp1.dat # to change "," to "-"
sed 's/,/-/1' temp1.dat > temp2.dat # same as above
sed 's/,/T/1' temp2.dat > temp3.dat # To add T between date and time
sed -i.bak "1 s/^.*$/DatesT,Latitude,Longitude,Magnitude,Depth,Catalog/" temp3.dat #to preserve the header.
I have the following output:
DatesT,Latitude,Longitude,Magnitude,Depth,Catalog
1909-7-23T170000,-17.430,-66.349,5.1,0,PRE-GEM-ISC
1913-12-14T024500,-17.780,-63.170,5.6,0,PRE-GEM-ISC
I tried to implement the following command:
sed 's/.\{13\}/&: /g' temp3.dat > temp4.dat
Unfortunately it did not work as I thought because it did not have the same place for all lines.
Do you have any idea to improve my code?
One way using GNU sed:
sed -r 's/([0-9]{4}),([0-9]{1,2}),([0-9]{1,2}),([0-9]{2})([0-9]{2})([0-9]{2})(,.*)/\1-\2-\3T\4:\5:\6\7/' file
You split the file into individual tokens,meaning column as token one, 2nd column as token 2, and when it comes to 4th column, take 2 numbers as a token, and then substitute it as required.
You can do:
cat initialfile.csv|perl -p -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g"
or for inline edit:
perl -p -i -e "s/^(\d{4}),(\d+),(\d+),(\d{2})(\d{2})(\d{2}),([0-9.-]+),([0-9.-]+),(.*)$/\1-\2-\3T\4:\5:\6,\7,\8,\9/g" initialfile.csv
which should output USGS format
This might work for you (GNU sed):
sed -E '1!s/^([^,]*),([^,]*),([^,]*),(..)(..)/\1-\2-\3T\4:\5:/' file
Forget about the header.
Replace the first and second fields delimiters (all fields are delimited by a comma ,) with a dash -.
Replace the third fields delimiter by T.
Split the fourth field into three equal parts and separate each part by a colon :.
N.B. The last part of the fourth field will stay as is and so does not need to be defined.
Sometimes as programmers we become too focused on data and would be better served by looking at the problem as an artist and coding what we see.

Command to replace single space with tabs in a text file

So i have a text file and the problem is that both spaces and tabs are used to separate column values which is causing me a lot of issue due to the inconsistency. For example:
ID Name Class Time
1 Johnson 5-D 6pm
.
.
.
As you can see in the example, both ID and Name are separated by single space while Name and Class are separated by a tab.
How do i write a sed command to replace all single space in the text file with a tab? I would want a new text file generated looking something like:
ID Name Class Time
1 Johnson 5-D 6pm
The alignment doesn't matter at this point, i just want to replace the single space with tab.
Edit: awk script is welcomed too
With tr:
tr ' ' '\t' <inputfile >outputfile
This replaces each space character with a tab.
If you should need this, you can also replace a sequence of multiple spaces with one tab using
tr -s ' ' '\t' <inputfile >outputfile
Use column function:
column -t file
From man column:
-t Determine the number of columns the input contains and create a table. Columns are
delimited with whitespace, by default, or with the characters supplied using the -s
option. Useful for pretty-printing displays.
This might do the job sed -i -e 's/\s/\t/g' filename.txt.
Using perl to get its advanced regular expressions (In particular lookbehind and lookahead assertations to match a single space with no other whitespace on either side):
perl -pi -e 's/(?<=\S) (?=\S)/\t/g' input.txt

How to use awk and grep on 300GB .txt file?

I have a huge .txt file, 300GB to be more precise, and I would like to put all the distinct strings from the first column, that match my pattern into a different .txt file.
awk '{print $1}' file_name | grep -o '/ns/.*' | awk '!seen[$0]++' > test1.txt
This is what I've tried, and as far as I can see it works fine but the problem is that after some time I get the following error:
awk: program limit exceeded: maximum number of fields size=32767
FILENAME="file_name" FNR=117897124 NR=117897124
Any suggestions?
The error message tells you:
line(117897124) has to many fields (>32767).
You'd better check it out:
sed -n '117897124{p;q}' file_name
Use cut to extract 1st column:
cut -d ' ' -f 1 < file_name | ...
Note: You may change ' ' to whatever the field separator is. The default is $'\t'.
The 'number of fields' is the number of 'columns' in the input file, so if one of the lines is really long, then that could potentially cause this error.
I suspect that the awk and grep steps could be combined into one:
sed -n 's/\(^pattern...\).*/\1/p' some_file | awk '!seen[$0]++' > test1.txt
That might evade the awk problem entirely (that sed command substitutes any leading text which matches the pattern, in place of the entire line, and if it matches, prints out the line).
Seems to me that your awk implementation has an upper limit for the number of records it can read in one go of 117,897,124. The limits can vary according to your implementation, and your OS.
Maybe a sane way to approach this problem is to program a custom script that uses split to split the large file into smaller ones, with no more than 100,000,000 records each.
Just in case that you don't want to split the file, then maybe you could look for the limits file correspondent to your awk implementation. Maybe you can define unlimited as the Number of Records value, although I believe that is not a good idea, as you might end up using a lot of resources...
If you have enough free space on disk (because creates a temp .swp file) I suggest to use Vim, vim regex has small difference but you can convert from standard regex to vim regex with this tool http://thewebminer.com/regex-to-vim
The error message says your input file contains too many fields for your awk implementation. Just change the field separator to be the same as the record separator and you'll only have 1 field per line and so avoid that problem, then merge the rest of the commands into one:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\// && !seen[$0]++' file_name
If that's a problem then try:
awk 'BEGIN{FS=RS} {sub(/[[:space:]].*/,"")} /\/ns\//' file_name | sort -u
There may be an even simpler solution but since you haven't posted any sample input and expected output, we're just guessing.

Extracting Last column using VI

I have a csv file which contains some 1000 fields with values, the headers are something like below:
v1,v2,v3,v4,v5....v1000
I want to extract the last column i.e. v1000 and its values.
I tried %s/,[^,]*$// , but this turns out to be exact opposite of what i expected, Is there any way to invert this expression in VI ?
I know it can be done using awk as awk -F "," '{print $NF}' myfile.csv, but i want to make it happen in VI with regular expression,Please also note that i have VI and don't have VIM and working on UNIX, so i can't do visual mode trick as well.
Many thanks in advance, Any help is much appreciated.
Don't you just want
%s/.*,\s*//
.*, is match everything unto the last comma and the \s* is there to remove whitespace if its there.
You already accepted answer, btw you can still use awk or other nice UNIX tools within VI or VIM. Technique below calls manipulating the contents of a buffer through an external command :!{cmd}
As a demo, let's rearrange the records in CSV file with sort command:
first,last,email
john,smith,john#example.com
jane,doe,jane#example.com
:2,$!sort -t',' -k2
-k2 flag will sort the records by second field.
Extract last column with awk as easy as:
:%!awk -F "," '{print $NF}'
Dont forget cut!
:%!cut -d , -f 6
Where 6 is the number of the last field.
Or if you don't want to count the number of fields:
:%!rev | cut -d , -f 1 | rev

How can I write data from txt file to database?

If I have a txt with a certain number of rows and column (number of columns unknown at the beginning, columns are separated by tab), how can I export the data into the database? I have managed to iterate through the first row to count the number of columns and create a table accordingly but now I need to go through each row and insert the data into the respective column. How can I do that?
Example of the txt file:
Name Size Population GDP
aa 2344 1234 12
bb 2121 3232 15
... ... .. ..
.. .. .. ..
The table has been created:
CREATE TABLE random id INT, Name char(20), Size INT, Population INT, GDP INT
The difficult part is reading in the text fields. According to your definition, the field titles are separated by spaces. Is this true for the text fields?
A generic process is:
Create an SQL CREATE statement from the header text.
Execute the SQL statement.
While reading a line of text doesn't fail do
Parse the text into variables.
Create an SQL INSERT statement using field names and values from the variables.
Execute the SQL statement.
End-While
Another solution is to convert the TXT file into tab or comma separated fields. Check your database documentation to see if there is a function for loading files and also discover the characters used for separating columns.
If you need specific help, please ask a more specific or detailed question.
Using PostgreSQL's COPY, command, something like:
COPY random FROM 'filename' WITH DELIMITER '\t'
something like this might work.
basic idea is to use print statements to transform the line into SQL commannds.
then you can execute these commands using a sql command interpreter.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random (Name,size,population,gdp) VALUES (" $0 ");" }' > sqlcommands.txt
for the unknown number of columns, this might work.
cat textfile.txt | sed 's/^\([^ ]*\) /'\1' /; s/[ \t]+/,/g;' | awk '($NR!=1) {print "INSERT INTO random VALUES (ID," $0 ");" }' > sqlcommands.txt
replace ID with the id value needed. but you will need to execute it separately for each ID value.
I work with Sybase where "bcp" utility does this. Quick google on "postgres bcp" brings up this:
http://lists.plug.phoenix.az.us/pipermail/plug-devel/2000-October/000103.html
I realize its not the best answer, but good enough to get you going, I hope.
Oh, and you may need to change your text format, make it comma or tab-delimited. Use sed for that.