is it possible to determine the max length of a field in a csv file using regex? - regex

This has been discussed in stackoverflow before but I couldn't find a case/answer that might apply to my situation:
From time to time I have raw data in text to be imported into SQL, for almost every case I must try out several times as SSIS wizard doesn't know what's the max size of each field and the default is 50 characters. Only after it fails I can know from the error message which (first) field was truncated and I then increase the field's size.
There might be more than one field that needs getting its size increased, and the SSIS wizard only gives one error each time it encounters a truncate, as you can see this is very tedious, I want to find a way to have a quick inspect to the data first to determine the max size of each field.
I came across an old post on stackoverflow: Here is the post
Unfortunately it might not work on my case: my raw data could have as many rows as 10 Million (yes, in one single text file which is over GB).
I am kind of do not think there would be a way to get that, but just still want to post my question here hoping to get some clue.
Thank you very much.

Related

How to deserialize a file containing multiple records

i've written a thrift-definition, and used this defintion to serialize multiple records in one file (i've added the size of the whole record at the beginning of each record). That is in short what I have done.
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
myClass->write(protocol.get());
const std::string & data(transport->getBufferAsString());
Afterwards i just print the string data in binary mode. Now I want to deserialize this file again. I wouldn't have any problem if there was only on record in the file, unfortunately I have to print multiple files, so I guess I have to work with offset based on the size i saved in the file along with the record itself. However, I can't seem to find any example I can use to achieve my goals, and the official documentation is quite lacking. Has anyone any tipps for me. If I'm missing some information, just ask.
Further Informations:
Of course I want to use use thrift to deserialize. However, one file can contain multiple records. For example: Imagine I have defined a struct in a thrift-definition file that contains car-Information. Now I serialize multiple car-structs in one output file. Serializing is no problem as i just append the data. If i want to deserialize however, I have to know where one record starts, and the next begins. That is my problem. I don't know how to tell thrift where one record begins and ends. I've searched the internet, but can't seem to find an example for c++ (i got one for python so far, but am not able to translate it to c++). The structure of one file can be described as followed: [lenghtofrecord1][record1][lengthofrecord2][record2][...]
Thanks in Advance
Michael
How about having a list<records> that you de/serialize as a whole? Or is it an absolute requirement to read them independently and randomly? If yes, I see 1,5 (one and a half) possible solutions:
Have a second file as an index. This holds a map< recordNumber, offset>, or simply a sorted list of integers-pairs, to quickly locate records. Since these data are much less than the records you probably can cache it in memory all the time.
The half solution: iff the record size is fixed, any records position could be calculated easily by multiplying recordSize * (recordNr-1). This way you don't even need the size prefix. If you have strings in the record or other variable-sized entities, this will not work, unless you force a fixed record size by reserving a buffer for each record with a predefined (maximum) size. It's a little ugly, thus the "half" solution, but you don't need the index file.
Although maybe not the perfect solution, this seems to work for me:
boost::shared_ptr<apache::thrift::transport::TMemoryBuffer> transport(new apache::thrift::transport::TMemoryBuffer);
boost::shared_ptr<apache::thrift::protocol::TBinaryProtocol> protocol(new apache::thrift::protocol::TBinaryProtocol(transport));
transport->resetBuffer((uint8_t*) buffer, sizeOfEntry);
Buffer is a char array containing the desired record (I used seekg for the offset) and sizeOfEntry is the records size. Afterwards I can go on with the automatically generated read-Method of my thrift-generated class. In Fact I had this solution earlier, I just messed up my offset, thus it didn't work.

Comparing two documents

I have two very large lists. They both were originally in excel, but the larger one is a list of emails (about 160,000) of them with other information like their name and address etc. And the smaller one is a list of just 18,000 emails.
My question is what would be the easiest way to get rid of all 18,000 rows from the first document that contain the email addresses from the second?
I was thinking regex or maybe there is another application I can use? I have tried searching online but it seems like there isn't much specific to this. I also tried notepad++ but it freezes when I try to compare these large files.
-Thank You in Advance!!
Good question. One way I would tackle this is making a C++ program [you could extrapolate the idea to the language of your choice; You never mentioned which languages you were proficient in] that read each item of the smaller file into a vector of strings. First, of course, use Excel to save the files as CSV instead of XLS or XLSX, which will comma-separate the values so you can work with them easier. For the larger list, "Save As" a copy of just email addresses, deleting the other rows for now.
Then, you could open the larger list and use a nested loop to check if you should output to an output file. Something like:
bool foundMatch=false;
for(int y=0;y<LargeListVector.size();y++) {
for(int x=0;x<SmallListVector.size();x++) {
if(SmallListVector[x]==LargeListVector[y]) foundMatch=true;
}
if(!foundMatch) OutputVector.append(LargeListVector[y]);
foundMatch=false;
}
That might be partially pseudo-code, but do you get the idea?
So I read a forum post at : Here
=MATCH(B1,$A$1:$A$3,0)>0
Column B would be the large list, with the 160,000 inputs and column A was my list of things I needed to delete of 18,000.
I used this to match everything, and in a separate column pasted this formula. It would print out either an error or TRUE. If the data was in both columns it printed out true.
Then because I suck with excel, I threw this text into Notepad++ and searched for all lines that contained TRUE (match case, because in my case some of the data had the word true in it without caps.) I marked those lines, then under search, bookmarks, I removed all lines with bookmarks. Pasted that back into excel and voila.
I would like to thank you guys for helping and pointing me in the right direction :)

Sqlite max columns number configuration from QT

I want to store rows that have 65536 columns in a Sqlite database, and I am doing that using C++ and QT.
My question is: Since the default maximum number of columns seems to be 2000 no more, how to configure this parameter from C++ and Qt?
Thank you.
The SQLLite homepage has some explanation on this:
2.Maximum Number Of Columns
The SQLITE_MAX_COLUMN compile-time parameter is used to set an upper
bound (...)
and
The default setting for SQLITE_MAX_COLUMN is 2000. You can change it
at compile time to values as large as 32767. On the other hand, many
experienced database designers will argue that a well-normalized
database will never need more than 100 columns in a table.
Like that, even if you increased it, you could only achieve half of what you want. Apart from that I can only refer to Styne666's comment on your post.

Coldfusion 8 - Problems with indexing large data using verity

I am currently running coldfusion 8 with verity running on a K2 server. I am using a query to index several different columns with my table using cfindex. One of the columns is a large varchar type.
It seems that when the data is being indexed only the first 30KB is being stored, resulting in no results being brought back if I search for anything after that. I tried moving several different phrases and words further up in the data, within the 30KB and the results then appear.
I then carried out more verity tests using the browse command in the command prompt to see whats actually in the collection.
i.e. Coldfusion8\verity\collections\\parts browse 0000001.ddd
I found out that the body being indexed (CF_BODY) never exceeds the size of 32000.
Can anyone tell me if there is a fixed index size per document for verity?
Many thanks,
Richard
Punch line
Version 6 has operator limits:
up to 32 764 children in one "topic" for ANY operator
up to 64 children for NEAR
Exceeding these values doesn't necessarily give error message. When you search, you're certain you don't exceed them?
Source
Verity documentation, Appendix B: Query limits says there are two limitations: search time and operator's. Quote below is whole section telling about the latter, straight from the book.
Verity Query Language and Topic Guide, Version 6.0:
Note the following limits on the use of operators:
There can be a maximum of 32,764 children for the ANY operator. If a topic exceeds
this limit, the search engine does not always return an error message.
The NEAR operator can evaluate only 64 children. If a topic exceeds this limit, the
search engine does not return an error message.
For example, assume you have created a large topic that uses the ACCRUE operator with
8365 children. This topic exceeds the 1024 limit for any ACCRUE-class topic and the
16000/3 limit for the total number of nodes.
In this case, you cannot substitute ANY for ACCRUE, because that would cause the topic
to exceed the 8,000 limit for the maximum number of children for the ANY operator.
Instead, you can build a deeper tree structure by grouping topics and creating some
named subnodes.

"out of memory" exception in CRecordset when selecting a LONGTEXT column from MySQL

I am using CODBCRecordset (a class found on CodeProject) to find a single record in a table with 39 columns. If no record is found then the call to CRecordset::Open is fine. If a record matches the conditions then I get an Out of Memory exception when CRecordset::Open is called. I am selecting all the columns in the query (if I change the query to select only one of the columns with the same where clause then no exception).
I assume this is because of some limitation in CRecordset, but I can't find anything telling me of any limitations. The table only has 39 columns.
Has anyone run into this problem? And if so, do you have a work around / solution?
This is a MFC project using Visual Studio 6.0 if it makes any difference.
Here's the query (formatted here so wold show up without a scrollbar):
SELECT `id`, `member_id`, `member_id_last_four`, `card_number`, `first_name`,
`mi`, `last_name`, `participant_title_id`, `category_id`, `gender`,
`date_of_birth`, `address_line_1`, `address_line_2`, `city`, `state`,
`zip`, `phone`, `work_phone`, `mobile_phone`, `fax`, `email`,
`emergency_name`, `emergency_phone`, `job_title`, `mail_code`,
`comments`, `contract_unit`, `contract_length`, `start_date`,
`end_date`, `head_of_household`, `parent_id`, `added_by`, `im_active`,
`ct_active`, `organization`, `allow_members`, `organization_category_id`,
`modified_date`
FROM `participants`
WHERE `member_id` = '27F7D0982978B470C5CF94B1B833CC93F997EE23'
Copying and pasting into my query browser gives me only one result.
More info:
Commented out each column in the select statement except for id. Ran the query and no exception.
Then I systematically went through and uncommented each column, one at a time, and re-ran query in between each uncomment.
When I uncomment the comment column then I get the error.
This is defined as the following (Using MySQL): LONGTEXT
Can we assume you mean you're calling CODBCRecordset::Open(), yes? Or more precisely, something like:
CDatabase db;
db.Open (NULL,FALSE,FALSE,"ODBC;",TRUE);
CODBCRecordSet rs (&db);
rs.Open ("select blah, blah, blah from ...");
EDIT after response:
There are some known bugs with various ODBC drivers that appear to be caused by retrieving invalid field lengths. See these links:
http://forums.microsoft.com/msdn/showpost.aspx?postid=2700779&siteid=1
https://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=296391
This particular one seems to have been because CRecordset will allocate a buffer big enough to hold the field. As the column returns a length of zero, it's interpreted as the max 32-bit size (~2G) instead of the max 8-bit size (255 bytes). Needless to say, it can't allocate enough memory for the field.
Microsoft has acknowledged this as a problem, have a look at these for solutions:
http://support.microsoft.com/kb/q272951/
http://support.microsoft.com/kb/940895/
EDIT after question addenda:
So, given that your MySQL field is a LONGTEXT, it appears CRecordSet is trying to allocate the max possible size for it (2G). Do you really need 2 gig for a comments field? Typing at 80 wpm, 6cpw would take a typist a little over 7 years to fill that field, working 24 h/day with no rest :-).
It may be a useful exercise to have a look at all the columns in your database to see if they have appropriate data types. I'm not saying that you can't have a 2G column, just that you should be certain that it's necessary, especially in light of the fact that the current ODBC classes won't work with a field that big.
Read Pax's response. It gives a you a great understanding about why the problem happens.
Work Around:
This error will only happen if the field defined as (TEXT, LONGTEXT, etc) is NULL (and maybe empty). If there is data in the field then it will only allocate for the size the data in the field and not the max size (thereby causing the error).
So, if there is a case where you absolutely have to have these large fields. Here is a potential solution:
Give the field a default value in the database. (ie. '<blank>')
Then when displaying the value; you pass NULL/empty if you find default value.
Then when updating the value; you pass the default value if you find NULL/empty.
I second Pax's suggestion that this error is due to trying to allocate a buffer big enough to hold the biggest LONGTEXT possible. The client doesn't know how large the data is until it has fetched it.
LONGTEXT is indeed way larger than you would ever need in most applications. Consider using MEDIUMTEXT (max size 16MB) or just TEXT (max size 64KB) instead.
There are similar problems in PHP database interfaces. A PHP normally has a memory size limit and any fetch of a LONGBLOB or LONGTEXT is likely to exceed that limit.