Access a specific record using lseek

Access a specific record using lseek - c++

Is it possible to access a specific one (using a specific index) in a text file without knowing the size of each record?

If you maintain a separate index of record offsets then you can simply consult it for the appropriate location to seek to. Otherwise, no.

If the records happened to be sorted on a convenient key and you can identify where one record ends and another begins, then you can implement a binary or interpolation search approach. You may be able to add this to your text-file format retrospectively to aid the lookup. Otherwise, you're stuck with serial searches from a position with known index (obviously start of file is one, if you know the total number of records you can work backwards from end of file). You can also consider doing one pass to create an index to allow direct access, or having the file embed a list of offsets that can be easily read.

Check out the dbopen() function. If you pass DB_RECNO as the type parameter you can access variable-length records. These records can be delimited by newlines. Essentially your "database" is a flat text file.
The API will conveniently handle inserts and deletes for you as well.

Related

BizTalk Mapping:Source record does not exists but need to map and pass default value

I have a source schema in which a particular record is optional and in the source message instance the record does not exist. I need to map this record to destination record, scenario goes like if the source record doesn't exist, need to map a default value 0 to destination nodes. and If it does exists , need to pass the source node values as it is (followed by few arithmetic operations).
I have tried using various combinations of functoids like logical existence followed by value mapping,record count ,string existence,etc. Also tried using c# within scripting functoid and also xslt , nothing works.its very tough to deal with mapping non existing records. I have several records on top of this record which are mapped just fine and they do exists. having trouble only with this one.No matter how many combination of c# and xslt code i write , it feels like scripting functoid will never accept a non existence record or node link. Mind you that this record if exists ,can repeat multiple times.
Using BizTalk2013r2.

If the record doesn't exist (record is not coming, not even as < record/>) you can use this simple combination of Functoids.
Link the record to Logical Existence, if exist it will be sent by the top Value Mapping. If doesn't exit the second condition will be true and the zero will be sent from the value mapping in the bottom.

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

I could split the file contents up into separate search documents but then I would have to manually identify this in the results and only show one result to the user - otherwise it will look like there are 2 files that match their search when in fact there is only one.
Also the relevancy score would be incorrect. Any ideas?

So the response from AWS support was to split the files up into separate documents. In response to my concerns regarding relevancy scoring and multiple hits they said the following:
You do raise two very valid concerns here for your more challenging use case here. With regard to relevance, you face a very significant problem already in that is harder to establish a strong 'signal' and degrees of differentiation with large bodies of text. If the documents you have are much like reports or whitepapers, a potential workaround to this may be in indexing the first X number of characters (or the first identified paragraph) into a "thesis" field. This field could be weighted to better indicate what the document subject matter may be without manual review.
With regard to result duplication, this will require post-processing on your end if you wish to filter it. You can create a new field that can generate a unique "Parent" id that will be shared for each chunk of the whole document. The post-processing can check to see if this "Parent" id has already been return(the first result should be seen as most relevant), and if it has, filter the subsequent results. What is doubly useful in such a scenario, is that you include a refinement link into your results that could filter on all matches within that particular Parent id.

Random file access of stl::map data in C++

I have a stl::map data-structure
key:data pair
which I need to store in a binary file.
key is an unsigned short value, and is not sequential
data is another big structure, but is of fixed size.
This map is managed based on some user actions of add, modify or delete. And I have to keep the file updated every time I update the map. This is to survive a system crash scenario.
Adding can always be done at the end of the file. But, user can modify or delete any of the existing records.
That means I have to randomly access the file to update that modified/deleted record.
My questions are:
Is there a way I can reach the modified record in the file directly without sequentially searching thru the whole records ? ( Max record size is 5000)
On a delete, how do I remove it from the file and move the next record to the deleted record's position ?
Appreciate your help!

Assuming you have no need for the tree structure of std::map and you just need an associative container, the most common way I've seen to do this is to have two files: One with the keys and one with the data. In the key file, it will contain all of they keys along with the corresponding offset of their data in the data file. Since you said the data is all of the same size, updating should be easy to do (since it won't change any of the offsets). Adding is done by appending. Deleting is the only hard part; you can delete the key to remove it from the database, but it's up to you if you want to keep track of "freed" data sections and try to write over them. To keep track of the keys, you might want another associative container (map or unordered_map) in memory with the location of keys in the key file.
Edit: For example, the key file might be (note that offsets are in bytes)
key1:0
key2:5
and the corresponding data file would be
data1data2
This is a pretty tried and true pattern, used in everyone from hadoop to high speed local databases. To get an idea of persistence complications you might consider, I would highly recommend reading this Redis blog, it taught me a lot about persistence when I was dealing with similar issues.

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

I have several (1-5) very wide (~50,000 columns) .csv files. The files are (.5GB-1GB) in size (avg. size around 500MB). I need to perform a join on the files on a pre-specified column. Efficiency is, of course, the key. Any solutions that can be scaled out to efficiently allow multiple join columns is a bonus, though not currently required. Here are my inputs:
-Primary File
-Secondary File(s)
-Join column of Primary File (name or col. position)
-Join column of Secondary File (name or col. position)
-Left Join or Inner Join?
Output = 1 File with results of the multi-file join
I am looking to solve the problem using a C-based language, but of course an algorithmic solution would also be very helpful.

Assuming that you have a good reason not to use a database (for all I know, the 50,000 columns may constitute such a reason), you probably have no choice but to clench your teeth and build yourself an index for the right file. Read through it sequentially to populate a hash table where each entry contains just the key column and an offset in the file where the entire row begins. The index itself then ought to fit comfortably in memory, and if you have enough address space (i.e. unless you're stuck with 32-bit addressing) you should memory-map the actual file data so you can access and output the appropriate right rows easily as you walk sequentially through the left file.

Your best bet by far is something like Sqlite, there's C++ bindings for it and it's tailor made for lighting fast inserts and queries.
For the actual reading of the data, you can just go row by row and insert the fields in Sqlite, no need for cache-destroying objects of objects :) As an optimization, you should group up multiple inserts in one statement (insert into table(...) select ... union all select ... union all select ...).

If you need to use C or C++, open the file and load the file directly into a database such as MySQL. The C and C++ languages do not have adequate data table structures nor functionality for manipulating the data. A spreadsheet application may be useful, but may not be able to handle the capacities.
That said, I recommend objects for each field (column). Define a record (file specific) as a collection of fields. Read a text line from a file into a string. Let the record load the field data from the string. Store records into a vector.
Create a new record for the destination file. For each record from the input file(s), load the new record using those fields. Finally, for each record, print the contents of each field with separation characters.
An alternative is to whip up a 2 dimensional matrix of strings.
Your performance bottleneck will be I/O. You may want to read huge blocks of data in. The thorn to the efficiency is the variable record length of a CSV file.
I still recommend using a database. There are plenty of free ones out there, such as MySQl.

It depends on what you mean by "join". Are the columns in file 1 the same as in file 2? If so, you just need a merge sort. Most likely a solution based on merge sort is "best". But I agree with #Blindy above that you should use an existing tool like Sqlite. Such a solution is probably more future proof against changes to the column lists.

read csv using jmeter(starting from x)

I'm writing a jmeter script and I have a huge csv file with a bunch of data which I use in my requests, is it possible to start not from first entry but from 5th or nth entry?

Looking at the CSVDataSet, it doesn't seem to directly support skipping to a given row. However, you can emulate the same effect by first executing N loops where you just read from the data set and do nothing with the data. This is then followed by a loop containing your actual tests. It's been a while since I've used JMeter - for this approach to work, you must share the same CVSDataSet between both loops.
If that's not possible, then there is an alternative. In your main test loop, use a Counter and a If Controller. The Counter counts up from 1. The If controller contains your tests, with the condition ${Counter}>N where N is the number to skip. ("Counter" In the expression is whatever you set the "reference name" property" to in the counter.)

mdma's 2nd idea is a clean way to do it, but here are two other options that are simple, but annoying to do:
Easiest:
Create separate CSV files for where you want to start the file, deleting the rows your don't need. I would create separate CSV data config elements for each CSV file, and then just disable the ones you don't want to run.
Less Easy:
Create a new column in your CSV file, called "ignore". In the rows you want to skip, enter the value "True". In your test plan, create an IF controller that is parent to your requests. Make the If condition: "${ignore}"!="True" (include the quotes and note that 'true' is case sensitive). This will skip the requests if the 'ignore' column has a value of 'true'.
Both methods require modifying the CSV file, but method two has other applications (like excluding a header row) and can be fast if you're using Open Office, Excel, etc.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Access a specific record using lseek - c++

Is it possible to access a specific one (using a specific index) in a text file without knowing the size of each record?

If you maintain a separate index of record offsets then you can simply consult it for the appropriate location to seek to. Otherwise, no.

Check out the dbopen() function. If you pass DB_RECNO as the type parameter you can access variable-length records. These records can be delimited by newlines. Essentially your "database" is a flat text file. The API will conveniently handle inserts and deletes for you as well.

Related

BizTalk Mapping:Source record does not exists but need to map and pass default value

I have large file contents that I want to make searchable on AWS CloudSearch but the maximum document size is 1MB - how do I deal with this?

Random file access of stl::map data in C++

How do I join huge csv files (1000's of columns x 1000's rows) efficiently using C/C++?

read csv using jmeter(starting from x)

Categories

Resources