Skip first line in csv read on Kettle - kettle

Hello i am trying to skip the first line of a csv file when i import it to Kettle Pentaho PDI 8.1.0.
The first line has the separator declaration
sep=;
The second line has the Headers. Cause of the first line the get fields button read only two variables. The first is the sep= and the second one that does not set a name.
I tried to set that header lines are 2 ,also to escape sep= also to use the Document header lines set to 1 in order to escape the first line but the get fields button does not recognize the headers.
Is there any other idea?

Get fields will always look at the first line. You will need to enter the field list by hand.
You were on the right track, set headers to 2 and it will read the data correctly.
If you need to parse the separator declaration you will need to parse the file once to determine its structure, then use metadata injection to read a 2nd time for the data.

Related

How to read a specific line from a text file in c++?

C++ program that displays on the screen item codes with corresponding
item descriptions and prices. It asks the user to enter the code of the item
purchased by a customer. It looks for a match of the item code stored in items.txt.
How can I output only a specific line from a text file after the user inputs the item code?
You need to read the file line-by-line (std::getline), extract (depending on the exact format, e.g. by searching for a whitespace in the string) and compare the code and then return the corresponding line on a match.
It is not possible to access lines from a text file directly by index or content.
This is assuming that you mean the file contains lines in the form
code1 item1
code2 item2
//...
If the code is just the index of the line, then you only need to call std::getline in a loop with a loop counter for the current index of the line.
If you do this multiple times on the same file, you should probably parse the whole content first line-by-line into a std::vector<std::string> or a std::(unordered_)map<std::string, std::string> or something similar to avoid the costly repeated iteration.
Depending on the use case, maybe it would be even better to parse the data into a database first and then query the database, even if it is only e.g. sqlite or something like that.

How to ignore specific charactor and new line using regex

I am trying to validate a csv file using Apache-NiFi.
My CSV file has some defects.
id,name,address
1,sachith,{"Lane":"ABC.RTG.EED","No":"12"}
2,nalaka,{"Lane":"DEF",
"No":"23"}
3,muha,{"Lane":"GRF.FFF","No":"%$&%*^%"}
Here in second row,its been divided into two lines and third row has some special characters.
I want to ignore both the lines. For that I use \{("\w+":"\w+",)*[^%&*#]*\}, but this is not capturing row split error and new line.
I also used \{("\w+":"\w+",)*[^%&*#]*\}$, but it doesnt even get the right answer.
This is you might looking for: ^[0-9]+,[a-z]+,\{("\w+":"[\w\.]+","\w+":"[a-zA-Z0-9]+")\}$

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.
According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

Reading a line of a text file from a specific position in C++

I would like to read a text file in C++ in following manner:
Ignore the entire first line as it is simply meant as an introduction.
Only read the following lines from a specific position.
That starting position for reading is a fixed one and remains the same for every line; however, the numbers after that may be of variable length. I need to save all of these numbers from line 2 to line n into an Array.
At the moment I can read a regular 2D Array with getline.
How can I work around these things?
An example for a line I want to read could be:
Person1: 25 988.3 0.0023 7
To set the file to a position, use std::ifstream::seekg().
To set the file to the beginning of a line, you must read and count the line endings. Many text files have variable length text lines.
How can I work around these things?
You can't, unless you can ensure that all of the data lines after the first line are all the same length.
If you can't ensure that, then all you can do is read through all of the preceding lines.
An alternative I have employed in the past is to generate an 'index' of line start positions in a secondary file in binary format (so that I CAN jump directly to the right place in that file), and use that to jump to the right place in the text file. Of course that means that you need to regenerate that index file every time you replace/amend the data file.

index a text file (lines with different size) in c++

I have to extract information from a text file.
In the text file there is a list of strings.
This is an example of a string: AAA101;2015-01-01 00:00:00;0.784
The value after the last ; is a non integer value, which changes from line to line, so every line has different lenght of characters.
I want to map all of these lines into a structured vector as I can access to a specific line anytime I need without scan the whole file again.
I did some research and I found some threads about a command called, which permit me to reach a specific line of a text file but I read it only works if any line has the same characters lenght of the others.
I was thinking about converting all the lines in the file in a proper format in order to be able to map that file as I want but I hope there is a better and quick way
You can try TStringList*. It creates a list of AnsiStrings. Then each AnsiString can be accessed via ->operator [](numberOfTheLine).