Importing file with split lines Azure Data Factory pipeline - replace

I have a pipe delimited text file with a header row that I need to import into an SQL Server table (obtained via SFTP). That should be easy enough, however, the input file has input rows split over several lines if the data for the row exceeds 80 chars in length. EOL character is a newline character, which ADF can cope with just fine.
So, we have something like:
Col1Name|Col2Name|Col3Name|Col4Name
aaaa|bbbbbb|cccc|ddddd
eeeeeee|fffff|gggggg|this is some data that pushes the row over the 80 character li\
mit
hhhhhh|iiiiiii|jjjjjjjj|kk
If some of the rows of data weren't split in this manner it would be straightforward to shunt the data into the destination table but I can't work out how to merge the split lines prior to mapping the data to output columns.
Things I have tried/looked at doing:
Using a text file source with pipes as delimiters and newlines as row terminators, replacing the backslash and newline combination with an empty string. Unfortunately, the data is already processed into separate rows at this point so this achieves nothing.
Mucking around with the column/row delimiters to read the file into one big blob and replacing the backslash/newline combos in the blob with an empty string. This doesn't work as the file gets truncated doing this.
Some combination of aggregate transformation with a collect() expression to merge the lines. Again, can't seem to manage this because there aren't any grouping columns in common between the lines the row has been split into to be able to perform this sort of aggregation.
Do I need to write an Azure function to pre-process the file and merge the split lines, or is there something I'm missing that would help?

Related

Glue Crawler is removing leading 0's when reading CSV

For brevity sake I am just going to create a practical example:
Let's say I am ingesting a raw CSV into S3 and one of the columns is SSNO. Within the CSV format, SSNO is not wrapped in any quotes. As we all know, SSNOs can have leading 0's: 012345678. When I run my crawler, it creates a schema where SSNO is of Type bigint, and because of this it strips the leading zeroes: 12345678.
How do I either:
A) make it not strip zeroes
B) force it to read the columns as a string

How to ignore specific charactor and new line using regex

I am trying to validate a csv file using Apache-NiFi.
My CSV file has some defects.
id,name,address
1,sachith,{"Lane":"ABC.RTG.EED","No":"12"}
2,nalaka,{"Lane":"DEF",
"No":"23"}
3,muha,{"Lane":"GRF.FFF","No":"%$&%*^%"}
Here in second row,its been divided into two lines and third row has some special characters.
I want to ignore both the lines. For that I use \{("\w+":"\w+",)*[^%&*#]*\}, but this is not capturing row split error and new line.
I also used \{("\w+":"\w+",)*[^%&*#]*\}$, but it doesnt even get the right answer.
This is you might looking for: ^[0-9]+,[a-z]+,\{("\w+":"[\w\.]+","\w+":"[a-zA-Z0-9]+")\}$

parsing csv file that has newline characters in one of columns in AWS Athena/ AWS Glue catalog

I've sample data like below:
id,log,code,sequence
100,sample <(>&<)> O sample ? PILE UP - 3 sample,20,7^M$
101,sample- 4/52$
sample$
CM,21,7^M$
102,sample AT 3PM,22,4^M$
In second row (id=101), log column has newline characters making 3 lines out of one line.
I've enabled ":set list" option in vim editor to show newline ($) and endofline (^M) characters.
To handle newline characters AWS Suggested OpenCSVSerde here.
I tried using OPENCSVSerde serialisation with escapeChar=\\, quoteChar=\", seperatorChar=,
Nonetheless, it is showing data as 5 rows where as I need three rows.
When I query in Athena, id=101 is showing only first line and rest is missing:
id,log,code,sequence
101,sample- 4/52
Any tips or example on how to handle multiline characters in a csv file column?
I'm exploring custom classifiers but no luck yet.
According to this doc https://docs.aws.amazon.com/athena/latest/ug/csv.html opencsvserde does not support line breaks.
I see that you are trying to put some kind of log there.
Your options are:
Cleanup the log not to include the line breaks. Or,
use regexserde, which is not useful if your log format keeps changing. Or,
If both are not an option you can change ur format from csv to parquet or something else, where there are no line break issues

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data. Can we avoid it?

SAS while reading varbinary data from Amazon RDS is appending spaces at the end of the data.
proc sql;
select emailaddr from tablename1;
quit;
The column emailaddr is varbinary(20)
For example:
I inserted "XX#WWW.com ", but while reading from db, it is appending spaces equal to the length of the column.
Since the column length is 20 it is returning "XX#WWW.com " ( note the spaces appended. I cannot use the trim() function since this also removes spaces that might genuinely be part of the original inserted data.
How can i stop sas from appending these spaces?
For my program i need to get the exact data as present in database without any extra spaces attached.
That's how SAS works; SAS has only CHAR equivalent datatype (in base SAS, anyway, DS2 is different), no VARCHAR concept. Whatever the length of the column is (20 here) it will have 20 total characters with spaces at the end to pad to 20.
Most of the time, it doesn't matter; when SAS inserts into another RDBMS for example it will typically treat trailing spaces as nonexistent (so they won't be inserted). You can use TRIM and similar to deal with the spaces if you're using regular expressions or concatenation to work with these values; CATS and similar functions perform concatenation-with-trimming.
If trailing spaces are part of your data, you are mostly out of luck in SAS. SAS considers trailing spaces irrelevant (equivalent to null characters). You can append a non-space character in SQL, or translate the spaces to NBSPs ('A0'x) or something else, while still in SQL, or use quotes or something around your actual values - but whatever you do will be complicated.

Matlab - how to extract specific data from a vector

I have some data from a GPS receiver, however, some of the data are corrupted by extra characters. I want to extract the timestamp (the first field) and the data for the $GPGGA and $GPVTG.
To be more clear, here is a sample of the data I have in a cell array:
'1458937887.70818 $GPGGA,200228.90,3555.3269,N,15552.9641,A*25'
'1458937887.709668 $GPVTG,56.740,T,56.740,M,0.069,N,0.127,K,D*2D'
'1458937887.712022 ªDe¾,…´apö$™°%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,C*2B'
'1458937887.714071 $GPVTG,286.847,T,286.847,M,0.028,N,0.051,K,D*28'
As you can see, the problem here is in the third line where some strange characters appear between the timestamp and the data.
Another problem is that sometimes this third line is split into two lines, something like this:
'1458937887.712022 ªDe¾,…´apö$™°'
'%=HfSrîU¾Õ½ôAqö‚>1ÀàHqgu$GPGGA,200229.00,3555.3269,N,15552.9641,D*24'
which is making using regexp very hard.
In summary, I want to format the third line (in both cases) as:
'1458937887.712022 $GPGGA,200229.00,3555.3269,N,15552.9641,D*2R'
Update:
Thanks to #excaza, this solves the first issue (removing the garbage):
regexprep(str, '(?<=\d\s)(.*)(?=\$GPGGA)', '')
As for the second issue, #Suever's question gave me an idea by looking at the format of the data. Is it possible to solve it while reading the data from a .txt file? Something like defining the delimiter to be * followed by two characters and a \n since all packets end with this pattern?