Hadoop process file with different field delimiters

Hadoop process file with different field delimiters - mapreduce

What are the options to process a text file with different field delimiters in the same file and non new line row delimiter?
Some fields in the file can be fixed length and some can be separated by a character.
Example:
100 xyz |abc#hello#200 xyz1 |abc1#world
In this example, 100 is the first field value, xyz is the second field value, abc is the 3rd field value, hello is the fourth field value. | and # are the delimiters for the 3rd and the 4th fields. The lines are separated by #.
Any of Map reduce or pig or hive solution is fine.
One option may be an MR to configure a custom row delimiter, read the entire line and process the same. But any InputFormat accepts a custom delimiter?

You can override the record delimiter and set it to #.After that load the records as a line and then replace the '|' and '#' characters with space.Then you will get all the fields separated by ' '.Use STRSPLIT to get the individual fields.
SET textinputformat.record.delimiter '#'
A = LOAD 'data.txt' AS (line:chararray);
B = FOREACH A REPLACE(REPLACE(line,'|',' '),'#',' ') AS line;-- Note:'\\|' if you need to escape '|'
C = FOREACH B GENERATE STRSPLIT(line,' ',4);
DUMP C;

You could try Hive with RegexSerDe

Related

Create table Athena ignore comma in the row values

I am creating a table in Athena using below scripts
CREATE EXTERNAL TABLE `itcfmetadata`(
`itcf id` string,
`itcf control name` string,
`itcf control description` string,
`itcf process` string,
`standard` string,
`controlid` string,
`threshold` string,
`status` string,
`date reported` string,
`remediation (accs specific)` string,
`aws account id` string,
`aws resource id` string,
`aws account owner` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION
's3://itcfmetadata/'
TBLPROPERTIES (
'skip.header.line.count'='1');
The S3 source file is csv file. This file is converted from a excel file and this csv file doe snot have comma seperated values, it is more like a excel file. Problem is when any column contains text like "Hi, How are you". It get split into two as there is a comma and "Hi" and "How are you" becomes two value and get split into two rows. How to avoid this using above create scripts ?
CSV File :

Try using
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
instead of DELIMITED
The DELIMITED deserializer just looks at the delimiters you provide. The csv deserializet will only use those outside a pair of double quotes ".
See the docs: https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html

Adding a space within a line in file with a specific pattern

I have a file with some data as follows:
795 0.16254624E+01-0.40318151E-03 0.45064186E+04
I want to add a space before the third number using search and replace as
795 0.16254624E+01 -0.40318151E-03 0.45064186E+04
The regular expression for the search is \d - \d. But what should I write in replace, so that I could get the above output. I have over 4000 of similar lines above and cannot do it manually. Also, can I do it in python, if possible.

Perhaps you could findall to get your matches and then use join with a whitespace to return a string where your values separated by a whitespace.
[+-]?\d+(?:\.\d+E[+-]\d+)?\b
import re
regex = r"[+-]?\d+(?:\.\d+E[+-]\d+)?\b"
test_str = "795 0.16254624E+01-0.40318151E-03 0.45064186E+04"
matches = re.findall(regex, test_str)
print(" ".join(matches))
Demo

You could do it very easily in MS Excel.
copy the content of your file into new excel sheet, in one column
select the complete column and from the data ribbon select Text to column
a wizard dialog will appear, select fixed width , then next.
click just on the location where you want to add the new space to tell excel to just split the text after this location into new column and click next
select each column header and in the column data format select text to keep all formatting and click finish
you can then copy all the new column or or export it to new text file

Line breaking issue to move csv file in Linux

[I have moved the csv file into Linux system with binary mode. File content of one field is spitted into multiple lines its comment sections,I need to remove the new line , keep the same format, Please help on shell command or perl command
here is the example for three records, Actual look like]
Original content of the file
[After moved into linux, comments field is splitted into 4 lines , i want to keep the comment field in the same format but dont want the new line characters
"First line
Second line
Third line
all lines format should not change"
]2

As I said in my comment above, the specs are not clear but I suspect this is what you are trying to do. Here's a way to load data into Oracle using sqlldr where a field is surrounded by double-quotes and contains linefeeds where the end of the record is a combination carriage return/linefeed. This can happen when the data comes from an Excel spreadsheet saved as a .csv for example, where the cell contains the linefeeds.
Here's the data file as exported by Excel as a .csv and viewed in gvim, with the option turned on to show control characters. You can see the linefeeds as the '$' character and the carriage returns as the '^M' character:
100,test1,"1line1$
1line2$
1line3"^M$
200,test2,"2line1$
2line2$
2line3"^M$
Construct the control file like this using the "str" clause on the infile option line to set the end of record character. It tells sqlldr that hex 0D (carriage return, or ^M) is the record separator (this way it will ignore the linefeeds inside the double-quotes):
LOAD DATA
infile "test.dat" "str x'0D'"
TRUNCATE
INTO TABLE test
replace
fields terminated by ","
optionally enclosed by '"'
(
cola char,
colb char,
colc char
)
After loading, the data looks like this with linefeeds in the comment field (I called it colc) preserved:
SQL> select *
2 from test;
COLA COLB COLC
-------------------- -------------------- --------------------
100 test1 1line1
1line2
1line3
200 test2 2line1
2line2
2line3
SQL>

Load file in pig based on whitespace

I am trying to load a file in PIG which 2 words may be separated with spaces or tabs (may me more than one). Is there a way to delimit the file load using a regex for whitespace? Or is there any other way to achieve the below?
Input:
COUNTESS This young gentlewoman had a father,--O, that`
Output:
COUNTESS
This
young
gentlewoman
had
a
father,--O,
that
It would be great to have a comma delimiter also, but that would make it more complex. For now, only the whitespace delimiter should work for me.

Load the file as a line and then use TOKENIZE.If you have a mixture of tabs and space then after loading the data add a step to replace the tabs with spaces in the line and then use TOKENIZE.
A = LOAD 'test2.txt' as (line:chararray);
B = FOREACH A GENERATE FLATTEN(TOKENIZE(A.$0));
C = FOREACH B GENERATE TOBAG(*);
DUMP C;
OUTPUT

I don't really know PIG, but here's some info:
https://pig.apache.org/docs/r0.9.1/func.html#strsplit
STRSPLIT(string, regex, limit)
regex could be something like [\s,]+. That will split on any blocks of whitespace and commas. So for instance, a b,c ,d, e would split in to each letter. the order of space and comma does not matter.

Regex to parse file where records are delimited with new line, fields with comma, but both comma new line can be in strings

I have to parse a string in VB.NET, which has the following structure
records separated by new line
fixed number of fields per record, separated by comma
fields can be quoted (strings) or not quoted (other type of data - date, int, etc)
Comment fields (strings) can contains both new line and comma
so, due to point 4, comma and new line must be ignored as field / record separators if between a odd and even quote (e.g. if between quote 1 and 2, they are in comment field and must be ignored, but if between quotes 2 and 3, they are field / record delimiter.
I can write manual parsing code for this, but think a regex can be more reliable. But I have very limited experience with regex.
Example string
(record 1)
10,"Test",10.1,,,"123"
(record 2)
20,"Test, has comma
and new line",,2.1,,"aaa"
So actual string is
10,"Test",10.1,,,"123"
20,"Test, has comma
and new line",,2.1,,"aaa"
EDIT:
I need to add more clarifications:
1. records can have more or less then 4 fields
2. fields can be empty
So an actual test input string can be
10,"Test",10.1,,,"123"
20,"Test, has comma
and new line",,2.1,,"aaa"
So apparently the problem should be split in two:
Extract records (where new line is not between quotes)
for each record, extract fields (where delimited by comma not between quotes)
How should I split the regex, (or have two regexes) to match this?
Thanks

I don't know how to eliminate the redundancy for the expression for each field, but the following appears to work for your example, per this test:
("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+),("[^"]*"|[^",\n]+)
If you use a repeating group, the match will only be retained for the last instance. If anyone knows how to get around this duplication, I'd be inerested.
Update: If you know something about the type of each positional field (e.g. whether it's a quoted string, integer, float, etc.) you can of course adjust the regex accordingly.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Hadoop process file with different field delimiters - mapreduce

You could try Hive with RegexSerDe

Related

Create table Athena ignore comma in the row values

Adding a space within a line in file with a specific pattern

Line breaking issue to move csv file in Linux

Load file in pig based on whitespace

Regex to parse file where records are delimited with new line, fields with comma, but both comma new line can be in strings

Categories

Resources