How to read first few rows from CSV using univocity parser - univocity

How to stop parsing after reading few rows from CSV file using iterator/row processor in univocity parser?
Update #1
I tried the below code and I'm getting empty rows.
val parserSettings = new CsvParserSettings
parserSettings.detectFormatAutomatically()
parserSettings.setEmptyValue("")
parserSettings.setNumberOfRecordsToRead(numberOfRecordsToRead)
val parser = new CsvParser(parserSettings)
val input = new FileInputStream(path)
val rows = parser.parseAll(input)
Update #2
Before passing inputstream to parser, I was using Apache Tika to detect the MIME type of the file to detect whether the file is CSV.
new Tika().detect(input)
This was altering the inputstream. Due to that Univocity parser was unable to parse correctly.

You have many different options:
From your row processor just call context.stop().
On the parser settings, you can set settings.setNumberOfRecordsToRead(10) to read 10 rows and stop.
With the parser itself, call parser.stopParsing()
Hope this helps

Related

Write rows from list to specific columns

I have gone through similar questions but am having trouble fitting this to my needs. I am reading a csv, creating a list and appending the list to a seperate csv.
with open('in_table.csv', 'rb') as vo:
next(vo) # skip header row
reader = csv.reader(vo)
vo_list = list(reader)
print vo_list
with open('out_table.csv', 'ab') as f:
cf = csv.writer(f)
for row in vo_list:
cf.writerow(row)
I need to write the list starting at the second column and not the first, as the first column will contain separate information. What is the simplest way to do this?
Realistically I have another input CSV exactly like the first one and I need to put them both into the output file into a total of 4 columns. Like so:
Column1, join_count1, grid_id1, join_count2, grid_id2
Blah, 0, U24, 3, U24
I would go with the built-in csv package. Also, you are opening CSV files as binary files, was that intentional? CSVs should be text files by definition, but if yours are binary then please correct the flags below:
import csv
with open("out_table.csv", "a+") as out_file:
writer = csv.writer(out_file)
with open("in_table.csv") as in_file:
reader = csv.reader(in_file)
next(reader) # skip the header
for oid, join_count, grid_id in reader:
writer.writerow([join_count, grid_id])

How to normalize fields delimited by colon thats into a single column in informatica cloud

I need help to normalize the field "DSC_HASH" inside a single column delimeted by colon.
Input:
Outuput:
I achieved what I needed with java transformation:
1) In java transformation I created 4 output columns: COD1_out, COD2_out, COD3_out and DSC_HASH_out
2) Then I put the following code:
String [] column_split;
String column_delimiter = ";";
String [] column_data;
String data_delimiter = ":" ;
Column_split = DSC_HASH.split(column_delimiter);
COD1_out = COD1;
COD2_out = COD2;
COD3_out = COD3;
for (int I =0; i < column_split.length; i++){
column_data = column_split[i].split(data_delimiter);
DSC_HASH_out = column_data[0];
generateRow();
}
There are no generic parsers or loop construct in Informatica that can take one record and output an arbitrary number of records.
There are some ways you can bypass this limitation:
Using the Java Transformation, as you did, which is probably the easiest... if you know Java :) There may be limitations to performance or multi-threading.
Using a Router or a Normalizer with a fixed number of output records, high enough to cover all your cases, then filter out empty records. The expressions to extract fields are a bit complex to write (an maintain).
Using the XML Parser, but you have to convert your data to XML before, and design an XML schema. For example your first line would be changed in (on multiple lines for readability):
<e><n>2320</n><h>-1950312402</h></e>
<e><n>410</n><h>103682488</h></e>
<e><n>4301</n><h>933882987</h></e>
<e><n>110</n><h>-2069728628</h></e>
Using SQL Transformation or Stored Procedure Transformation to use database standard or custom functions, but that would result in an SQL query for each input row, which is bad performance-wise
Using a Custom Transformation. Does anyone want to write C++ for that ?
The Java Transformation is clearly a good solution for this situation.

Univocity parser using routines ignores the LongCoversion using defaultNullRead attribute?

I have below field configuration:
#Parsed(field="TEST_ID", defaultNullRead="000000")
private Long testId
now when the input file (csv parsing) contains value as NULL, it is not converting to default long value of 0, rather throws LongConversion exception for "NULL"
e.g. row in csv file: (5th column containing NULL is an issue)
7777|ab|444|PENDING|NULL|VESRION|TEST|11
I am using csvRoutines for parsing the input csv file
NULL in your input is actually text and not Java's null. You need to tell the parser to translate the string NULL to java null.
Add the following annotation (you can give it more than one string that represents null:
#NullString(nulls = {"NULL", "N/A", "?"})
Hope this helps

Null Byte appending while reading the file through Python pandas

I have created a script which will give you the match rows between the two files. Post that, I am returning the output file to a function, which will be used the file as input to create pivot using pandas.
But somehow, something seems to be wrong, below is the code snippet
def CreateSummary(file):
out_file = file
file_df = pd.read_csv(out_file) ## This function is appending NULL Bytes at
the end of the file
#print file_df.head(2)
The above code is giving me the error as
ValueError: No columns to parse from file
Tried another approach:
file_df = pd.read_csv(out_file,delim_whitespace=True,engine='python')
##This gives me error as
_csv.Error: line contains NULL byte
Any suggestions and criticism is highly appreciated.

How to Alter returned data format called from csv using python

I am looking for guidance regarding a return result FORMAT from a csv file. The code I have to date partially ahcieves my objective but despite significant effort researching through this and many other sites/forums I cannot resolve the final step. I have also posed this question on gis.stackexchange but was redirected to this forum with the comment "Questions relating to general Information Technology, with no clear GIS component, are off-topic here, but can be researched/asked at Stack Overflow".
My successful piece of python code that reads selected data from a csv and returns it in dict format is below ; (Yes I know the reason it returns as type dict is due to the format my code is calling!!! and that is the crux of the problem)
import arcpy, csv
Att_Dict ={}
with open ("C:/Data/Code/Python/Library/Peter/123.csv") as f:
reader = csv.DictReader(f)
for row in reader:
if row['Status']=='Keep':
Att_Dict.update({row['book_id']:row['book_ref']})
print Att_Dict
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
For the next part of my code to run I need the result above but in the format of ; (this is part of a very lengthy code but the only show stopper is the returned format so little value in posting the other 200 or so lines)
Att_Dict = [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]
Although I have experimented endlessly and can achieve this by reverting to csv.Reader rather than csv.DictReader, I then lose the ability to 'weed out' rows where column 'Status' has value 'Keep' in them and that is a requirement for the task at hand.
My sledgehammer approach to date has been to use 'search and replace' within Idle to amend the returned set to the meet the other requirement but Im sure it can be done programatically rather than manually. Similar but not exact to https://docs.python.org/2/library/index.html, plus my startout question at Returning values from multiple CSV columns to Python dictionary? and Using Python's csv.dictreader to search for specific key to then print its value plus a multitude of csv based questions at geonet.esri.
(Using Win 7, ArcGIS 10.2, Python 2.7.5)
Try this
Att_Dict = {'7643': '7625', '9644': '2289', '4406': '4443', '7588': '9681', '2252': '7947'}
Att_List = []
for key, value in Att_Dict.items():
Att_List.append([int(key), int(value)])
print Att_List
Out: [[7643, 7625], [9644, 2289], [4406, 4443], [7588, 9681], [2252, 7947]]