Finding and debugging bad record using hive - mapreduce

Is there any way to pinpoint the badrecord when we are loading the data using hive or while processing the data.
The scenario Goes like this.
Suppose I have file that need to be loaded as table using hive which got 1 Million records in it. Delimited by some '|' symbol.
So suppose after Half a million record processing I encounter a problem. IS there anyway to debug it or precisely pinpoint the record/records having the issues.
If you are not clear about my question please let me know.
I know there is a skipping of bad record in mapreduce (Kind of percentage). I would like to get this in the perspective of hive.
Thanks In Advance.

Related

ClientError: Unable to parse csv: rows 1-1000, file

I've looked at the other answers to this issue and none of them are helping me. I am trying to run a simple random cut forest algorithm. I have a small data set of IPs which have been stripped down to only have numbers. I still get this error. It only has one column of these numbers. The CSV looks like this:
176162144
176862141
176762141
176761141
176562141
Have you looked at this sample notebook, and tried using it with your own data?
https://github.com/awslabs/amazon-sagemaker-examples/blob/master/introduction_to_amazon_algorithms/random_cut_forest/random_cut_forest.ipynb
In a nutshell, it reads the CSV file with Pandas and trains the model like this:
rcf = RandomCutForest(role=execution_role,
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
data_location='s3://{}/{}/'.format(bucket, prefix),
output_path='s3://{}/{}/output'.format(bucket, prefix),
num_samples_per_tree=512,
num_trees=50)
# automatically upload the training data to S3 and run the training job
rcf.fit(rcf.record_set(taxi_data.value.as_matrix().reshape(-1,1)))
You didn't say what your use case was, but as you're working with IP addresses, you may find the IP Insights built-in algorithm useful too: https://docs.aws.amazon.com/sagemaker/latest/dg/ip-insights.html
I was utilizing the sample notebook Julien Simon mentioned earlier, but at some point the data was ending up as strings! The funny thing about RCF algorithms is they have to run on integer data.
What I did is I made sure to cast the array as an int array as a double check and vallah! It worked. I am at loss over how the data ended up in a string format but alas, that was the issue. Simple solution.

SAS DI Stop job if dataset is populated

I'm quite new to SAS and really can't get my head around it's code, so asking here for help.
I've a job that is reading an external csv file, and have a macro created by a colleague that validates the data in this external file and prints out error message to a work table.
What I'd like to do is either on precode of the file reader, or by using another user written code transformation is to read the work table and check if observations exist, and if they do, abort the job. From googling, and between here and SAS community, I can find how to read a dataset and count observations but I'm having real difficulty in figuring out how to implement it so any guidance would be really appreciated
Can anyone please help me on this?
Thanks

Cloud Dataflow - Heap Space error while using PcollectionList

I have to partition the data by a date field in it. I am doing it using Partition Transform.
When I divide yearly data by month, Partition returns a Pcollectionlist which has 12 pcollection. This works fine.
When I have to divide it by day. I will have to create 1*12*31 Pcollection in PcollectionList. This throughs Heap space error. I tried only for 2 months data. That is,
a PcollectionList of 2*31 Pcollection
I tried using n1-highmem-4 and n1-highmem-8 machines with more than 10 workers. Still it throughs Heap space error. I am testing with only 2.0 MiB file. So I believe data size should not be a problem. The screen shots are below.
Please help me to fix this. Or a work around to my solution is also most welcome.
Thanks in advance.
It sounds like you're trying to get time-based divisions of your data. Have you looked at windowing? It should allow you to do monthly/daily/hourly windowing without needing to perform the partition. If windowing isn't applicable, could you explain why you need to partition by day?
How are you consuming the partitioned results? You may be running into a known bug with pipelines with many sinks running into OOM errors due to the byte buffers for each of the sinks.

Advises for handling a 200 GB CSV with geometry

I have a 200 GB CSV file that represent locations (points) around the globe. Each entry (row) has 64 columns and it has redundant information. I made some calculations and the size is approx. 800 million rows. My first approach was to push all the data into Postgres + Postgis. The data is not very clean and there are some rows that do not hold the datatype feature thus, I made an ORM implementation to first, validate and fix the datatype inconsistencies and handle exceptions.
The ORM I used was Django > 1.5 and it took approx. 3 hours to process less than 0.1% of the total dataset.
I tried also to partition the dataset in different files so that little by little I can process them pushing into the database. I used common unix commands like "Sed, Cat, AWK and Head" to do this but it takes so much time!
My questions are the following:
Using Django ORM sounds like a god approach?
What about SQLAlchemy could it help in making the insertions faster?
How can I split the dataset in shorter time?
I recently saw Pandas (Python library for data analysts) can it help with this task, maybe making the queries easier once the data is stored in the database.
Which other tools would you recommend to work with this massive amount of data?
Thank you for your help and reading the long post.

SimpleDB Incremental Index

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.