How to load savedsearch with huge data in MR script? NetSuite - mapreduce

We have a transactional saved search with lines in millions. The saved search fails to get load in UI, is there any way to load such saved searches in the map-reduce script?
I tried using pagination but it still shows an error (ABORT_SEARCH_EXCEEDED_MAX_TIME).

Netsuite may stil time out depending on the complexity of the search but you do not have to run the search in order to send the results to the map stage
function getInputData(ctx){
return search.load({id:'mysearchid'});
}
function map(ctx){
var ref = JSON.parse(ctx.value);
const tranRec = record.load({type:ref.recordType, id:ref.id});
log.debug({
title:'map stage with '+ ref.values.tranid, //presumes Document Number was a result column
details: ctx.value // have a look at the serialized form
});
}

Instead of getting all rows, perhaps it's even a better option to get the first nth rows (100K or even less) per MR execution, save the last internal id from the processed row and use the next internal id in the next MR script execution.

Related

How to know when elasticsearch is ready for query after adding new data?

I am trying to do some unit tests using elasticsearch. I first start by using the index API about 100 times to add new data to my index. Then I use the search API with aggs. The problem is if I don't pause for 1 second after adding data 100 times, I get random results. If I wait 1 second I always get the same result.
I'd rather not have to wait x amount of time in my tests, that seems like bad practice. Is there a way to know when the data is ready?
I am waiting until I get a success response from elasticsearch /index api already, but that is not enough it seems.
First I'd suggest you to index your documents with a single bulk query : it would save some time because of less http/tcp overhead.
To answer your question, you should consider using the refresh=true parameter (or wait_for) while indexing your 100 documents.
As stated in documentation, it would :
Refresh the relevant primary and replica shards (not the whole index)
immediately after the operation occurs, so that the updated document
appears in search results immediately
More about it here :
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-refresh.html

Pig UDF seems to always run in a single reducer - PARALLEL not working

I have a Pig script with a Python UDF that is supposed to generate user level features. My data is preprocessed by Pig and then sent to an UDF as a list of tuples. The UDF will process the tuples of data and return a chararray with my features computer per user. The code where this happens looks like this below:
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Limit the ids to process
userids = LIMIT grouped_ids (long)'$limit';
-- Generate features
user_features = FOREACH userids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
The UDF code clearly runs in the reducer, and for some reason it always goes to one reducer and it takes quite some time. I am searching for a way to parallelize the execution of it, as now my job takes 22 minutes in total of which 18 mins are in this single reducer.
Pig tries to allocate 1GB of data to a reducer typically, and my data is indeed less than 1GB, around 300-700MB, but pretty time consuming on the UDF end, so this is clearly not optimal, while the rest of my cluster is empty.
Things I have tried:
Setting default parallel impacts the whole script script, but still does not manage to get the reducer with the UDF to parallelize
Manually setting parallel on GROUP data BY user_id parallelizes the output of the group and invokes multiple reducers, but at the point where the UDF kicks in, it's again a single reducer
Setting pig.exec.reducers.bytes.per.reducer that allows you to set for instance a maximum of 10MB of data per reducer, and it clearly works for other parts of my script (and ruins the parallelism as this also affects data preparation in the beginning of my pipeline - as expected) but again DOES NOT allow more than one reducer to run with this UDF.
As far as I understand what is going on, I don't see why - if the shuffle phase can hash the user_id to one or more reducers - why this script would not be able to spawn multiple reducers, instantiate the UDF there and hash the corresponding data based on user_id to the correct reducer. There is no significant skew in my data or anything.
I am clearly missing something here but fail to see what. Does anyone have any explanation and/or suggestion?
EDIT: I updated the code as something important was missing: I was running a LIMIT between the GROUP BY and the FOREACH. And i also cleaned up irrelevant info. I also expanded the inline code to separate lines for readability.
Your problem is that you are passing the whole data relation as input parameter to your UDF, so your UDF only gets called once with the whole data, hence it runs in only one reducer. I guess you want to call it once for each group of user_id, so try with a nested foreach instead:
data_grouped = GROUP data BY user_id;
user_features = FOREACH data_grouped {
GENERATE group AS user_id: chararray,
udfs.extract_features(data) AS features: chararray;
}
This way you force the UDF to run in as many reducers as the ones used in group by.
Having the LIMIT operator in the code between the group by and foreach eliminates the possibility to run my code in multiple reducers, even if I explicitly set the parallelism.
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Limit the ids to process
>>> userids = LIMIT grouped_ids (long)'$limit'; <<<
-- Generate features
user_features = FOREACH userids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
Once the LIMIT is placed further in the code, I manage to get the predefined number of reducers to run my UDF:
-- ... loading data above
data = FOREACH data_raw GENERATE user_id, ...; -- some other metrics as well
-- Group by ids
grouped_ids = GROUP data BY user_id PARALLEL 20;
-- Generate features
user_features = FOREACH grouped_ids {
GENERATE group as user_id:chararray,
udfs.extract_features(data) as features:chararray;
}
-- Limit the features
user_features_limited = LIMIT user_features (long)'$limit';
-- ... process further and persist
So my effort of trying to optimize/reduce the inflow of user_ids was counter-productive for increasing paralellism.

neo4j-import with node_auto_indexing

For a project, I need to import 5 million nodes and 15 millions relations.
I tried to import by batch but it was very slow, so I used the new tool 'Neo4j-import' from Neo4j 2.2. We generate some specifics .csv and use the 'neo4j-import'. It is very fast, the whole database is created in 1mn30.
But the problem is that I need to do a regex query on one property (find a movie with only the beginning of his name). And the average response time is between 2.5 and 4 seconds, which is huge.
I read that with Lucene query it would be much more efficient. But with Neo4-import, nodes are created without the node_auto_indexing.
Is there a way to use Neo4j-import and have node_auto_indexing in order to use the Lucene query?
Thanks,
Reptile
neo4j-import does not populate auto indexes. For doing you need to trigger a write operation on the the nodes to be auto indexed. Assume you have nodes with a :Person label having a name property.
Configure node auto index for name in neo4j.properties and restart Neo4j.
To populate the autoindex run a cypher statement like:
MATCH (n:Person)
WHERE NOT HAS(n.migrated)
SET n.name = n.name, n.migrated=true
RETURN count(n) LIMIT 50000
Rerun this statement until the reported count is 0. The rationale for the LIMIT is to have transactions of a reasonable size.

Pig killing data nodes while loading a lot of files

I have a script that tries to get the times that users start/end their days based on log files. The job always fails before it completes and seems to knock 2 data nodes down every time.
The load portion of the script:
log = LOAD '$data' USING SieveLoader('#source_host', 'node', 'uid', 'long_timestamp', 'type');
log_map = FILTER log BY $0 IS NOT NULL AND $0#'uid' IS NOT NULL AND $0#'type'=='USER_AUTH';
There are about 6500 files that we are reading from, so it seems to spawn about that many map tasks. The SieveLoader is a custom UDF that loads a line, passes it to an existing method that parses fields from the line and returns them in a map. The parameters passed in are to limit the size of the map to only those fields with which we are concerned.
Our cluster has 5 data nodes. We have quad cores and each node allows 3 map/reduce slots for a total of 15. Any advice would be greatly appreciated!

SimpleDB Incremental Index

I understand SimpleDB doesn't have an auto increment but I am working on a script where I need to query the database by sending the id of the last record I've already pulled and pull all subsequent records. In a normal SQL fashion if there were 6200 records I already have 6100 of them when I run the script I query records with an ID greater than > 6100. Looking at the response object, I don't see anything I can use. It just seems like there should be a sequential index there. The other option I was thinking would be a real time stamp. Any ideas are much appreciated.
Using a timestamp was perfect for what I needed to do. I followed this article to help me on my way:http://aws.amazon.com/articles/1232 I would still welcome if anyone knows if there is a way to get an incremental index number.