Google Dataflow job failed with "insufficient data uploaded" error - google-cloud-platform

I am trying to create a dataflow job that processes a few thousand files, and for each file, write to a different destination in GCS.
I have to do a lot of TextIO as source and write to destination as separate flow. A sample code snippet looks like this:
List<PCollection<String>> pcs = new ArrayList<>();
for(int i = 0; i < 2000; i++) {
pcs.add(p.apply(TextIO.Read.from("gs://wushilin-asia/some-folder/input-" + i + "/*")));
}
for(int i = 0; i < 2000; i++) {
pcs.get(i).apply(TextIO.Write.to("gs://wushilin-asia/some-folder/output-" + i + "/"));
}
p.run();
This fails silently (seems hanging forever) with error "insufficient data uploaded" in the backend.
What is going wrong here?

It turned out to be that the dataflow structure is too complicated and dataflow job metadata storage can't handle it. Reducing to less components solved this issue

Related

Why is my Arduino MKR NB 1500 stuck after sending or receiving a couple of MQTT messages?

Good morning everyone, newcomer writing his first question here (and new to C++/OOP).
So, i'm currently working on a project in which i have to send 2 types of JSON payloads to a MQTT broker after regular intervals (which can be set by sending a message to the Arduino MKR NB 1500).
I'm currently using these libraries: ArduinoJson, StreamUtils to generate and serialize/deserialize the JSONs, PubSubClient to publish/receive, and MKRNB in my VS code workplace.
What I noticed is that my program runs fine for a couple of publishes/receives and then stays stuck in the serialization function: I tried to trace with the serial monitor to see exactly where, but eventually arrived at a point in which my knowledge in C++ is too weak to recognize where to even put the traces in the code...
Let me show a small piece of the code:
DynamicJsonDocument coffeeDoc(12288); //i need a big JSON document (it's generated right before
//the transmission and destroyed right after)
coffeeDoc["device"]["id"] = boardID
JsonObject transaction = coffeeDoc.createNestedObject("transaction");
transaction["id"] = j; //j was defined as int, it's the number of the message sent
JsonArray transaction_data = transaction.createNestedArray("data");
for(int i = 0; i < total; i++){ //this loop generates the objects in the JSON
transaction_data[i] = coffeeDoc.createNestedObject();
transaction_data[i]["id"] = i;
transaction_data[i]["ts"] = coffeeInfo[i].ts;
transaction_data[i]["pwr"] = String(coffeeInfo[i].pwr,1);
transaction_data[i]["t1"] = String(coffeeInfo[i].t1,1);
transaction_data[i]["t2"] = String(coffeeInfo[i].t2,1);
transaction_data[i]["extruder"] = coffeeInfo[i].extruder;
transaction_data[i]["time"] = coffeeInfo[i].time;
}
client.beginPublish("device/coffee", measureJson(coffeeDoc), false);
BufferingPrint bufferedClient{client, 32};
serializeJson(coffeeDoc, bufferedClient); //THE PROGRAM STOPS IN THIS AND NEVER COMES OUT
bufferedClient.flush();
client.endPublish();
j++;
coffeeDoc.clear(); //destroy the JSON document so it can be reused
The same code works as it should if i use an Arduino MKR WiFi 1010, I think i know too little about how the GSM works.
What am I doing wrong? (Again, it works about twice before getting stuck).
Thanks to everybody who will find the time to help, have a nice one!!
Well, here's a little update: turns out i ran out of memory, so 12288 bytes were too many for the poor microcontroller.
By doing some "stupid" tries, i figured 10235 bytes are good and close to the maximum available (the program won't use more than 85% of the RAM); yeah, that's pretty close to the maximum, but the requirements of the project give no other option.
Thanks even for having read this question, have a nice one!

Batch together multiple read function calls for different blocks of ethereum blockchain

I am developing a simple Dapp in which the line charts will be shown and the data will be obtained for each day. However, when fetching the data I realized that it is really slow to get the historic data from different blocks mined at different times.
My first naive solution was just to make read function call for each block mined in the given day, but I quickly realized this is very slow. I am using RPC provider.
const blocksPerDay = 20 * 60 * 24;
const startingBlock = 1446905;
const lastBlock = await provider.getBlockNumber();
let day = (await provider.getBlock(startingBlock)).timestamp;
const secondsPerDay = 60 * 60 * 24;
const result: LineChartData = [];
for (
let currentBlock = startingBlock;
currentBlock <= lastBlock;
currentBlock += blocksPerDay
) {
const valueAtBlock = await contract.getData({ blockTag: currentBlock });
result.push({x: day, y: fromWeiNumber(valueAtBlock)});
day += secondsPerDay;
}
return result;
Then I did a little research and found Multicall.js. Unfortunately, I did not found a way to batch read function calls for different blocks using Multicall(as it's probably not possible).
Is there any way to send multiple read calls for different blocks in one function call or make it faster? How do the web apps showing charts (like poocoin) do it? Do I need to get my own archive node?
You will need access to a node that allows these functionalities, but to answer your batch question, with geth as your node you can batch read requests with graphql endpoint, batch the requests and responses, or use the json-rpc-batch-provider

Kernel automatically restart after loading data when using AI platform of google cloud platform

I'm trying to load a 600 MB data using notebook of AI platform.
The data loading was fine at first, but right after the loading complete, the kernel will restart automatically. I've successfully load the data before, the issue comes after I do some preprocessing to the images while loading data.
I'm wondering if I have done anything wrong to make this happen since I'm new to GCP. I have tried to setup higher RAM but it still not work. And here is the code that trigger the problem.
for i in random.sample(items,5600):
print(j)
j += 1
img = cv2.imread(PATH + "C1-P1_Train/" + labels[i][0])
img = cv2.resize(img,size)
img = cv2.resize(preprocess(img), size)
X.append(img)
y.append(labels[i][1])
print("Import successfully!")
Thanks for your help

C++ download progress report algorithm

I have an application (Qt but that is not really important) which is downloading several files and I want to notify the user about the progress. The c++ app runs on a different machine and progress reports are sent over network (protocoll does not matter here). I do not want to sent for each data receival a message over the network but only in defined intervalls e.g. every 5% (so 0%, 5%, 10%).
Basically I have it like this right now:
void Downloader::OnUpdateDownloadProgress(int downloaded_bytes)
{
m_files_downloaded_size += downloaded_bytes;
int perc_download = (int) ((m_files_downloaded_size / m_files_total_size)*100);
if(m_percentage_buffer > LocalConfig::getDownloadReportSteps() || m_files_downloaded_size == m_files_total_size){
emit sigDownloadProgress(DOWNLOAD_PROGRESS, perc_download);
m_percentage_buffer = 0;
}else{
m_percentage_buffer += (downloaded_bytes / m_files_total_size) * 100;
}
}
Which means that for each data receival triggering this slot I need to perform:
greater comparison, addition , division, multiplication
I know that I could at least skimp on the multiplication by storing a float in the settings and comparing to that. Other than that are there any ways to get this more performant or did I do good on my first try implementing?

ADO's Command object Error when adAsyncExecute

I'm using ADO's Command object to execute simple commands.
For example,
_CommandPtr CommPtr;
CommPtr.CreateInstance(__uuidof(Command));
CommPtr->ActiveConnection = MY_CONNECTION;
CommPtr->CommandType = adCmdText;
CommPtr->CommandText = L"insert into MY_TABLE values MY_VALUE";
for (int i=0; i<10000; i++) {
CommPtr->Execute(NULL, NULL, adExecuteNoRecords);
}
This works fine, yet I wanted to make this an asynchronus execution to enhance performance when dealing with large amount of data... So I just simply changed the Execute Option to adAsyncExecute..
(Documentation Link)
_CommandPtr CommPtr;
CommPtr.CreateInstance(__uuidof(Command));
CommPtr->ActiveConnection = MY_CONNECTION;
CommPtr->CommandType = adCmdText;
CommPtr->CommandText = L"insert into MY_TABLE values MY_VALUE";
for (int i=0; i<10000; i++) {
CommPtr->Execute(NULL, NULL, adAsyncExecute);
}
This gives me a memory error for some reason..
First-change exception
Microsoft C++ exception:
_com_error at memory location 0x0028FA24
Any experts on ADO know why this is happening..?
Thanks
First, I would not bother ask why you need to loop 10K times just to execute the query as it would take you tremendous network, client and server resources.
I will answer how to use the Asynchronous way of executing queries.
You use this style of query execution to prevent your client from having a stuck-up GUI.
To the user, your App has hanged while it is waiting for the reply of your query. Your App cannot do anything. It looks frozen while waiting for a reply from the Database Server.
To implement a nice GUI with animated Hourglass to work, just like below:
you need to execute the query in Asynch mode.
Example below written in Visual Basic:
dbCon.Execute "Insert Into ...Values....",,adAsyncExecute
Do While dbCon.State = (ADODB.ObjectStateEnum.adStateExecuting + ADODB.ObjectStateEnum.adStateOpen)
Application.DoEvents
Loop
This way, the client will continue to wait and permit the GUI to do some events while it is waiting for the reply of the server making your App more responsive.