For example, I have a list of URLs as strings which are stored in Datastore.
So, I used the DatastoreIO function and read them into a PCollection. In ParDo’s DoFn, for each URL (which is a GCP cloud storage location of a file), I have to read the file present in that location and do further transformations.
So I want to know if I can write ParDo for PCollections inside a ParDo function. Kind of parallel execution of each file transformation and send KV (key, PCollection) something as output of the first ParDo function.
Sorry, if I haven't presented my scenario clearly. I'm a newbie to Apache Beam & Google Dataflow
What you want is TextIO#readAll().
PCollection<String> urls = pipeline.apply(DatastoreIO.read(...))
PCollection<String> lines = urls.apply(TextIO.readAll())
Related
I have a use case where I want to read the filename from a metadata table, I have written a pipeline function to read the metadata table, but I am not sure how can I pass this information to ReadFromText as it only takes string as input, Is it possible to assign this value to ReadFromText(). Please suggest some workarounds or ideas how to achieve this, Thanks
code: pipeline | 'Read from a File' >> ReadFromText(I want to pass the file path here?,
skip_header_lines=1)
Note: There will be various folders and files in storage, files are in csv format, but in my use case I can't directly pass the storage location or filename to file path in ReadFromText. I want to read it from metadata and pass the value. Hope I am clear, Thanks
I don't understand why you need to read the metadata. If you want to read all the files inside a folder, you can just provide a blob. This solution working in python, not sure about java.
p|readfromtext("./folder/*.csv")
"*" is the blob here, which allows pipeline to read all the patterns matching .csv. You can also add something at the starting.
What you want is textio.ReadAllFromText which reads from a PCollection instead of taking a string directly.
Using the Dataflow streaming templates, namely the Cloud Storage Text to BigQuery (Stream) template, it used to be possible to describe the "inputFilePattern" (i.e.: the Cloud Storage location of the text you'd like to process) as a regular expression. For example you could enter gs://my-bucket/my-files/file-to-upload* as the parameter and all the files starting with "file-to-upload" would then be streamed.
Unfortunately it now throws this error message: "Object not found."
Is there another way to upload all files from a google storage location with a similar naming convention to BigQuery?
Please see screenshots below:
Thanks in advance.
This looks like a bug in the UI you can pass the file pattern when you submit the job via command line. The source code takes the file pattern as input so there should not be any problem with the actual job
PCollectionTuple transformedOutput =
pipeline
// 1) Read from the text source continuously.
.apply(
"ReadFromSource",
TextIO.read()
.from(options.getInputFilePattern())
.watchForNewFiles(DEFAULT_POLL_INTERVAL, Growth.never()))
While working with postman, data.someVariable returns data from within a csv file that can also be used as {{someVariable}} in uri/json.
This gives us the data for that variable from that row/iteration.
Is there a mechanism to write back to the data file by doing something like postman.setData('responseCode') = responseCode.
This would be really helpful to store response code in the data file and to record call wise details in same format as the input within csv.
The only solution I figured out is
to populate json objects in the environment with information about the data file name and structure/values of information to be added
to create a separate web service (maybe in node.js) that exposes an http call to write to a file and takes in as parameter a json input as the one created in the environment as mentioned above and writes that to a file / original data file (or a copy of it) in the desired format
to call the above mentioned web service call at the end of each run or desired rest call execution to generate step wise information/debug report
There is no way to write back to data file in postman as of now .
However, you can populate that in your environment file at run time using
pm.environment.set("varname")
keep varname in such a way that you understand this is the variable you wanted to write back into data file.
I am trying to build a custom receiver adaptor. Which will read from CSV file and push events to a stream.
As far a I understand, we have to follow any of the WSO2 standard format(TEXT, XML or JSON) to push data to a stream.
Problem is, CSV files doesn't match with any of the standard format stated above. We have to convert csv values to any of the supported format within the custom adapter.
As per my observation, WSO2 TEXT format doesn't support comma(,) within a string value. So, I have decided to convert CSV JSON.
My questions are below:
How to generate WSO2 TEXT events if values ave comma ?
(if point 1 is not possible) In my custom adapter MessageType, if I add either only TEXT or all 3 (TEXT, XML, JSON) it works fine. But if I add only JSON I get below error. My target is to add only JSON and convert all the CSV to JSON to avoid confusion.
[2016-09-19 15:38:02,406] ERROR {org.wso2.carbon.event.receiver.core.EventReceiverDeployer} - Error, Event Receiver not deployed and in inactive state, Text Mapping is not supported by event adapter type file
To read from CSV file and push events to a stream, you could use the file-tail adapter. Refer the sample 'Receiving Custom RegEx Text Events via File Tail'. This sample contains the regex patterns which you could use to map your CSV input.
In addition to this, as Charini has suggested in a comment, you could also check out the event simulator. However, the event simulator is not an event receiver - meaning, it will not receive events in realtime, rather it will "play" a previously defined set of events (in the CSV file, in this case) to simulate a flow of events. It will not continuously monitor the file for new events. If you want to monitor the file for new events, then consider using the file-tail adapter.
I have just made it. Not an elegant way. However it worked fine for me.
As I have mentioned, JSON format is the most flexible one to me. I am reading from file and converting each line/event to WSO2 JSON format.
Issue with this option was, I want to limit message format only to JSON from management console ("Message Format" menu while creating new receiver). If I add only JSON [supportInputMessageTypes.add(MessageType.JSON)] it shows error as I mentioned in question#2 above.
The solution is, instead of putting static variable from MessageType class, use corresponding string directly. So now, my method "getSupportedMessageFormats()" in EventAdapterFactory class is as below:
#Override
public List<String> getSupportedMessageFormats() {
List<String> supportInputMessageTypes = new ArrayList<String>();
// just converting the type to string value
// to avoid error "Text Mapping is not supported by event adapter type file"
String jsonType = MessageType.JSON;
supportInputMessageTypes.add(jsonType);
//supportInputMessageTypes.add(MessageType.JSON);
//supportInputMessageTypes.add(MessageType.XML);
//supportInputMessageTypes.add(MessageType.TEXT);
return supportInputMessageTypes;
}
My request to WSO2 team, please allow JSON format event adapter type file.
Thanks, Obaid
I am trying to test a webservice's performance, and having a few issues with using and passing variables. There are multiple sequential requests, which depend on some data coming from a previous response. All requests need to be encoded to base64 and placed in a SOAP envelope namespace before sending it to the endpoint. It returns and encoded response which needs to be decoded to see the xml values which need to be used for the next request. What I have done so far is:
1) Beanshell preprocessor added to first sample to encode the payload which is called from a file.
2) Regex to pull the encoded response bit from whole response.
3) Beanshell post processor to decode the response and write to a file (just in case). I have stored the decoded response in a variable 'Output' and I know this works since it writes the response to file correctly.
4) After this, I have added 4 regex extractors and tried various things such as apply to different parts, check different fields, check JMeter variable etc. However, it doesn't seem to work.
This is what my tree is looking like.
JMeter Tree
I am storing the decoded response to 'Output' variable like this and it works since it's writing to file properly:
import org.apache.commons.io.FileUtils;
import org.apache.commons.codec.binary.Base64;
String Createresponse= vars.get("Createregex");
vars.put("response",new String(Base64.decodeBase64(Createresponse.getBytes("UTF-8"))));
Output = vars.get("response");
f = new FileOutputStream("filepath/Createresponse.txt");
p = new PrintStream(f);
this.interpreter.setOut(p);
print(Output);
f.close();
And this is how I using Regex after that, I have tried different options:
Regex settings
Unfortunately though, the regex is not picking up these values from 'Output' variable. I basically need them saved so i can use ${docID} in the payload file for next request.
Any help on this is appreciated! Also happy to provide more detail if needed.
EDIT:
I had a follow up question. I am trying to run this with multiple users. I have a field ${searchuser} in my payload xml file called in the pre-processor here.
The CSV Data set above it looks like this:
However, it is not picking up the values from CSV and substituting in the payload file. Any help is appreciated!
You have 2 problems with your Regular Expression Extractor configuration:
Apply to: needs to be response
Field to check: needs to be Body, Body as a Document is being used for binary file formants like PDF or Word.
By the way, you can do Base64 decoding and encoding using __base64Decode() and __base64Encode() functions available via JMeter Plugins. The plugins in their turn can be installed in one click using Plugin Manager