This is a follow-up to this question:
flatMap and `Ambiguous reference to member` error
There I am using the following code to convert an array of Records to an array of Persons:
let records = // load file from bundle
let persons = records.flatMap(Person.init)
Since this conversion can take some time for big files, I would like to monitor an index to feed into a progress indicator.
Is this possible with this flatMap construction? One possibility I thought of would be to send a notification in the init function, but I am thinking counting the records is also possible from within flatMap ?
Yup! Use enumerated().
let records = // load file from bundle
let persons = records.enumerated().flatMap { index, record in
print(index)
return Person(record)
}
Related
I have the following Stream Dataframe
+------------------------------------+
|______sentence______________________|
| Representative is a scientist |
| Norman did a good job in the exam |
| you want to go on shopping? |
--------------------------------------
I have list as follows
val myList
as the final output i need myList contain above three sentences in the stream dataframe
output
myList = [Representative is a scientist, Norman did a good job in the exam, you want to go on shopping? ]
I tried the following which gives stream error
val myList = sentenceDataframe.select("sentence").rdd.map(r => r(0)).collect.toList
Error thrown with above method
org.apache.spark.sql.AnalysisException: Queries with streaming sources
must be executed with writeStream.start()
Please note that above method work with normal datframe but not with stream dataframe.
Is there a way to iterate through each row of the stream dataframe and assign the row value into the common list using scala and spark ?
That sounds like a very weird Use-Case as the stream could theoretically never end. Are you sure you are not just looking for common spark DataFrames?
if that is not the case what you can do is use Accumulators and sparks streaming foreachBatch sink. I used a simple socket connection to demonstrate this. You can start a simple socket server under e.g. ubuntu with nc -lp 3030 and just past message there to the stream, the resulting DataFrame will have a schema of [value: String]
val acc = spark.sparkContext.collectionAccumulator[String]
val stream = spark.readStream.format("socket").option("host", "localhost").option("port", "3030").load()
val query = stream.writeStream.foreachBatch((df: DataFrame, l: Long) => {
df.collect.foreach(v => acc.add(v(0).asInstanceOf[String]))
}).start()
...
// For some reason you are stopping the stream here
query.stop()
val myList = acc.value
Now one question you might have is why are we using Accumulators and not just an ArrayBuffer. ArrayBuffers would work locally but on a cluster the code in foreachBatch might be executed on a total different node. That means it would not have any effect and thats also the reason Accumulators exist in the first place (see https://spark.apache.org/docs/latest/rdd-programming-guide.html#accumulators)
I have a function:
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.map { saveProduct(it)}
.takeWhile { products.isNotEmpty() }
.toList()
Is there a possibility to rewrite this sequence so it works in batches? In example, a list of 1000 products is passed to this method, and the sequence is taking 100 products, saves them, then next 100 until products.isNotEmpty() condition is met?
takeWhile is not needed here, as the products list size doesn't change even after you iterate over the sequence. products.isNotEmpty() will always be true
Kotlin has chunked method that can give you objects in batches as required
private fun importProductsInSequence(products: List<Product>): List<Result> =
products.asSequence()
.chunked(100)
.map { saveProducts(it)} // saveProducts method would take list of products and return list of Result
.flatten()
.toList()
You neeed to use chunked
private fun importProductsInSequence(products: List<Product>): List<Product>{
products.asSequence().chunked(100).onEach{ saveProduct(it)}.flatten().toList()
chunk will partition the list to the size provided, the last portion will have less elements than previous one.
I am trying to read values from Google pubsub and Google storage and put those values into big query based on count conditions i.e., if the values exists, then it should not insert the value, else it can insert a value.
My code looks like this:
p_bq = beam.Pipeline(options=pipeline_options1)
logging.info('Start')
"""Pipeline starts. Create creates a PCollection from what we read from Cloud storage"""
test = p_bq | beam.Create(data)
"""The pipeline then reads from pub sub and then combines the pub sub with the cloud storage data"""
BQ_data1 = p_bq | 'readFromPubSub' >> beam.io.ReadFromPubSub(
'mytopic') | beam.Map(parse_pubsub, param=AsList(test))
where 'data' is the value from Google storage and reading from pubsub is the value from Google Analytics. Parse_pubsub returns 2 values: one is the dictionary and the other is count (which states the value is present or not in the table)
count=comparebigquery(insert_record)
return (insert_record,count)
How to provide condition to big query insertion since the value is in Pcollection
New edit:
class Process(beam.DoFn):
def process1(self, element, trans):
if element['id'] in trans:
# Emit this short word to the main output.
yield pvalue.TaggedOutput('present',element)
else:
# Emit this word's long length to the 'above_cutoff_lengths' output.
yield pvalue.TaggedOutput(
'absent', present)
test1 = p_bq | "TransProcess" >> beam.Create(trans)
where trans is the list
BQ_data2 = BQ_data1 | beam.ParDo(Process(),trans=AsList(test1)).with_outputs('present','absent')
present_value=BQ_data2.present
absent_value=BQ_data2.absent
Thank you in advance
You could use
beam.Filter(lambda_function)
after the beam.Map step to filter out elements that return False when passed to the lambda_function.
You can split the PCollection in a ParDo function using Additional-outputs based on a condition.
Don't forget to provide output tags to the ParDo function .with_outputs()
And when writing elements of a PCollection to a specific output, use .TaggedOutput()
Then you select the PCollection you need and write it to BigQuery.
I want to create a function that works on a single date or over a period of time. For that I make a list of strings like this:
import com.github.nscala_time.time.Imports._
val fromDate = "2015-10-15".toLocalDate
val toDate = "2015-10-15".toLocalDate
val days = Iterator.iterate(fromDate)(_ + 1.day).takeWhile(_ <= toDate).map(_.toString)
Now I want to access days content:
scala> days.foreach(println)
2015-10-15
scala> days.foreach(day => println(day))
With the first foreach I get the only element in days list, but with the second I get nothing. This is just a sample, but I need to use the 2nd form, because I need to use day value inside the function.
If I use a range both methods work like they are supposed to...
Anyone knows why this behaviour?
Thanks in advance!
P.D. I don't want to use another method to create the list of strings (unless I can't find a solution for this)
Second function works in same way as first.
You've created Iterator object, which you can iterate only once.
To be able use it any amount of times just convert you iterator to list.
val daysL = days.toList
My problem originates from me trying to create names for all my crazy (brilliant?) ideas for business and products, which then need to have their purchasing availability checked for .com domain names.
So I have a pen and paper system where I create two lists of words... List A and List B for example.
I want to find or create a little app where I can create and store custom lists which takes each word from List A, appends each word from List B (to create a total of List A * List B results?)
After the list is compiled of "ListAListB" results, I want to check if the .com domain is available for purchase online via some other method...
And ultimately, create a new list of each combination, along with some sort of visual status like maybe a color or word representing if the combined word is available as a .com...
So I'm basically using a nested for loop structure to index each word in List A, Loop through each word in List B, and create List C?
Then when the list is fully completed, send a CSV? to somewhere online and then somehow get a new list back.
I guess that is my rough thought process.
Any advice in the algorithm to create the list from the two original lists is appreciated.
Any help in the process to check the available domain names online via godaddy, ICANN, etc is appreciated..
Any help as to where I might find this tool already is even more appreciated..
I could probably download a free sdk or tool and write this in a language I suppose, based on my c++ experience from a few years ago, but I am rusty for sure, and haven't actually created anything since college like 3 years ago.
Thank you.
Here's a quick shell script that leverages Chris's answer.
#!/bin/sh
ids_url="http://instantdomainsearch.com/services/quick/?name="
for a in $(< listA); do
for b in $(< listB); do
avail=`wget -qO- $ids_url$a$b | sed -e "s/.*'com':'u'.*//g"`
if [ "$avail" == "" ]; then
echo "$a$b.com unavailable"
else
echo "$a$b.com available"
fi
done
done
It iterates through both lists, hits the DNS service with wget and looks for any results that contain "'com':'a'". Supposing List A contains 'goo', 'foo', and 'arglbar' and List B contains 'gle', the output should look like this:
google.com unavailable
foogle.com unavailable
arglbargle.com available
Pipe it through grep -v unavail to see only the available names.
For checking if a domain you can give try this out:
Request this page:
http://instantdomainsearch.com/services/quick/?name=example
Which will return this json (u = unavailable, a = available)
{'name':'example','com':'a','net':'u','org':'a'}
Then you just need to parse it. You may get blocked if you are checking lots of domains but I doubt it since this site recieves a ton of request from one session so you should be good (I'd pause a 800 miliseconds between each request at least).
C# code for list creation:
// Load up all the lines of each list into string arrays
string[] listA = File.ReadAllLines("listA.txt");
string[] listB = File.ReadAllLines("listB.txt");
// Create a list to hold the combinations
List<string> listC = new List<string>();
// Loop through each line in listA
foreach(string buzzwordA in listA)
{
// Now loop through each word in listB
foreach(string buzzwordB in listB)
listC.Add(buzzwordA + buzzwordB); // Combine them and add it to the listC
}
File.WriteAllLines("listC.txt", listC.ToArray()); // Save all the combos
I didn't check the code but thats the general idea. This is a bad idea for huge lists though because it reads the lists completely into memory at the same time. A better solution is probably reading the files line-by-line using FileStream and StreamReader
Edit (Same idea but this uses filestreams):
// Open up file streams for the lists
using(FileStream _listA = new FileStream("listA.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using(FileStream _listB = new FileStream("listB.txt", FileMode.Open, FileAccess.Read, FileShare.Read))
using(StreamReader listA = new StreamReader(_listA))
using(StreamReader listB = new StreamReader(_listB))
using(StreamWriter listC = new StreamWriter("listC.txt"))
{
string buzzwordA = listA.ReadLine();
while(buzzwordA != null)
{
string buzzwordB = listB.ReadLine();
while(buzzwordB != null)
{
listC.WriteLine(buzzwordA + buzzwordB);
buzzwordB = listB.ReadLine();
}
buzzwordA = listA.ReadLine();
// reset the listB stream to the begining
listB.BaseStream.Seek(0, SeekOrigin.Begin);
}
} // All streams and readers are disposed by using statement
For parsing the json try this out: C# Json Parser Library
check out www.bustaname.com
sounds exactly like what you're doing