How do I read a .csv file into a GCP Dataflow and then get the count for a specific column and write it to BigQuery? - google-cloud-platform

I need to read a csv file into DataFlow that represents a table, perform a GroupBy transformation to get the number of elements that are in a specific column, and then write that number to a BigQuery table along with the original file.
So far I've gotten the first step - reading the file from my storage bucket and I've called a transformation, but I don't know how to get the count for a single column since the csv has 16.
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> lines = p.apply("ReadLines", TextIO.read().from("gs://bucket/data.csv"));
PCollection<String> grouped_lines = lines.apply(GroupByKey())
PCollection<java.lang.Long> count = grouped_lines.apply(Count.globally())
p.run();
}
}

You are reading whole lines from your CSV to a PCollection on strings. That's most likely not enough for you.
What you want to do is to
Split whole string into multiple strings relevant to columns
Filter PCollection to values that contain something in required column. [1]
Apply Count [2]
[1] https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/transforms/Filter.html
[2] https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/Count.html

It would be better if you convert that csv into suitable form. For Eg: Convert it into TableRow and then perform GroupByKey based. This way you can identify the column respective to particular value and find the count based on that.

Related

Adding label in AutoML for text classification

I am trying to create a text dataset in a Pipeline for a text classification but I believe I am doing it the wrong way or at least I don't get it. The csv passing only contains two columns message and label which is true or false.
Inside my pipeline I am creating dataset like this which I am not very sure how dataset is recognizing that column label is the independent variable.
dataset = gcp_aip.TextDatasetCreateOp(
project = project # my project id,
display_name = display_name # reference name,
gcs_source = src_uris # path to my data in gcs,
import_schema_uri = aiplatform.schema.dataset.ioformat.text.single_label_classification,
)
once created the dataset, i do training like this within the Pipeline
# training
model = gcp_aip.AutoMLTextTrainingJobRunOp(
project = project,
display_name = display_name,
prediction_type = "classification",
multi_label = False,
dataset = dataset.outputs["dataset"],
)
Not sure if creation and training is doing correctly since I never specified that label is my label column and needs to use message as a feature.
In vertex ai the dataset created look like this
But in my training section the results from the AutML, looks like this, dont know why, label with 0% is there, which makes me doubt about the insertion of the data
In preparation of CSV file, you don't need to specify which column is the feature and the label. Vertex AI's AutoML automatically reads the first column as the feature and the second column as the label. You may refer to this documentation for more details in preparation of CSV data.
Below is sample CSV file, all values under first column(column A) are detected to be the feature and all values under second column(column B) are the labels.
You might need to check your CSV file and search for the word "label" on your second column and replace it with either "True" or "False" since based on your given data, you are only trying to have 2 labels which are "True" and "False". In addition, if you find the word "label" on your 2nd column and it doesn't have a value on its first column, then you just need to just remove the word "label".
In your provided screenshot here, there is a 1 count for the word "label", which means there is a "label" value existing on the 2nd column of your CSV data.

How to load large images in PowerBi

I want to load pictures in PowerBi, problem is that pictures are a bit bigger than 32kb so it's showing just a part of the picture.
Is there some quick way to go around PowerBi limitation and display entire picture?
I fetched pictures from activite directory, then I converted binary file to text in order to set them as a picture (using this formula : data:image/jpeg;base64)
There is a great article by Chriss Webb, Storing Large Images In Power BI Datasets, from which I will copy the essentials here, in case the original article is not available.
Quote from Chriss Webb's article
The maximum length of a text value that the Power Query engine can load into a single cell in a table in a dataset is 32766 characters – any more than that and the text will be silently truncated. To work around this, what you need to do is to split the text representation of the image up into multiple smaller text values stored across multiple rows, each of which is less than the 32766 character limit, and then reassemble them in a DAX measure after the data has been loaded.
Splitting the text up in M is actually not that hard, but it is hard to do efficiently. Here’s an example of an M query that reads all the data from all of the files in the folder above and returns a table:
let
//Get list of files in folder
Source = Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
//Remove unnecessary columns
RemoveOtherColumns = Table.SelectColumns(Source,{"Content", "Name"}),
//Creates Splitter function
SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
//Converts table of files to list
ListInput = Table.ToRows(RemoveOtherColumns),
//Function to convert binary of photo to multiple
//text values
ConvertOneFile = (InputRow as list) =>
let
BinaryIn = InputRow{0},
FileName = InputRow{1},
BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
SplitUpText = SplitTextFunction(BinaryText),
AddFileName = List.Transform(SplitUpText, each {FileName,_})
in
AddFileName,
//Loops over all photos and calls the above function
ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
//Combines lists together
CombineLists = List.Combine(ConvertAllFiles),
//Converts results to table
ToTable = #table(type table[Name=text,Pic=text],CombineLists),
//Adds index column to output table
AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1)
in
AddIndexColumn
Here’s what the query above returns:
The Pic column contains the split text values, each of which are less than the 32766 character limit, so when this table is loaded into Power BI no truncation occurs. The index column is necessary because without it we won’t be able to recombine all the split values in the correct order.
The only thing left to do is to create a measure that uses the DAX ConcatenateX() function to concatenate all of the pieces of text back into a single value, like so:
Display Image =
IF(
HASONEVALUE('PQ Pics'[Name]),
"data:image/jpeg;base64, " &
CONCATENATEX(
'PQ Pics',
'PQ Pics'[Pic],
,
'PQ Pics'[Index],
ASC)
)
…set the data category of this measure to be “Image URL”:
…and then display the value of the image in a report:
Storing images directly in Power BI is not the best option as using the convert to base64 method you'll be limited to 32KB in size. If possible it would be best to extract the images, and place them in a Azure Blob Store (or other accessible store) and reference them from there. You can use the datatype Image URL to show in a table, or the HTML viewer custom visual to show the image via a url. You'll have to use Power BI to get the list of images in Blob Storage, but if the image name is the same you could link that table to your dataset.
This worked well except:
Image type is hard-coded
Version 2021-07 Desktop doesn't have ImageURL visualization.
Also Card, MultiCard and two free visualizations, Chiclet Browser and Image Grid, but none of these showed my PNG image. Perhaps because it is not JPEG? Image meta data should be correct.
So I managed to use show a miniature picture in a table.

How to delete a row from csv file on datalake store without using usql?

I am writing a unit test for appending data to CSV file on a datalake. I want to test it by finding my test data appended to the same file and once I found it I want to delete the row I inserted. Basically once I found the test data My test will pass but as the tests are run in production so I have to search for my test data i.e to find the row I have inserted in a file and delete it after the test is run.
I want to do it without using usql inorder to avoid the cost factor involved in using usql. What are the other possible ways we can do it?
You cannot delete a row (or any part) from a file. Azure data lake store is an append-only file system. Data once committed cannot be erased or updated. If you're testing in production, your application needs to be aware of test rows and ignore them appropriately.
The other choice is to read all the rows in U-SQL and then write an output excluding the test rows.
Like other big data analytics platforms, ADLA / U-SQL does not support appending to files per se. What you can do is take an input file, append some content to it (eg via U-SQL) and write it out as another file, eg a simple example:
DECLARE #inputFilepath string = "input/input79.txt";
DECLARE #outputFilepath string = "output/output.txt";
#input =
EXTRACT col1 int,
col2 DateTime,
col3 string
FROM #inputFilepath
USING Extractors.Csv(skipFirstNRows : 1);
#output =
SELECT *
FROM #input
UNION ALL
SELECT *
FROM(
VALUES
(
2,
DateTime.Now,
"some string"
) ) AS x (col1, col2, col3);
OUTPUT #output
TO #outputFilepath
USING Outputters.Csv(quoting : false, outputHeader : true);
If you want further control, you can do some things via the Powershell SDK, eg test an item exists:
Test-AdlStoreItem -Account $adls -Path "/data.csv"
or move an item with Move-AzureRmDataLakeStoreItem. More details here:
Manage Azure Data Lake Analytics using Azure PowerShell

How to iterate on each line of a rdd which contains textFile

I'm trying to do something like this
file = sc.textFile('mytextfile')
def myfunction(mystring):
new_value = mystring
for i in file.toLocalIterator()
if i in mystring:
new_value = i
return new_value;
rdd_row = some_data_frame.map(lambda u: Row(myfunction(u.column_name)))
But I get this error
It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers
The problem is (as is clearly stated in the error message) that you are trying to work with an RDD inside the map. File is an RDD. it can have various transformations on it (e.g. you are trying to do a local iterator on it). But you are trying to use the transformation inside another - the map.
UPDATE
If I understand correctly you have a dataframe df with a column URL. You also have a text file which contains blacklist values.
Lets assume for the sake of argument that your blacklist files is a csv with a column blacklistNames and that the dataframe df's URL column is already parsed. i.e. you just want to check if URL is in the blacklistNames columns.
What you can do is something like this:
df.join(blackListDF, df["URL"]==blackListDF["blacklistNames"], "left_outer")
This join basically adds a blacklistNames column to your original dataframe which would contain the matched name if it is in the blacklist and null otherwise. Now all you need to do is filter based on whether or not the new column is null or not.

import data from HBase insert to another table using Pig UDF

I'm using Pig UDFs in python to read data from HBase table then process and parse it, to finally insert it into another HBase table. But I'm facing some issues.
Pig's map is equivalent to Python's dictionary;
My Python script take as input (rowkey, some_mapped_values), and
return two strings "key" and "content", and the
#outputSchema('tuple:(rowkey:chararray, some_values:chararray)');
The core of my Python takes a rowkey, parse it, and transform it into
another rowkey, and transform the mapped data into another string, to
return variables (key,content);
But when I try to insert those new values into another HBase table, I have faced two problems:
Processing is well done, but the script insert "new_rowkey+content" as a rowkey and leaves the cell empty;
Second problem, is how do I specify the new column family to insert
in ?
here is a snippet of my Pig script:
register 'parser.py' using org.apache.pig.scripting.jython.JythonScriptEngine as myParser;
data = LOAD 'hbase://source' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('values:*', '-loadKey true') AS
(rowkey:chararray, curves:map[]);
data_1 = FOREACH data GENERATE myParser.table_to_xml(rowkey,curves);
STORE data_1 INTO 'hbase://destination' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('a_column_family:curves:*');
and my Python script take as input (rowkey,curves)
#outputSchema('tuple:(rowkey:chararray, curves:chararray)')
def table_to_xml(rowkey,curves):
key = some_processing_which_is_correct
for k in curves:
content = some_processing_which_is_also_correct
return (key,content)