I am trying to create a text dataset in a Pipeline for a text classification but I believe I am doing it the wrong way or at least I don't get it. The csv passing only contains two columns message and label which is true or false.
Inside my pipeline I am creating dataset like this which I am not very sure how dataset is recognizing that column label is the independent variable.
dataset = gcp_aip.TextDatasetCreateOp(
project = project # my project id,
display_name = display_name # reference name,
gcs_source = src_uris # path to my data in gcs,
import_schema_uri = aiplatform.schema.dataset.ioformat.text.single_label_classification,
)
once created the dataset, i do training like this within the Pipeline
# training
model = gcp_aip.AutoMLTextTrainingJobRunOp(
project = project,
display_name = display_name,
prediction_type = "classification",
multi_label = False,
dataset = dataset.outputs["dataset"],
)
Not sure if creation and training is doing correctly since I never specified that label is my label column and needs to use message as a feature.
In vertex ai the dataset created look like this
But in my training section the results from the AutML, looks like this, dont know why, label with 0% is there, which makes me doubt about the insertion of the data
In preparation of CSV file, you don't need to specify which column is the feature and the label. Vertex AI's AutoML automatically reads the first column as the feature and the second column as the label. You may refer to this documentation for more details in preparation of CSV data.
Below is sample CSV file, all values under first column(column A) are detected to be the feature and all values under second column(column B) are the labels.
You might need to check your CSV file and search for the word "label" on your second column and replace it with either "True" or "False" since based on your given data, you are only trying to have 2 labels which are "True" and "False". In addition, if you find the word "label" on your 2nd column and it doesn't have a value on its first column, then you just need to just remove the word "label".
In your provided screenshot here, there is a 1 count for the word "label", which means there is a "label" value existing on the 2nd column of your CSV data.
Related
Would you let me know how to make the pivot table in excel using xlwings?
please give me the sample code
Thanks for your help
I have tried to find how to make it but I couldn't find it
Creating a PivotTable with xlwings is not currently straightforward, and requires the use of the .api to access the VBA functions.
An example to create a PivotTable using my mock data in a Table called Table1 of 3 columns of data with headers: "Colour", "Type", "Data".
# set the kwarg values for creating PivotTable
source_type = xw.constants.PivotTableSourceType.xlDatabase
source_data = wb.sheets["Sheet1"]["Table1[#All]"].api # cannot be of type Range or String
table_destination = wb.sheets["Sheet1"]["A20"].api # cannot be of type Range or String
table_name = "PTable1"
# create PivotTable
wb.api.PivotCaches().Create(SourceType=source_type,
SourceData=source_data).CreatePivotTable(
TableDestination=table_destination,
TableName=table_name)
pt = ws.api.PivotTables(table_name)
# Set Row Field (Rows) as Colour column of table
pt.PivotFields("Colour").Orientation = xw.constants.PivotFieldOrientation.xlRowField
# Set Column Field (Columns) as Type column of table
pt.PivotFields("Type").Orientation = xw.constants.PivotFieldOrientation.xlColumnField
# Set Data Field (Values) as Data
pt.AddDataField(pt.PivotFields("Data"),
# with name as "Sum of Data"
"Sum of Data",
# and calculation type as sum
xw.constants.ConsolidationFunction.xlSum)
Additional Detail
For reason of xlDatabase source_type, the VBA documentation can be found here
The parameters can be seen in the VBA documentation here. And this tutorial provides a detailed explanation of these, along with how to change this for different scenarios (dynamic range, in a new workbook, etc.); the guide gives a couple of additional changes to the formatting of the values, such as position of fields and number format.
The options for type of data calculation can be found here.
I need to read a csv file into DataFlow that represents a table, perform a GroupBy transformation to get the number of elements that are in a specific column, and then write that number to a BigQuery table along with the original file.
So far I've gotten the first step - reading the file from my storage bucket and I've called a transformation, but I don't know how to get the count for a single column since the csv has 16.
public class StarterPipeline {
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);
public static void main(String[] args) {
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
PCollection<String> lines = p.apply("ReadLines", TextIO.read().from("gs://bucket/data.csv"));
PCollection<String> grouped_lines = lines.apply(GroupByKey())
PCollection<java.lang.Long> count = grouped_lines.apply(Count.globally())
p.run();
}
}
You are reading whole lines from your CSV to a PCollection on strings. That's most likely not enough for you.
What you want to do is to
Split whole string into multiple strings relevant to columns
Filter PCollection to values that contain something in required column. [1]
Apply Count [2]
[1] https://beam.apache.org/releases/javadoc/2.2.0/org/apache/beam/sdk/transforms/Filter.html
[2] https://beam.apache.org/releases/javadoc/2.0.0/org/apache/beam/sdk/transforms/Count.html
It would be better if you convert that csv into suitable form. For Eg: Convert it into TableRow and then perform GroupByKey based. This way you can identify the column respective to particular value and find the count based on that.
I want to load pictures in PowerBi, problem is that pictures are a bit bigger than 32kb so it's showing just a part of the picture.
Is there some quick way to go around PowerBi limitation and display entire picture?
I fetched pictures from activite directory, then I converted binary file to text in order to set them as a picture (using this formula : data:image/jpeg;base64)
There is a great article by Chriss Webb, Storing Large Images In Power BI Datasets, from which I will copy the essentials here, in case the original article is not available.
Quote from Chriss Webb's article
The maximum length of a text value that the Power Query engine can load into a single cell in a table in a dataset is 32766 characters – any more than that and the text will be silently truncated. To work around this, what you need to do is to split the text representation of the image up into multiple smaller text values stored across multiple rows, each of which is less than the 32766 character limit, and then reassemble them in a DAX measure after the data has been loaded.
Splitting the text up in M is actually not that hard, but it is hard to do efficiently. Here’s an example of an M query that reads all the data from all of the files in the folder above and returns a table:
let
//Get list of files in folder
Source = Folder.Files("C:\Users\Chris\Documents\PQ Pics"),
//Remove unnecessary columns
RemoveOtherColumns = Table.SelectColumns(Source,{"Content", "Name"}),
//Creates Splitter function
SplitTextFunction = Splitter.SplitTextByRepeatedLengths(30000),
//Converts table of files to list
ListInput = Table.ToRows(RemoveOtherColumns),
//Function to convert binary of photo to multiple
//text values
ConvertOneFile = (InputRow as list) =>
let
BinaryIn = InputRow{0},
FileName = InputRow{1},
BinaryText = Binary.ToText(BinaryIn, BinaryEncoding.Base64),
SplitUpText = SplitTextFunction(BinaryText),
AddFileName = List.Transform(SplitUpText, each {FileName,_})
in
AddFileName,
//Loops over all photos and calls the above function
ConvertAllFiles = List.Transform(ListInput, each ConvertOneFile(_)),
//Combines lists together
CombineLists = List.Combine(ConvertAllFiles),
//Converts results to table
ToTable = #table(type table[Name=text,Pic=text],CombineLists),
//Adds index column to output table
AddIndexColumn = Table.AddIndexColumn(ToTable, "Index", 0, 1)
in
AddIndexColumn
Here’s what the query above returns:
The Pic column contains the split text values, each of which are less than the 32766 character limit, so when this table is loaded into Power BI no truncation occurs. The index column is necessary because without it we won’t be able to recombine all the split values in the correct order.
The only thing left to do is to create a measure that uses the DAX ConcatenateX() function to concatenate all of the pieces of text back into a single value, like so:
Display Image =
IF(
HASONEVALUE('PQ Pics'[Name]),
"data:image/jpeg;base64, " &
CONCATENATEX(
'PQ Pics',
'PQ Pics'[Pic],
,
'PQ Pics'[Index],
ASC)
)
…set the data category of this measure to be “Image URL”:
…and then display the value of the image in a report:
Storing images directly in Power BI is not the best option as using the convert to base64 method you'll be limited to 32KB in size. If possible it would be best to extract the images, and place them in a Azure Blob Store (or other accessible store) and reference them from there. You can use the datatype Image URL to show in a table, or the HTML viewer custom visual to show the image via a url. You'll have to use Power BI to get the list of images in Blob Storage, but if the image name is the same you could link that table to your dataset.
This worked well except:
Image type is hard-coded
Version 2021-07 Desktop doesn't have ImageURL visualization.
Also Card, MultiCard and two free visualizations, Chiclet Browser and Image Grid, but none of these showed my PNG image. Perhaps because it is not JPEG? Image meta data should be correct.
So I managed to use show a miniature picture in a table.
I have a issue in Oracle Apex Data Load which I will try to explain in a simple way:
I want to copy csv data (copy/paste) in Data Load Application and apply the transformation, and rules and load the data into the table BIKE.
csv columns (type, amount-a, amount-b)
blue, 10, 100
green, 20, 200
table BIKE columns
(type, amount)
I want to create a transformation to check if the column value in the table BIKE is 'blue' then load amount-b other wise amount-a.
Can anyone help me on this?
Thanks
Create a staging table, e.g. like this (you will need to adjust the details of the columns to match your data model):
create table bike_staging
(
apex_session number not null,
bike_id number not null,
bike_type varchar2(100) not null,
amount_a number,
amount_b number
);
Add a trigger to populate session_id:
create or replace trigger bi_bike_staging
before insert on bike_staging for each row
begin
:new.apex_session := v('APP_SESSION');
end bi_bike_staging;
Add two processes to the third page of the data load wizard, on either side of the "Prepare Uploaded Data" process, like this:
The code for "delete staging table" will be like this:
delete bike_staging where apex_session = :APP_SESSION;
The code for "load bikes" may be something like this:
merge into bikes t
using (select bike_id,
bike_type,
case bike_type
when 'BLUE' then amount_b
else amount_a
end as amount
from bike_staging
where apex_session = :APP_SESSION
) s
on (t.bike_id = s.bike_id)
when matched then update set
t.bike_type = s.bike_type,
t.amount = s.amount
when not matched then insert (bike_id, bike_type, amount)
values (s.bike_id, s.bike_type, s.amount);
delete bike_staging where apex_session = :APP_SESSION;
Alternatively, if you are only inserting new records you don't need the merge, you can use a simple insert statement.
I am currently in the process of writing some code to analyse the mushrooms data off UCI using Weka. I am trying to get the values (i.e. coefficients) of the attributes, but the attribute name is truncated (indicated by the "..."), and am unable to get the full set of coefficients from the attributes.
e.g.
#attribute -0.251a=e+0.242m=k+0.241n=k-0.224t=p+0.213f=f... numeric
Any help would be greatly appreciated.
I believe your attribute names are being truncated because of an option in the PCA filter.
-A
Maximum number of attributes to include in
transformed attribute names.
(-1 = include all, default: 5)
Using the following code I change the value of this option to -1 and print an attribute name from the transformed data.
Instances originalTrain=...//load the training data
PrincipalComponents pca = new PrincipalComponents(); // new PCA filter
pca.setMaximumAttributeNames(-1); //set the value to -1
pca.setInputFormat(originalTrain);// inform filter about dataset
Instances newData = Filter.useFilter(originalTrain, pca); // apply filter
System.out.println(newData.attribute(0).name()); //look at new name
An example of the obviously untruncated attribute name is (scroll to view):
0.257stalksurfacebelowring=k+0.256stalksurfaceabovering=k+0.234ringtype=l+0.231odor=f-0.215ringtype=p-0.212stalksurfaceabovering=s+0.206sporeprintcolor=h-0.195stalksurfacebelowring=s+0.185bruises+0.18 stalkroot=b-0.176stalkcolorbelowring=w-0.175stalkcolorabovering=w-0.173odor=n-0.139sporeprintcolor=n-0.134sporeprintcolor=k+0.133habitat=p+0.133gillcolor=b+0.13 stalkcolorbelowring=b+0.13 stalkcolorabovering=b+0.129population=v+0.128stalkcolorabovering=n-0.125population=s-0.124stalkroot=e+0.121stalkcolorbelowring=n-0.119capcolor=w+0.119stalkcolorbelowring=p+0.119stalkcolorabovering=p-0.11gillspacing-0.105stalkroot=c-0.101gillcolor=n+0.094sporeprintcolor=w-0.087capshape=b-0.085gillcolor=k-0.082odor=l-0.082odor=a-0.082habitat=m+0.08 capcolor=y-0.08gillcolor=w+0.078gillcolor=h-0.076population=n-0.073habitat=g-0.072gillsize+0.068odor=y+0.068odor=s-0.067population=a-0.065capsurface=s-0.064odor=p+0.063gillcolor=g-0.059stalksurfaceabovering=f+0.057capsurface=y-0.057ringnumber=t-0.057stalksurfacebelowring=f+0.055ringnumber=o+0.051population=y-0.05habitat=u-0.048stalkcolorabovering=o-0.048stalkcolorbelowring=o+0.047veilcolor=w-0.046population=c+0.046capshape=k+0.046ringtype=e-0.046gillattachment-0.045stalkcolorabovering=g-0.045stalkcolorbelowring=g+0.043capcolor=e-0.041stalkroot=r-0.039gillcolor=u+0.039capcolor=g+0.034habitat=l-0.034veilcolor=n-0.034veilcolor=o-0.033habitat=w-0.031capcolor=p-0.031odor=c-0.031stalksurfacebelowring=y-0.031sporeprintcolor=r+0.03 capshape=f-0.029capcolor=n-0.028gillcolor=o-0.024stalkshape-0.024sporeprintcolor=o-0.024sporeprintcolor=y-0.024sporeprintcolor=b-0.024gillcolor=y-0.023gillcolor=e-0.023capcolor=b-0.023stalkcolorabovering=e-0.023stalkcolorbelowring=e-0.019gillcolor=r-0.018capshape=s-0.018sporeprintcolor=u-0.015capshape=x+0.012habitat=d+0.009gillcolor=p-0.006capsurface=g+0.005capsurface=f-0.004capshape=c+0.003stalkcolorbelowring=y-0.003stalkcolorabovering=y-0.003veilcolor=y+0.001stalksurfaceabovering=y+0.001capcolor=u+0.001capcolor=r-0.001capcolor=c+0 stalkcolorabovering=c+0 odor=m+0 ringtype=n+0 stalkcolorbelowring=c+0 ringnumber=n+0 ringtype=f