Google DataPrep - Apparently Limited Table Size - google-cloud-platform

I'm trying to prepare SEO data from Screaming Frog, Majestic and Ahrefs, join it before importing said data into BigQuery for analysis.
The Majestic and Ahrefs csv files import after some pruning down to the 100MB limit.
The Screaming Frog CSV file however doesn't fully load, only displaying approx 37,000 rows of 193,000. By further pruning less important cols in Excel and reducing the filesize(from 44MB to 39MB) , the number of rows loaded increases slightly. This would indicate to me that it's not an errant character or cell.
I've made sure(resaved via text editor) that the CSV file is saved in UTF8, checked the limitations of Dataprep to see if there is a limit on the number of cells per Flow/Wrangle and can find nothing.
The Majestic and AHREFS files are larger and load completely with no issue. There is no data corruption in the Screaming Frog file. Is there something common I'm missing?
Is the total limit for all files 100MB?
Any advice or insight would be appreciated.

To get the full transformation of your files, you should run the recipe.
What you see in the Dataprep Transformer Page is a head sample.
You can take a look about how the sampling works here.

Related

Google Cloud Vision not automatically splitting images for trainin/test

It's weird, for some reason GCP Vision won't allow me to train my model. I have met the minimum of 10 images per label, no images unlabeled and tried uploading a CSV pointing to 3 of this labels images as VALIDATION images.. Yet I get this error
Some of your labels (e.g. ‘Label1’) do not have enough images assigned to your Validation sets. Import another CSV file and assign those images to those sets.
any ideas would be appreciated
This error generally occurs when you did not labelled all the images because AutoML divides your images, including the mislabelled ones, into the categories and this error is triggered when the unlabelled images go to the VALIDATION set.
According to the documentation, it is recomended 1000 images per label. However, the minimum is 10 images for each label or 50 for complex cases. In addition,
The model works best when there are at most 100x more images for the most common label than for the least common label. We recommend removing very low frequency labels.
Furthermore, AutoML Vision uses the 80% of your content documents for training, 10% for validating, and 10% for testing. Since your images were not divided into these three categories, you should manually assign them to TRAIN, VALIDATION and TEST. You can do that by, uploading your images to a GCS bucket and referencing each labelled image in a .csv file, as follows:
TRAIN, gs://my_bucket/image1.jpeg,cat
As you can see above, it follows the format [SET],[GCS image path], [Label]. Note that you will be dividing your dataset manullay and it should respect the percentages already mentioned. Thus, you will have enough data in each category. You can follow the steps for preparing your training data here and here.
Note: please be aware that your .csv file is case sentive.
Lastly, in order to validate your dataset and inspect labelled/unlabelled images you can export the created dataset and check the exported .csv file. You can do it as described in the documentation. After exporting, download it and verify each SET( TRAIN, VALIDATION and TEST).

Annotation specs - AutoML (GCP)

I'm using the Natural Language module on Google Cloud Platform and more specifically AUTOML for text classification.
I come across this error which I do not understand when I have finished importing my data and the text has been processed :
Error: The dataset has too many annotation specs, the maximum allowed number is 5000.
What does it mean? Have you already got it?
Thanks
Take a look at the AutoML Quotas & Limits documentation for better understanding.
It seems that you are touching the highest limit of labels per dataset. Check it on the AutoML limits --> Labels per dataset --> 2 - 5000 (for classification).
Take into account that limits, unlike quotas, cannot be increased.
I also got this error while I was certain that my number of labels are below 5000. It turns out to be an error with my CSV formatting.
When you create your text data using to_csv() in Pandas, it will only quotes that part of text data that contains comma, while AutoML Text wants you to quote all lines of the text. I have written the solution in this Stackoverflow answer

C++ SQLite importing entire CSV file in C Interface

Is there a way to Import an entire CSV file into SQLite through the C Interface?
I'm aware of the commandline import that looks like this,
sqlite> .mode csv <table>
sqlite> .import <filename> <table>
but I need to be able to do this in my program.
I should also note that I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line.
This gets the job done but with a CSV containing 730k lines this method takes ~20 minutes to load which is WAY too long. (This is going to be around average size of the stuff being processed)
(Machine: Intel(R) Core(TM)2 Duo CPU E8500 # 3.16GHz 3.17GHz, 4.0 GB Ram, Windows 7 64 bit, Visual studios 2010)
This is unacceptable for my project so I need a faster way, something taking around 2-3 minutes.
Is there a way to reference the file's memory location so Import isn't necessary? If so is access of the information slow?
Can SQLite take the CSV file as binary data? Would this make importing the file any faster?
Ideas?
Note: I'm using the ":memory:" option with the C Interface to load the DB in memory to increase speed (I hope).
EDIT
After doing some more optimizing I found this. It explains how you can group insert statements into 1 transaction by writing.
BEGIN TRANSACTION;
INSERT into TABLE VALUES(...);
...Million more INSERT statements
INSERT into TABLE VALUES(...);
COMMIT;
This created a HUGE improvement in performance.
Useful Related Side Note
Also if you're looking to a create table from a query's results or Insert query results into a table try this for creating tables or this for inserting results into a table.
The insert link might not be obvious for inserting into a table. The query to do that looks like this.
INSERT INTO [TABLE] [QUERY]
where [TABLE] is the table you want the results of [QUERY] the query you're running to go into.
I have successfully created a CSV reader in C++ that reads in a CSV file and inserts its content to a table line by line... takes ~20 minutes to load
Put all your inserts into a single transaction - or at least batch up 100 or 1000 rows per transaction - and I would expect your program to run much faster.

Searching for means to get smaller rdf (n3) dataset

I have downloaded yago.n3 dataset
However for testing I wish to work on a smaller version of the dataset (as the dataset is 2 GB) and even though i make a small change it takes me a lot of time to debug.
Therefore, I tried to copy a small portion of the data and create a separate file, however this did not work and threw lexical errors.
I saw the earlier posts, however the earlier post is about big datasets, whereas I am searching for smaller ones.
Is there any means by which I may obtain a smaller amount of the same dataset?
If you have an RDF parser at hand to read your yago.n3 file, you can parse it and write on a separate file as many RDF triples as you want/need for your smaller dataset to run your experiments with.
If you find some data in N-Triples format (i.e. one RDF triple per line) you can just take as many line as you want and make your dataset as small as you want: head -n 10 filename.nt would give you a tiny dataset of 10 triples.

Webtrends - utilization report

Want to find out how many files were downloaded on my website out of total number of files. eg: i have a million pdf files and people have downlaoded only 100,000. this is 10% utilization.
I tried downloaded files report but it shows only top 1000 files. is there a way to get the complete count. ie number of fiels downloaded atleast once.
is it possible to get this count without re-analyzing the report.
First of all, no, it is not possible without re-analyzing the profile and the report. You have to adjust the so called "Table Size Limit", which limits the number of elements to analyze and the number of elements to show in the report.
Example: You have 1 Mio. Pages within your website. The report analysis limit is set to 250.000 pages, so after that, new pages will not be recorded and count by webtrends. The final report you see within reports will only show you the top 2.000 pages.
You need to increase the Table Size Limits and re-analyze. If you not use Webtrends On-Demand and you still have the logs, a re-analyze will not effect your page views licenses.