Export data from GBQ into CSV with specific encoding - google-cloud-platform

Im using GBQ, I want to export the results of a query into CSV file.
The data is larger than 20M lines so Im using this option :
In my query results I have some text in french, that is being saved in bad encoding to CSV.
Is there a way to define the encoding on Saving step in GBQ ?
Thank you

You can write simple Python script (or another language that made you feel comfortable) to query and save the result by using Python code. So you can use any encoding you want to save your result to CSV file.

Related

What would be a way to convert a csv file to json WITHOUT using avroSchema or ConvertRecord Processors in Apache NIFi?

So I had made a workflow in ApacheNifi that extracted email attachements and converted the csv files into json files. I used InferAvroSchema to ConvertRecord to convert the csv into json. Everything works well until I get a csv file that does not follow the avroschema I had written. Now I need to find a way to convert csv to json without using these two processors as the csv formatting will vary from time to time. The csv Format I currently am working with I will link below.
I have tried to extractText but I am having trouble writing the correct regex to extract the values that match their headers. I also tried AttriutesToJson but it seems like it is not reading the desired attributes. I know I can specify which attributes to pull but since the headers/values will be changing constantly, I can't seem to find a way to set it up dynamically.Current CSV format
If you are using NiFi 1.9.2+, you can use a CsvReader which automatically infers schema on a per-flowfile basis. As the JsonRecordSetWriter can use the embedded inferred schema to write out the JSON as well, you no longer need an explicit Avro schema to be pre-defined.
As long as all the lines of CSV in a single flowfile follow the same schema, you won't have any problems. If you can have different schemas in the same flowfile (which I suspect would cause many additional problems as well), you'll have to filter them first into separate flowfiles.
Have you tried writing a script using the executeStreamCommand processor?
And more specifically, are you talking about the headers being different? There are options in the ConvertRecord processors to include headers

Export CSV to open as utf-8 in Excel on MAC using Python 2.7

We have users who need to be able to export data to a csv, which they open in Excel on mac machines, that support utf-8 characters.
NOTE: We don't want our users to have to go to the data tab, click import from text, then... We want them to be able to open the file immediately after downloading and have it display the correct info.
At first, I thought this was just an encoding/decoding problem since we are using python 2.7 (actively working on upgrading to python 3.6), but after that was fixed, I discovered Excel was the cause of the problem (as the csv works fine when opened in a text editor or even Numbers). The solution i am trying involves adding the utf-8 BOM to the beginning of the file as I read somewhere that this would let Excel know that it requires utf-8.
#Here response is just a variable that is valid when used like this and
#we can export CSV's fine that don't need utf-8
writer = csv.writer(response)
writer.writerow("0xEF0xBB0xBF")
I was hoping that but just adding the utf-8 BOM to the beginning of the csv file like this would allow Excel to realize it needed to use utf-8 encoding when opening this file, but alas it does not work. I am not sure if this is because Excel for MAC doesn't support this or if I simply added the BOM incorrectly.
Edit: I'm not sure why I didn't mention it, as it was critical in the solution, but we are using Django. I found this stack overflow post that gave the solution (which I've included below).
Because we are using Django, we were able to just include:
response.write('\xEF\xBB\xBF')
before creating a csv writer and adding the content to the csv.
Another idea that probably would have lead to a solution is opening the file normally, adding the BOM, and then creating a csv writer (Note: I did not test this idea, but if the above solution doesn't work for someone/they aren't using Django, it is an idea to try).

Pickled scipy sparse matrix as input data?

I am working on a multiclass classification problem consisting in classifying resumes.
I used sklearn and its TfIdfVectorizer to get a big scipy sparse matrix that I feed in a Tensorflow model after pickling it. On my local machine, I load it, convert a small batch to dense numpy arrays and fill a feed dictionnary. Everything works great.
Now I would like to do the same thing on ML cloud. My pickle is stored at gs://my-bucket/path/to/pickle but when I run my trainer, the pickle file can't be found at this URI (IOError: [Errno 2] No such file or directory). I am using pickle.load(open('gs://my-bucket/path/to/pickle), 'rb') to extract my data. I suspect that this is not the good way to open a file on GCS but I'm totally new to Google Cloud and I can't find the proper way to do so.
Also, I read that one must use TFRecords or a CSV format for input data but I don't understand why my method could not work. CSV is excluded since the dense representation of the matrix would be too big to fit in memory. Can TFRecords encode efficiently sparse data like that? And is it possible to read data from a pickle file?
You are correct that Python's "open" won't work with GCS out of the box. Given that you're using TensorFlow, you can use the file_io library instead, which will work both with local files as well as files on GCS.
from tensorflow.python.lib.io import file_io
pickle.loads(file_io.read_file_to_string('gs://my-bucket/path/to/pickle'))
NB: pickle.load(file_io.FileIO('gs://..', 'r')) does not appear to work.
You are welcome to use whatever data format works for you and are not limited to CSV or TFRecord (do you mind pointing to the place in the documentation that makes that claim?). If the data fits in memory, then your approach is sensible.
If the data doesn't fit in memory, you will likely want to use TensorFlow's reader framework, the most convenient of which tend to be CSV or TFRecords. TFRecord is simply a container of byte strings. Most commonly, it contains serialized tf.Example data which does support sparse data (it is essentially a map). See tf.parse_example for more information on parsing tf.Example data.

How to save data in Sanskrit language to Excel in Python

I want to save the data from the following page into an Excel sheet using Python. Can anyone please tell me which encoding should I use so that the data is saved in correct format?
http://dsalsrv02.uchicago.edu/cgi-bin/philologic/getobject.pl?c.0:5.apte

Issues when loading data with weka

I am trying to load some csv data in weka. Some gene expression feature for 12 patients. There are around 22,000 features. However, when I load the csv file, it says
not recognized as an "CSV data files' file
to my csv file.
I am wondering is it because of the size of the features or something else. I have checked the csv file and it is nicely comma separated. Any suggestions?
I would not encourage you to use CSV files in Weka. While it is entirely possible (http://weka.wikispaces.com/Can+I+use+CSV+files%3F) it leads to some severe drawbacks. Try to generate a ARFF file from your CSV instead.