I'm trying to read the following rows out of a CSV file stored in GCS
headers: "A","B","C","D"
row1:"4000,0000000000000","15400000,000","12311918,400000","3088081,600"
row2:"5000,0000000000000","19250000,000","15389898,000000","3860102,000"
The issue here is how BigQuery is actually interpreting and thus outputting these numbers:
Results query number 1
It's interpreting A as FLOAT64, and B, C and D as INT64, which is okay since I decided to use autodetect schema. But when I try to convert it to a different type it's still outputting the numbers unproperly.
This is the query:
SELECT
CAST(quantity AS INT64) AS A,
CAST(expenses_2 AS FLOAT64) AS B,
CAST(cexpenses_3AS FLOAT64) AS C,
CAST(expenses_4 AS FLOAT64) AS D
FROM
`wide-gecko-289100.bqtest.expenses`
These are the results of query above:
Result query number 2
Either way, it's misinterpreting how to read the numbers, it should be as follows:
row1: [4000] [15400000] [12311918,4] [3088081,6]
row2: [5000] [19250000] [15389898] [3860102]
Is there a way to solve this?
This is due to BigQuery not understanding the localized format you're using for the numeric values. It expects the period (.) character for the decimal separator.
If you can't deal with this early in the process that produces the CSV files in BigQuery, another strategy is to instead use a string type for the columns, and then do some manipulation.
Here's a simple conversion example that shows some string manipulation and casting to get to the desired type. If you're using both commas and periods as part of the localized format, you'll need a more complex string manipulation.
WITH
sample_row AS (
SELECT "4000,0000000000000" as A, "15400000,000" as B,"12311918,400000" as C,"3088081,600" as D
)
SELECT
A,
CAST(REPLACE(A,",",".") AS FLOAT64) as A_as_float64,
CAST(CAST(REPLACE(A,",",".") AS FLOAT64) AS INT64) as A_as_int64
FROM
sample_row
You could also generalize this as a user defined function (temporary or persisted) to make it easier to reuse:
CREATE TEMPORARY FUNCTION parseAsFloat(instr STRING) AS (CAST(REPLACE(instr,",",".") AS FLOAT64));
WITH
sample_row AS (
SELECT "4000,0000000000000" as A, "15400000,000" as B,"12311918,400000" as C,"3088081,600" as D
)
SELECT
CAST(parseAsFloat(A) AS INT64) as A,
parseAsFloat(B) as B,
parseAsFloat(C) as C,
parseAsFloat(D) as D,
FROM
sample_row
I think this is an issue with how BigQuery interprets a comma. It seems to detect it as a thousands separator rather than a decimal.
https://issuetracker.google.com/issues/129992574
Is it possible to replace with a "." instead?
Related
I have a varchar column whose values I would like to update by concatenating a prefix to a padded integer. Here is what I have tried so far:
Item.objects.update(
field=Concat(
Value('prefix'), Value(f"{F('id'):>011d}")
)
)
Which gives me TypeError: unsupported format string passed to F.__format__
I need help on how I can achieve this if possible.
Considering the fact that my use case of the f-string was padding, the LPAD and CAST database functions came in handy (I definitely need to study SQL). Here is the update query:
Item.objects.update(
field=Concat(
Value('prefix'), LPad(Cast('id', output_field=CharField()), 11, Value('0'))
)
)
I have a table that has a text field which has formatted strings that represent money.
For example, it will have values like this, but also have "bad" invalid data as well
$5.55
$100050.44
over 10,000
$550
my money
570.00
I want to convert this to a numeric field but maintain the actual numbers that can be retained, and for any that can't , convert to null.
I was using this function originally which did convert clean numbers (numbers that didn't have any formatting). The issue was that it would not convert $5.55 as an example and set this to null.
CREATE OR REPLACE FUNCTION public.cast_text_to_numeric(
v_input text)
RETURNS numeric
LANGUAGE 'plpgsql'
COST 100
VOLATILE
AS $BODY$
declare v_output numeric default null;
begin
begin
v_output := v_input::numeric;
exception when others then return null;
end;
return v_output;
end;
$BODY$;
I then created a simple update statement which removes the all non digit characters, but keeps the period.
update public.numbertesting set field_1=regexp_replace(field_1,'[^\w.]','','g')
and if I run this statement, it correctly converts the text data to numeric and maintains the number:
alter table public.numbertesting
alter column field_1 type numeric
using field_1::numeric
But I need to use the function in order to properly discard any bad data and set those values to null.
Even after I run the clean up to set the text value to say 5.55
my "cast_text_to_numeric" function STILL sets this to null ? I don't understand why this sets it to null, but the above statement correctly converts it to a proper number.
How can I fix my cast_text_to_numeric function to properly convert values such as 5.55 , etc?
I'm ok with disgarding (setting to NULL) any values that don't end up with numbers and a period. The regular expression will strip out all other characters... and if there happens to be two numbers in the text field, with the script, they would be combined into one (spaces are removed) and I'm good with that.
In the example of data above, after conversion, the end result in numeric field would be:
5.55
100050.44
null
550
null
570.00
FYI, I am on Postgres 11 right now
Apologies as I'm a complete novice when it comes to Weka.
I have 100 instances and each instance has 400 attributes most of which have a single value. However some attributes have multiple values as they contain a time component. I was wondering if Weka can analyse multiple values for one attribute and if so, how do I separate these values so that weka can read them (e.g. commas, semi-colons?)
Many Thanks for your help
R
Weka natively works with a format called arff acronym for Attribute-Relation
File Format. This format consists of a clearly differentiated structure in three parts:
1.Head. Here, the name of the relationship is defined. Its format is as follows:
relation <name-of-the-relationship>
Where is of type String. If this name contains some
space will be put between quotation marks.
2. Statements of attributes. This section describes the attributes that make up our file with his type are declared. The syntax is:
attribute <attribute-name> <type>
Where it is of type String having the same restrictions
as above.
Weka accepts various types, these are:
a) NUMERIC. Real numbers*
b) INTEGER.
c) DATE. Dates, to do this kind should be preceded by a label quoted format.
The label format is composed of separator characters (hyphens
and / or spaces) and time units:
dd Day.
MM Month.
yyyy Year.
HH Hours.
mm minutes.
ss seconds.
d) STRING.. With the restrictions of the type String commented
previously.
e) LISTED The identifier of this type is to express in braces and separated
Comma possible values (or character strings) that can take
attribute. For example, if we have an attribute that indicates the time could be defined:
attribute time {sunny, rainy, cloudy}
3. Data Section. Declare the data that make up the relationship between commas separating the attributes and line breaks relationships.
data
4,3.2
Although this is the "full" mode it is possible to define the data in a short form (sparse data). If we have a sample in which there are many data we can express 0 Data, omitting those items that are zero, surrounding each of the rows in braces and placing in front of each of the data the attribute number.
An example of this is as follows:
data
{14 1, 3 3}
In the event that any of the information is unknown is expressed with a symbol of close interrogation ("?"). And if you want to add comments, use the character %.
So, you can use several values to contruct your dataset.
Example:
1 % Test Weka.
2 #relation MyTest
3
4 #attribute nombre STRING
5 #attribute ojo_izquierdo {Bien,Mal}
6 #attribute dimension NUMERIC
7 #attribute fecha_analisis DATE "dd-MM-yyyy HH:mm"
8
9 #data
10 Antonio,Bien,38.43,"12-04-2003 12:23"
11 ’Maria Jose’,?,34.53,"14-05-2003 13:45"
12 Juan,Bien,43,"01-01-2004 08:04"
13 Maria,?,?,"03-04-2003 11:03"
I currently have a string of values which I retrieved after filtering through data from a csv file. ultimately I had to do some filtering of the data but I have the same numbers as a list, dataframe, or array. I just need to take the numbers in the string and convert them to hex and then take the first 8 numbers of the hex and convert that to dec for each element in the string. Lastly I also need to convert the last 8 of the same hex and then to dec as well for each value in the string.
I cannot provide a snippet because it is sensitive data, but here is an example.
I basically have something like this
>>> list_A
[52894036, 78893201, 45790373]
If I convert it to a dataframe and call df.dtypes, it says dtype: object and I can convert the values of Column A to bool, int, or string, but the dtype is always an object.
It does not matter whether it is a function, or just a simple loop. I have been trying many methods and am unable to attain the results I need. But ultimately the data is taken from different csv files and will never be the same values or list size.
Pandas is designed to work primarily with integers and floats, with no particular facilities for hexadecimal that I know of, but you can use apply to access standard python conversion functions like hex and int:
df=pd.DataFrame({ 'a':[52894036999, 78893201999, 45790373999] })
df['b'] = df['a'].apply( hex )
df['c'] = df['b'].apply( int, base=0 )
Results:
a b c
0 52894036999 0xc50baf407 52894036999
1 78893201999 0x125e66ba4f 78893201999
2 45790373999 0xaa951a86f 45790373999
Note that this answer is for Python 3. For Python 2 you may need to strip off the trailing "L" in column "b" with str[:-1].
I have text provided as varchar2 variable, for exameple:
EXER/ATP-45/-//
MSGID/BIOCHEM3/-/-/-/-/-//
GEODATUM/UTM//
PAPAA/1KM/-//15KM/-//
So, every line separator is // (but there can be also spaces, new lines etc and they should be ignored). '-' is indicating blank field and should be ignored. I have also defined new object type, defined as follows:
TYPE t_promien IS RECORD(
EXER VARCHAR2,
MSGID VARCHAR2(1000),
PAPAA t_papaa
......
)
I need to extract data from corresponding rows into new variable that has t_promien type and set its field, for example - EXER should has 'ATP-45' value, MSGID should has 'BIOCHEM3', PAPAA should has ('1KM','15KM') value (t_papaa is my custom type too and it contains 2 varchar fields).
What is the best way to do this inside oracle PL-SQL procedure? I need to extract needed data into out parameter. Can I use regex for this (how?) Ufortunatelly, I'm totally newbie with oracle, so...
Can you give me some tips? Thanks.
You can do this with REGEXP_SUBSTR using something like this:
SELECT REGEXP_SUBSTR('EXER/ATP-45/-//
MSGID/BIOCHEM3/-/-/-/-/-//
GEODATUM/UTM//
PAPAA/1KM/-//15KM/-//', 'EXER/[^/]+/', 1, 1) AS EXER
FROM DUAL;
The important bit above is 'EXER/[^/]+/' which is looking for a string that starts with the literal EXER/ followed be a sequence of characters which are not / and ended by a final /.
The above query will return EXER/ATP-45/, but you can use standard string functions like SUBSTR, LTRIM or RTRIM to remove the bits you don't need.
A simple demonstration of the use of REGEXP_SUBSTR in PL/SQL.
CREATE OR REPLACE PROCEDURE TEST_REGEXP_PROC(VAR_PI_MSG IN VARCHAR2,
T_PO_PAPAA OUT T_PROMIEN) AS
VAR_L_EXER VARCHAR2(1000);
VAR_L_MSGID VARCHAR2(1000);
BEGIN
SELECT SUBSTR(REPLACE(REGEXP_SUBSTR(VAR_PI_MSG, 'EXER/[^/]+/', 1, 1),'/'),5)
INTO VAR_L_EXER
FROM DUAL;
T_PO_PAPAA.EXER := VAR_L_EXER;
SELECT SUBSTR(REPLACE(REGEXP_SUBSTR(VAR_PI_MSG, 'MSGID/[^/]+/', 1, 1),'/'),6)
INTO VAR_L_MSGID
FROM DUAL;
T_PO_PAPAA.MSGID := VAR_L_MSGID;
END;
Hope this will get you started.