PySpark dataframe remove white-spaces from a column of the string - regex

Here in this pic, column Values contains some string values where the spaces are there in between, hence I am unable to convert this column to an Integer type.
If you can help me remove this white space from these string values, I can then cast them easily.
I have trieddf_cause_death_france.select(regexp_replace(col("Value")," ",""))
It does works but it removes all other columns from my spark dataframe.

please ignore this question. I am able to solve it.
In case you want to know my solution, here it is.
df_cause_death_france.withColumn('VALUE', regexp_replace('Value', ' ','')).show()
output =
https://i.stack.imgur.com/1bljf.png

Related

BigQuery remove <0x00> hidden characters from a column

I have a table with unwanted hidden characters such as my_table:
id
fruits
1
STuff1 stuff_2 ����������������������
2
Blahblah-blahblah �������������
3
nothing
How do I remove ���������������������� when selecting this column?
Current query:
SELECT fruits, TRIM(REGEXP_REPLACE(fruits, r'[^a-zA-Z,0-9,-]', ' ')) AS new_fruits
FROM `project-id.MYDATASET.my_table`
This query is too flaw because I'm worried if I accidentally exclude/replace important data. I only want to be specific on this weird characters.
Upon opening the data as csv, the weird characters shows as <0x00>. How do I solve this?
First you have to identify which is this character, because as it is a non printable this sign is just a random representation. For replace it without remove any other important information, do the following:
identify the hexadecimal of the character. Copy from csv and past on this site:
Use the replace function in bigquery replacing the char of this hex, as following:
SELECT trim(replace(string_field_1,chr(0xfffd)," ")) FROM `<project>.<dataset>.<table>`;
if your character result is different than fffd, put you value on the chr() function

Python - Dataframe column conversion

I'm new on python and I'm trying to convert a column of a dataframe with strings (like 10,000+ or 1,000+) with regex in order to eliminate characters (+ and ,) and then convert them into integer.
How can I do that?
I've tried with regex functions but it doesn't work
convert_installs = re.compile('(?P<amount>\d*).(?P<unit>\d*)')
is it correct for finding what I want to save?
enter image description here
df['columnname'] = df['columnname'].str.replace('([+,])', '', regex = True).astype('int')
Takes your column called columname, and replaces the + or , to nothing, then changes the type to integer.

regular expression replace for SQL

I have to replace a string pattern in SQL with empty string, could anyone please suggest me?
Input String 'AC001,AD001,AE001,SA001,AE002,SD001'
Output String 'AE001,AE002
There are the 4 digit codes with first 2 characters "alphabets" and last two are digits. This is always a 4 digit code. And I have to replace all codes except the codes starting with "AE".
I can have 0 or more instances of "AE" codes in the string. The final output should be a formatted string "separated by commas" for multiple "AE" codes as mentioned above.
Here is one option calling regex_replace multiple times, eliminating the "not required" strings little by little in each iteration to arrive at the required output.
SELECT regexp_replace(
regexp_replace(
regexp_replace(
'AC001,AD001,AE001,SA001,AE002,SD001', '(?<!AE)\d{3},{0,1}', 'X','g'
),'..X','','g'
),',$','','g'
)
See Demo here
I would convert the list to an array, unnest that to rows then filter out those that should be kept and aggregate it back to a string:
select string_agg(t, ',')
from unnest(string_to_array('AC001,AD001,AE001,SA001,AE002,SD001',',') as x(t)
where x.t like 'AE%'; --<< only keep those
This is independent of the number of elements in the string and can easily be extended to support more complex conditions.
This is a good example why storing comma separated values in a single column is not such a good idea to begin with.

How to remove the space between the minus sign and number's in informatica

i have a issue where the there is a amount field which has data like
(- 98765.00),minus{spaces]{numbers} ?, i need to remove the space between the minus and the number and get is as (-98765.00), how do i do it in expression transformation.
field datatype is decimal (8,2).
Thanks,
Kiran
output_port: TO_DECIMAL(REPLACECHR(FALSE,input_port,' ',''))
REPLACECHR replaces the blanks with empty character, essentially removing them. The first argument can be TRUE/FALSE to specify case sensitive or not, but it is not important in this case.
You can use REG_REPLACE function to replace space
To achieve this you need to follow below steps,
* Create two variable ports
* REG_REPLACE - function requires string column, so you need to convert the decimal column to string column using TO_CHAR function
First variable port(string) - TO_CHAR(column_name)
* In previous port data is converted to string, now convert it again to decimal and apply REG_REPLACE function
Second variable port(decimal) - to_decimal(reg_replace(first_variable_port,'s+',''))
s - determines the white spaces in informatica regular expression
See the below image,
same number which you provided is used. Use the same data type and function
Debugger gives the exact result by removing white space in the below image,
May be you have the issue with other transformations which you are passing through. Debug and verify the data once.
Hope you got it, any issues feel free to ask
To have enjoy informatica, have a fun on https://etlinfromatica.wordpress.com/
If my understanding is correct, you need to replace both the spaces and the brackets. Here's the expression:
TO_DECIMAL(
REPLACECHR(0,
REPLACECHR(0, '(- 98765.00)', ' ', '') -- this part does the space replacement
, '()', '') -- this part replaces the brackets
)

Create data frame from patterns in a string from a file

I have a very large file that has some sort of titles in the begginning, then a lot of data in eight columns but this data is not separated in a regular way by spaces (they decided to spit the columns separated by spaces but if some column breaks the "normal" size, the columns end up separated by more or less space characters.
What I did is, I can read the file using a connection and reading line by line using gsub by applying a certain regular expression, something like this:
conn <- file("my_file.dat", open="rt")
y <- gsub("a_ver_large_regexp",
"\\1, \\2, \\3, \\4, \\5, \\6, \\7, \\8", #the columns I want csv'd
perl = TRUE,
readLines(conn, n=-1L))
then I end up with y, a vector of characters where I have each element in character class but at least now comma separated too.
Now I want to convert that y vector to a data frame, I suppose it could be somehow easy given that each element is an string but it has commas so I can read them easily, any idea on how to do this?
It's somewhat difficult to try to write a solution when we cannot see for example y or the original data. However, I think that
as.data.frame(do.call("rbind", strsplit(y, ",")))
might get you what you are after.