Dataprep - accents and special characters - google-cloud-platform

How do I solve this problem with accents / special characters in the dataprep? I need this information to appear.
Thank you very much for your attention.

DataPrep has builtin recipes which allow you to remove or change special characters. For example, you can change accented letters to unaccented ones with Remove accents in text or you can also replace non recognised characters for another character with Replace text or patterns.
Below are the steps to change a special character or accented letter.
Create your flow.
Add/import your data
Click Add a recipe, as per documentation. In your case you can do one or both of the following:
First, in case you have an accented word, go to Search Transformations > Select Remove accents in text. Then, select the column, which there are accented words. It will replace the accented words for non-accented ones. Your data your be shown to you so you can check the transformation.
Second, in case you have an non recognised character, go to Search Transformations > Replace text or patterns > Select the column you want to transform the data > Within Find write the letter/symbol between single quotes > In Replace with write the letter which will be placed instead. Finally, preview your data to see the transformation.
UPDATE: I was able to load a .csv file with the mentioned characters to DataPrep. Below are my steps and sample data:
The .csv file I used had the following content:
Test
Non rec. char É
Non rec. char ç
Accented word não
In the DataPrep UI home page, click on Import Data (top right corner) Google Cloud Storage (left part of the screen). Then, find and select you file (test just importing one file instead of parametrizing) and click in the add(+) symbol. In this step, you can already see the characters, in my case I could see them normally. Finally, click in Import&Wrangle and visualise your data. Using the data above, I was able to see the characters properly without any issues.

Related

Google Apps Script - ReplaceText vertical tab

Whenever I paste text into a Google Docs document, all the newline characters get convereted into vertical tab characters (\013 OR \v). This happens regardless of the source of the clipboard text (webpage, word, notepad++).
Usually this means I have to work my way through the document clearing all the vertical tabs and replacing them with proper newlines by backspacing the character and hitting return. However, I want to write a script to replace all the characters in the doc at once. The Replace ui feature doesn't support newline characters but I'm hoping the scripting api does.
I have written the code below, but though it runs, the vertical tab characters are not replaced. I can still see hundreds in the document with the find/replace ui feature. What am I doing wrong?
function myFunction() {
var body = DocumentApp.getActiveDocument().getBody();
body.replaceText("\\v", "\n");
}

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Notepad++ CSV splitting and removing fields

I have a collection of data in a CSV file and I want to manipulate it in so that I only use some of the values in it. example I have:
1,2,3,4,5,6
asd,sdf,dfg,fgh,ghj,hjk
asd,sdf,dfg,fgh,ghj,hjk
asd,sdf,dfg,fgh,ghj,hjk
asd,sdf,dfg,fgh,ghj,hjk
asd,sdf,dfg,fgh,ghj,hjk
what I want to do is use only use a few fields and possibly in a different order
1,4,3
I know notepad++ can break up values and rearrange them using \1,\2,ect but I don't know how I would do it for this.
Notepad++ isn't really a spreadsheet, which would be your easiest approach for this kind of edit after importing a .CSV. However it does have some limited column editing features which make this workable. Two steps are required, which are outlined below.
1) Line up your text:
a) Highlight all the text.
b) Select the TextFX menu,
c) then the TextFX Edit sub-menu,
d) then the "Line up mulitple lines by (,)"
option.
You will now have all your columns aligned on the commas.
2) Delete an individual column:
a) Ensure no text is highlighted.
b) Hold down the [alt] key while selecting the column you wish to delete with the mouse.
c) Press the delete key to delete what you've highlighted.
The above assumes you have the TextFX plugin installed. If I remember right, this comes as standard but if not, you can easily find it and add it from the Plugin Manager.
Here is an example of selecting an aligned column using the alt key and the mouse, taken from the official Notepad++ site:
http://notepad-plus-plus.org/features/column-mode-editing.html
This regular expression will match 6 groups of 0+ non-comma characters for each line:
^([^,]*),([^,]*),([^,]*),([^,]*),([^,]*),([^,]*)$
You can then replace with your captured groups like so:
\1,\4,\3
RegEx101
Side note:
Someone correct me if I'm wrong. I've been trying to reduce this down to just one repeated capturing group for legibility, but I can't seem to make it work since it only sees this as one capture group:
^([^,]*,?){6}$

How replacing with regex characters generated by encoding errors when embedded in text

I need to replace the following characters with regex (gsub):
ÃÆè -> è
ÃÆÃÂ -> à
ÃÆò -> ò
ÃÆì -> ì
ÃÆÃù -> ù
My strategy is to first remove the first three characters ÃÆÃ that are common to all and the move to the last, leaving à at the end since it is basically the lowest common denominator.
Now gsub correctly removes the first three but then it seams it doesn't see the final ones - like ¨ - but I noticed it sees ñ (for ñ).
By copy/pasting the characters into the text editor I noticed they cause weird behaviours (such as moving the cursor forward by few positions).
My dataset was downloaded from a website that itself has encoding problems for the oldest pages but not for the most recent ones (I think they corrected the encoding problem sometime in the last years). Visiting the oldest pages you can still see the very same ̮̬ in plain sight. Then the problem is not (I assume) in the encoding of my file.
That is, the encoding errors are limited to regions of the dataset and are not the result of an encoding issue with the whole text corpus.
The problem when the characters are not correctly displayed is to understand exactly how they are parsed by the regex. In my case, as explained, the encoding errors where limited to few strings in my dataset. Then Encoding() was not applicable.
I solved the problem by visualising the problematic characters directly in R console. In console they appear like Ã\u0083Æ\u0092Ã\u0082¨ while in the R-studio they were visualised as Ã Æ Ã Â¨. What visualised in console was what I needed for a correct match with regex: gsub("Ã\u0083Æ\u0092Ã\u0082¨"...

Use cases for regular expression find/replace

I recently discussed editors with a co-worker. He uses one of the less popular editors and I use another (I won't say which ones since it's not relevant and I want to avoid an editor flame war). I was saying that I didn't like his editor as much because it doesn't let you do find/replace with regular expressions.
He said he's never wanted to do that, which was surprising since it's something I find myself doing all the time. However, off the top of my head I wasn't able to come up with more than one or two examples. Can anyone here offer some examples of times when they've found regex find/replace useful in their editor? Here's what I've been able to come up with since then as examples of things that I've actually had to do:
Strip the beginning of a line off of every line in a file that looks like:
Line 25634 :
Line 632157 :
Taking a few dozen files with a standard header which is slightly different for each file and stripping the first 19 lines from all of them all at once.
Piping the result of a MySQL select statement into a text file, then removing all of the formatting junk and reformatting it as a Python dictionary for use in a simple script.
In a CSV file with no escaped commas, replace the first character of the 8th column of each row with a capital A.
Given a bunch of GDB stack traces with lines like
#3 0x080a6d61 in _mvl_set_req_done (req=0x82624a4, result=27158) at ../../mvl/src/mvl_serv.c:850
strip out everything from each line except the function names.
Does anyone else have any real-life examples? The next time this comes up, I'd like to be more prepared to list good examples of why this feature is useful.
Just last week, I used regex find/replace to convert a CSV file to an XML file.
Simple enough to do really, just chop up each field (luckily it didn't have any escaped commas) and push it back out with the appropriate tags in place of the commas.
Regex make it easy to replace whole words using word boundaries.
(\b\w+\b)
So you can replace unwanted words in your file without disturbing words like Scunthorpe
Yesterday I took a create table statement I made for an Oracle table and converted the fields to setString() method calls using JDBC and PreparedStatements. The table's field names were mapped to my class properties, so regex search and replace was the perfect fit.
Create Table text:
...
field_1 VARCHAR2(100) NULL,
field_2 VARCHAR2(10) NULL,
field_3 NUMBER(8) NULL,
field_4 VARCHAR2(100) NULL,
....
My Regex Search:
/([a-z_])+ .*?,?/
My Replacement:
pstmt.setString(1, \1);
The result:
...
pstmt.setString(1, field_1);
pstmt.setString(1, field_2);
pstmt.setString(1, field_3);
pstmt.setString(1, field_4);
....
I then went through and manually set the position int for each call and changed the method to setInt() (and others) where necessary, but that worked handy for me. I actually used it three or four times for similar field to method call conversions.
I like to use regexps to reformat lists of items like this:
int item1
double item2
to
public void item1(int item1){
}
public void item2(double item2){
}
This can be a big time saver.
I use it all the time when someone sends me a list of patient visit numbers in a column (say 100-200) and I need them in a '0000000444','000000004445' format. works wonders for me!
I also use it to pull out email addresses in an email. I send out group emails often and all the bounced returns come back in one email. So, I regex to pull them all out and then drop them into a string var to remove from the database.
I even wrote a little dialog prog to apply regex to my clipboard. It grabs the contents applies the regex and then loads it back into the clipboard.
One thing I use it for in web development all the time is stripping some text of its HTML tags. This might need to be done to sanitize user input for security, or for displaying a preview of a news article. For example, if you have an article with lots of HTML tags for formatting, you can't just do LEFT(article_text,100) + '...' (plus a "read more" link) and render that on a page at the risk of breaking the page by splitting apart an HTML tag.
Also, I've had to strip img tags in database records that link to images that no longer exist. And let's not forget web form validation. If you want to make a user has entered a correct email address (syntactically speaking) into a web form this is about the only way of checking it thoroughly.
I've just pasted a long character sequence into a string literal, and now I want to break it up into a concatenation of shorter string literals so it doesn't wrap. I also want it to be readable, so I want to break only after spaces. I select the whole string (minus the quotation marks) and do an in-selection-only replace-all with this regex:
/.{20,60} /
...and this replacement:
/$0"¶ + "/
...where the pilcrow is an actual newline, and the number of spaces varies from one incident to the next. Result:
String s = "I recently discussed editors with a co-worker. He uses one "
+ "of the less popular editors and I use another (I won't say "
+ "which ones since it's not relevant and I want to avoid an "
+ "editor flame war). I was saying that I didn't like his "
+ "editor as much because it doesn't let you do find/replace "
+ "with regular expressions.";
The first thing I do with any editor is try to figure out it's Regex oddities. I use it all the time. Nothing really crazy, but it's handy when you've got to copy/paste stuff between different types of text - SQL <-> PHP is the one I do most often - and you don't want to fart around making the same change 500 times.
Regex is very handy any time I am trying to replace a value that spans multiple lines. Or when I want to replace a value with something that contains a line break.
I also like that you can match things in a regular expression and not replace the full match using the $# syntax to output the portion of the match you want to maintain.
I agree with you on points 3, 4, and 5 but not necessarily points 1 and 2.
In some cases 1 and 2 are easier to achieve using a anonymous keyboard macro.
By this I mean doing the following:
Position the cursor on the first line
Start a keyboard macro recording
Modify the first line
Position the cursor on the next line
Stop record.
Now all that is needed to modify the next line is to repeat the macro.
I could live with out support for regex but could not live without anonymous keyboard macros.