Clean accented character and white space in column in Talend

Clean accented character and white space in column in Talend - replace

I have a workflow as follows. In the column 'summary', i want to remove
question mark(?)
white space from the text
replace accented alphabets with the english equivalent. For example é into e.
Thanks in advance!!

Removing question mark(?)
In your tMap, use StringHandling.EREPLACE(row.yourString,"?","")
white space from the text
In your tMap, use StringHandling.TRIM("row.yourString")
replace accented alphabets with the english equivalent. For example é
into e.
In your tMap, use TalendString.removeAccents(row.yourString)
You don't have to import additionnal librairies with TalendString class already implemented.
Basically all these functions (and much more) are accessible through the Expression Builder in tMap.

see my answer on Talend community forum
1st, load the commons-lang3-3.4.jar file and import org.apache.commons.lang3.StringUtils.
For that, in tLibraryLoad Basic settings select "commons-lang3-3.4.jar", then in Advanced setting enter import "org.apache.commons.lang3.StringUtils;" in the import field.
In tJavaRow, enter the following (maybe something similar in tMap depending on your use case):
output_row.line = StringUtils.stripAccents(input_row.line);
tFixedFlowInput is here to generate data for the flow ("aaaéééàààçççbbbb" for my example), and the result is:
aaaeeeaaacccbbbb
Hope this helps,
TRF

Related

Grok custom pattern for space delimited file

I'm trying to load a file to structured table in Athena. I am using GROK pattern to load it to the table but not able to find the correct pattern. The file format is as below:
L1127 ACTUALS 214171 ON 27649075 -00000000000000000409618.02 601 MBS DAILY VISION - CAN OS
L1127 ACTUALS 412821 ON 27649075 002060 -00000000000000000002657.33 521 MBS DAILY VISION - CAN OS
GROK pattern I'm using:
(?<BusinessUnit>.{5})%{SPACE}(?<Type>.{7})%{SPACE}(?<PSGLAccountNumber>.{6})%{SPACE}(?<Province>.{2})%{SPACE}(?<DepartmentId>.{8})%{SPACE}(?<ProductId>.{6})%{SPACE}(?<Amount>.{27})%{SPACE}(?<TransCode>.{3})%{SPACE}(?<Feed>.{35})
I'm having trouble when the ProductId has no value.
Any help would be appreciated.

(?<ProductId>.{6})%{SPACE} means that you expect the ProductId field to be exactly six characters followed by any number of spaces. From the data you posted it seems to me that what should happen is that in the first row ProductId would end up as six spaces.
If the problem is that it becomes six spaces and you want it to be an empty string, you could for example use (?<ProductId>\S*)%{SPACE} (\S* matches zero or more non-space characters).
If this does not solve your problem, perhaps you could describe in some more detail what trouble you are having, and what you want to happen?
Update: in a comment you indicated that the problem with this solution is that the ProductId column becomes "-00000". The reason for that is that the %{SPACE} pattern before (?ProductId… consumes all the spaces between the DepartmentId and Account fields. To solve this you could for example limit the number of spaces that can appear between the DepartmentId and ProductId fields. In the example data you post there are two spaces, and since the fields are fixed-width I assume this is always the case. Using a pattern like …(?<DepartmentId>.{8})\s{2}(?<ProductId>\S*)%{SPACE}(?<Amount>.{27})… should fix the problem.

I was able to make it work using the below pattern below
%{WORD:BusinessUnit}%{SPACE}%{WORD:Type}%{SPACE}%{POSINT:PSGLAccountNumber}%{SPACE}%{WORD:Province}%{SPACE}%{POSINT:DepartmentId}%{SPACE}%{custompat:ProductId}%{SPACE}%{NUMBER:Amount}%{SPACE}%{NUMBER:TransCode}%{SPACE}(?<Feed>[A-Za-z0-9\-\s]{26})
And using custom pattern:
custompat ([0-9]{6}|\s{6})

How to find "complicated" URLs in a text file

I'm using the following regex to find URLs in a text file:
/http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+/
It outputs the following:
http://rda.ucar.edu/datasets/ds117.0/.
http://rda.ucar.edu/datasets/ds111.1/.
http://www.discover-earth.org/index.html).
http://community.eosdis.nasa.gov/measures/).
Ideally they would print out this:
http://rda.ucar.edu/datasets/ds117.0/
http://rda.ucar.edu/datasets/ds111.1/
http://www.discover-earth.org/index.html
http://community.eosdis.nasa.gov/measures/
Any ideas on how I should tweak my regex?
Thank you in advance!
UPDATE - Example of the text would be:
this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/).

This will trim your output containing trail characters, ) .
import re
regx= re.compile(r'(?m)[\.\)]+$')
print(regx.sub('', your_output))
And this regex seems workable to extract URL from your original sample text.
https?:[\S]*\/(?:\w+(?:\.\w+)?)?
Demo,,, ( edited from https?:[\S]*\/)
Python script may be something like this
ss=""" this is a test http://rda.ucar.edu/datasets/ds117.0/. and I want this to be copied over http://rda.ucar.edu/datasets/ds111.1/. http://www.discover-earth.org/index.html). http://community.eosdis.nasa.gov/measures/). """
regx= re.compile(r'https?:[\S]*\/(?:\w+(?:\.\w+)?)?')
for m in regx.findall(ss):
print(m)

So for the urls you have here:
https://regex101.com/r/uSlkcQ/4
Pattern explanation:
Protocols (e.g. https://)
^[A-Za-z]{3,9}:(?://)
Look for recurring .[-;:&=+\$,\w]+-class (www.sub.domain.com)
(?:[\-;:&=\+\$,\w]+\.?)+`
Look for recurring /[\-;:&=\+\$,\w\.]+ (/some.path/to/somewhere)
(?:\/[\-;:&=\+\$,\w\.]+)+
Now, for your special case: ensure that the last character is not a dot or a parenthesis, using negative lookahead
(?!\.|\)).
The full pattern is then
^[A-Za-z]{3,9}:(?://)(?:[\-;:&=\+\$,\w]+\.?)+(?:\/[\-;:&=\+\$,\w\.]+)+(?!\.|\)).

There are a few things to improve or change in your existing regex to allow this to work:
http[s]? can be changed to https?. They're identical. No use putting s in its own character class
[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),] You can shorten this entire thing and combine character classes instead of using | between them. This not only improves performance, but also allows you to combine certain ranges into existing character class tokens. Simplifying this, we get [a-zA-Z0-9$-_#.&+!*\(\),]
We can go one step further: a-zA-Z0-9_ is the same as \w. So we can replace those in the character class to get [\w$-#.&+!*\(\),]
In the original regex we have $-_. This creates a range so it actually inclues everything between $ and _ on the ASCII table. This will cause unwanted characters to be matched: $%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_. There are a few options to fix this:
[-\w$#.&+!*\(\),] Place - at the start of the character class
[\w$#.&+!*\(\),-] Place - at the end of the character class
[\w$\-#.&+!*\(\),] Escape - such that you have \- instead
You don't need to escape ( and ) in the character class: [\w$#.&+!*(),-]
[0-9a-fA-F][0-9a-fA-F] You don't need to specify [0-9a-fA-F] twice. Just use a quantifier like so: [0-9a-fA-F]{2}
(?:%[0-9a-fA-F][0-9a-fA-F]) The non-capture group isn't actually needed here, so we can drop it (it adds another step that the regex engine needs to perform, which is unnecessary)
So the result of just simplifying your existing regex is the following:
https?://(?:[$\w#.&+!*(),-]|%[0-9a-fA-F]{2})+
Now you'll notice it doesn't match / so we need to add that to the character class. Your regex was matching this originally because it has an improper range $-_.
https?://(?:[$\w#.&+!*(),/-]|%[0-9a-fA-F]{2})+
Unfortunately, even with this change, it'll still match ). at the end. That's because your regex isn't told to stop matching after /. Even implementing this will now cause it to not match file names like index.html. So a better solution is needed. If you give me a couple of days, I'm working on a fully functional RFC-compliant regex that matches URLs. I figured, in the meantime, I would at least explain why your regex isn't working as you'd expect it to.

Thanks all for the responses. A coworker ended up helping me with it. Here is the solution:
des_links = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', des)
for i in des_links:
tmps = "/".join(i.split('/')[0:-1])
print(tmps)

How do I use regex to return text following specific prefixes?

I'm using an application called Firemon which uses regex to pull text out of various fields. I'm unsure what specific version of regex it uses, I can't find a reference to this in the documentation.
My raw text will always be in the following format:
CM: 12345
APP: App Name
BZU: Dept Name
REQ: First Last
JST: Text text text text.
CM will always be an integer, JST will be sentence that may span multiple lines, and the other fields will be strings that consist of 1-2 words - and there's always a return after each section.
The application, Firemon, has me create a regex entry for each field. Something simple that looks for each prefix and then a return should work, because I return after each value. I've tried several variations, such as "BZU:\s*(.*)", but can't seem to find something that works.
EDIT: To be clear I'm trying to get the value after each prefix. Firemon has a section for each field. "APP" for example is a field. I need a regex example to find "APP:" and return the text after it. So something as simple as regex that identifies "APP:", and grabs everything after the : and before the return would probably work.

You can use (?=\w+ )(.*)
Positive lookahead will remove prefix and space character from match groups and you will in each match get text after space.

I am a little late to the game, but maybe this is still an issue.
In the more recent versions of FireMon, sample regexes are provided. For instance:
jst:\s*([^;]?)\s;
will match on:
jst:anything in here;
and result in
anything in here

Prevent Notepad++ from reading the entire string as a Regular Expression

Short story:
Is there a way to prevent Notepad++ from interpreting all parts of a string as regex?
The long story:
I have a list of German cities. In Germany some cities have the suffix a.d. (meaning close by) plus the name of a river to differentiate this city from others with the same name.
Unfortunately the suffix is written in various forms, for example:
Dillingen a. d. Donau
Dörnfeld a. d.Ilm
Eldena a.d.Elde
Limburg a d Lahn
To be able to join this list with other data I need a coherent form, for example:
Dillingen a.d. Donau
Dörnfeld a.d. Ilm
Eldena a.d. Elde
Limburg a.d. Lahn
I tried to search for
(a.d.)\b.+\b
but, of course, Notepad++ interprets a.d. as regex (. = any letter) giving also results such as
Fürstenwalde/Spree
Immenstaad am Bodensee
Jänschwalde Ost
making it impossible to search and replace all.
How can I realize this using regex?
I guess the answer is fairly easy but I found no hint in the forum or Notepad++ documentation.
Can someone help? Thanks a lot in advance!
Best,
David

\ba\s*\.?\s*d\b.+
You can use this.a.d here . will match any character so escape it.See demo.
https://regex101.com/r/eX9gK2/7

Match alphanumerics, except only numbers

The original question that gave the idea behind this particular regex is Regex to find content not in quotes.
Let's just modify the original sample a little bit:
INSERT INTO Notifify (to_email, msg, date_log, from_email, ip_from)
VALUES
(
:to_email,
'test teste nonono',
:22,
:3mail,
:ip_from
)
I know that variables starting with numerals are not allowed in any programming language, but that doesn't mean we can't have scenarios where we need to match just :to_email or :3mail and :ip_from and not :22.
How do we proceed? Me and my friend tried it(theoretically only) this way ->
Store all string in a set
Subtract the set that contains only numbers
For online testing, I am using RegExr.

i don't know which programming language do you use, but why can't you just check if the line match:
^\s*:[0-9]+,?\s*$
and just take unmatched lines?

lookaheads will work here
\b(?=\d*[a-z])\w+\b
as will
\b\d*[a-z]\w*\b

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Clean accented character and white space in column in Talend - replace

I have a workflow as follows. In the column 'summary', i want to remove question mark(?) white space from the text replace accented alphabets with the english equivalent. For example é into e. Thanks in advance!!

Related

Grok custom pattern for space delimited file

How to find "complicated" URLs in a text file

How do I use regex to return text following specific prefixes?

Prevent Notepad++ from reading the entire string as a Regular Expression

Match alphanumerics, except only numbers

Categories

Resources