RegEx to clean VISA merchant names (remove random strings) - regex

I am trying to develop a ReGex (.Net flavor), which I can use to clean VISA merchant names.
Examples:
Norton *AP1223506209 --> Norton *AP
Norton *AP1223511428
EUROWINGS VYJD6J_123001 --> EUROWINGS
EUROWINGS W6PDFI_125626
AER LINGUCB22QKM2 --> AER LINGUCB
AER LINGUCB248L2W
AIR FRANCE JWNCSC --> AIR FRANCE
AIR FRANCE K8L7TT
PAYPAL *AIRBNB HMQXBW --> PAYPAL *AIRBNB
PAYPAL *AIRBNB HMQXNZ
SAS 1174565172360 --> SAS
SAS 1174565172368
I would like to keep the first "name" part, but remove the second "gibberish" part.
The following Regex works for Norton and Air Lingu as well as for Eurowings and Air France, if they contain numbers in the gibberish part. It totally fails for PAYPAL *AIRBNB and other strings, that don't contain any numbers in the gibberish part, and also for SAS, probably because the name is too short / there are too many spaces:
Search:
([A-z *-]{2,50}[A-z]{2,50})(.{0,3}([0-9-]{0,3}[A-z *+.#-/]{0,3}){1,10})
Replace:
$1
Is there any way to make this work for gibberish parts that don't contain numbers? I have something like this in mind, but don't manage to create an according RegEx:
Group 1 (to keep)
Must contain consonants and vowels
Can contain few numbers, spaces or punctuation signs (e.g.: "7x7: Taxi Service")
Group 2 (to be removed)
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists of consonants, only
OR: consists of numbers, only
Thanks for any help and best regards
Pesche
Edit:
If I add more examples, Lindens solution still works quite well, but does not recognize all of the examples or in some cases too much of the string. I tried to adjust it, but with my lacking skills didn't quite succeed:
https://regex101.com/r/7y9zGl/4
The following problems remain:
with a length of 6 for the last \w, longer patterns would not be matched in full length (e.g. after easyjet and after EMP Merchan). Increasing it, however, causes other strings to be truncated (e.g. AER LINGU, potentially also HOTELS.COM if > 12 was used).
The merchant names after PAYPAL * and GOOGLE * should not be deleted, as they are true merchant names. I tried to exclude strings containing GOOGLE * with a negative lookbehind, but it does not seem to work like that.
Whereas the merchant name after PAYPAL * should generally remain, in some cases it is followed by gibberish, e.g. PAYPAL *AIRBNB HMQXBW. If the negative lookbehind worked, those cases would no longer be cleaned.
if the merchant name is not followed by gibberish, part of the name itself may be deleted (e.g. EMP Merchan)
As the full list of merchant names is long and versatile, the approach to detect "gibberish" should be as generic as possible (i.e. not rely on a certain length of the gibberish part). Hence my original, now slightly modified "pattern":
Consists of sequences of numbers, letters and optional punctuation signs
OR: consists non or very few vowels (EASYJET 000ESJ5TWN -> the gibberish contains only one vowel, EASYJET 3 of them; PAYPAL *NITSCHKE -> NITSCHKE should not be matched, it contains 2 vowels)
OR: consists of numbers, only
Is such a thing even possible? The goal is to use SQL to clean the merchant names. If necessary, this can be done in several run throughs (for different kind of patterns).
Thx again!

Updated regex based on extended sample and desired results:
[\s*<]+\d+$|[\s*<]+(?![A-Z]{6}.*)\w*\d[\w>]*$|\d{6,}$|[\s*<]+[A-Z]{6}$|(?![A-Z]+$)(?<=[A-Z])\w{6}$
Demo

I cannot validate as I'm only on my phone, but can you try something like this?
^([0-9A-Za-z\*][ ]{0-2})
Take all the numbers, the letters (capital and minor) the star and max 2 spaces from the beginning of the line.
Please check the () but I guess the idea is here.
Sorry, it seems wrong when there is no double space.
You want to take all the char until 2 spaces or 2 numbers according to your examples.
.* {2}|.*[0-9]{2}
Is it better?
Regards,
Thomas

Related

Complex Regex is not validating all repetitions of one specific rule

I need help to validate a field using regex. It will run in Postgres 9.5.
The rules are
The string must contain all seven services: Oil, Wiper blades, Air filter, Tires, Battery, Brake, Antifreeze
All services must have the operation hours, and the accepted values are HH[:MM]{am|pm}-HH[:MM]{am|pm}, or the literals ”working hours”, ”after hours”, ”not available” (this is the rule that I couldn't find the solution)
It is case insensitive, and the spaces should be irrelevant.
The services as separated by a pipe, and the service and working hours are separated by a colon
I did the regex:
^(?=.*(Oil))(?=.*(Wiper blades))(?=.*(Air filter))(?=.*(Tires))(?=.*(Battery))(?=.*(Brake))(?=.*(Antifreeze))(?=.*(\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)\s{0,}-\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)|working hours|after hours|not availabl)).+
This part of the regex is validating only one sequence, not all seven sequences.
(?=.*(\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)\s{0,}-\s{0,}(1{0,1}[0-2]|[1-9])(:[0-5][0-9]){0,1}\s{0,}([ap]m)|working hours|after hours|not availabl))
Example of good string
Oil:8AM-10PM|Wiper blades:8 AM -10 PM|Air filter:8AM-10pm|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:not available
Example of bad strings
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM-10PM|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:fsdfdsfs
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM|Tires:8AM-10PM|Battery:8AM-10PM|Brake:8AM-9PM|Antifreeze:
Oil:8AM-10PM|Wiper blades:8AM-10PM|Air filter:8AM-10PM|Tires:8AM-10PM|Battery:|Brake:|Antifreeze:8AM-9PM
Oil:8AM-10PM|Wiper blades:8AM-10PM
Do someone have any idea what is missing to validate the seven occurrences?
I've made another regex that works :
^(((oil|Air\ filter|Wiper\ blades|Tires|Battery|Brake|Antifreeze):((((\d{1,2})((A|P)M)(-?)){2})|(not available))(\|?)){7})$
How ever, this regex does not take counts of repetition. Which mean, you could have Oil two time it will still works.
I've create a regex101 if you wish to tests more cases.

Validate Street Address Format

I'm trying to validate the format of a street address in Google Forms using regex. I won't be able to confirm it's a real address, but I would like to at least validate that the string is:
[numbers(max 6 digits)] [word(minimum one to max 8 words with
spaces in between and numbers and # allowed)], [words(minimum one to max four words, only letters)], [2
capital letters] [5 digit number]
I want the spaces and commas I left in between the brackets to be required, exactly where I put them in the above example. This would validate
123 test st, test city, TT 12345
That's obviously not a real address, but at least it requires the entry of the correct format. The data is coming from people answering a question on a form, so it will always be just an address, no names. Plus they're all address is one area South Florida, where pretty much all addresses will match this format. The problem I'm having is people not entering a city, or commas, so I want to give them an error if they don't. So far, I've found this
^([0-9a-zA-Z]+)(,\s*[0-9a-zA-Z]+)*$
But that doesn't allow for multiple words between the commas, or the capital letters and numbers for zip. Any help would save me a lot of headaches, and I would greatly appreciate it.
There really is a lot to consider when dealing with a street address--more than you can meaningfully deal with using a regular expression. Besides, if a human being is at a keyboard, there's always a high likelihood of typing mistakes, and there just isn't a regex that can account for all possible human errors.
Also, depending on what you intend to do with the address once you receive it, there's all sorts of helpful information you might need that you wouldn't get just from splitting the rough address components with a regex.
As a software developer at SmartyStreets (disclosure), I've learned that regular expressions really are the wrong tool for this job because addresses aren't as 'regular' (standardized) as you might think. There are more rigorous validation tools available, even plugins you can install on your web form to validate the address as it is typed, and which return a wealth of of useful metadata and information.
Try Regex:
\d{1,6}\s(?:[A-Za-z0-9#]+\s){0,7}(?:[A-Za-z0-9#]+,)\s*(?:[A-Za-z]+\s){0,3}(?:[A-Za-z]+,)\s*[A-Z]{2}\s*\d{5}
See Demo
Accepts Apt# also:
(^[0-9]{1,5}\s)([A-Za-z]{1,}(\#\s|\s\#|\s\#\s|\s)){1,5}([A-Za-z]{1,}\,|[0-9]{1,}\,)(\s[a-zA-Z]{1,}\,|[a-zA-Z]{1,}\,)(\s[a-zA-Z]{2}\s|[a-zA-Z]{2}\s)([0-9]{5})

How to improve a twitter sentiment analyzer?

I'm working on a C++ Twitter company sentiment analysis tool. User inputs a company and the tool analyzes a # of tweets and returns a sentiment.
So far I did the following:
limit tweets to English and recent
make lowercase
remove RT, # symbol, #usernames and URLs
remove characters like &^%$(){}... etc
I then parse the tweet into words and check words against two dictionaries of positive and negative words. I create a total sentiment for each tweet. Then I count the number of positive , neutral and negative tweets to come up with a final answer. No weights are used.
I am thinking of implementing the following two things:
remove stop words from tweets
remove special characters and emoticons from tweets (non english Unicode basically)
However, even with this, most of the searches end up being very neutral. For example if I search "Apple" in 100 tweets I get say 30 positives, 10 negatives and 60 neutral.
Questions:
1. Is there any way to lower the neutrals?
2. What kind of positive and negative words should I add to represent my search criteria(Companies)
You say no weighting is used but why not add it. Assign each +/- word a base weight of 1 then maybe apply some of the following conditions:
If they use words like "very", "extremely", etc, weighting the following adjective heavier (or without weighting just count both of them as a +/- word)
Rather than changing everything to lowercase, if there is capslock involved for words weighting those words heavier with a multiplier
Rating words like "fantastic" heavier than words like "good"

Regex - Company name

I have a plain text and need to extract company names. It's a huge document including company names, financial reports and lots of text.Here are examples of company names:
Big laundry, a.s.
AVERA, s.r.o.
Airoflot Airlines, a.s.
Is it even possible to make regex like this? I'm complete beginner to regex and have no idea how to create this one. Thanks for any help.
Example of text:
`There are many competitors of AVERA, s.r.o. the main one is Airflot Airlines, a.s. and Big laundry, s.r.o. These organisations hold main share of market.
Another companies:
a. Big Company, a.s.
b. Smaller company, s.r.o.
c. Huge company, a.s.`
As the question currently stands, no it is not possible to create a regex for company names.
It would be possible if you are able to create a PATTERN.
Means e.g. A company name always:
starts with an uppercase letter
has a comma
after the comma there is always one of "a.s." or "s.r.o."
So, the difficulties that I see here are:
How many words before the comma belong to the name?
Is there always a comma with following abbreviation?
Names are always difficult to match because a name can be nearly everything, especially company names.
The examples you give follow this pattern : ([A-Z][A-Za-z]+ ?)+, (\w\.)+
The matching operation will be dependent of the tool you use.
For example in JavaScript :
var line = "some name is Airoflot Airlines, a.s. in this line";
var m = line.match(/([A-Z][A-Za-z]+ ?)+, (\w\.)+/);
if (m.length) console.log(m[0]);
This logs
"Airoflot Airlines, a.s."
But this isn't a very reliable solutions : many real company names wouldn't fit and, more importantly perhaps, this would match sentences that aren't company names. So this can only be used as an help in a solution which also incorporates some kind of validation (human or dictionary based).
I use this
(?:\s*[a-zA-Z0-9,_\.\077\0100\*\+\&\#\'\~\;\-\!\#\;]{2,}\s*)*
it matches all a-z, A-Z,0-9 and some special characters which Quickbook supports.
https://community.intuit.com/articles/1146006-acceptable-characters-in-the-company-name-in-quickbooks-online
with your given examples, this regexp would match
Big laundry, a\.s\.|AVERA, s\.r\.o\.|Airoflot Airlines, a\.s\.
The trick is to use the alternation operator | on a set of strings
You may wish to consider missing punctuation and white space in the company names too

Formatting UK postal codes for storage

I want to store UK postal codes in the database. Is it OK to store those postal codes without the spaces?
It is possible to store postcodes without spaces, but would definitely recommend formatting them correctly when they are displayed/output.
You can check out the allowed formats for postcodes here . There are always 3 characters after the space so it's easy to reinsert it.
Last 3 are always xyy
x Digit 0-9
yy Alpha A-Z
Anything before is the first part of the grid reference and has various formats.
we store postcodes and we accept inouts in any format, space or no space, but then strip or correct the entry for data storage
we find it works better this way when using the data for other things
Why would you want to store with no spaces?
Uk postcodes have a variety of formats:
list of formats
Why are you unable to store white spaces?
As others have said, there is no problem with removing all spaces and storing them, if that is what you want to do. As has been said, you can always format them with a space before the last three characters.
However, I would normally take them in any reasonable format, strip all spaces out, and them store them with this one extra space. The storage requirements are not an issue, and it makes it easier to simply display as it is. You would need to resolve the format before saving in some way, so you may as well save it as it is needed.
It's usually safe to remove the space. As others have said, you can re-insert the space later if required. The existence of a space between Outcode and Incode will not normally affect postal delivery. You should not have any non alpha numeric characters in a UK postcode, so if you see a dash you can safely remove it.
I work for Experian Data Quality and if your aim is clean data you may want to consider an address verification web service, like our Pro On Demand product. This will ensure you capture the correct postcode, as they can change over time, and that it is formatted correctly for your database.
It is okay to store without a space because you can always add an empty space back in to each postcode string - the heuristic is pretty simple.
As some other users have very helpfully explained, all UK postcodes have two groups of numbers and letters, separated by a space. The group following the space always contains a number and then two letters (thus, there are always three characters after the space). The group before the space will have either two, three, or four characters (see this Wikipedia page) and the screenshot below.
So, you can recreate the correct spacing by adding a space before the third-to-last character.
In R, it looks like this (but the same logic would work in other languages, such as Python):
#list of example postcodes
postcodes = c("LS176JA", "OX41EZ", "A99AA")
#add space to each postcode in the list of example postcodes
for (postcode in postcodes){
last_three = str_sub(postcode, start = -3)
first_x = str_replace(postcode, last_three, "")
final_postcode = paste0(first_x, " ", last_three)
print(final_postcode)
}
Which returns:
[1] "LS17 6JA"
[1] "OX4 1EZ"
[1] "A9 9AA"