REGEX in Pentaho to clean a join column in my data - regex

I have been struggling with a certain column in my data where the source data is dirty and i cant find joins because of this.
So What I am trying to do is:
Select the column [website_reference_number] among others
REGEX to review [website_reference_number] according to certain specs
Then I need to trim that data so that there are no in-consistencies left so that my joins will be clean
In example
if [website_reference_number] = "CC-DE-109" >>> Leave it like that
if [website_reference_number] = "CC-DE-109-Duplicate" >>> change to CC-DE-109
if [website_reference_number] = "CC-DE-109 Duplicate" >>> change to CC-DE-109
if [website_reference_number] = "CC-DE-109-Duplicate-Duplic" >>> change to CC-DE-109
So the rules are in human terms {Any 2 Letters}-{Any 2 Letters}-{AnyAmountOfNumbers}

Use this pattern:
/([A-Z]{2})-([A-Z]{2})-([0-9]+).*/
Online Demo

Related

django split data and apply search istartswith = query

I have a Project and when searching a query I need to split the data (not search query) in to words and apply searching.
for example:
my query is : 'bot' (typing 'bottle')
but if I use meta_keywords__icontains = query the filter will also return queries with 'robot'.
Here meta_keywords are keywords that can be used for searching.
I won't be able to access data if the data in meta_keywords is 'water bottle' when I use meta_keywords__istartswith is there any way I can use in this case.
what I just need is search in every words of data with just istartswith
I can simply create a model for 'meta_keywords' and use the current data to assign values by splitting and saving as different data. I know it might be the best way. I need some other ways to achieve it.
You can search the name field with each word that istartswith in variable query.
import re
instances = Model.objects.filter(Q(name__iregex=r'[[:<:]]' + re.escape(query)))
Eg: Hello world can be searched using the query 'hello' and 'world'. It don't check the icontains
note: It works only in Python3

Getting the structure of a raw queryset in django to serialize it or to browse data

Doing:
datas = models.Lfsa_eisn2.objects.raw("SELECT id, AREA_CODE_ID, OCUPATION_ID, YEAR_ID, GROUP_CONCAT(`cipher` separator ',') as `cipher` from core_lfsa_eisn2 group by AREA_CODE_ID , OCUPATION_ID,YEAR_ID" )
datas = list(datas)
print datas
...
OC9 Elementary occupations
S Other service activities
2012 UK United Kingdom
True
45.0,4.3,12.8,16.8,16.0,2619.3,:,60.2,57.2,247.4,344.0,208.2,5.5,42.4,455.5,87.1,233.4,24.1,168.6,180.5,362.2,:,43.9>]
...
Where for example 0C9(OCUPATION_ID) is the foreign key pointing to Elementary occupations.
I'd like to do something like datas.ocupation_id to get OC9 or Elementary occupations.
Do you know how to get the meta data structure of the raw object?
It should be something like print datas.meta or datas.fields...I didn't get after quite time looking and trying...
I want to obtain some info like this:
[{'id': 1, 'OCUPATION_ID': 'Elementary occupations', 'AREA_CODE_ID': 'United Kingdom'}]
In resume, basically I do not understand the structure of the raw queryset object data structure to access to it and later to serialize for JSON. What are your advices?
Thanks in advance!
Well, first things first. After you did:
datas = list(datas)
datas became a list object (not RawQuerySet). Probably, you don't need this line.
This leaves us with just:
datas = models.Lfsa_eisn2.objects.raw("... ultraviolent SQL query ...")
Now datas is a proper RawQuerySet. Let's print its attributes. Dir function to the resque (it is universal law: don't know what to do with obj? print dir(obj) and you'll know everything):
>>> print dir(datas)
[..., 'columns', 'translations', ...]
A wild guess: probably datas.columns will give us datas structure.
>>> datas.columns
['id', 'area_code_id', ...]
Yep, that's what we need.
Now we can perform getattr ultraviolence and print all attributes:
>>> first = datas[0]
>>> for column in datas.columns:
>>> print getattr(first, column)
And one other thing. You're doing ultraviolent stuff in your SQL query: you select id column but GROUP BY column list doesn't have id in it. Won't work in Postgres. Should work in sqlite and probably MySQL (I think it's not possible to write a query that won't work in MySQL).

How to add more conditions in where clause?

I have made a loop to retrieve conditions:
for level in levels:
requete += ' and level_concat like %'+level+'%'
and i made in my query:
countries = Country.objects.extra(where=['continent_id = "'+continent_id+'"', requete])
I have tried to add condition to my where clause, but return error:
not enough arguments for format string
Respected results:
SELECT * FROM `Country` WHERE country_id = "US-51" AND level_concat LIKE %level_test%
Is there a way to add requete to my 'where' clause?
Firstly, it is not a good practice to keep data in a relational database as a "list of [values] concatenated by coma" - you should create a new table for those values.
Still, even now you can use filter(), instead of extra() (which should be always your last resort - I don't see the rest of your code, but if you don't properly escape levels values you may even be introducing an SQL Injection vulnerability here).
An example of a secure, extra()-less code, that does the exact same thing:
from django.db.models import Q
q = Q()
for level in levels:
q &= Q(level_concat__contains=level)
countries = Country.objects.filter(q)
or the same functionality, but in even less number of lines:
from django.db.models import Q
q = (Q(level_concat__contains=l) for l in levels)
countries = Country.objects.filter(*q)
You can read more about Q object in Django docs.
I think you need to escape the % sign in your query:
' and level_concat like %%'+level+'%%'

Search a column for multiple words using Django queryset

I have an autocomplete box which needs to return results having input words. However the input words can be partial and located in different order or places.
Example:
The values in database column (MySQL)-
Expected quarterly sales
Sales preceding quarter
Profit preceding quarter
Sales 12 months
Now if user types quarter sales then it should return both of the first two results.
I tried:
column__icontains = term #searches only '%quarter sales% and thus gives no results
column__search = term #searches any of complete words and also returns last result
**{'ratio_name__icontains':each_term for each_term in term.split()} #this searches only sales as it is the last keyword argument
Any trick via regex or may be something I am missing inbuilt in Django since this is a common pattern?
Search engines are better for this task, but you can still do it with basic code.
If you're looking for strings containing "A" and "B", you can
Model.objects.filter(string__contains='A').filter(string__contains='B')
or
Model.objects.filter(Q(string__contains='A') & Q(string__contains='B'))
But really, you'd be better going with a simple full text search engine with little configuration, like Haystack/Whoosh
The above answers using chained .filter() require entries to match ALL the filters.
For those wanting "inclusive or" or ANY type behaviour, you can use functools.reduce to chain together django's Q operator for a list of search terms:
from functools import reduce
from django.db.models import Q
list_of_search_terms = ["quarter", "sales"]
query = reduce(
lambda a, b: a | b,
(Q(column__icontains=term) for term in list_of_search_terms),
)
YourModel.objects.filter(query)

Can Django do nested queries and exclusions

I need some help putting together this query in Django. I've simplified the example here to just cut right to the point.
MyModel(models.Model):
created = models.DateTimeField()
user = models.ForeignKey(User)
data = models.BooleanField()
The query I'd like to create in English would sound like:
Give me every record that was created yesterday for which data is False where in that same range data never appears as True for the given user
Here's an example input/output in case that wasn't clear.
Table Values
ID Created User Data
1 1/1/2010 admin False
2 1/1/2010 joe True
3 1/1/2010 admin False
4 1/1/2010 joe False
5 1/2/2010 joe False
Output Queryset
1 1/1/2010 admin False
3 1/1/2010 admin False
What I'm looking to do is to exclude record #4. The reason for this is because in the given range "yesterday", data appears as True once for the user in record #2, therefore that would exclude record #4.
In a sense, it almost seems like there are 2 queries taking place. One to determine the records in the given range, and one to exclude records which intersect with the "True" records.
How can I do this query with the Django ORM?
You don't need a nested query. You can generate a list of bad users' PKs and then exclude records containing those PKs in the next query.
bad = list(set(MyModel.obejcts.filter(data=True).values_list('user', flat=True)))
# list(set(list_object)) will remove duplicates
# not needed but might save the DB some work
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
# might not need the pk in user__pk__in - try it
You could condense that down into one line but I think that's as neat as you'll get. 2 queries isn't so bad.
Edit: You might wan to read the docs on this:
http://docs.djangoproject.com/en/dev/ref/models/querysets/#in
It makes it sound like it auto-nests the query (so only one query fires in the database) if it's like this:
bad = MyModel.objects.filter(data=True).values('pk')
rs = MyModel.objects.filter(datequery).exclude(user__pk__in=bad)
But MySQL doesn't optimise this well so my code above (2 full queries) can actually end up running a lot faster.
Try both and race them!
looks like you could use:
from django.db.models import F
MyModel.objects.filter(datequery).filter(data=False).filter(data = F('data'))
F object available from version 1.0
Please, test it, I'm not sure.
Thanks to lazy evaluation, you can break your query up into a few different variables to make it easier to read. Here is some ./manage.py shell play time in the style that Oli already presented.
> from django.db import connection
> connection.queries = []
> target_day_qs = MyModel.objects.filter(created='2010-1-1')
> bad_users = target_day_qs.filter(data=True).values('user')
> result = target_day_qs.exclude(user__in=bad_users)
> [r.id for r in result]
[1, 3]
> len(connection.queries)
1
You could also say result.select_related() if you wanted to pull in the user objects in the same query.