Unit Testing Large SQL Queries in Databricks with Spark Scala [closed] - unit-testing
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 11 hours ago.
Improve this question
I am trying to write some unit tests to ensure that data within our ETL pipeline is as expected. I would have generally expected unit tests to be a standard practice, but currently, at my place of work, the devs are not expected to write any tests (Bad I know).
We are ingesting data from on-prem source tables into Azure. We are then pulling the data from Azure into Databricks and Transforming that data into essentially a new table. The new table has data from various different source tables, which is then transferred across into a SaaS product.
As previously mentioned, there are currently no tests to ensure this SQL query is doing what it should do, but I am trying to change that. However, due to the nature of the project and how it has been developed, I feel like it has made testing at a unit level very difficult.
I have made fairly good progress with writing some tests that check that the source data is correctly being transformed into the new table, but I keep asking myself do I need to do more? It's difficult working in a place where noone seems to care about quality, but I do.
As an example, please see the below query, this is pulling the recently transferred source data and building that into a new table with a number of joins, and in my opinion, is becoming very messy:
DROP TABLE IF EXISTS platform.student;
CREATE OR REPLACE TABLE platform.student
(
STUDENT_ID string,
ULN string,
DOB string,
ETHNICITY string,
SEXID string,
DIFFLEARN1 string,
DIFFLEARN2 string,
DOMICILE string,
PARENTS_ED string,
SOCIO_EC string,
OVERSEAS string,
APPSHIB_ID string,
VLE_ID string,
HUSID string,
USERNAME string,
LAST_NAME string,
FIRST_NAME string,
ADDRESS_LINE_1 string,
ADDRESS_LINE_2 string,
ADDRESS_LINE_3 string,
ADDRESS_LINE_4 string,
POSTCODE string,
PRIMARY_EMAIL_ADDRESS string,
PERSONAL_EMAIL_ADDRESS string,
HOME_PHONE string,
MOBILE_PHONE string,
PHOTO_URL string,
ENTRY_POSTCODE string,
CARELEAVER string,
ADJUSTMENT_PLAN string,
COURSE_REGISTRATION_DATE string,
PROVIDED_AT string
);
insert into platform.student
(
STUDENT_ID,
ULN,
DOB,
ETHNICITY,
SEXID,
DIFFLEARN1,
DIFFLEARN2,
DOMICILE,
PARENTS_ED,
SOCIO_EC,
OVERSEAS,
APPSHIB_ID,
VLE_ID,
HUSID,
USERNAME,
LAST_NAME,
FIRST_NAME,
ADDRESS_LINE_1,
ADDRESS_LINE_2,
ADDRESS_LINE_3,
ADDRESS_LINE_4,
POSTCODE,
PRIMARY_EMAIL_ADDRESS, --University Email
PERSONAL_EMAIL_ADDRESS,
HOME_PHONE,
MOBILE_PHONE,
PHOTO_URL,
ENTRY_POSTCODE,
CARELEAVER,
ADJUSTMENT_PLAN,
COURSE_REGISTRATION_DATE,
PROVIDED_AT
)
select
STUDENT_ID,
ULN,
DOB,
ETHNICITY,
SEXID,
DIFFLEARN1,
DIFFLEARN2,
DOMICILE,
PARENTS_ED,
SOCIO_EC,
OVERSEAS,
APPSHIB_ID,
VLE_ID,
HUSID,
USERNAME,
LAST_NAME,
FIRST_NAME,
ADDRESS_LINE_1,
ADDRESS_LINE_2,
ADDRESS_LINE_3,
ADDRESS_LINE_4,
POSTCODE,
PRIMARY_EMAIL_ADDRESS, --University Email
PERSONAL_EMAIL_ADDRESS,
HOME_PHONE,
MOBILE_PHONE,
PHOTO_URL,
ENTRY_POSTCODE,
CARELEAVER,
ADJUSTMENT_PLAN,
COURSE_REGISTRATION_DATE,
PROVIDED_AT
from
(
SELECT DISTINCT
spriden_id AS STUDENT_ID,
case when skbspin_uln is null then 'NULL' else skbspin_uln end AS ULN,
DATE_FORMAT(spbpers_birth_date, 'yyyy-MM-dd') AS DOB,
stvethn_desc AS ETHNICITY,
'NULL' AS DIFFLEARN1,
'NULL' AS DIFFLEARN2,
'NULL' AS DOMICILE,
'NULL' AS PARENTS_ED,
'NULL' AS SOCIO_EC,
CASE WHEN sgbstdn_resd_code in ('1', '4', 'H') THEN '1'
WHEN sgbstdn_resd_code in ('5', '9', 'F', '7' ) THEN '3'
ELSE '99' END AS OVERSEAS,
CONCAT(syraccs_username, '#salford.ac.uk') AS APPSHIB_ID,
LOWER(syraccs_username) AS VLE_ID,
case when skbspin_husid is null then 'NULL' else skbspin_husid end AS HUSID,
syraccs_username AS USERNAME,
spriden_last_name AS LAST_NAME,
spriden_first_name AS FIRST_NAME,
tt.spraddr_street_line1 AS ADDRESS_LINE_1,
tt.spraddr_street_line2 AS ADDRESS_LINE_2,
tt.spraddr_street_line3 AS ADDRESS_LINE_3,
tt.spraddr_city AS ADDRESS_LINE_4,
tt.spraddr_zip AS POSTCODE,
goremal.goremal_email_address AS PRIMARY_EMAIL_ADDRESS, --University Email
personal_goremal.goremal_email_address AS PERSONAL_EMAIL_ADDRESS,
pr_sprtele.sprtele_intl_access AS HOME_PHONE,
mob_sprtele.sprtele_intl_access AS MOBILE_PHONE,
CONCAT('<URL>', spriden_id, '.png') AS PHOTO_URL,
'NULL' AS CARELEAVER,
PR_Address.spraddr_zip AS ENTRY_POSTCODE,
CASE
WHEN rap.student_id IS NULL THEN 'No'
ELSE 'Yes'
END AS ADJUSTMENT_PLAN,
spbpers_sex AS SEXID,
sfbetrm_ests_date AS COURSE_REGISTRATION_DATE,
'NULL' AS PROVIDED_AT,
row_number() OVER(PARTITION BY spriden_id ORDER BY skrsain_ucas_app_date desc) AS APPLICANT_ORDER_NUMBER
FROM global_temp.spriden
JOIN global_temp.spbpers ON spbpers_pidm = spriden_pidm
LEFT JOIN global_temp.spraddr tt ON spraddr_pidm = spriden_pidm
AND spraddr_atyp_code = 'TT'
AND spraddr_to_date IS NULL
AND spraddr_status_ind IS NULL
JOIN global_temp.skbspin ON skbspin_pidm = spriden_pidm
JOIN global_temp.skrsain ain1 ON ain1.skrsain_pidm = spriden_pidm
LEFT JOIN
(select distinct spraddr_pidm,spraddr_zip
from
global_temp.spraddr s1
join global_temp.skrsain on spraddr_pidm=skrsain_pidm
where
spraddr_atyp_code='PR'
and
spraddr_to_date is null
and
spraddr_seqno =(select max(spraddr_seqno)
from global_temp.spraddr s2
where s2.spraddr_pidm = s1.spraddr_pidm
and s2.SPRADDR_ATYP_CODE ='PR')) PR_Address
ON PR_Address.spraddr_pidm = spriden_pidm
JOIN(
SELECT MAX (ain2.skrsain_activity_date) AS skrsain_activity_date, ain2.skrsain_pidm
FROM global_temp.skrsain ain2
GROUP BY ain2.skrsain_pidm) ain3
ON ain3.skrsain_pidm = ain1.skrsain_pidm
AND ain3.skrsain_activity_date = ain1.skrsain_activity_date
JOIN global_temp.syraccs on syraccs_pidm = spriden_pidm
LEFT JOIN global_temp.stvethn on stvethn_code = spbpers_ethn_code
JOIN global_temp.sgbstdn stdn1 on sgbstdn_pidm = spriden_pidm
JOIN global_temp.sfbetrm ON sfbetrm_pidm = stdn1.sgbstdn_pidm AND sfbetrm_term_code = stdn1.sgbstdn_term_code_eff
LEFT JOIN global_temp.V_Maximiser_RAP_Status rap ON rap.student_id = spriden_id
JOIN global_temp.shrdgmr dgmr ON dgmr.shrdgmr_pidm = sgbstdn_pidm
AND dgmr.shrdgmr_program = sgbstdn_program_1
LEFT JOIN (SELECT goremal_pidm, goremal_email_address, goremal_emal_code, goremal_status_ind, goremal_activity_date
FROM global_temp.goremal mal
WHERE goremal_emal_code = 1
AND goremal_status_ind = 'A'
AND goremal_activity_date = (SELECT max(goremal_activity_date)
FROM global_temp.goremal g
WHERE goremal_pidm = mal.goremal_pidm
AND goremal_emal_code = 1
AND goremal_status_ind = 'A')) goremal
ON goremal_pidm = stdn1.sgbstdn_pidm
LEFT JOIN (SELECT goremal_pidm, goremal_email_address, goremal_emal_code, goremal_status_ind, goremal_activity_date
FROM global_temp.goremal mal
WHERE goremal_emal_code = 2
AND goremal_status_ind = 'A'
AND goremal_activity_date = (SELECT max(goremal_activity_date)
FROM global_temp.goremal g
WHERE goremal_pidm = mal.goremal_pidm
AND goremal_emal_code = 2
AND goremal_status_ind = 'A')) personal_goremal
ON personal_goremal.goremal_pidm = stdn1.sgbstdn_pidm
LEFT JOIN (SELECT DISTINCT sprtele_pidm,sprtele_intl_access, sprtele_tele_code, sprtele_status_ind,sprtele_seqno
FROM global_temp.sprtele s
WHERE sprtele_tele_code = 'MOB'
AND sprtele_status_ind IS NULL
AND sprtele_seqno = (SELECT MAX(sprtele_seqno)
FROM global_temp.sprtele s2
WHERE s2.sprtele_pidm = s.sprtele_pidm
AND sprtele_tele_code = 'MOB'
AND sprtele_status_ind IS NULL)) mob_sprtele
ON sprtele_pidm = stdn1.sgbstdn_pidm
LEFT JOIN (SELECT DISTINCT sprtele_pidm,sprtele_intl_access, sprtele_tele_code, sprtele_status_ind,sprtele_seqno
FROM global_temp.sprtele s
WHERE sprtele_tele_code = 'PR'
AND sprtele_status_ind IS NULL
AND sprtele_seqno = (SELECT MAX(sprtele_seqno)
FROM global_temp.sprtele s2
WHERE s2.sprtele_pidm = s.sprtele_pidm
AND sprtele_tele_code = 'PR'
AND sprtele_status_ind IS NULL)) pr_sprtele
ON pr_sprtele.sprtele_pidm = stdn1.sgbstdn_pidm
WHERE spriden_change_ind IS NULL
AND spriden_entity_ind = 'P'
--AND sfbetrm_ests_code IN ('RE', 'RS', 'RP', 'WU','EL')
AND (sfbetrm_ests_code IN ('RE', 'RS', 'RP', 'WU')
OR sfbetrm_ests_code = 'EL' AND substr(sgbstdn_blck_code, -1,1) > 1) -- include EL's on previous years except first year
AND stdn1.sgbstdn_program_1 IN (SELECT DISTINCT skrspri_program
FROM global_temp.skrspri
WHERE nvl(skrspri_frnchact,1) <> 3)
AND stdn1.sgbstdn_term_code_eff = (SELECT MAX(stdn2.sgbstdn_term_code_eff)
FROM global_temp.sgbstdn stdn2
WHERE stdn1.sgbstdn_pidm = stdn2.sgbstdn_pidm)
-- Previous 2 years data only
AND SUBSTR(stdn1.sgbstdn_term_code_eff,1,4) >= (
SELECT DISTINCT SUBSTR(stvterm_code,1,4) - 1 stvterm_code
FROM global_temp.stvterm
WHERE CURRENT_DATE BETWEEN stvterm_start_date AND stvterm_end_date)
AND dgmr.shrdgmr_degs_code <> 'AW'
AND dgmr.shrdgmr_seq_no =
(
SELECT MAX(dgmr2.shrdgmr_seq_no)
FROM global_temp.shrdgmr dgmr2
WHERE dgmr.shrdgmr_pidm = dgmr2.shrdgmr_pidm
AND dgmr.shrdgmr_program = dgmr2.shrdgmr_program)
AND EXISTS -- must have student record in last 2 years
(SELECT sgbstdn_pidm
FROM global_temp.sgbstdn stdn2
WHERE substr(stdn2.sgbstdn_term_code_eff,1,4) >= (
SELECT DISTINCT SUBSTR(stvterm_code,1,4) - 1 stvterm_code
FROM global_temp.stvterm
WHERE CURRENT_DATE BETWEEN stvterm_start_date AND stvterm_end_date)
AND stdn1.sgbstdn_pidm = stdn2.sgbstdn_pidm)
) a
where
APPLICANT_ORDER_NUMBER = 1
Tests, I have written so far (As an example), but these are more data validation checks?
test("Test - Date of Birth - Banner & Extracts Match") {
var dateOfBirth = spark.sql("""
SELECT distinct spriden.SPRIDEN_ID, spriden.SPRIDEN_PIDM, SPBPERS.SPBPERS_BIRTH_DATE, s.DOB
FROM global_temp.spriden spriden
JOIN global_temp.SPBPERS spbpers on spbpers.SPBPERS_PIDM = spriden.spriden_pidm
JOIN global_temp.V_SAL_EA_STUDENT s on s.STUDENT_ID = spriden.spriden_id
""")
dateOfBirth.as[DateOfBirth].collect().foreach(record => {
if (!dataMatches(s"${record.SPBPERS_BIRTH_DATE}", s"${record.DOB}")) {
println("Error between DOB records for student " + s"${record.SPRIDEN_ID}")
} else {
assert(s"${record.SPBPERS_BIRTH_DATE}" === s"${record.DOB}")
}
})
}
Has anyone ever unit tested large SQL queries like this? I have generally always wrote unit tests for units of code, which generally should be written in a way to do a specific job, which can then easily be identified and tested. I am not really looking for specific answers, but just ideally need someone who has done some thing similiar to give me some hints/tips. All of the articles I have found, just do not seem to do what I need them to.
To unit test, this, do I need to Mock out all of the ingested source data? But even if I did that (Which is a huge job) I would not know what the best way to run that is. If we were running that data against methods/functions then I could easily create tests but it will be being run against the SQL query, right?
Some articles I have found:
Scenario1
Databricks Unit Testing
Please get back to me if you have any information, or if you need any more info, please let me know :)
Thanks!
Related
How to use string as column name in Bigquery
There is a scenario where I receive a string to the bigquery function and need to use it as a column name. here is the function CREATE OR REPLACE FUNCTION METADATA.GET_VALUE(column STRING, row_number int64) AS ( (SELECT column from WORK.temp WHERE rownumber = row_number) ); When I call this function as select METADATA.GET_VALUE("TXCAMP10",149); I get the value as TXCAMP10 so we can say that it is processed as SELECT "TXCAMP10" from WORK.temp WHERE rownumber = 149 but I need it as SELECT TXCAMP10 from WORK.temp WHERE rownumber = 149 which will return some value from temp table lets suppose the value as A so ultimately I need value A instead of column name i.e. TXCAMP10. I tried using execute immediate like execute immediate("SELECT" || column || "from WORK.temp WHERE rownumber =" ||row_number) from this stack overflow post to resolve this issue but turns out I can't use it in a function. How do I achieve required result?
I don't think you can achieve this result with the help of UDF in standard SQL in BigQuery. But it is possible to do this with stored procedures in BigQuery and EXECUTE IMMEDIATE statement. Consider this code, which simulates the situation you have: create or replace table d1.temp( c1 int64, c2 int64 ); insert into d1.temp values (1, 1), (2, 2); create or replace procedure d1.GET_VALUE(column STRING, row_number int64, out result int64) BEGIN EXECUTE IMMEDIATE 'SELECT ' || column || ' from d1.temp where c2 = ?' into result using row_number; END; BEGIN DECLARE result_c1 INT64; call d1.GET_VALUE("c1", 1, result_c1); select result_c1; END;
After some research and trial-error methods, I used this workaround to solve this issue. It may not be the best solution when you have too many columns but it surely works. CREATE OR REPLACE FUNCTION METADATA.GET_VALUE(column STRING, row_number int64) AS ( (SELECT case when column_name = 'a' then a when column_name = 'b' then b when column_name = 'c' then c when column_name = 'd' then d when column_name = 'e' then e end from WORK.temp WHERE rownumber = row_number) ); And this gives the required results. Point to note: the number of columns you use in the case statement should be of the same datatype else it won't work
Oracle Update with the Case When Exists clause
I need a trigger to update a table DIRECTORY_NUMBER when one value of DN_NUM column matches with MSISDN column value of a different table (RNPH_REQUETS_DETAILS) under a different schema(NKADM). The trigger will run every time there's a new entry in the DIRECTORY_NUMBER table. Based upon several conditions, the values of the DN_STATUS column and a few other columns need to be updated. The updated value of the DN_STATUS column will be 'r' if the conditions are met, and 'w' if the conditions are not met. Active portion of my code is given below: UPDATE d SET d.DN_STATUS = CASE WHEN EXISTS (SELECT 1 from NKADM.RNPH_REQUESTS_DETAILS n where n.MSISDN = d.DN_NUM AND n.PROCESS_STATE_ID = 4 AND n.ACTION='IN' AND n.FAILED_STATUS IS NULL AND TRUNC(n.MODIFICATION_DATE) = TRUNC(SYSDATE)) THEN 'r' ELSE 'w' END, d.DN_MODDATE = SYSDATE, d.BUSINESS_UNIT_ID = 2, d.HLCODE = 5 WHERE d.DN_ID =: NEW.DN_ID AND d.PLCODE = 1004 AND d.DN_STATUS = 'f' FROM DIRECTORY_NUMBER d; I am getting the following error: Error(48,1): PL/SQL: SQL Statement ignored Error(60,3): PL/SQL: ORA-00933: SQL command not properly ended The errors get resolved only if I remove the references. But that gives a different result than intended. When the code is as follows: UPDATE DIRECTORY_NUMBER SET DN_STATUS = CASE WHEN EXISTS (SELECT 1 from NKADM.RNPH_REQUESTS_DETAILS where MSISDN = DN_NUM AND PROCESS_STATE_ID = 4 AND ACTION='IN' AND FAILED_STATUS IS NULL AND TRUNC(MODIFICATION_DATE) = TRUNC(SYSDATE)) THEN 'r' ELSE 'w' END, DN_MODDATE =SYSDATE, BUSINESS_UNIT_ID=2, HLCODE =5 WHERE DN_ID =:NEW.DN_ID AND PLCODE =1004 AND DN_STATUS ='f'; COMMIT; Even when the CASE WHEN EXISTS condition is true (returns result when run independently), the value of DN_STATUS gets updated to 'w'. Update: I tried with the following code: UPDATE DIRECTORY_NUMBER SET DN_STATUS = 'r', DN_MODDATE =SYSDATE, BUSINESS_UNIT_ID=2, HLCODE =5 WHERE DN_ID =:NEW.DN_ID AND PLCODE =1004 AND DN_STATUS ='f'; AND DN_NUM in (select MSISDN from NKADM.RNPH_PROCESS_DETAILS where PROCESS_STATE_ID = 4); This isn't working either. If I remove the last condition, the resultant row has DN_STATUS value of 'f', and the MSISDN is in NKADM.RNPH_PROCESS_DETAILS table with PROCESS_STATE_ID = 4. I don't understand why it's not working. What am I doing wrong?
In BEFORE update/insert trigger for EACH ROW you can modify data of record which is currently processed. You don't need to call an extra UPDATE to change the data. In other words you can do something like this IF :NEW.PLCODE = 1004 AND :NEW.DN_STATUS = 'f' THEN :NEW.DN_MODDATE := SYSDATE; :NEW.BUSINESS_UNIT_ID := 2; :NEW.HLCODE := 5; -- this query you can wrap in a function and call this function SELECT COUNT(1) INTO lv_count FROM NKADM.RNPH_REQUESTS_DETAILS n WHERE n.MSISDN = :NEW.DN_NUM AND n.PROCESS_STATE_ID = 4 AND n.ACTION = 'IN' AND n.FAILED_STATUS IS NULL AND TRUNC(n.MODIFICATION_DATE) = TRUNC(SYSDATE); IF lv_count > 0 THEN :NEW.DN_STATUS := 'r'; ELSE :NEW.DN_STATUS := 'w'; END IF; END IF;
My program reads 0 from the database even though there is a 1
I don't understand whats wrong with the code, I have read a lot of times but I can't find the error pstmt = con->prepareStatement("SELECT (?) FROM votos WHERE id = (?)"); pstmt->setString(1, eleccion); pstmt->setInt(2, p->getId()); res = pstmt->executeQuery(); while(res->next()) { p->setVoto(res->getInt(1)); } When the eleccion and id variables are Provincial and 1 respectively the getInt(1) function should return 1, but it returns 0. The command (in the mysql command line): SELECT Provincial from Votos WHERE id=1 Returns a table with one row and one column with the value 1 Side notes: Spelling was checked The getId() function works correctly The compiler doesn't give any error
You can't use a placeholder in a prepared query for a column name. It's returning the value of the string eleccion, not using it as the name of a column in the table. You need to do string concatenation to substitute the column name. std::string sql = std::string("SELECT `") + eleccion + "` FROM votos WHERE id = ?"; pstmt = con->prepareStatement(sql.c_str()); pstmt->setInt(1, p->getId()); res = pstmt->executeQuery(); while(res->next()) { p->setVoto(res->getInt(1)); } If the value of eleccion is coming from the user or some other untrusted source, make sure you validate it before concatenating, to prevent SQL injection.
How to use date() function / return hydrated objects?
In my Entity Repository for Doctrine2, I have the following: $date = new DateTime('NOW'); $date = $date->format('Y-m-d'); if ($region) { $region_sql = " AND WHERE region LIKE ?3 "; } else { $region_sql = ""; } $sql = "SELECT *, count(month) as count FROM (SELECT *, date(date_from, 'start of month', '+1 month', '-1 day') as month FROM manifestations WHERE date_to >= :date_to " . $region_sql . ") GROUP BY month"; $stmt = $em->getConnection()->prepare($sql); $stmt->bindValue(':date_to', $date); if($region) { $stmt->bindValue(3, sprintf('%%,%s,%%', $region)); } $stmt->execute(); return $stmt->fetchAll(); But I need to change this so that it returns the objects hydrated instead of an array. I originally wanted to use DQL or queryBuilder but could not find a way to get the date() function to work.
With NativeQuery you can execute native SELECT SQL statements and map the results to Doctrine entities or any other result format supported by Doctrine. What you want to do can be achieved using the ResultSetMappingBuilder. ResultSetMappingBuilder is a convenience wrapper. It can generate the mappings for you based on Entities. This is how I'd do it (I assume your query works, maybe you'll have to adjust it, as I will use a new alias): Create the ResultSetMapping: use Doctrine\ORM\Query\ResultSetMapping;// Don't forget this $rsm = new ResultSetMappingBuilder($entityManager);// $entityManager points to your entity manager. $rsm->addRootEntityFromClassMetadata('path/to/class/MyClass', 'a');// Notice the a, it's an alias that I'll later on use in the query. $rsm->addScalarResult("count", "count");// column, alias Prepare $region_sql part as you do in your code and add the a alias to whatever you want to map. a.* will be mapped to an object (notice the as a I use in the query): $sql = "SELECT a.*, count(month) as count FROM (SELECT *, date(date_from, 'start of month', '+1 month', '-1 day') as month FROM manifestations WHERE date_to >= :date_to " . $region_sql . ") as a GROUP BY month"; Execute the query: $query = $entityManager->createNativeQuery($sql, $rsm); $query->setParameter('date_to', $date); $result = $query->getResult(); This will give you an array of rows. Each of them will be a mixed array, $result[n][0] will contain the object and $result[n]["count"] the value of the count column of the query (name of the column is the same as the alias we set up in the $rsm) where n is the number of the row.
conversion sql query to jpa
I have a query SELECT d.name, count(e.id) FROM department d LEFT OUTER JOIN employee e on e.department_id = d.id and e.salary > 5000 and how i can convert this to jpa right now i have: CriteriaQuery<Object[]> criteria = builder.createQuery(Object[].class); Root<Department> root = criteria.from(Department.class); Path<String> name = root.get("name"); Expression<Long> empCount = builder.count(root.get("employees").get("id")); criteria.multiselect(name,empCount); TypedQuery<Object[]> query = em.createQuery(criteria); I simplified both examples by removing ordering and grouping can anyone tell me how i can modifie my jpa code to get same reslults like from my sql query thanks in advance
You're not far from the result. The problem is that, AFAIK, you can't add any restriction on the on clause, using JPA. So the query wil have to be rewritten as SELECT d.name, count(e.id) FROM department d LEFT OUTER JOIN employee e on e.department_id = d.id where (e.id is null or e.salary > 5000) Here is the equivalent of this query not tested): CriteriaQuery<Object[]> criteria = builder.createQuery(Object[].class); Root<Department> root = criteria.from(Department.class); Path<String> name = root.get("name"); Join<Department, Employee> employee = root.join("employees", JoinType.LEFT); Expression<Long> empCount = builder.count(employee.get("id")); criteria.multiselect(name,empCount); criteria.where(builder.or(builder.isNull(employee.get("id")), builder.gt(employee.get("salary"), 5000))); TypedQuery<Object[]> query = em.createQuery(criteria);