Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compare_column_values not working with Athena/Presto #87

Open
1 of 5 tasks
evabilski opened this issue Nov 23, 2023 · 2 comments · May be fixed by #108
Open
1 of 5 tasks

compare_column_values not working with Athena/Presto #87

evabilski opened this issue Nov 23, 2023 · 2 comments · May be fixed by #108
Labels
bug Something isn't working triage

Comments

@evabilski
Copy link

evabilski commented Nov 23, 2023

Describe the bug

Function compare_column_values not working on Athena/Presto.

Steps to reproduce

Create test in Presto environment:

{% set old_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{% set new_etl_relation_query %}
    select
        'a' AS col_a,
        'b' AS col_b
{% endset %}

{{ audit_helper.compare_column_values(
    a_query=old_etl_relation_query,
    b_query=new_etl_relation_query,
    primary_key="col_a",
    column_to_compare="col_b"
)
}}

Log output

COLUMN_NOT_FOUND: line 59:10: Column 'column_name' cannot be resolved or requester is not authorized to access requested resources

System information

dbt-labs/audit_helper version: 0.9.0

Which database are you using dbt with?

  • postgres
  • redshift
  • bigquery
  • snowflake
  • other (specify: Presto)

Additional context

Presto does not support grouping by aliases.
Line 53 of compare_column_values.sql: group by column_name should be removed in order to work with Presto

Are you interested in contributing the fix?

I can try making a PR

@evabilski evabilski added bug Something isn't working triage labels Nov 23, 2023
@sanga8
Copy link

sanga8 commented Apr 29, 2024

I still have this issue in version 0.11.0

@popcornylu
Copy link

popcornylu commented Aug 5, 2024

The root cause is that the group by clause does not allow to use the alias. See presto discussion

A workaround solution is to add a macro at macros/compare_column_values.sql. The content is

replace athena__compare_column_values to xxx__compare_column_values according to your warehouse adapter.

{% macro athena__compare_column_values(a_query, b_query, primary_key, column_to_compare, emojis, a_relation_name, b_relation_name) -%}
with a_query as (
    {{ a_query }}
),

b_query as (
    {{ b_query }}
),

joined as (
    select
        coalesce(a_query.{{ primary_key }}, b_query.{{ primary_key }}) as {{ primary_key }},
        a_query.{{ column_to_compare }} as a_query_value,
        b_query.{{ column_to_compare }} as b_query_value,
        case
            when a_query.{{ column_to_compare }} = b_query.{{ column_to_compare }} then '{% if emojis %}✅: {% endif %}perfect match'
            when a_query.{{ column_to_compare }} is null and b_query.{{ column_to_compare }} is null then '{% if emojis %}✅: {% endif %}both are null'
            when a_query.{{ primary_key }} is null then '{% if emojis %}🤷: {% endif %}missing from {{ a_relation_name }}'
            when b_query.{{ primary_key }} is null then '{% if emojis %}🤷: {% endif %}missing from {{ b_relation_name }}'
            when a_query.{{ column_to_compare }} is null then '{% if emojis %}🤷: {% endif %}value is null in {{ a_relation_name }} only'
            when b_query.{{ column_to_compare }} is null then '{% if emojis %}🤷: {% endif %}value is null in {{ b_relation_name }} only'
            when a_query.{{ column_to_compare }} != b_query.{{ column_to_compare }} then '{% if emojis %}❌: {% endif %}‍values do not match'
            else 'unknown' -- this should never happen
        end as match_status,
        case
            when a_query.{{ column_to_compare }} = b_query.{{ column_to_compare }} then 0
            when a_query.{{ column_to_compare }} is null and b_query.{{ column_to_compare }} is null then 1
            when a_query.{{ primary_key }} is null then 2
            when b_query.{{ primary_key }} is null then 3
            when a_query.{{ column_to_compare }} is null then 4
            when b_query.{{ column_to_compare }} is null then 5
            when a_query.{{ column_to_compare }} != b_query.{{ column_to_compare }} then 6
            else 7 -- this should never happen
        end as match_order

    from a_query

    full outer join b_query on a_query.{{ primary_key }} = b_query.{{ primary_key }}
),

aggregated as (
    select
        '{{ column_to_compare }}' as column_name,
        match_status,
        match_order,
        count(*) as count_records
    from joined

    group by '{{ column_to_compare }}', match_status, match_order
)

select
    column_name,
    match_status,
    count_records,
    round(100.0 * count_records / sum(count_records) over (), 2) as percent_of_total

from aggregated

order by match_order

{% endmacro %}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
3 participants