Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[infra] Adiciona suporte a Avro e Parquet (cont.) #1145

Merged
merged 4 commits into from
Mar 11, 2022
Merged

Conversation

isadorabugarin
Copy link
Collaborator

@isadorabugarin isadorabugarin commented Mar 7, 2022

Esse PR é uma continuação do PR #1100

Replicação do texto do PR original

Motivação

  • Formatos de arquivo com tamanho reduzido e tempos de leitura menores implicam:
    • Menor espaço necessário para armazenamento
    • Menor tempo de processamento
    • Consequentemente, menores custos

Modificações

  • Corrige dependência inconsistente:
    • Para o extra "dev" era google-cloud-bigquery = "1.28.0"
    • Para o pacote em si, google-cloud-bigquery = "2.30.1"
    • Coloquei ambos como google-cloud-bigquery = "2.30.1"
  • Adiciona dependência pandavro para lidar com a interface Pandas <-> Avro
  • Adiciona suporte a Apache Avro e Apache Parquet para tabelas externas
    • O default continua sendo CSV para compatibilidade
  • Corrige a falta do argumento source_format na chamada to Table.init

TODO @isadorabugarin

  • Criar testes
  • Revisar implementação

@lucascr91 lucascr91 changed the title PR 1100 [infra] Adiciona suporte a Avro e Parquet (cont.) Mar 10, 2022
@lucascr91 lucascr91 changed the base branch from master to python-1.6.2 March 10, 2022 18:06
@lucascr91 lucascr91 closed this Mar 10, 2022
@lucascr91 lucascr91 reopened this Mar 10, 2022
@lucascr91
Copy link
Collaborator

Os testes precisam ser pré-fixados com a palavra test. Ver página 287 na documentação do pytest (https://buildmedia.readthedocs.org/media/pdf/pytest/latest/pytest.pdf).

def table_create_all_implemented_source_format(table, path, source_format):
    table.delete(mode="all")

    table.create(
        path=path,
        if_storage_data_exists="pass",
        if_table_config_exists="pass",
        source_format=source_format,
    )
    assert table_exists(table, "staging")


def table_create_not_implemented_source_format(table):

    with pytest.raises(NotImplementedError):
        table.create(
            if_table_exists="replace",
            if_storage_data_exists="pass",
            if_table_config_exists="pass",
            source_format="json",
        )

python-package/tests/test_table.py (Lines 227-247)

Copy link
Collaborator

@lucascr91 lucascr91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parabéns pelo trabalho. O único reparo mesmo foi a formatação dos testes. Se possível, adicione um snippet no PR apenas mostrando como você fez teste.

@lucascr91 lucascr91 merged commit 564e671 into python-1.6.2 Mar 11, 2022
@lucascr91 lucascr91 deleted the pr-1100 branch March 11, 2022 21:37
lucascr91 added a commit that referenced this pull request Mar 14, 2022
* feat(infra): create version 1.6.2

* feat(infra): create version 1.6.2

* feat(infra): create version 1.6.2

* [infra] python-v1.6.2 (#1089)

* [infra] fix dataset_config.yaml folder path (#1067)

* feat(infra) merge master

* [infra] conform Metadata to new metadata changes (#1093)

* [dados-bot] br_ms_vacinacao_covid19 (2022-01-23) (#1086)

Co-authored-by: terminal_name <github_email>

* [dados] br_bd_diretorios_brasil.etnia_indigena (#1087)

* Sobe diretorio etnia_indigena

* Update table_config.yaml

* Update table_config.yaml

* feat: conform Metadata's schema to new one

* fix: conform yaml generation to new schema

* fix: delete test_dataset folder

Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com>
Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com>

Co-authored-by: Ricardo Dahis <6617207+rdahis@users.noreply.github.com>
Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com>
Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com>

* feat(infra): 1.6.2a3 version

* feat(infra): 1.6.2a3 version

* fix(ingra): edit partitions and update_locally

* feat(infra): update_columns new fields and accepts local files

* [infra] option to make dataset public (#1020)

* feat(infra): option to make dataset public

* feat(infra): fix None data

* fix(infra): roll back

* fix(infra): fix retry in storage upload

* fix(infra): add option to dataset data location

* feat(infra): make staging dataset not public

* feat(infra): make staging dataset not public

* fix(infra): change bd version in actions

* fix(infra): add toml to install in ci

* fix(infra): remove a forget print

* fix(infra): fix location location

* fix(infra): fix dataset description

* feat(infra): bump-version

* feat(infra): temporal coverage as list in update_columns

* feat(infra): add new parameters to cli

* feat(infra): fix cli options

* [infra] change download functions to consume CKAN endpoints #1129  (#1130)

* [infra] add function to wrap bd_dataset_search endpoint

* Update download.py

* [infra] modify list_datasets function to consume CKAN endpoint

* [infra] fix list_dataset function to include limit and remove order_by

* [infra] change function list_dataset_tables to use CKAN endpoint

* [infra] apply PEP8 to list_dataset_tables and respective tests

* add get_dataset_description, get_table_description, get_table_columns

* [infra] fix dataset_config.yaml folder path (#1067)

* feat(infra) merge master

* fix files organization to match master

* remove download.py

* remove test_download

* Delete test_download.py

* remove test files

* remove test_download.py

* remove test_download.py

* remove test_download.py

* remove test_download.py

* add tests metadata

* remove test_download.py

* remove unused imports

* [infra] add _safe_fetch and get_table_size functions

Co-authored-by: lucascr91 <lucas.ecomg@gmail.com>

* fix(infra): add a empty list to not a partition

* [infra] Adiciona suporte a Avro e Parquet (#1145)

* adiciona suporte a Avro e Parquet para upload

* Adds test for source formats

* [infra] update tests for avro, parquet, and csv upload

Co-authored-by: Gabriel Gazola Milan <gabriel.gazola@poli.ufrj.br>
Co-authored-by: Isadora Bugarin  <isadorabugarin@gmail.com >
Co-authored-by: lucascr91 <lucas.ecomg@gmail.com>

* [infra] Feedback messages in upload methods [issue #1059] (#1085)

* Creating dataclass config

* Success messages - create and update (table.py) using loguru

* feat: improve log level control

* refa: move logger config to Base.__init__

* Improving log level control

* Adjusting log level control function in base.py

* Fixing repeated 'DELETE' messages everytime Table is replaced.

* Importing 'dataclass' from 'dataclasses' to make config work.

* Fixing repeated 'UPDATE' messages inside other functions.

* Defining a new script message format.

* Definng standard log messages for 'dataset.py' functions

* Definng standard log messages for 'storage.py' functions

* Definng standard log messages for 'table.py' functions

* Definng standard log messages for 'metadata.py' functions

* Adds standard configuration to billing_project_id in download.py

* Configuring billing_project_id in download.py

* Configuring config_path in base.py

Co-authored-by: Guilherme Salustiano <guissalustiano@gmail.com>
Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com>

* update toml

Co-authored-by: Ricardo Dahis <6617207+rdahis@users.noreply.github.com>
Co-authored-by: Lucas Moreira <65978482+lucasnascm@users.noreply.github.com>
Co-authored-by: Gustavo Aires Tiago <36206956+gustavoairestiago@users.noreply.github.com>
Co-authored-by: lucascr91 <lucas.ecomg@gmail.com>
Co-authored-by: Isadora Bugarin <57679195+isadorabugarin@users.noreply.github.com>
Co-authored-by: Gabriel Gazola Milan <gabriel.gazola@poli.ufrj.br>
Co-authored-by: Isadora Bugarin  <isadorabugarin@gmail.com >
Co-authored-by: Guilherme Salustiano <guissalustiano@gmail.com>
Co-authored-by: Isadora Bugarin <isadorabugarin@gmail.com>
@lucascr91 lucascr91 linked an issue Apr 4, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[infra] Suporte a parquet
3 participants