Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add: schema loading support for Presto query runner #1233

Closed
wants to merge 1 commit into from

Conversation

ninneko
Copy link
Contributor

@ninneko ninneko commented Aug 10, 2016

No description provided.

@arikfr arikfr changed the title fix schema resolves for presto Add: schema loading support for Presto query runner Aug 10, 2016
@arikfr
Copy link
Member

arikfr commented Aug 10, 2016

Thanks!

My only concern is that some users have a lot of tables in Presto, which might result in long schema load time due to the need to run a separate query for each table.

Do you know how fast those queries are? What size of database did you test this with?

@hussainbohra
Copy link

hussainbohra commented Aug 11, 2016

I had a look into this code - Infact I also implemented the query browser for a presto - I am running only one query and that fetches all schema and tables

def get_schema(self, get_stats=False):
        schema = {}
        query = SELECT table_schema, table_name, column_name FROM information_schema.columns      WHERE table_schema NOT IN ('pg_catalog', 'information_schema')


        results, error = self.run_query(query)
        if error is not None:
            raise Exception("Failed getting schema.")

        results = json.loads(results)

        for row in results['rows']:
            if row['table_schema'] != 'public':
                table_name = '{}.{}'.format(row['table_schema'], row['table_name'])
            else:
                table_name = row['table_name']

            if table_name not in schema:
                schema[table_name] = {'name': table_name, 'columns': []}

            schema[table_name]['columns'].append(row['column_name'])

        return schema.values()

The base class will also remain the same (BaseQueryRunner)

@ninneko
Copy link
Contributor Author

ninneko commented Aug 21, 2016

@arikfr I tested on EMR cluster that have 5 schemas and 30 tables, and I didnt feel the problem.

There is two reason that I dont think this code is bad.
First, this queris are also used in the hive plugin and the impala plugin, and these plugins are already commonly used.
Second, I dont think that mete tables should be directly used.

Thank you for comments and sorry for poor English.

@arikfr
Copy link
Member

arikfr commented Aug 28, 2016

I understand the concern of using meta tables, but the method of running a query for each table is just not something we can add. @rohanpd benchmarked your method on their Presto cluster and resulted in thousands of queries and unreasonable time to complete.

I've decided to go with @hussainbohra's solution as implemented by @rohanpd in #1252.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants