Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table proposal discussion #354

Closed
jclark opened this issue Oct 3, 2019 · 19 comments
Closed

Table proposal discussion #354

jclark opened this issue Oct 3, 2019 · 19 comments
Labels
enhancement Enhancement to language design lang Relates to the Ballerina language specification

Comments

@jclark
Copy link
Collaborator

jclark commented Oct 3, 2019

This issue is to collect comments on the Tables proposal

@jclark jclark added enhancement Enhancement to language design lang Relates to the Ballerina language specification labels Oct 3, 2019
@jclark
Copy link
Collaborator Author

jclark commented Oct 3, 2019

Comment from @sanjiva. Another variant of persistent tables is to be able to create the persistent table within Ballerina and use the Ballerina type to create the schema for the underlying database engine.

Originally posted by @jclark in #286 (comment)

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

Overall I think this proposal is not satisfactory. I will explain the reasons in subsequent comments.

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

Compared to T[], one big thing that table<T> provides is the primary field concept. You could think of it as a map of records, where keys can be structured and where keys come from the records. This is useful functionality. The problem is that the immutable clone semantics makes it inconvenient to use and inconsistent with the other structural types in Ballerina. To update a field, you have to

  1. get a record representing the row
  2. clone it
  3. modify the relevant fields
  4. store it back

Whereas what you really want to do is simply:

  1. get a record representing the row
  2. modify the relevant fields

So why are we doing the immutable clone? One reason is to enable persistence/transactionality. The other reason is so that the table can enforce the primary key constraints. Otherwise you could get the following situation:

  • some variable r refers to a record
  • the record is stored in a table without copying
  • r is then used to change a field of the record that is part of the primary key to be the same as another record stored in the table

In #348, @hrennau makes the interesting suggestion of pushing the primary key concept into record. If we did this, then we could say that the fields corresponding to primary keys are immutable, thus preventing the situation described above.

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

Another more minor thing, that table<T> brings us compared to T[] is a syntax that avoids repeating field/column names for every row. But we could imagine having a list of mappings syntax that provides this same feature, but results in a value of type T[]. In other words, something like our table constructor syntax, but which results in a list of records value.

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

Another idea underlying the tables proposal was that it would be the gateway to providing a convenient SQL-like language-integrated feature for manipulating tabular data. But in fact it is possible to provide such as feature on top of lists without needing a table data type, as the Query proposal shows.

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

The most fundamental flaw in the proposal is this. It tries to both

  • treat tables as data, i.e. have the table basic type be part of anydata, and
  • allow transactional access to tables, using the persistent table concept

But it is simply not possible to do both these things, while at the same time keeping to Ballerina's explicit error handling principle. Transactional reads on data can fail in a way that needs to be reported as an error value not as a panic (because the errors are expected and need to be handled by the programmer typically using a retry). For example, a transactional read from a database can fail because of a deadlock. But anydata requires read operations like clone that do not return error values.

With transactions, each transaction gets its own distinct, isolated view of the table. This feels like something behavioural not like pure data.

@jclark
Copy link
Collaborator Author

jclark commented Oct 6, 2019

Another fundamental issue is that it doesn't really solve the problem of providing convenient access to a relational database. It is not sufficient to look at each table in isolation. You really want to be able to capture the underlying Entity/Relationship data model, which involves multiple tables, as done by
e.g. JHipster.

@hrennau
Copy link

hrennau commented Oct 12, 2019

I think that the Table proposal is very important for the evolution of the Ballerina language (expressiveness, degree of being declarative). The proposal must neither be canceled, nor finalized prematurely.

The main shortcoming of its current state may be the lack of "goals and requirements" beyond the very high-level paragraphs opening, and the very specific SQL interop section closing, the proposal. We might be more systematic; start with defining categories of goals, like (a) which kinds of entities to represent, (b) which categories of operations to support, (c) which logical operations to support, (d) kinds of operation support.

How about an additional document which is equivalent to “Use cases” or “Requirements” documents preparing and guiding the devlopment of a W3C specification? Presently, important goals and requirements are scattered across discussions of various formal details – for example, the need of representing ordered SQL results pops up in the middle of the discussion of a formal detail. (While it is inevitable to let our minds shuttle, the resulting document should abstract from the process of its creation.) Currently, there is no coherent picture of goals and requirements; and guidance given by stated goals and requirements is reduced because perceived in isolation, rather than in context.

To sketch the direction which I imagine the “Requirements” documents could take:
(1) What do we want to represent, Ballerina-internal view?
Describing the new from the perspective of what is already there. Tentatively: “A collection of records, constrained by a uniform record type including (or extended by) a uniform identity model (ie list of primary keys)
(2) What do we want to represent, external view?
Listing the entities from the world which might be covered. Tentatively: SQL table, including its relationships with other tables (foreign key semantics); CSV data; a set of JSON documents constrained by a common grammar; a set of RDF resources constrained by a common class
(3) Which operation categories to support?
For example: filtering (intra-collection); navigation (intra-collection); query (possibly cross-collection; (1) à la SQL; 2 à la SPARQL; ...)
(4) Which operations to support? …
(5) Kinds of operation support?
(a) Reuse of existing query languages, reuse of existing queries
(b) Enable declarative expression of intent

Such a document would be written in an iterative way, enabling us to see new requirements readily in the context of the complete set of requirements already defined.

@snow6oy
Copy link

snow6oy commented Oct 31, 2019

Hello, it seems the count function got lost in version 1.0.0. Can it be put back?

numOfRows = myTable.count();

@gimantha
Copy link
Contributor

Hello, it seems the count function got lost in version 1.0.0. Can it be put back?

numOfRows = myTable.count();

Hi @snow6oy, It is already in the spec, Please refer to https://github.com/ballerina-platform/ballerina-spec/blob/master/lang/proposals/table/table.bal line no 11.

@jclark
Copy link
Collaborator Author

jclark commented Nov 1, 2019

@snow6oy It's called length to be consistent with string and array (since table rows are ordered in this proposal).

@anupama-pathirage
Copy link
Contributor

The row type descriptor of a table type descriptor that is the inherent type of a table value must be a closed record type, with the keys of the record type being the same as the column names.

The below question is regarding the above condition of having the keys of the record type being the same as the column names.

When we are dealing with SQL queries, sometimes the result of the query may contain some complex column names.

Ex :

SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;

The result of the above query will be a table with two columns named COUNT(CustomerID) and Country. But we cannot define a record which is having a key name as COUNT(CustomerID) .So when we think about SQL interop with ballerina tables how can we have a mapping with such a column and corresponding ballerina table column name?

One alternative is to modify the SQL query with as to have a simple column name as follows. But that may not feasible in some cases.

SELECT COUNT(CustomerID) as IDCount, Country
FROM Customers
GROUP BY Country;

In the current ballerina table implementation we don’t have such restriction. I.e. only the order of the data types are considered and the corresponding ballerina table can have any name for the fields in the constraint record type of the table.

@jclark
Copy link
Collaborator Author

jclark commented Nov 27, 2019

@anupama-pathirage You can use a https://ballerina.io/spec/lang/2019R3/#QuotedIdentifier to define a record with any field name you want.

@anupama-pathirage
Copy link
Contributor

@jclark Thanks for the reference. I missed the spec definition for QuotedIdentifier as only the alphanumerics are allowed at the moment in the current implementation.

@grainier
Copy link

@jclark, @hasithaa, @anupama-pathirage Presently, the main usage of tables are with SQL client. And currently, when retrieving data, we return a CursorTable from SQL client and assign it to a table<ConstrainType>. But, with this proposal, we have to decouple the table concept from SQL and make it a more generic data type for tabular data.

To do that, what we can do is to return an iterator from SQL client and use that to construct the table. However, there are some implementation complexities with this approach. For example, the iterator should be aware of the value type it returns (i.e record {| ConstrainType value; |}?). And that information is not available since we don't have generics/parametric types at the moment. Therefore we thought of making the Iterator.next() return a RowType (map<anydata|error>) regardless of the returned data type from DB.

type Iterable object {
	
	__iterator() returns Iterator {
		...
	}
}

// current implementation
type Iterator object {
	next() returns record {| ConstrainType value; |}? {
		...
	}
}

// proposed implementation
type Iterator object {
	next() returns record {| map<anydata|error> value; |}? {
		...
	}
}

With this, if a user of a SQL client needs to retrieve information from a DB into a table, they can use this Iterator to do that. WDYT?

@jclark
Copy link
Collaborator Author

jclark commented Nov 29, 2019

@grainier It should be convenient for the user to be able to perform further processing on the results they get back from the SQL client, by e.g. using a query or using functional methods like foreach or map. So what we need is an Iterable with some more conveniences and a built-in type name. This is exactly what stream<T> in the upcoming Event Stream Processing proposal provides. So I was hoping to be able to use that for this application also. Do you think that would work?

@hasithaa
Copy link
Contributor

hasithaa commented Dec 12, 2019

@grainier It should be convenient for the user to be able to perform further processing on the results they get back from the SQL client, by e.g. using a query or using functional methods like foreach or map. So what we need is an Iterable with some more conveniences and a built-in type name. This is exactly what stream<T> in the upcoming Event Stream Processing proposal provides. So I was hoping to be able to use that for this application also. Do you think that would work?

Conceptually it should work. But we may have to special case stream value in the implementation when representing an SQL result set. This is the exact same problem we faced with the table. (Cursor tables vs in-memory tables).

For me, it is more natural to represent a SQL result set using an Iterable object. If we can find a way to extend type-param support for lang.object module, we can easily provide functional iterations for iterator objects. That will give more convenient API for the users. (This also related to #386)

@jclark
Copy link
Collaborator Author

jclark commented Dec 12, 2019

That would be true of the old design of stream<T>. But in the new design, there is very little about stream<T> that is specific to event streaming. It is just an Iterable<T> that is more convenient to use. (Actually that is not completely true: the timestamping support is specific to events.) A couple of other points:

  • an Iterable<T> isn't enough to do querying and it is crucial to be able to use language-integrated query to refine the results of an SQL query
  • there is no way to convenient way to write the type Iterable<T>.

@jclark
Copy link
Collaborator Author

jclark commented Feb 29, 2020

It is clear that this proposal is not the right approach. So closing this issue.

@jclark jclark closed this as completed Feb 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement to language design lang Relates to the Ballerina language specification
Projects
None yet
Development

No branches or pull requests

7 participants