Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add temp table bulk loading #2

Closed
wants to merge 5 commits into from

Conversation

davidtme
Copy link
Contributor

@davidtme davidtme commented Dec 17, 2020

This is a feature I added to SqlClient which I would like to also include here

(Edit by @cmeeren: Closes #3)

@cmeeren
Copy link
Owner

cmeeren commented Dec 17, 2020

Wow! This project has been published for a grand total of one single day, and already you have delved into the codebase and implemented a non-trivial feature like this. I'm impressed!

Since this is a non-trivial feature and there are several design decisions involved, I don't want to rush this PR. I'll try to have a close look at it over the coming days. In the meantime, could you provide some details about your design? Such as:

  • Why have you decided on the API you have? Have you considered other designs? Are there other designs that could mesh better with Facil's existing functionality? For example, loading temp tables seems like it would ideally be similar to the existing functionality and API for loading TVPs.
  • How, at a high level, is it implemented? (It will help me understand the PR and the decisions behind it better)

One thing I immediately notice (though I haven't considered it in light of the entire PR) is that you seem to have SQL files with the temp table definition. I may prefer just having the temp table definition directly in the config. As I see it, this could be a property under param for the relevant parameter.

(Also, sorry for the mess in Db.fs. It was the first file I wrote when prototyping, and I haven't gotten around to properly cleaning it up.)

@cmeeren
Copy link
Owner

cmeeren commented Dec 17, 2020

Also, it seems that currently, there can only be one temp table per script (TempTable option field in Script). Again, I think that perhaps specifying temp tables on a per-parameter basis and using a similar interface to the current TVP stuff would be better.

@cmeeren
Copy link
Owner

cmeeren commented Dec 17, 2020

By the way, since this may be a fairly major feature design-wise: If you want this implemented, but don't necessarily want to spend the time discussing and modifying the PR possibly at length, then just let me know and I may be able to look into implementing this myself. But if you'd rather be involved and have your name on the commits and not have the work you have done so far go to waste, then let's continue adapting this PR.

@davidtme
Copy link
Contributor Author

davidtme commented Dec 17, 2020

Some very good points, documentation isn’t my one of my strengths, but I’ll try and answer your questions over the next few days.

One thing I can say about temp tables in the config file. I added the temp table load to sqlclient because I was using it for an etl process for energy data. Currently that project has 516 different temp tables so adding that all that into the config file could make it huge and unmanageable. You also have sql primary key, clustered indexes and vast array of option on a temp-table which could be hard to add config values for.

I also like to have the create table script and exec script to hand so I can easily edit both in sql management studio.

The one reason I was on this project so quick is because the TypeProvider for the etl project is struggling so many thanks for your brilliant work.

@cmeeren
Copy link
Owner

cmeeren commented Dec 17, 2020

Currently that project has 516 different temp tables so adding that all that into the config file could make it huge and unmanageable.

Is there any reason why you can't use user-defined table types?

You also have sql primary key, clustered indexes and vast array of option on a temp-table which could be hard to add config values for.

I was thinking you could just have a single config property containing the entire definition, primary keys and all.

In any case, I see your point. Perhaps both approaches can be supported - tempTableDef for a given parameter can either be a path to an SQL file with the definition, or the inlined definition itself. That part should be trivial, at least. (If valid file path and file exists, load from file, otherwise treat string as the raw SQL temp table definition.)

@cmeeren
Copy link
Owner

cmeeren commented Dec 17, 2020

I'll do some research on temp tables in the coming days. I've never used them apart from as an ad-hoc replacement for user-defined table types in FSharp.Data.SqlClient. So it is possible I have a somewhat twisted view of what temp tables are and how they are used. Perhaps the "temp table as TVP" approach doesn't always make sense? Do you know?

Also, do you know if temp tables make sense as "input" to stored procedures, or only scripts?

@cmeeren
Copy link
Owner

cmeeren commented Dec 18, 2020

After doing a bit more research and looking at how temp tables are used, as well as some pitfalls regarding which scope they're available in, I am perhaps thinking that the best and most flexible option is to require users to have temp table definitions in SQL files (meaning we can generate one temp table helper type per file), and then have the API allow users to:

  1. initiate a connection
  2. create/fill any necessary temp tables using that connection
  3. call the sproc/script that uses the temp table

Something like this:

let test (connStr: string) : Async<..> =
  async {
    use conn = new SqlConnection()
    do! conn.OpenAsync() |> Async.AwaitTask

    do!
      TempTables.MyTempTable
        .WithConnection(conn)
        // DTO variant
        .AsyncCreateAndLoad(rows = myData |> List.map TempTables.MyTempTable.createRow)
        // Parameter variant
        .AsyncCreateAndLoad(
          rows = (
            myData |> List.map (fun x -> 
              TempTables.MyTempTable.createRow(
                col1 = x.Col1,
                col2 = x.Col2
              )
          )
        )

    return!
      Scripts.SomethingThatUsesTempTable
        .WithConnection(conn)
        .WithParameters(..)
        .AsyncExecute()
  }

This is just a draft. One caveat is that nothing requires you to create/define the temp table before calling scripts/procedures that use it.

However, some benefits are:

  • The user is free to load as many temp tables as they want when they want
  • The API is orthogonal to the current features
  • We avoid more complicated connection handling inside Facil

I will work a bit more on the API design and see if there is an elegant way to make it safer (i.e., forcing users to load temp tables as part of the parameter list, just like with TVPs).

@cmeeren
Copy link
Owner

cmeeren commented Dec 18, 2020

As an alternative, the following API mirrors the current TVP functionality:

MyScript
  .WithConnection(connStr)
  .WithParameters(
    tempTable1 = [TempTables.MyTempTable1.create(Foo = 1)],
    tempTable2 = [TempTables.MyTempTable2.create (Foo = 1, Bar = "test")]
  )

This would require the following:

  • Detect references to temp tables in scripts/procedures
  • In facil.yaml, allow the user to configure parameters with the same name as the temp table - most importantly, allow pointing to the SQL file that contains the temp table definition (we can then use the path/filename to generate the proper module/class structure, just like with scripts; if we allow inlining the definition, we wouldn't know where to put the table type or what to call it, though perhaps that can be specified by the user)
  • At build-time, detect the temp table schema and generate a class for the temp table whose API mirrors that of the table type classes
  • At runtime, we can't use the same internals as for TVPs; instead, have the script/proc helpers define both async (unit -> Task) and sync (unit -> unit) functions that can be passed to the C# internals, and have them run those functions to create/load the temp tables after opening the connection but before executing the query/command

A drawback of this approach is that temp tables are required to be created and loaded in a single script; it is not possible to share temp table data between scripts. The approach mentioned in the previous post is more flexible, at the expense of a little bit of safety.

Given that temp tables may be more of an escape hatch and not the recommended general approach when using Facil, I don't necessarily think the "lack of safety" matters that much. I think it would be hard to cause a lot of real harm by forgetting to create/load the temp table before calling your script. It's something that fails fast. So I may actually prefer the approach in the previous comment.

@davidtme
Copy link
Contributor Author

We currently use temp table in the SqlClient to stream data from one database to another. I cannot change the schema of the target database which is why I use temp table not table type parameters.

I think I can change my code todo the following and stick to the other connection requirements such as keeping things in the C# internals

    scripts:

      - include: "**/*.sql"
        except: "**/SkipThis.sql"

      - for: SQL/Script1.sql
        params:
          col1:
            tempTable: "CREATE TABLE #Temp(Id int, [Name] NVARCHAR(100))"
			

tempTable could also be a filename. I'm not 100% sure about yaml so I don't know if the format is correct


  let externalData =
    seq {
      1, "user"
    }

  let mapped =
    externalData
    |> Seq.map(fun (id, name) ->
      {| Id = id; Name = name |}
    )

  let result =
    DbGen.Scripts.SQL.NormalParams
      .WithConnection("")
      .WithParameters(
        {| SomeValue = 1
           Data = mapped |})
      .Execute()

This would be a really help me slot this into existing code.

@cmeeren
Copy link
Owner

cmeeren commented Dec 18, 2020

Just want to let you know that I'm actively working on this now. I hope to have something working by the end of the weekend, but no promises. The API will be similar to what you suggest above, but the internals are rather different.

@davidtme davidtme changed the title Add temp table bulk loading [WIP] Add temp table bulk loading Dec 18, 2020
Clean up


Clean up
@davidtme
Copy link
Contributor Author

davidtme commented Dec 18, 2020

@cmeeren I've had a go at this myself and I've got the code to match what I said above:

      let data =
        seq {
          {| Id = 1; Name = "name" |}
        }

      let res =
        DbGen.Scripts.SQL.TempTable
          .WithConnection(Config.connStr)
          .WithParameters({| data = data |})
          .Execute()
        |> Seq.toList

I still haven't done the non-dto WithParameters correctly, it's hacked in for now. It's still very rough but if you like the way I'm going I can clean it up.

@cmeeren
Copy link
Owner

cmeeren commented Dec 18, 2020

Thanks. Don't worry about it. I'm close with an implementation that works with both DTO and non-DTO, and uses somewhat different internals/config.

@davidtme
Copy link
Contributor Author

Non-DTO are in now in. Its become quite light touch PR - I'm managed to squeeze it in without changing too much.

@cmeeren It's here if you need it, could save you a bit of time.

@cmeeren cmeeren closed this in 4a300f3 Dec 19, 2020
@cmeeren
Copy link
Owner

cmeeren commented Dec 19, 2020

Hi,

Support for temp tables are now on master. Once the CI build completes, there will be a nupkg here here.

Are you able to give it a try? I'll wait with releasing this until you have provided feedback. It's well tested and works, and I think you'll find it satisfies your requirements, but I'd like to be sure and hear your opinion before releasing.

@cmeeren
Copy link
Owner

cmeeren commented Dec 19, 2020

Note that I just updated the readme with info on temp tables.

Also, I overlooked a compile error; the nupkg will be available here.

@cmeeren
Copy link
Owner

cmeeren commented Dec 20, 2020

By the way, thank you so much for this PR. I have credited you in the changelog and the commit (4a300f3).

Even though I did not merge this PR, the work you did here definitely helped me pin down some critical implementation details and sped things up quite a bit. The reason I implemented it myself, was a combination of the following: 1) This is a large and non-trivial feature, 2) I had some clear opinions on the implementation details, 3) I have limited resources for OSS work at the moment, and 4) guiding this PR to match my preferred implementation would take a lot longer than simply doing it myself.

Thanks again for your work, and let me know if the current implementation works for you.

@davidtme
Copy link
Contributor Author

@cmeeren it's looking really good, I really like the extra config options.

I didn't notice one thing, if you changed the render to it output this:

        member inline this.WithParameters(dto: ^a) =
          let sqlParams =
            [|
            |]
          let tempTableData =
            this.CreateTempTableData(
              (^a: (member ``TempTableInlined``: _) dto) |> Seq.map Script1.tempTableInlined.create // Change here
            )
          ``Script1_Executable``(this.connStr, this.conn, this.configureConn, this.userConfigureCmd, sqlParams, tempTableData)

You could cut down the verbose typings at the user side for extra developer happiness 😄

    DbGen.Scripts.SQL.Script1.WithConnection("").WithParameters(
      {| TempTableInlined = [{| Col1 = 1; Col2 = Some "" |} ] |})
      .Execute()

I'll try and add this to a real project tonight and see how it goes.

@cmeeren
Copy link
Owner

cmeeren commented Dec 20, 2020

You could cut down the verbose typings at the user side for extra developer happiness 😄

It is a deliberate design decision on my part to not support nested records like that for TVPs and temp tables. I could do it, but it would have to be named something other than WithParameter (because it would conflict in overload resolution with the existing DTO WithParameter), and it would make it more complicated regarding adjusting DTO names in the config, and it would mean it's less flexible because currently you can use any create overload to create the temp table or TVP rows.

Furthermore, if you already have the nested record, there's a trivial workaround using anonymous record copy-and-update:

let myNestedRecord = { .. }

let myNestedRecordWithCorrectTempTableTypes =
  {| myNestedRecord with MyTableType = myNestedRecord.MyTableType |> List.map Script1.tempTableInlined.create |}

..
.WithParameters(myNestedRecordWithCorrectTempTableTypes)

This design decision is not set in stone, but I'd like to see how Facil currently feels before potentially changing it.

@cmeeren
Copy link
Owner

cmeeren commented Dec 20, 2020

This design decision is not set in stone, but I'd like to see how Facil currently feels before potentially changing it.

Thinking about it, it may actually be better to just have the DTO overloads accept nested DTOs. I'll give it some more thought.

@davidtme
Copy link
Contributor Author

Just dropping some of this into a real project I can see you are using c# style ResizeArrays and IEnumerable which comes from the c# library.

I think most f# users would rather see the native f# list and seq than the c# types. For large resizearrays this could dramatical increase the memory needed as it's re-cast into f# list.

Correct me if I’m wrong but are you using the c# project because you of the awaits and tasks. You may already know but there is a an TaskBuilder which makes working with .net tasks really easy and you wouldn’t need to c# project

Also there is AsyncSeq which could be better?

        asyncSeq {
            let! ct = Async.CancellationToken

            do! LoadTempTablesAsync(existingConn, tempTableData, ct)
            use cmd = existingConn.CreateCommand()
            configureCmd(cmd)
            use! reader = cmd.ExecuteReaderAsync(CommandBehavior.SingleResult, ct) |> Async.AwaitTask
            
            if reader.HasRows then
                initOrdinals reader

                let rec nextRow () = 
                    asyncSeq {
                        let! hasRow = reader.ReadAsync(ct) |> Async.AwaitTask
                        if hasRow then 
                            getItem(reader)
                            return nextRow ()
                    }

                return nextRow ()

    
        }

@cmeeren
Copy link
Owner

cmeeren commented Dec 20, 2020

All the decisions you mention are deliberately made, primarily for performance reasons.

ResizeArrays and IEnumerable which comes from the c# library.

I use ResizeArray and C#/Task because of performance. AFAIK ResizeArray is the most effective solution for reading an unknown number of rows into a data structure. I deliberately chose to surface this directly and not convert, because users should control this themselves (e.g. whether to convert to list or array or just use ResizeArray for optimal efficiency if they don't need anything else). I agree that F# users normally want list, but that's just a Seq.toList away, and thus not a problem. You'd have to do this if Facil returned seq anyway. It's no different than FSharp.Data.SqlClient; it returns seq where you have to do Seq.toList after calling Execute.

Also, I wouldn't return seq unless it was lazy, as is the case with LazyExecute. Laziness has a cost, so all non-Lazy execute methods are eager, and return a concrete collection type (ResizeArray) to make that clear and to make it easier for users to optimize.

I use C#'s native async/await rather than TaskBuilder because of performance as well as not taking unneeded dependencies. (I know I can embed TaskBuilder, but I'd still have to keep it up-to-date myself. And again, nothing beats C#'s native async/await in terms of performance, not even Ply.)

Using AsyncSeq as the inner read loop is out of the question due to the abysmal performance of Async compared to Task for such hot loops (also it's an unneeded dependency). Note that you can use AsyncSeq on your end to lazily evaluate the results if you use LazyExecuteAsync, which returns the BCL built-in IAsyncEnumerable (only available if your target is compatible with .NET Standard 2.1).

For large resizearrays this could dramatical increase the memory needed as it's re-cast into f# list.

AFAIK this is not true. To build up an F# list most efficiently, you'd have to prepend each item, building the list up in reverse. Then at the end, you'd have to reverse it, creating a new list. That would first of all effectively be a copying of the list, and second of all using an F# list would generally cause massively more allocations (one per item) compared to ResizeArray.

Note that I haven't benchmarked anything (primarily because as a general library I know nothing of the workloads users will have); this is just going by intuition. I've given it quite a bit of thought, but let me know if you think I'm in the wrong in any of the above.

You may already know but there is a an TaskBuilder which makes working with .net tasks really easy and you wouldn’t need to c# project

In case it wasn't clear above: Note that I do not surface ResizeArray just because it's C#. It's no problem using many F# types from C#; for example, some of the C# Execute functions return option and voption. If I believed (or someone convinces me) it would be more efficient to return list or array or something like that, then I would do so, C# or not.

Note also that other performance-focused F# libraries also surface ResizeArray, such as Hopac.

@davidtme
Copy link
Contributor Author

Sorry, I should have put this as an issue/suggestion rather than a comment on this PR.

I agree with points you have made on performance on the dotnet side but most of the bottlenecks I've found working with SQL isn't the dotnet side but some IO lock in SQL or some-such.

There is a trade off here with speed and building a toolset that hooks up with other parts of the ecosystem.

Coming from someone who battles with the F# SqlClient everyday I'm hoping your new ace tool will be picked up and used by the community and all the help it brings. Just trying to pre-empt some future - "Why are we using X when we should be using Y" which is at the core of the project.

@cmeeren
Copy link
Owner

cmeeren commented Dec 21, 2020

I see. If you believe any of the points I mentioned in the previous comment are mistaken, please open an issue and we can surely discuss both the public API and implementation details.

I'll add a FAQ entry about ResizeArray.

@cmeeren
Copy link
Owner

cmeeren commented Dec 23, 2020

Thinking about it, it may actually be better to just have the DTO overloads accept nested DTOs. I'll give it some more thought.

I have given this "nested DTO" feature some thought. While I agree it is useful to be able to just pass nested DTOs and "have it magically work", I see several problems with this:

Overload resolution

Overload resolution doesn't work for the TVP/temp table create methods, so they need to have different names, e.g. create and createFromDto.

Records vs. primitives for single-col TVPs/temp tables

For single-column TVPs/temp tables, it's hard to know whether the user wants to pass a list of record with a single field, or just pass a list of primitive values. Both may be useful in different circumstances.

Ability to drop DTO in config

In the config, one can omit DTO overloads for any table type. This would break if the table type is used in a procedure/script where the DTO overload is not omitted. I prefer keeping such interactions to a minimum to avoid confusion.

Makes it harder for the user to choose

Since the user must currently call the relevant create method, they have full freedom with respect to which overload they use. With nested records, we take away that freedom (unless we make it possible to both keep the current behavior and use the other behavior, adding complexity and possible confusion).

Less orthogonal API

Currently, since the proc/script DTO overloads accept the concrete TVP/temp table type, it works equally well regardless of how the user creates the TVP/temp table rows. If we support nested DTOs, we lock it down to a single way of creating the TVPs/temp tables. While it is theoretically possible to add multiple "nested DTO" overloads, it quickly becomes confusing and downright unmanageable; imagine if we add more create methods to TVPs and temp tables for e.g. different member casing. You'd quickly get a combinatorial explosion of variants, and it would be infeasible to account for all combinations of all TVPs/temp tables in a procedure/script.


The above points, together with the fact that calling the relevant create method is currently very easy (including using anonymous record copy-and-update for existing DTOs in whatever form), I believe the best option is to keep the current behavior.

@cmeeren
Copy link
Owner

cmeeren commented Dec 23, 2020

By the way, please let me know if you (or anyone else reading this) think the temp table feature is ready to be released on NuGet (i.e., there are no immediate breaking changes that needs to be made to this feature). I'll wait for your go-ahead on this one.

@cmeeren cmeeren reopened this Dec 23, 2020
@davidtme
Copy link
Contributor Author

davidtme commented Dec 23, 2020

@cmeeren I'm just trying this against a smaller project that uses bulk load/merge statement and raised an issue: #4

It's not todo with the temp tables but I can't generate the file at the moment.

@cmeeren
Copy link
Owner

cmeeren commented Jan 4, 2021

Published in 0.1.3.

@cmeeren cmeeren closed this Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for temp tables in scripts
2 participants