Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulk rerun work orders (from start) #659

Closed
5 tasks
amberrignell opened this issue Feb 22, 2023 · 31 comments · Fixed by #865
Closed
5 tasks

Bulk rerun work orders (from start) #659

amberrignell opened this issue Feb 22, 2023 · 31 comments · Fixed by #865
Assignees

Comments

@amberrignell
Copy link
Contributor

amberrignell commented Feb 22, 2023

User story

As a user that had a workflow which was running successfully, who then realised the logic isn't what I wanted, I would like to be able to bulk reprocess workflows, so that each workorder can be rerun with my updated logic.

Details

Allow users to select work orders from the history page
Allow users to reprocess the work orders they have selected

  1. Selecting work orders
    Using the following component, add a checkbox column to the history page.
  • There should be a checkbox in the first column (to the left of 'Workflow').
  • Every work order should have a checkbox, unselected by default
  • When the user selects the top-level checkbox in the column heading, all work orders on the page get selected
  • A user can select several individual work orders using the checkboxes within the work order rows

image

If a user has selected some work orders, then changes the filters, their selection is lost.

  1. Reprocessing work orders
  • When a user selects either the top-level checkbox OR work order level checkboxes, a 'rerun from start' button should appear.
  • When the user clicks 'Rerun from start', the user should see a pop up modal which confirms which work orders have been selected (how many work orders will be reprocessed, and the search filters used to select these work orders)

Use the following modal component

A.Top-level select
Message: You've selected the number_of_workorders_on_page work orders from page age_number of results for workorders matching the following query search_params. There are a total of number_of_workorders_matching_query work orders matching this query.

Example: You've selected the 12 work orders from page 1 of results for workorders matching the following query: received between x and y date, for workflow A, with a status of 'Failed'. There are a total of 32 work orders matching this query.

Buttons: Cancel, Rerun all work orders on this page from start, Rerun all work orders matching the query.

Screenshot 2023-04-12 at 10.28.24.png

B.Individual work orders selected
Message: You've selected number_of_selected_work_orders work orders to rerun from the start (first job). This will create a new attempt for these within the same work order
Example: You've selected 12 work orders to reprocess from the start (first job). This will create a new attempt for these within the same work order.
Buttons: Cancel, Rerun from start

Screenshot 2023-04-12 at 10.31.21.png

  • When the user clicks 'Rerun from start', the users gets an alert "xx work orders will be bulk rerun" (so that they have feedback that their action was applied), and the work orders all get reprocessed in the same order in which they were received

Implementation notes

Please read through the platform-app implementation first
Write a function that can take a list of work order IDs and create a new attempt for each one, in the order in which they have been passed to the function.
Write a function that can take a query, and create a new attempt for each work order returned by the query, in the order in which they are returned.

The function rerun_run_from_start parameters would be:
source: (list of IDs, or parameters)
user: (used to add to the invocation reason)

Release notes

  • Bulk rerun work orders from start

User acceptance criteria

  • I can 'select all' work orders on the history page
  • Work orders get reprocessed in the order in which they were received
  • when I click 'Rerun from start', I see a pop-up which confirms how many work orders will be reprocessed, and the search filters used to select these work orders (ex: Reprocess 12 work orders which were received between x and y date, for workflow A, with a status of 'Failed')
  • If a user changes the filters, they lose their selected items.
  • If I click the top level select, all workorders are selected. If I unselect one, the top-level select is also unselected. If I click unselect from the top level, all of them are unselected.
@amberrignell amberrignell self-assigned this Feb 22, 2023
@taylordowns2000 taylordowns2000 added this to the RC1 milestone Apr 1, 2023
@taylordowns2000
Copy link
Member

taylordowns2000 commented Apr 6, 2023

@amberrignell , tailwind component for this where employee rows are workorders : https://tailwindui.com/components/application-ui/lists/tables#component-390db22a862687fe88b313692989ffaf

note that this requires some JS

image

@amberrignell amberrignell changed the title Bulk reprocess work orders Bulk rerun work orders Apr 11, 2023
@amberrignell amberrignell changed the title Bulk rerun work orders Bulk rerun work orders (from start) Apr 27, 2023
@taylordowns2000 taylordowns2000 removed their assignment Apr 27, 2023
@taylordowns2000
Copy link
Member

Useful naming content:
image

@taylordowns2000
Copy link
Member

Possible approach:

  1. see how "rerun" works currently in the liveview and/or service.
  2. determine how a "batch" or a "query" of workorders could all be "rerun" using the current backend (maybe: is there a function that takes a list or workorder IDs and creates a new attempt for each workorder from the first job in the workflow with the initial state.)
  3. write that function
  4. then, consider how to pass workorder IDs from the UI (when i've selected 3 workorders) to that function.
  5. for the modal, consider how to use the applied query (or "total" count as opposed to the paginated view) to give the user the option to send just those IDs select or ALL IDs regardless of current pagination.
  6. send them for execution, display toast notification.

@sigu
Copy link
Collaborator

sigu commented Jun 12, 2023

Jun-11-2023 21-08-32

  1. How would you like the modal to be rendered, using the petal components or using liveview components?
  2. We already have the list of work orders to be rerun.
  3. Maybe get the first attempt and first run then rerun, use inserted_at to determine the oldest of each.
  4. WordOrderService looks like a good place to restart the job from. I am currently thinking of calling retry_attempt_run(attempt_run, user) passing it an attempt run determined by 3 above.

@sigu
Copy link
Collaborator

sigu commented Jun 12, 2023

Screenshot 2023-06-11 at 21 11 57

Why do we need the following tables?

  1. Attempt runs - currently assuming its a join table of attempts and runs
  2. Jobs - what the big difference between this and a Run?
  3. Dataclips - No idea at the moment, if we need to adjust anything in it I might need to have an idea.

For this assignment, should we change the invocation reason? (I am still not sure what invocation reason is , just inferred from the name)

@amberrignell
Copy link
Contributor Author

Hey @sigu , thanks for the questions!

To give a quick answer,

  1. Correct, attempt runs are a join table of attempts and runs (this is because two different attempts could have the same runs (if they get copied over for a “rerun from job 2” for example
  2. A job is the adaptor, credential and job expression that a user has defined. A run is when that job has been run with some data (state)
  3. A dataclip is the data that is used to run a job with

We'll see you in standup tomorrow so Stu can give you more details then, hope this helps in the meantime!

@midigofrank
Copy link
Collaborator

Hey @amberrignell @taylordowns2000 , should the rerun from start button get disabled / hidden when you don't have permission? Or should we allow the users to click around and get the notification error message

@taylordowns2000
Copy link
Member

hey @midigofrank , who doesn't have permission? (have you created a policy for this action? sorry, i have not seen anything related to permissions in the issue.)

@midigofrank
Copy link
Collaborator

Hey @taylordowns2000 , I noticed there's a check for can_rerun_job when rerunning a single job. I thought maybe we could use the same check for the bulk rerun

  def handle_event(
        "rerun",
        %{"attempt_id" => attempt_id, "run_id" => run_id},
        socket
      ) do
    if socket.assigns.can_rerun_job do
      attempt_id
      |> AttemptService.get_for_rerun(run_id)
      |> WorkOrderService.retry_attempt_run(socket.assigns.current_user)

      {:noreply, socket}
    else
      {:noreply,
       socket
       |> put_flash(:error, "You are not authorized to perform this action.")}
    end
  end

@taylordowns2000
Copy link
Member

Cool. For now please allow the error flash to appear as we don't have a design that explains why a user isn't able to run it. We can consider adding this later.

@midigofrank
Copy link
Collaborator

Alright. Perfect

@midigofrank
Copy link
Collaborator

Hey @taylordowns2000 , how would you suggest we handle failures. Let's say a single attempt fails, should we abort all the other attempts?

@taylordowns2000
Copy link
Member

hi @midigofrank , great q 😄

No, the scope of this feature ends at Oban.insert. In other words, this feature is to "queue up" a bunch of attempts (same thing that clicking "rerun" really fast on a bunch of attempts does) but once they're on the Oban queue they're in a totally different domain.

@taylordowns2000
Copy link
Member

So.... sorry to be extra clear: if there is an error before you successfully enqueue those attempts with Oban.insert (what might this be???) you could display a flash message explaining that something broke and asking the user to try again, but I don't want to do that since this would certainly be an unexpected error and I'd rather see it bubble up on Sentry then be able to debug the code. That make sense?

@midigofrank
Copy link
Collaborator

midigofrank commented Jun 14, 2023

Hey @taylordowns2000 . That makes sense. Currently, I've wrapped all the actions (including the Oban.insert) for all the selected WorkOrders to be retried in a single transaction. Assuming we have Work Orders A, B, C, should an error occur while still retrying A, B and C won't get processed at all. Should that be the case?

@taylordowns2000
Copy link
Member

Sounds like we might need a quick call to nail this down—let's chat tomorrow! In the meantime:

When you say "while still retrying" do you mean "while you're attempting to add them to Oban" or "while they are running"? Assuming the first, I think what you actually want is the Oban.multi.

      reducer = fn {changesets, index}, multi ->
        Oban.insert_all(multi, "run_batch_#{index}", changesets)
      end

      runs
      |> Enum.chunk_every(5_000)
      |> Enum.with_index()
      |> Enum.reduce(Ecto.Multi.new(), reducer)
      |> Repo.transaction()

      conn
      |> put_status(:ok)
      |> text(Enum.count(runs))

What we can chat about tomorrow is that if the user asks to enqueue 43,000 attempts, it would be nice to either have them all get enqueue or have none of them get enqueued. (We don't know and don't care if the new attempts succeed or fail as they are processed over the next several days... our job is DONE and we report back to the end user "All good... 43,000 enqueued!" as soon as they hit the queue.)

@taylordowns2000
Copy link
Member

taylordowns2000 commented Jun 15, 2023

@midigofrank , next steps here please:

  • rebase on main
  • get all tests passing with current implementation style (then ping me on github please!)
  • once tests are passing, optimize to use the Oban.insert_all so we're hitting the DB slightly less for a big bulk re-processing request

@midigofrank
Copy link
Collaborator

Hey @taylordowns2000 , might you have an idea why the Mimic library isn't getting loaded while testing? I keep getting the error below when I run mix test:

** (UndefinedFunctionError) function Mimic.copy/1 is undefined (module Mimic is not available)

@taylordowns2000
Copy link
Member

I haven't seen that problem pop up for anyone yet. Please push your most recent changes to your branch (assuming you've rebased, also check that box above!) and I'll fire it up.

Tests run OK for me as of this commit:

commit 04cdafe004f5aec7b19e7973166235640493f3b5 (HEAD -> 659-bulk-rerun-work-orders, origin/659-bulk-rerun-work-orders)
Author: Frank Midigo <midigofrank@gmail.com>
Date:   Wed Jun 14 12:58:21 2023 +0300

@midigofrank
Copy link
Collaborator

sure thing. Just pushed

@taylordowns2000
Copy link
Member

Eish, sorry @midigofrank , it looks ok on my machine and on circleCI. i'm not sure i'll be able to help.

What OS are you using? Have you seen this issue on other Elixir projects? Maybe try asdf install again just to make sure? @sigu , any ideas on this one?

image

@midigofrank
Copy link
Collaborator

Hey @taylordowns2000 , I'm on M1 MacOs

@midigofrank
Copy link
Collaborator

Let me try it on a different OS

@taylordowns2000
Copy link
Member

Hrmmm I'm on an M1 also.

@midigofrank
Copy link
Collaborator

@taylordowns2000 , I've been able to run the tests by setting MIX_ENV=test. I thought mix test does that for us 🤔

@taylordowns2000
Copy link
Member

taylordowns2000 commented Jun 16, 2023 via email

@midigofrank
Copy link
Collaborator

Aha, you're right. I made a copy of the .env.example and MIX_ENV has been set to dev. Let me get rid of it.

@midigofrank
Copy link
Collaborator

midigofrank commented Jun 16, 2023

Hey @taylordowns2000 , all tests are passing except for one I've added for the liveview page. This only happens when I have the Oban.insert(Pipeline Job) code as part of the Multi which does everything else.

could not checkout the connection owned by #PID<0.1124.0>

EDIT:
Full stack trace

** (EXIT from #PID<0.1124.0>) an exception was raised:
         ** (RuntimeError) ChildProcess task exited without a value:
     {%DBConnection.ConnectionError{message: "could not checkout the connection owned by #PID<0.1124.0>. When using the sandbox, connections are shared, so this may imply another process is using a connection. Reason: connection not available and request was dropped from queue after 510ms. You can configure how long requests wait in the queue using :queue_target and :queue_interval. See DBConnection.start_link/2 for more information", severity: :error, reason: :queue_timeout}, [{Ecto.Adapters.SQL, :raise_sql_call_error, 1, [file: 'lib/ecto/adapters/sql.ex', line: 913, error_info: %{module: Exception}]}, {Ecto.Repo.Schema, :apply, 4, [file: 'lib/ecto/repo/schema.ex', line: 756]}, {Ecto.Repo.Schema, :"-do_update/4-fun-3-", 15, [file: 'lib/ecto/repo/schema.ex', line: 459]}, {Lightning.Invocation, :update_run, 2, [file: 'lib/lightning/invocation.ex', line: 272]}, {Lightning.Pipeline.Runner.Handler, :"-start/2-fun-1-", 3, [file: 'lib/lightning/runtime/handler.ex', line: 74]}, {Task.Supervised, :invoke_mfa, 2, [file: 'lib/task/supervised.ex', line: 89]}, {Task.Supervised, :reply, 4, [file: 'lib/task/supervised.ex', line: 34]}, {:proc_lib, :init_p_do_apply, 3, [file: 'proc_lib.erl', line: 240]}]}
             (lightning 0.6.3) lib/lightning/runtime/handler.ex:122: Lightning.Pipeline.Runner.Handler.wait/1
             (lightning 0.6.3) lib/lightning/pipeline.ex:29: Lightning.Pipeline.process/1

@midigofrank
Copy link
Collaborator

Hey @taylordowns2000 , I found it easier to just move the implementation to "bulk mode" and leave out the Oban.insert operation from the main transaction.
All tests are now passing.

@taylordowns2000
Copy link
Member

great! please convert this to a full PR (it's a "draft" right now) and request a review from me and @stuartc ! we'll take a look tomorrow AM. do you know if @sigu is joining our final call tomorrow AM?

also, in prep for that call, have a look at the failed checks related to code coverage. thanks @midigofrank and speak soon

@midigofrank
Copy link
Collaborator

Got it. Given his current timezone, I'm not so sure. I'll confirm with him and let you know

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants