Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endstops on a different MCU to stepper motors #3832

Closed
wants to merge 10 commits into from

Conversation

mattshepcar
Copy link
Contributor

@mattshepcar mattshepcar commented Jan 24, 2021

Draft pull request just to ask for some feedback/discussion on these changes.

The idea of this branch is to facilitate non-contact end-stops/probes attached to an MCU on a toolhead to reduce the amount of moving wiring points (no need for endstops wired at the end of moving rails/gantries). My prototype board design is here for reference.

There are two main parts to these changes:

  • Only drip feed a safe amount of steps to the motors during 'blind homing'. The idea here is to have the end stop report back to klippy on a regular basis so that it knows whether it is safe to continue moving the robot. To minimise any over shoot the motors are signalled to stop as soon as klippy knows the end stop triggered.
  • Compensate for any over shoot caused by the delay in signalling the motors to stop. To do this I simply record the moves that were sent to the MCU and then count how many steps occurred after the reported end stop trigger time (and before the motor stopped).

This is just a proof of concept and it needs a lot of clean up and testing.

@KevinOConnor
Copy link
Collaborator

Interesting. Thanks.

FYI, I recently purchased some Huvid boards and have also been working on this. (I've been researching the work, but have not yet made code changes.)

FWIW, my high-level plan was the following:

  1. Update the host to keep history of sent "queue step commands" so that it can calculate where the toolhead actually was at a given time.
    • It looks like you've done something similar with "stepcompress history tracking". I think it should be safe to always enable history tracking (the host has plenty of memory so it can store the last 30 seconds of position info). That's generally useful for debugging, and some of the filament sensors really want to be able to obtain an extruder position given a time. It looks like you have focused on implementing a "time delta -> number of steps", but a general "mcu clock -> mcu position" may be more useful.
  2. Refactor the mcu endstop code to separate the "signal detection" code from the "signal action" code.
    • It looks like you've done something similar with "stepper groups".
  3. Add a "maximum time since last confirmation" check to the "signal action" code. Basically, if the host doesn't confirm that the mcu can continue to home every 20ms, then the mcu will consider the operation to have "timed out" and run the "signal action" code (stop the steppers).
  4. Implement an assured "signal delivery" in the host serialqueue.c code. It would periodically query and confirm the homing state on relevant mcus. It would ensure that all mcus are actively in an okay state and confirm the homing state to those mcus if so. If any mcu reports a timeout or trigger, then it would distribute that info to the remaning mcus. If any mcu becomes unresponsive, then they'll all timeout in a short time period.
    • Interestingly, it looks like you've approached this in a different way. It seems you are only releasing queue_steps as the timing is verified. I think our goals are a little different here - I'm also interested in making multi-mcu homing work on contact based switches. My high-level thinking is that if we can confirm timing in windows of ~30ms, then the worst case overshoot may still be manageable.
  5. After a homing operation the host would schedule movement to undo overshoot caused by a delay in trigger notification. This is particularly important for multi-stepper axes where it's possible one stepper may have a different overshoot than another stepper, and thus needs to be moved back in sync in order for the axis to function properly.
    • I'm guessing you tried to also do this with the itersolve_set_commanded_pos() code. But, I don't think that approach will work well - the iterative solver is designed to calculate exact times and it is not expecting to generate extra steps to handle an overshoot. Doing so could lead to "instantaneous movement" which would trigger error checks in the stepcompress code. Generating explicit stepper moves should be robust as we can then allocate sufficient time for those moves to complete.
  6. Possibly restart the homing operation if a spurious timeout occurs. (Maybe this isn't needed - I'm unsure.)
  7. Multi-mcu homing on a delta may need particular handling. On a delta we need to home three endstops at the same time, and an endstop trigger only stops the steppers associated with that endstop. However, if a timeout occurs on any of the endstops, then we need to stop all the steppers.
    • I think you noted something similar in your comments.

One of my high-level goals would be to try and put most of this logic into klippy/extras/homing.py - ideally limiting the impact to the toolhead, stepper, probe, etc. code. The recent move of the homing.py code from klippy/homing.py to klippy/extras/homing.py was in preparation for work on this feature. Of course, it's possible this may not be practical.

Your thoughts?
-Kevin

@mattshepcar
Copy link
Contributor Author

mattshepcar commented Jan 25, 2021

It's good to hear it's something you are considering supporting at least :)

It looks like you have focused on implementing a "time delta -> number of steps", but a general "mcu clock -> mcu position" may be more useful.

Yes, probably, I think I was still struggling to wrap my head around the homing code when I implemented that part.

  1. Refactor the mcu endstop code to separate the "signal detection" code from the "signal action" code.

    • It looks like you've done something similar with "stepper groups".

A more general solution would probably be good, yes. The main reason I added the "stepper group" thing is that I am running a quad motor gantry and it is was easier/better to ensure all 4 Z stepper motors stop simultaneously rather than try to correct afterwards (originally I just added a command to call stepper_stop in stepper.c and then discovered each motor was overshooting a slightly different amount of steps when z-probing).

  1. Add a "maximum time since last confirmation" check to the "signal action" code.
    ...
  • I'm also interested in making multi-mcu homing work on contact based switches. My high-level thinking is that if we can confirm timing in windows of ~30ms, then the worst case overshoot may still be manageable.

Right, I was a bit worried about how safe that would be considered to be. e.g. if the raspberry pi crashes/stalls and the MCU still has a lot of steps queued so I went with the option of making sure there are never more steps queued than are considered to be safe.

In fact, in my tests I have my Y endstop set up only 1.4mm away from the true limit and am able to home at 50mm/s with this code. That gives a window of about 28ms which is almost the 30ms you quote :) it seems this is quite close to the edge of what my raspberry pi 3b+ can handle though. I would be concerned about what happens if you are interacting with octoprint or octodash at the same time.

  1. After a homing operation the host would schedule movement to undo overshoot
    ...
    • I'm guessing you tried to also do this with the itersolve_set_commanded_pos() code.

In fact, no. This is just a small refactoring to note_homing_end: instead of directly adjusting _mcu_position_offset in this function I moved it into set_mcu_position instead. Then instead of changing _mcu_position_offset I made it call set_commanded_pos which has the same effect: get_mcu_position will now return the value that was reported by the MCU but in addition it means that get_commanded_pos will now also return a valid position instead of it being invalidated by the homing operation. This slightly simplifies the code in homing.py meaning it no longer has to track MCU steps and can query get_commanded_position to get the position where the steppers have stopped.

I have not made any changes to the way steps are generated or how moves are performed, I've simply made it record where the steppers finished including any overshoot so that get_mcu_position and get_commanded_position will both return that position. When the homing code performs its retract operation for the second homing attempt this will undo that overshoot. A further move could be scheduled if you want to undo overshoot from the second homing.

  1. Possibly restart the homing operation if a spurious timeout occurs. (Maybe this isn't needed - I'm unsure.)

My thoughts here were that if something has gone wrong/timed out then the machine may now be in an unsafe state with motors that have driven past their intended limits and it should require user confirmation before anything further occurs.
Edit: at least, that was my thinking before the drip feeding limiting was in place. On second thoughts it shouldn't be possible for the motors to over shoot by more than the allowed safe limit any more. So as long as you retract by at least the amount of steps you sent after the last end stop signal was received it should be safe to retry. This will definitely be useful in cases where you are pushing the raspberry pi's realtime capabilities and something happens to stall the klippy process for a few ms.

  1. Multi-mcu homing on a delta may need particular handling.

I tried to bear that in mind while writing this code. The only place I think needs special handling (which I commented on) is when to know a motor has definitely stopped and you no longer need to consider it when limiting the drip feed rate.

One of my high-level goals would be to try and put most of this logic into klippy/extras/homing.py

The changes ended up being quite far reaching so if more of the homing logic could be brought together into that module it would help clean it up and make it easier to follow.

I did find it quite hard going figuring out how it all fit together, like why MCU positions were being used pretty much only during homing, what the 'tag'/'calc' position stuff was all about - maybe some clarification in the developer documentation could help here. A lot of that is just down to me coming at it from never having seen the code before and not knowing how all the move buffering works etc, what this 'flush delay' is for etc. It also seemed slightly strange to me that there is a separate drip feeding system only used during homing but it was quite fortuitous as this was ideal for my needs (any reason this is not always in effect, but maybe with larger chunks when not homing?)

I have an idea in the back of my head about being able to delegate the drip feeding to an MCU if the raspberry pi is not able to manage the realtime requirements of it in this situation. This might not be too bad with MCUs connected on a CAN bus such as huvud and duet 3 tool board if the main board is also on the CAN bus. (In my current setup I have the option of connecting my 2nd MCU to my 1st if I implement some code to forward comms from klippy. I actually have the 1st MCU connected to the pi by both UART and SPI so that I could just tell klippy to open the /dev/spidevX.X to talk to the 2nd MCU)

Final point: the regression tests are failing due to the addition of 'report_ticks' and 'triggered_time' in the end stop commands. Do you think it would be possible/better to add separate commands for the blind homing endstops so the old commands and tests still work? I haven't looked at how the tests work at all or what the process is for regenerating the data if something like this changes.

@KevinOConnor
Copy link
Collaborator

KevinOConnor commented Jan 25, 2021

I was still struggling to wrap my head around the homing code when I implemented that part.

Yes - the homing code is quiet complex in Klipper. I struggle with it also. In general, homing is difficult because there are so many different printers and different ways to "home". The code itself could likely also be improved.

it seems this is quite close to the edge of what my raspberry pi 3b+ can handle though.

The Klipper host python code can have some high latency. However, the host serialqueue.c C code is very responsive. I'd guess less than 1ms of added latency in the normal case. There's still the worst case scenario, but in that case my plan was to have the micro-controllers detect the loss of "continue to home" messages and stop the steppers on their own.

I have not made any changes to the way steps are generated or how moves are performed, I've simply made it record where the steppers finished including any overshoot so that get_mcu_position and get_commanded_position will both return that position. When the homing code performs its retract operation for the second homing attempt this will undo that overshoot.

That is what I was referring to earlier. The above is not robust - it can lead to stepcompress errors in the general case. For example, if one Z motor needs to move 5.005mm on the retract, and another Z stepper needs to move 5.020mm, then the iterative solver code will start the moves with an instantaneous movement followed by regular movement - it does not average out the overshoot over the entire move.

why MCU positions were being used pretty much only during homing, what the 'tag'/'calc' position stuff was all about

There's two goals with that - one is to keep units in millimeters and seconds where possible (it's easier to reason about the code without many unit conversions) and the second goal is to handle the many queues that are in place. There's some info on the queues at https://www.klipper3d.org/Code_Overview.html#coordinate-systems . Basically, different parts of the code have a different understanding of "current position" depending on which part of the queue they work on. The "mcu position" isn't really needed, except when we need to determine exactly how many steps have been taken.

It also seemed slightly strange to me that there is a separate drip feeding system only used during homing but it was quite fortuitous as this was ideal for my needs (any reason this is not always in effect, but maybe with larger chunks when not homing?)

Under normal circumstances, Klipper calculates in advance the exact time each move will take. It can therefore generate the steps for a move, queue them up, and go on to other tasks (like processing and scheduling more movement commands). Due to the nature of the G-Code protocol, we have to queue movement commands and perform lookahead on them in batches - we can't read a command, generate its steps, read the next command, etc. . In contrast, when homing, the code does not know how long the move will take (the endstop could be triggered at any point) and we must not process new commands until the home is complete. So, the "drip move" system was added to handle this special case. If you look at the normal toolhead.py:_update_move_time() though, you'll see it doesn't differ that much from the "drip feed" code path - the normal code path mostly doesn't have the checks for triggers.

I have an idea in the back of my head about being able to delegate the drip feeding to an MCU if the raspberry pi is not able to manage the realtime requirements of it in this situation.

FYI, if the synchronization is done in the host C code, I don't think it will be necessary to implement synchronization in the micro-controllers. In my experience, the host C code is very responsive.

the regression tests are failing

If you look at the python code, you'll see various checks for is_fileoutput() or printer.get_start_args().get('debugoutput') is not None. These checks are to disable waiting for query responses from the micro-controller when in "write output to file mode". Looking briefly at the error output, it looks like some of your code needs these checks. You can run the regression tests locally - see: https://www.klipper3d.org/Debugging.html#running-the-regression-tests EDIT: Actually, you'd need to manually generate the data-dictionaries in order to run the test suite locally. Might be easier to make sure the code doesn't wait for query responses when in debugoutput mode. Once the test cases pass you can download the "data dictionary artifact" that the github regression tool generates.

-Kevin

@mattshepcar
Copy link
Contributor Author

The Klipper host python code can have some high latency. However, the host serialqueue.c C code is very responsive. I'd guess less than 1ms of added latency in the normal case.

That's good to know.

For example, if one Z motor needs to move 5.005mm on the retract, and another Z stepper needs to move 5.020mm,

Right, that's why I had to add the stepper group stuff. I think it's probably ok to enforce that steppers that have stay in sync need to be on the same MCU and maybe add special handling for that case later if required.

Sounds like implementing some kind of keep-alive system on the serial thread is the way to go and maybe that will simplify the logic a bit too.

@KevinOConnor
Copy link
Collaborator

So, as a high-level question, how would you like to proceed on this?

I'm planning to try out some of the ideas I mentioned above, but I'm probably a few weeks out from having feedback with the results of those tests.

-Kevin

@Hello1024
Copy link

If the config file had a parameter for "number of mm past the endstop switch where damage will be done", then all guesses about latency could be removed from the code... You can simply make sure that contact is had with both mcu's at least every that distance.

For most regular microswitches that number is 5mm or more, so homing should be fast even with slow python in the loop. Even more precise switches it still tends to be at least 1mm.

@mattshepcar
Copy link
Contributor Author

So, as a high-level question, how would you like to proceed on this?

I've got a few other things to sort out on my printer first but I'll probably take another look at this once I'm back up and running. What's here already does work for me (most of the time) but it would be nice to clean it up and move the time critical stuff into the C code. I'll let you know when I've taken another pass at it.

@Hello1024 - that is exactly what this change does. There is no guessing: it uses the position_endstop and position_max/min values from the config to determine the safe amount to move.

@KevinOConnor
Copy link
Collaborator

Any further updates on this? FYI, there has been similar work on PR #3956 .

-Kevin

@KevinOConnor
Copy link
Collaborator

Thanks. I'm going to close this as it looks like it has been superseded by #3956.

-Kevin

@github-actions github-actions bot locked and limited conversation to collaborators Oct 15, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants