[META] Filesync fails silently and breaks watches #147

grayside · 2018-03-02T03:13:07Z

Problem

@illepic reports that filesync's silent failures cause developers to work for an indefinite amount of time with uncertainty if troubleshooting fixes are failing because they are wrong or because the filesync isn't carrying changes properly.

As a result, we have wasted time from troubleshooting after a fix has been found, as well as a growing distrust in the filesync system.

This has been seconded by a number of other developers, raising the profile of this issue to seemingly the largest source of trouble for Outrigger users. Thank you to everyone that has spoken up about this problem.

Solution

With such a sweeping problem statement, it is impossible to declare a single solution. Rather, we will treat this like a "meta" issue, a bug that will require multiple changes to fully address. The definition of done should be that this problem stops being encountered for a reasonable length of time.

Related Issues

Here are the issues so far identified to help support this goal:

Add notes about care and feeding of unison during filesystem operations outrigger-docs#30: Write a temporary PSA in the documentation
Detect unison failure: unison doctor command? #163: Add a tool to detect sync failures that can be run manually or in the background.
Automatically restart broken unison processes #164: Automatically restart broken unison processes

Use of This Issue

Report specific reproduction steps that cause Unison to crash
Report any steps/upgrades to rig that make your problem go away.
Suggest changes to rig or the documentation here so they can be coordinated with efforts underway.

grayside · 2018-03-02T04:09:49Z

As an alternative to building out the restart of unison process inside the container (assuming it's crashing of the server and not of the client we need to be concerned with) we could take steps to build a smarter healthcheck into our unison container image and set something up to auto-restart when it reports unhealthy. This has some answers on how we might do that: https://stackoverflow.com/questions/47088261/restarting-an-unhealthy-docker-container-based-on-healthcheck

grayside · 2018-03-06T18:47:27Z

I'm working on a rig project sync:check to operate as a sort of doctor check of the unison process.

grayside · 2018-03-06T19:36:54Z

Collecting some research avenues:

rig project sync defaults fs.inotify.max_user_watches to 100,000 for the docker-machine. Is this enough? Probably.
- Do we need to increase this number inside the container? Maybe.
When lots of files change, or a file is changed while being synced from a change, it can cause the high CPU spikes mentioned in the referenced issue. https://github.com/EugenMayer/docker-image-unison/pull/11/files demonstrates using Monit to monitor for performance and use supervisord to restart the unison server process.

mkochendorfer · 2018-03-08T18:01:28Z

This has happened to me and other developers countless times. It is incredibly frustrating and wastes untold hours of time going down the wrong paths debugging things that are really just that your code changes are not making it into the container. This is by far the highest priority issue with rig currently today.

srjosh · 2018-03-08T18:02:41Z

I've run into this quite frequently when working on client work; it definitely seems tied in my case to my host machine going to sleep/waking up. It is definitely frustrating.

grayside · 2018-03-08T20:25:57Z

Note: This issue is now: support request, problem research, "doctor" research, and autoheal research. I will probably split this apart in the next few days. I'm breaking the "doctor" angle here to #163

grayside · 2018-03-12T05:24:57Z

I have converted this issue to a METABUG, please re-read the issue summary for details on what we are doing so far and what this issue should continue to be used for.

grayside · 2018-03-21T02:27:09Z

Further discussion with afflicted users has pointed out one of the major error cases is resume-from-sleep. Improved handling of sleep/suspend/hibernation operations may go a long way to address this problem.

crittermike · 2018-04-17T20:41:03Z

Some of us have gotten in the habit of just assuming it's broken both when starting dev (for the day or after a break) and also whenever something unexpected happens, and running sync:start proactively before doing anything else.

febbraro · 2018-04-18T18:36:43Z

@mikecrittenden Does that approach of always running sync:start more or less alleviate any of the unison problems?

potterme · 2018-04-18T18:46:50Z

I don't run into this with sleep, but I do run into it when sleeping+changing-networks, such as going from office to home and back. In my experience doing sync:start always fixes it.

This is different from the unison quitting because there are too many file changes. Changing the max_user_watches "might" help. This often happens when doing something that seems simple, like mv vendor vendor_old or rm -rf node_modules. Deletions seem to cause the most issues. When doing a "mv" unison sees this as both a file deletion and a file addition.

I'm not sure I'm in favor of something trying to auto-restart unison processes, since there have been cases where I've shut down unison on purpose. But a tool to detect a problem and notify would be useful.

Education and docs on this definitely the most useful. Once this has happened to somebody a few times they stop going down hour-long debugging rabbit holes and start checking unison more often, so even part of "rig doctor" would be helpful. But also helpful for devs to think more about what is happening when they do stuff like "mv vendor vendor_old" and why they might be doing that.

crittermike · 2018-04-18T18:51:00Z

@febbraro yeah that seems to handle it for me. Typically if I see issues now it's because I just forgot to run that command. I don't usually see it crash in the middle of doing something, but I might just be lucky.

grayside added bug question labels Mar 2, 2018

grayside mentioned this issue Mar 8, 2018

Detect unison failure: unison doctor command? #163

Closed

This was referenced Mar 12, 2018

Add notes about care and feeding of unison during filesystem operations phase2/outrigger-docs#30

Open

Automatically restart broken unison processes #164

Open

grayside changed the title ~~Filesync fails silently and breaks watches~~ [META] Filesync fails silently and breaks watches Mar 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Filesync fails silently and breaks watches #147

[META] Filesync fails silently and breaks watches #147

grayside commented Mar 2, 2018 •

edited

Loading

grayside commented Mar 2, 2018

grayside commented Mar 6, 2018 •

edited

Loading

grayside commented Mar 6, 2018

mkochendorfer commented Mar 8, 2018

srjosh commented Mar 8, 2018

grayside commented Mar 8, 2018

grayside commented Mar 12, 2018

grayside commented Mar 21, 2018

crittermike commented Apr 17, 2018

febbraro commented Apr 18, 2018

potterme commented Apr 18, 2018

crittermike commented Apr 18, 2018

[META] Filesync fails silently and breaks watches #147

[META] Filesync fails silently and breaks watches #147

Comments

grayside commented Mar 2, 2018 • edited Loading

Problem

Solution

Related Issues

Use of This Issue

grayside commented Mar 2, 2018

grayside commented Mar 6, 2018 • edited Loading

grayside commented Mar 6, 2018

mkochendorfer commented Mar 8, 2018

srjosh commented Mar 8, 2018

grayside commented Mar 8, 2018

grayside commented Mar 12, 2018

grayside commented Mar 21, 2018

crittermike commented Apr 17, 2018

febbraro commented Apr 18, 2018

potterme commented Apr 18, 2018

crittermike commented Apr 18, 2018

grayside commented Mar 2, 2018 •

edited

Loading

grayside commented Mar 6, 2018 •

edited

Loading