Add accumlated usage #155

TheLostLambda · 2020-03-25T21:08:37Z

Closes #147

This works pretty well, but the existing code wasn't really designed for accumulated usage like this, so I've left in some comments where I'd appreciate some others' opinions on refactoring matters.

This feature really adds a lot to bandwhich for me! I have a metered connection on a server and this PR makes it possible to track down what processes use the most data over a longer period of time!

This is a draft PR at the moment and I'd be hesitant to merge before we have the chance to address some of the comments I left behind :)

Additionally, the test snapshots have not been updated yet.

TheLostLambda · 2020-03-27T14:36:31Z

I've been dogfooding this on my server and ran into a performance issue over time. Because I'm remembering every connection every established, the number of items to sort before displaying them onscreen balloons quickly:

[src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 7808
[src/display/ui_state.rs:201] self.connections.len() = 93394

I've now implemented a pruning step in the UIState update, so I end up with something much more manageable:

[src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 1000
[src/display/ui_state.rs:201] self.connections.len() = 1000

Where only the 1000 biggest values are retained.

This fixes the performance leak and will run nicely for a long time without eating up the CPU cycles. Again, I've left some comments in the code any would love your feedback on those suggestions :)

imsnif

Hey @TheLostLambda - really great work! I am very happy to see this change coming along and can't wait to see this merged and released.

Aside from the comments in the code, I mainly have one issue with the design, and that's with the addition of the MAX_BANDWIDTH_ITEMS. I definitely understand that performance implications and am happy you caught it early (it was something I was honestly planning to look at when I saw the PR :) ).

The problems I see with the current implementation:

The total bandwidth we'd be showing on the top line would be a little inaccurate right now. It would include bandwidth from items we'd already removed from the list. While this isn't terrible (the list is 1000 items long, after all) I can imagine if people track these things, it might get a little confusing.
The cross-referencing between tables (eg. number of connections per process) is affected by this and becomes inaccurate.
This also creates an issue that new items will have a hard time getting into the list as time goes on.

I have two solutions in mind for this (and would be happy to hear other suggestions and/or your thoughts about these):

Instead of removing items from the list, we could have a Utilization < [count] field in each table, and we could update it whenever we need to remove something from the list.
We could re-count the total utilization in real time on every tick. This might be easier to implement, but I feel it's less "holistic" than the first solution.

Let me know what you think, and if there's anything else I can do to make this happen!

src/display/components/display_bandwidth.rs

src/display/components/table.rs

src/display/ui_state.rs

src/main.rs

Cargo.toml

src/display/ui_state.rs

TheLostLambda · 2020-03-29T11:23:47Z

Thank you for all of the excellent feedback! I'll look into addressing a couple of things today and then finishing things off in the days to come :)

Before I dive too deep though, could you elaborate a bit on this?

Instead of removing items from the list, we could have a Utilization < [count] field in each table, and we could update it whenever we need to remove something from the list.

Another option would be to use a Binary Heap. We could build up the data in a binary heap (it's roughly O(1) for adding values) then they are popped off in sorted order (but the whole structure does not need to be sorted!). That way we could take only 100 or so items, but still remember everything we have seen. We could also use the purpose-built lazysort crate which is even faster than the heap and would likely involve less memory allocation.

Taking 100 sorted elements from a Vec of 2,500,000 elements:

test benches::a_heap_bench     ... bench:  15,981,458 ns/iter (+/- 285,314)
test benches::a_lazy_bench     ... bench:   8,933,722 ns/iter (+/- 317,402)
test benches::a_standard_bench ... bench: 180,872,221 ns/iter (+/- 4,104,147)

Strictly speaking, unless we delete items, we'll hit a performance wall eventually (memory or CPU). This is also only addressing the sorting step of the performance problem (which I certainly think is the limiting step, but there might be other lurking bottlenecks).

Maybe a compromise is to increase MAX_BANDWIDTH_ITEMS quite a bit (say 100,000 or something) and balance that performance hit with a Heap or lazy sorting? I just think it would be a shame if the program was destined to crash or eat CPU resources after a while (even if it is several weeks).

The total bandwidth we'd be showing on the top line would be a little inaccurate right now. It would include bandwidth from items we'd already removed from the list. While this isn't terrible (the list is 1000 items long, after all) I can imagine if people track these things, it might get a little confusing.

As for this, I agree it's not ideal, but I think that I would rather this be the case than the total ignoring the smaller connections. I think by recomputing it, we are no longer reporting an accurate measure of data sent / received (even if we do maintain internal consistency).

TheLostLambda · 2020-03-29T12:48:15Z

Unfortunately, I think there is a deeply nested issue that might be blocking this functionality.

I've been doing some testing, and it looks like Bandwhich is missing a lot of data. I seems to be somewhat speed dependent. If I download a 10MB file at 100KBps, then bandwhich totals up to ~6MB, but at full speed (~6.5MBps) it only comes out to be 1.6MB.

Downloading a 1GB file at 25MBps reports only 185MB in bandwhich, with or without the divide_by.

Furthermore, even the traditional rate mode seems broken 😢 A rate of more than 20MBps reports under 5MBps in bandwhich (even in the officially packaged version of Bandwhich). This is all on Linux.

Do you have any idea what might be causing this?

imsnif · 2020-03-29T13:16:05Z

Do you have any idea what might be causing this?

My suspicion is that it might be (related to) this: #136
I still haven't managed to isolate the issue (I have not managed to reproduce this issue locally or figure out in which platforms/use-cases it happens). Unfortunately I haven't yet had the time to write a branch (as I promised in the issue :( ) that would essentially peal stuff away from bandwhich in order to find out where this is happening.
Now that you're more familiar with the code and can reproduce it, if you feel like tackling this one (before or after), that would be great and I'd be happy to guide you towards possible root cause areas. But really - only if you feel like it. :)

TheLostLambda · 2020-03-29T13:20:21Z

Hmm, okay. I might be able to take a look into it, but I'm not sure when I'd get around to it. What sort of setup are you running where it works properly?

imsnif · 2020-03-29T13:27:29Z

Hmm, okay. I might be able to take a look into it, but I'm not sure when I'd get around to it. What sort of setup are you running where it works properly?

That would be great. I hope it doesn't block your current work here? If so, how can I help?

I'm using arch linux on my thinkpad t450. I get accurate reporting when downloading a large file with one connection or when downloading from multiple connections (eg. downloading a linux distribution through a torrent - I use debian).

TheLostLambda · 2020-03-29T13:39:59Z

Curious, I'm also on Arch Linux and don't get accurate reporting in either case (on my desktop). I'll try on my Thinkpad as well I suppose.

And no, I thing work on this can continue as long as the underlying issue is eventually tracked down :) What did you think about the performance compromise I mentioned back in this comment?

EDIT: No luck on my Arch Linux P52s Thinkpad either... Still inaccurate reporting :(

imsnif · 2020-03-29T13:46:18Z

I'll take a closer look at your comments and address everything in the next few hours. I wanted to answer what I could quickly to get stuff out of the way for you :)

imsnif · 2020-03-29T15:39:56Z

Hey, so regarding the MAX_BANDWIDTH_ITEMS issue:

I definitely hear what you're saying about all of these solutions hitting some sort of bottleneck in the end (even if only after running for several weeks). I think both of the proposed solutions are great, but definitely in the "least bad thing" category. :) Let's reach for them if we decide that's the best way to go.

I'll try to explain my original idea better:

I propose we still use the MAX_BANDWIDTH_ITEMS mechanism. Only in addition, we add another "entry" in each table. This entry will be called something like Processes with bandwidth under <NUMBER> and will be used to record those items that we would have deleted from the table after we have MAX_BANDWIDTH_ITEMS. Whenever we perform the MAX_BANDWIDTH_ITEMS calculation, we'll update <NUMBER> to be the lowest item in the 1000.
Does this make sense? What do you think of the approach?

TheLostLambda · 2020-03-29T16:05:34Z

I quite like the idea of that! That seems like a good way to impose a maximum limit on the number of items in the maps and still have a list that sums to the total!

I'll go about implementing that soon. I think it's a quite good compromise :)

In that vein, would you mind if I swapped out the BTreeMap for an ordinary HashMap? Technically it should be a little quicker. I may also explore raising the MAX_BANDWIDTH_ITEMS and implementing the sorting with BinaryHeap so only the items the UI can display are sorted, but that probably won't be worth the code complexity.

imsnif · 2020-03-29T16:09:22Z

Sounds great. Feel free to make any changes you see fit and we can discuss their ups and downs when you're ready (or before if you'd like the input, ofc).

TheLostLambda · 2020-03-31T13:50:31Z

Alright, added a bit more polish, but I had a question! How should the deleted entry category actually be shown in the tables? What do we show for connection count (process and address tabs) or the process name (for the connections tab)?

Currently if you run this version of the code, and redirect stderr: ./target/release/bandwhich 2> out.txt, then out.txt will show the number of connections in the map and the bandwidth of all of the connections that have been culled.

Also, I don't presume that we want this row to show unless items have actually been removed, which means it will only be included if more than 10,000 items are already in the table. Does raw mode also read these tables? If not, what's the point of a row that no one will ever see?

On some other notes, I've moved to HashMaps and am only sorting the vectors once (before, the maps would need to be sorted every time they were trimmed and again when they were displayed. A consequence of this is that all sorting operations / code (as well as the details of what map type is used) are internal to UIState now. This means it's easy to change out the sorting or internal data structures without touching any other files.

Let me know what you think and what we should be in these tables!

Hope you are well!

imsnif · 2020-03-31T15:44:18Z

Good point about the extra row... I think it's mostly there in order to support the total bandwidth.

About the connections: hoi... tbh I'm starting to remember all the complexity involved in the feature. :) So - we could either not display them (and then change the title of the column to "active connections" or "active processes") or keep track of them in a different struct when we are in cumulative mode. Makes sense? Or do you have a different idea?

TheLostLambda · 2020-03-31T15:53:00Z

Currently they exist in special fields of the UIState struct and are being correctly updated. It's just about displaying them I think. I've called the fields whatever_drift:

    pub processes_drift: Bandwidth,
    pub remote_addresses_drift: Bandwidth,
    pub connections_drift: Bandwidth,

I'm not sure where they would best fit in the GUI to be honest, but there are struct with this data when cumulative mode is running (though I might be misunderstanding what you are going for).

imsnif · 2020-03-31T15:57:29Z

Ah, sorry. I think I was trying to solve a different problem. Been a long day :)
Let's not display these in the UI. I think as you say there will never be a situation where they should be displayed. I'm thinking raw_mode won't really benefit from the total usage feature anyway, right?

TheLostLambda · 2020-03-31T16:00:05Z

No worries, I know how you feel :) I don't think it would be very useful in raw mode either. Should I keep the "drift tracking" in the code in case we want to use that value in the future? Or should I cut it?

imsnif · 2020-03-31T16:02:28Z

I haven't yet gone over the code, but will the total bandwidth calculation be correct without it if we're removing excess connections?

TheLostLambda · 2020-03-31T16:07:39Z

I believe so, yes. It just adds up over time. It's not calculated from the entries or anything.

imsnif · 2020-03-31T16:12:12Z

Awesome - so let's remove them. I prefer not having things in the code unless we directly use them.

TheLostLambda · 2020-03-31T16:14:24Z

Sounds good! I'll push something in a couple of minutes with them cut out

TheLostLambda · 2020-03-31T18:25:00Z

Alright! This should be some pretty final code! Could I get some help updating the tests? I couldn't manage to get it working and it's broken after I removed the word "Rate" from the interface.

imsnif · 2020-03-31T20:59:20Z

Hey @TheLostLambda - I'm going to take a look this in the next few days. About the tests, we use cargo insta for the snapshots. If you install it, you can do cargo insta review after you ran the tests in order to review the newly created snapshots and accept/reject them as you see fit. Then commit+push.

EDIT: oops, forgot to link :) https://docs.rs/insta/0.8.1/insta/

TheLostLambda · 2020-03-31T21:36:03Z

No matter what I try, I can't get things to pass. Most of the tests are fine, just the removal of "Rate", but now the last handful of tests, while all of the lines are always present, they are presented in a random order.

Maybe you could take a look when you get the chance?

imsnif · 2020-04-01T07:49:10Z

Off the top of my head this seems like something isn't sorted (eg. when entries have the same amount of bandwidth, they're not also sorted by name) - but that's just a guess. I can take a closer look (and review everything else) over the weekend.

TheLostLambda · 2020-04-01T10:25:45Z

Yep, it looks like I was just being a bit silly! Moving from BTreeMaps to HashMaps means that things are no longer sorted by key, and the test packets: "I have come from 1.1.1.1" and "Awesome, I'm from 3.3.3.3" just so happened to be the exact same length!

I've coerced the tests into passing by changing "Awesome, I'm from 3.3.3.3" to "Greetings traveller, I'm from 3.3.3.3".

Everything should pass now :) Good catch!

imsnif · 2020-04-04T08:36:50Z

Hey @TheLostLambda - this is some fantastic work. I went over everything and the changes seems solid.

For me, the only thing missing before we can merge is a test for this. I don't think it should be too involved to add one - just pass the flag and see that the traffic is accumulated. What do you think?

Also, fair to note some issues this introduces:

The pause functionality doesn't do what one would expect in this mode. When we pause, we stop collecting data. Which is okay when we're dealing with rate, but a little unexpected when we're collecting total bandwidth. I don't think this is terrible, and a fix for this would involve quite a big architecture change. So I'm fine with leaving this as is and considering adding some sort of warning to it as a separate PR.
In this mode we do not accumulate connections. This is also fine, but might be a little non-intuitive (the bandwidth is accumulated, why not the connections?) I think this can be solved by changing the name of the "Connections" column to "Active Connections". This is a relatively simple change, but will involve playing a little with the UI breakpoints, as this column will now be much wider. This can also definitely be done out of scope in a different PR.

TheLostLambda · 2020-04-04T08:42:31Z

Thanks for giving it a look! I can try to put together a test for it today :)

As for the pause and connection issues, they were bugging me too. I tried to accumulate connections, but that would also require some major reworking, as NetworkData currently only holds a count of connections with no way of knowing which ones are unique. I agree that these issues are likely best dealt with in another PR (which I might like to help with as well).

I'll try to add some tests today and let you know how it goes!

TheLostLambda · 2020-04-04T20:06:53Z

Sorry for the delay @imsnif ! I've duplicated all of the sustained tests to test the total mode as well.

It seems to work well and the totals add up from snapshot to snapshot.

imsnif · 2020-04-05T15:59:34Z

src/tests/cases/ui.rs

+    let (terminal_events, terminal_draw_events, backend) = test_backend_factory(190, 50);
+
+    let os_input = os_input_output(network_frames, 3);
+    let mut opts = opts_ui();


I think these still use total_utilization: false, don't they?

Initially, opts_ui() does have total_utilization set to false, but I'm setting it to true in the line immediately after. Adding another "opts building" function was a bit messy, so I opted for that. I could define an opts_ui_total() function if you'd rather.

The new snapshots from these tests do show totals as opposed to rates ("B" not "Bps"). Perhaps I'm missing something?

Oh man, I managed to totally miss that line! My bad, sorry. :)

Haha, no worries :)

imsnif · 2020-04-05T16:16:46Z

Hey @TheLostLambda - I think everything looks great. I'd be happy to go ahead and merge this. Or is there anything still open that I'm missing? (I'll open issues for the above two caveats we talked about after the release).

TheLostLambda · 2020-04-05T16:21:21Z

I think that it should be ready for merging :) I've been running it for a couple of days without issues.

Perhaps we could add an issue for displaying the elapsed time when you are in total mode? I can just check how long the process has been running with ps, but having elapsed time in the top right corner or something might be helpful for some. I think that would need to be discussed in a separate issue though.

imsnif · 2020-04-05T16:30:09Z

@TheLostLambda - thank you very much for this! I'm excited about releasing this feature. :)

I like your idea about the elapsed time (when in total_utilization mode). Would you like to open an issue about it?

TheLostLambda · 2020-04-05T16:31:30Z

Sure! I'll make that issue now :) Thanks for making it so easy to contribute to this awesome repository!

TheLostLambda mentioned this pull request Mar 25, 2020

Feature: accumulated usage #147

Closed

imsnif requested changes Mar 29, 2020

View reviewed changes

imsnif mentioned this pull request Mar 29, 2020

Possible inaccuracies on (linux) bond interfaces #136

Open

TheLostLambda added 5 commits March 31, 2020 19:09

Add accumlated usage

24a3b2f

Fixed a nasty performace leak

8822a43

Fix a silly reverse-ordering bug

ffff8fe

Address some feedback

c70ce42

Finished some reorganizing

91c3459

TheLostLambda force-pushed the master branch from 2cfca9c to 91c3459 Compare March 31, 2020 18:14

Another silly reversal bug

8480685

TheLostLambda added 2 commits March 31, 2020 22:25

Let the test cases run

da4911c

Most tests are working

817ac2c

Fix some formatting that got left behind

5f3c787

Fix the rest of the tests

a7633b0

TheLostLambda marked this pull request as ready for review April 1, 2020 10:47

imsnif mentioned this pull request Apr 1, 2020

Recording bandwith usage over a time period and showing it using a graph #160

Open

Add some new tests to cover the total mode

8700079

Whoops, a double newline

42abf91

imsnif reviewed Apr 5, 2020

View reviewed changes

imsnif approved these changes Apr 5, 2020

View reviewed changes

imsnif merged commit 62a3946 into imsnif:master Apr 5, 2020

Add accumlated usage #155

Add accumlated usage #155

Conversation

TheLostLambda commented Mar 25, 2020

TheLostLambda commented Mar 27, 2020 • edited Loading

imsnif left a comment

Choose a reason for hiding this comment

TheLostLambda commented Mar 29, 2020

TheLostLambda commented Mar 29, 2020 • edited Loading

imsnif commented Mar 29, 2020

TheLostLambda commented Mar 29, 2020

imsnif commented Mar 29, 2020

TheLostLambda commented Mar 29, 2020 • edited Loading

imsnif commented Mar 29, 2020

imsnif commented Mar 29, 2020 • edited Loading

TheLostLambda commented Mar 29, 2020

imsnif commented Mar 29, 2020

TheLostLambda commented Mar 31, 2020

imsnif commented Mar 31, 2020

TheLostLambda commented Mar 31, 2020 • edited Loading

imsnif commented Mar 31, 2020

TheLostLambda commented Mar 31, 2020 • edited Loading

imsnif commented Mar 31, 2020

TheLostLambda commented Mar 31, 2020

imsnif commented Mar 31, 2020

TheLostLambda commented Mar 31, 2020

TheLostLambda commented Mar 31, 2020

imsnif commented Mar 31, 2020 • edited Loading

TheLostLambda commented Mar 31, 2020

imsnif commented Apr 1, 2020

TheLostLambda commented Apr 1, 2020

imsnif commented Apr 4, 2020

TheLostLambda commented Apr 4, 2020

TheLostLambda commented Apr 4, 2020

imsnif Apr 5, 2020

Choose a reason for hiding this comment

TheLostLambda Apr 5, 2020

Choose a reason for hiding this comment

imsnif Apr 5, 2020

Choose a reason for hiding this comment

TheLostLambda Apr 5, 2020

Choose a reason for hiding this comment

imsnif commented Apr 5, 2020

TheLostLambda commented Apr 5, 2020

imsnif commented Apr 5, 2020

TheLostLambda commented Apr 5, 2020 • edited Loading

TheLostLambda commented Mar 27, 2020 •

edited

Loading

TheLostLambda commented Mar 29, 2020 •

edited

Loading

TheLostLambda commented Mar 29, 2020 •

edited

Loading

imsnif commented Mar 29, 2020 •

edited

Loading

TheLostLambda commented Mar 31, 2020 •

edited

Loading

TheLostLambda commented Mar 31, 2020 •

edited

Loading

imsnif commented Mar 31, 2020 •

edited

Loading

TheLostLambda commented Apr 5, 2020 •

edited

Loading