Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add accumlated usage #155

Merged
merged 12 commits into from
Apr 5, 2020
Merged

Add accumlated usage #155

merged 12 commits into from
Apr 5, 2020

Conversation

TheLostLambda
Copy link
Collaborator

Closes #147

This works pretty well, but the existing code wasn't really designed for accumulated usage like this, so I've left in some comments where I'd appreciate some others' opinions on refactoring matters.

This feature really adds a lot to bandwhich for me! I have a metered connection on a server and this PR makes it possible to track down what processes use the most data over a longer period of time!

This is a draft PR at the moment and I'd be hesitant to merge before we have the chance to address some of the comments I left behind :)

Additionally, the test snapshots have not been updated yet.

@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Mar 27, 2020

I've been dogfooding this on my server and ran into a performance issue over time. Because I'm remembering every connection every established, the number of items to sort before displaying them onscreen balloons quickly:

[src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 7808
[src/display/ui_state.rs:201] self.connections.len() = 93394

I've now implemented a pruning step in the UIState update, so I end up with something much more manageable:

[src/display/ui_state.rs:199] self.processes.len() = 22
[src/display/ui_state.rs:200] self.remote_addresses.len() = 1000
[src/display/ui_state.rs:201] self.connections.len() = 1000

Where only the 1000 biggest values are retained.

This fixes the performance leak and will run nicely for a long time without eating up the CPU cycles. Again, I've left some comments in the code any would love your feedback on those suggestions :)

Copy link
Owner

@imsnif imsnif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @TheLostLambda - really great work! I am very happy to see this change coming along and can't wait to see this merged and released.

Aside from the comments in the code, I mainly have one issue with the design, and that's with the addition of the MAX_BANDWIDTH_ITEMS. I definitely understand that performance implications and am happy you caught it early (it was something I was honestly planning to look at when I saw the PR :) ).

The problems I see with the current implementation:

  1. The total bandwidth we'd be showing on the top line would be a little inaccurate right now. It would include bandwidth from items we'd already removed from the list. While this isn't terrible (the list is 1000 items long, after all) I can imagine if people track these things, it might get a little confusing.
  2. The cross-referencing between tables (eg. number of connections per process) is affected by this and becomes inaccurate.
  3. This also creates an issue that new items will have a hard time getting into the list as time goes on.

I have two solutions in mind for this (and would be happy to hear other suggestions and/or your thoughts about these):

  1. Instead of removing items from the list, we could have a Utilization < [count] field in each table, and we could update it whenever we need to remove something from the list.

  2. We could re-count the total utilization in real time on every tick. This might be easier to implement, but I feel it's less "holistic" than the first solution.

Let me know what you think, and if there's anything else I can do to make this happen!

src/display/components/display_bandwidth.rs Outdated Show resolved Hide resolved
src/display/components/display_bandwidth.rs Outdated Show resolved Hide resolved
src/display/components/table.rs Outdated Show resolved Hide resolved
src/display/components/table.rs Outdated Show resolved Hide resolved
src/display/ui_state.rs Outdated Show resolved Hide resolved
src/display/ui_state.rs Outdated Show resolved Hide resolved
src/main.rs Outdated Show resolved Hide resolved
Cargo.toml Outdated Show resolved Hide resolved
src/display/ui_state.rs Show resolved Hide resolved
@TheLostLambda
Copy link
Collaborator Author

Thank you for all of the excellent feedback! I'll look into addressing a couple of things today and then finishing things off in the days to come :)

Before I dive too deep though, could you elaborate a bit on this?

Instead of removing items from the list, we could have a Utilization < [count] field in each table, and we could update it whenever we need to remove something from the list.

Another option would be to use a Binary Heap. We could build up the data in a binary heap (it's roughly O(1) for adding values) then they are popped off in sorted order (but the whole structure does not need to be sorted!). That way we could take only 100 or so items, but still remember everything we have seen. We could also use the purpose-built lazysort crate which is even faster than the heap and would likely involve less memory allocation.

Taking 100 sorted elements from a Vec of 2,500,000 elements:

test benches::a_heap_bench     ... bench:  15,981,458 ns/iter (+/- 285,314)
test benches::a_lazy_bench     ... bench:   8,933,722 ns/iter (+/- 317,402)
test benches::a_standard_bench ... bench: 180,872,221 ns/iter (+/- 4,104,147)

Strictly speaking, unless we delete items, we'll hit a performance wall eventually (memory or CPU). This is also only addressing the sorting step of the performance problem (which I certainly think is the limiting step, but there might be other lurking bottlenecks).

Maybe a compromise is to increase MAX_BANDWIDTH_ITEMS quite a bit (say 100,000 or something) and balance that performance hit with a Heap or lazy sorting? I just think it would be a shame if the program was destined to crash or eat CPU resources after a while (even if it is several weeks).

The total bandwidth we'd be showing on the top line would be a little inaccurate right now. It would include bandwidth from items we'd already removed from the list. While this isn't terrible (the list is 1000 items long, after all) I can imagine if people track these things, it might get a little confusing.

As for this, I agree it's not ideal, but I think that I would rather this be the case than the total ignoring the smaller connections. I think by recomputing it, we are no longer reporting an accurate measure of data sent / received (even if we do maintain internal consistency).

@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Mar 29, 2020

Unfortunately, I think there is a deeply nested issue that might be blocking this functionality.

I've been doing some testing, and it looks like Bandwhich is missing a lot of data. I seems to be somewhat speed dependent. If I download a 10MB file at 100KBps, then bandwhich totals up to ~6MB, but at full speed (~6.5MBps) it only comes out to be 1.6MB.

Downloading a 1GB file at 25MBps reports only 185MB in bandwhich, with or without the divide_by.

Furthermore, even the traditional rate mode seems broken 😢 A rate of more than 20MBps reports under 5MBps in bandwhich (even in the officially packaged version of Bandwhich). This is all on Linux.

Do you have any idea what might be causing this?

@imsnif
Copy link
Owner

imsnif commented Mar 29, 2020

Do you have any idea what might be causing this?

My suspicion is that it might be (related to) this: #136
I still haven't managed to isolate the issue (I have not managed to reproduce this issue locally or figure out in which platforms/use-cases it happens). Unfortunately I haven't yet had the time to write a branch (as I promised in the issue :( ) that would essentially peal stuff away from bandwhich in order to find out where this is happening.
Now that you're more familiar with the code and can reproduce it, if you feel like tackling this one (before or after), that would be great and I'd be happy to guide you towards possible root cause areas. But really - only if you feel like it. :)

@TheLostLambda
Copy link
Collaborator Author

Hmm, okay. I might be able to take a look into it, but I'm not sure when I'd get around to it. What sort of setup are you running where it works properly?

@imsnif
Copy link
Owner

imsnif commented Mar 29, 2020

Hmm, okay. I might be able to take a look into it, but I'm not sure when I'd get around to it. What sort of setup are you running where it works properly?

That would be great. I hope it doesn't block your current work here? If so, how can I help?

I'm using arch linux on my thinkpad t450. I get accurate reporting when downloading a large file with one connection or when downloading from multiple connections (eg. downloading a linux distribution through a torrent - I use debian).

@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Mar 29, 2020

Curious, I'm also on Arch Linux and don't get accurate reporting in either case (on my desktop). I'll try on my Thinkpad as well I suppose.

And no, I thing work on this can continue as long as the underlying issue is eventually tracked down :) What did you think about the performance compromise I mentioned back in this comment?

EDIT: No luck on my Arch Linux P52s Thinkpad either... Still inaccurate reporting :(

@imsnif
Copy link
Owner

imsnif commented Mar 29, 2020

I'll take a closer look at your comments and address everything in the next few hours. I wanted to answer what I could quickly to get stuff out of the way for you :)

@imsnif
Copy link
Owner

imsnif commented Mar 29, 2020

Hey, so regarding the MAX_BANDWIDTH_ITEMS issue:

I definitely hear what you're saying about all of these solutions hitting some sort of bottleneck in the end (even if only after running for several weeks). I think both of the proposed solutions are great, but definitely in the "least bad thing" category. :) Let's reach for them if we decide that's the best way to go.

I'll try to explain my original idea better:

I propose we still use the MAX_BANDWIDTH_ITEMS mechanism. Only in addition, we add another "entry" in each table. This entry will be called something like Processes with bandwidth under <NUMBER> and will be used to record those items that we would have deleted from the table after we have MAX_BANDWIDTH_ITEMS. Whenever we perform the MAX_BANDWIDTH_ITEMS calculation, we'll update <NUMBER> to be the lowest item in the 1000.
Does this make sense? What do you think of the approach?

@TheLostLambda
Copy link
Collaborator Author

I quite like the idea of that! That seems like a good way to impose a maximum limit on the number of items in the maps and still have a list that sums to the total!

I'll go about implementing that soon. I think it's a quite good compromise :)

In that vein, would you mind if I swapped out the BTreeMap for an ordinary HashMap? Technically it should be a little quicker. I may also explore raising the MAX_BANDWIDTH_ITEMS and implementing the sorting with BinaryHeap so only the items the UI can display are sorted, but that probably won't be worth the code complexity.

@imsnif
Copy link
Owner

imsnif commented Mar 29, 2020

Sounds great. Feel free to make any changes you see fit and we can discuss their ups and downs when you're ready (or before if you'd like the input, ofc).

@TheLostLambda
Copy link
Collaborator Author

Alright, added a bit more polish, but I had a question! How should the deleted entry category actually be shown in the tables? What do we show for connection count (process and address tabs) or the process name (for the connections tab)?

Currently if you run this version of the code, and redirect stderr: ./target/release/bandwhich 2> out.txt, then out.txt will show the number of connections in the map and the bandwidth of all of the connections that have been culled.

Also, I don't presume that we want this row to show unless items have actually been removed, which means it will only be included if more than 10,000 items are already in the table. Does raw mode also read these tables? If not, what's the point of a row that no one will ever see?

On some other notes, I've moved to HashMaps and am only sorting the vectors once (before, the maps would need to be sorted every time they were trimmed and again when they were displayed. A consequence of this is that all sorting operations / code (as well as the details of what map type is used) are internal to UIState now. This means it's easy to change out the sorting or internal data structures without touching any other files.

Let me know what you think and what we should be in these tables!

Hope you are well!

@imsnif
Copy link
Owner

imsnif commented Mar 31, 2020

Good point about the extra row... I think it's mostly there in order to support the total bandwidth.

About the connections: hoi... tbh I'm starting to remember all the complexity involved in the feature. :) So - we could either not display them (and then change the title of the column to "active connections" or "active processes") or keep track of them in a different struct when we are in cumulative mode. Makes sense? Or do you have a different idea?

@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Mar 31, 2020

Currently they exist in special fields of the UIState struct and are being correctly updated. It's just about displaying them I think. I've called the fields whatever_drift:

    pub processes_drift: Bandwidth,
    pub remote_addresses_drift: Bandwidth,
    pub connections_drift: Bandwidth,

I'm not sure where they would best fit in the GUI to be honest, but there are struct with this data when cumulative mode is running (though I might be misunderstanding what you are going for).

@imsnif
Copy link
Owner

imsnif commented Mar 31, 2020

Ah, sorry. I think I was trying to solve a different problem. Been a long day :)
Let's not display these in the UI. I think as you say there will never be a situation where they should be displayed. I'm thinking raw_mode won't really benefit from the total usage feature anyway, right?

@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Mar 31, 2020

No worries, I know how you feel :) I don't think it would be very useful in raw mode either. Should I keep the "drift tracking" in the code in case we want to use that value in the future? Or should I cut it?

@imsnif
Copy link
Owner

imsnif commented Mar 31, 2020

I haven't yet gone over the code, but will the total bandwidth calculation be correct without it if we're removing excess connections?

@TheLostLambda
Copy link
Collaborator Author

I believe so, yes. It just adds up over time. It's not calculated from the entries or anything.

@imsnif
Copy link
Owner

imsnif commented Mar 31, 2020

Awesome - so let's remove them. I prefer not having things in the code unless we directly use them.

@TheLostLambda
Copy link
Collaborator Author

Sounds good! I'll push something in a couple of minutes with them cut out

@TheLostLambda
Copy link
Collaborator Author

Alright! This should be some pretty final code! Could I get some help updating the tests? I couldn't manage to get it working and it's broken after I removed the word "Rate" from the interface.

@imsnif
Copy link
Owner

imsnif commented Mar 31, 2020

Hey @TheLostLambda - I'm going to take a look this in the next few days. About the tests, we use cargo insta for the snapshots. If you install it, you can do cargo insta review after you ran the tests in order to review the newly created snapshots and accept/reject them as you see fit. Then commit+push.

EDIT: oops, forgot to link :) https://docs.rs/insta/0.8.1/insta/

@TheLostLambda
Copy link
Collaborator Author

No matter what I try, I can't get things to pass. Most of the tests are fine, just the removal of "Rate", but now the last handful of tests, while all of the lines are always present, they are presented in a random order.

Maybe you could take a look when you get the chance?

@imsnif
Copy link
Owner

imsnif commented Apr 1, 2020

Off the top of my head this seems like something isn't sorted (eg. when entries have the same amount of bandwidth, they're not also sorted by name) - but that's just a guess. I can take a closer look (and review everything else) over the weekend.

@TheLostLambda
Copy link
Collaborator Author

Yep, it looks like I was just being a bit silly! Moving from BTreeMaps to HashMaps means that things are no longer sorted by key, and the test packets: "I have come from 1.1.1.1" and "Awesome, I'm from 3.3.3.3" just so happened to be the exact same length!

I've coerced the tests into passing by changing "Awesome, I'm from 3.3.3.3" to "Greetings traveller, I'm from 3.3.3.3".

Everything should pass now :) Good catch!

@imsnif
Copy link
Owner

imsnif commented Apr 4, 2020

Hey @TheLostLambda - this is some fantastic work. I went over everything and the changes seems solid.

For me, the only thing missing before we can merge is a test for this. I don't think it should be too involved to add one - just pass the flag and see that the traffic is accumulated. What do you think?

Also, fair to note some issues this introduces:

  1. The pause functionality doesn't do what one would expect in this mode. When we pause, we stop collecting data. Which is okay when we're dealing with rate, but a little unexpected when we're collecting total bandwidth. I don't think this is terrible, and a fix for this would involve quite a big architecture change. So I'm fine with leaving this as is and considering adding some sort of warning to it as a separate PR.
  2. In this mode we do not accumulate connections. This is also fine, but might be a little non-intuitive (the bandwidth is accumulated, why not the connections?) I think this can be solved by changing the name of the "Connections" column to "Active Connections". This is a relatively simple change, but will involve playing a little with the UI breakpoints, as this column will now be much wider. This can also definitely be done out of scope in a different PR.

@TheLostLambda
Copy link
Collaborator Author

Thanks for giving it a look! I can try to put together a test for it today :)

As for the pause and connection issues, they were bugging me too. I tried to accumulate connections, but that would also require some major reworking, as NetworkData currently only holds a count of connections with no way of knowing which ones are unique. I agree that these issues are likely best dealt with in another PR (which I might like to help with as well).

I'll try to add some tests today and let you know how it goes!

@TheLostLambda
Copy link
Collaborator Author

Sorry for the delay @imsnif ! I've duplicated all of the sustained tests to test the total mode as well.

It seems to work well and the totals add up from snapshot to snapshot.

let (terminal_events, terminal_draw_events, backend) = test_backend_factory(190, 50);

let os_input = os_input_output(network_frames, 3);
let mut opts = opts_ui();
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these still use total_utilization: false, don't they?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, opts_ui() does have total_utilization set to false, but I'm setting it to true in the line immediately after. Adding another "opts building" function was a bit messy, so I opted for that. I could define an opts_ui_total() function if you'd rather.

The new snapshots from these tests do show totals as opposed to rates ("B" not "Bps"). Perhaps I'm missing something?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man, I managed to totally miss that line! My bad, sorry. :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha, no worries :)

@imsnif
Copy link
Owner

imsnif commented Apr 5, 2020

Hey @TheLostLambda - I think everything looks great. I'd be happy to go ahead and merge this. Or is there anything still open that I'm missing? (I'll open issues for the above two caveats we talked about after the release).

@TheLostLambda
Copy link
Collaborator Author

I think that it should be ready for merging :) I've been running it for a couple of days without issues.

Perhaps we could add an issue for displaying the elapsed time when you are in total mode? I can just check how long the process has been running with ps, but having elapsed time in the top right corner or something might be helpful for some. I think that would need to be discussed in a separate issue though.

@imsnif
Copy link
Owner

imsnif commented Apr 5, 2020

@TheLostLambda - thank you very much for this! I'm excited about releasing this feature. :)

I like your idea about the elapsed time (when in total_utilization mode). Would you like to open an issue about it?

@imsnif imsnif merged commit 62a3946 into imsnif:master Apr 5, 2020
@TheLostLambda
Copy link
Collaborator Author

TheLostLambda commented Apr 5, 2020

Sure! I'll make that issue now :) Thanks for making it so easy to contribute to this awesome repository!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature: accumulated usage
2 participants