Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] core.timeline: High-resolution event log for diagnostics #849

Open
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

lukego
Copy link
Member

@lukego lukego commented Mar 30, 2016

Timeline

Here is a big idea: How about if we create a unified "timeline" of all the events that are happening in a collection of Snabb processes and the hardware they are running on:

  • Processes starting and stopping
  • Engine breaths entering the pull and push steps
  • Apps processing batches of packets
  • Software errors being detected and recovered
  • Processes dropping into kernel space for interrupts and system calls
  • PEBS hardware detecting cache misses and branch mispredictions

The timeline would be a collection of log files that are high-performance (millions of messages per second), high-resolution (cycle precision based on a common time source), and ubiquitous (we would explicitly log events throughout Snabb and source external events e.g. from the kernel too).

This timeline could then serve as a reference for everybody involved in monitoring and troubleshooting the fine-grained behavior of a Snabb application e.g. developers, users, testers, support organizations, etc.

Here is an example (just a mock-up) of what a timeline might look like when printing the combined logs from one Snabb process and also the PMU hardware to see cache misses and branch mispredictions. Both logs are timestamped using the CPU Time Stamp Counter (TSC) and so they can be shown in parallel:

CYCLE  SYSTEM          ENGINE          NIC             FIREWALL        PEBS
------ --------------- --------------- --------------- --------------- -----------------------------
00000  jit trace #4
00001                  breath #33 pull
00002                                  output 3 pkts
00003                  breath #33 push
00004                                                  sort by proto   branch-miss ip=0xff0
00005                                                                  branch-miss ip=0xff8
00006                                                  process 82 tcp
00007                                                                  cache-miss ip=0xfa0,addr=0xd00
00008                                                                  cache-miss ip=0xfa0,addr=0xd10
00009                                                                  cache-miss ip=0xfa0,addr=0xd10
00010                                                  process 18 udp
00011                                                                  cache-miss ip=0xfb0,addr=0xc00
00012                                                  process other
00013  GC start                                                        cache-miss
00014  GC end                                                          cache-miss
00015                  breath #34 pull

Things to notice in this example:

  • Time is measured in CPU-cycles i.e. sub-nanosecond.
  • The SYSTEM column can tell us what is going on in the LuaJIT runtime system: compiling new JIT traces, garbage collection, and so on. We could also log the addresses of the generated machine code, both for Lua traces and DynASM code, to make them easy to cross-reference.
  • The ENGINE column shows us the breathe loop operation. Here you could look for e.g. a breath that took an unexpectedly long time to complete (perhaps flagged in your histogram Add core.histogram, use it to measure latency of app.breathe() #795).
  • The NIC and FIREWALL columns are showing individual processing steps for each app. This can be used directly like a simple profiler for application developers to divide-and-conquer performance problems by breaking processing into discrete steps (possibly an excellent thing to do for LuaJIT trace performance anyway, must talk with @alexandergall about his latest experiences and coding style).
  • The PEBS column tells us what performance events are sampled by the CPU. These can be cross-referenced either temporally ("what else is going on in parallel on the timeline?"), by instruction pointer (which trace / dynasm code does this refer to?), or by address (what kind of memory is this cache miss on: Lua, DMA, or QEMU?). Note that PEBS hardware can log these events into memory and timestamp them with a suitable time source (TSC).

The timeline could be sliced and diced. Perhaps you only want to look at the FIREWALL column to see how many cycles elapse between each processing step. Perhaps you want to post-process the timeline to calculate statistics e.g. histogram of the durations of certain events. Perhaps you want to import the timeline into a GUI tool that lets you zoom in and out to look for patterns and anomalies.

Implementation notes

The log function is written in assembler and the selftest function is able to log one message every 36 CPU cycles. I hope that the actual cost will be less in real uses due to instruction-level parallelism. I am not sure exactly how much overhead is acceptable. I believe that the RDTSCP instruction is providing quite a good time source: on the one hand not interfering with out-of-order execution and on the other hand not being overly confused by it.

Related work

This idea is essentially a successor to the lib.blackbox (#754) idea. The trouble with the blackbox idea is that you need a GDB-like debugger to understand the log. The timeline seems simpler and more intuitive.

Google image search showed up the QNX profiler as the GUI nearest what I have in mind.

One more thing that we need to ubiquitously instrument the codebase with is counters. Could we do both at the same time? (Perhaps we could unify events and counters e.g. incrementing a counter logs an event?)

Intel whitepaper explaining more about RDTSCP instruction as a time source: How to benchmark code execution times.

Feedback

If you have thoughts or see problems then leave a comment and I will try to reference/summarize that up here on the PR :).

@lukego
Copy link
Member Author

lukego commented Mar 30, 2016

Missed a reference to Agner Fog's instruction tables which seems to tell us that the instruction RDTSCP is high-latency (takes a while to produce a result), not pipelined (can't execute multiple RDTSCP instructions in parallel), and does not consume execution units (won't slow down execution of other instructions while waiting for the result). This is why it seems like the selftest producing a log message every 36 cycles is fairly reasonable and also why the cost may be less in real usage.

@plajjan
Copy link
Contributor

plajjan commented Mar 30, 2016

Awesome stuff, I suppose this would be very nice to look at the flow of programs and seeing hot spots.

@RossBencina
Copy link

HDR Histogram is not the same thing, but possibly worthy of a see also: https://github.com/HdrHistogram/HdrHistogram

@lukego
Copy link
Member Author

lukego commented Apr 1, 2016

Quick braindump on latest details I am working out:

  • Logging every event all the time would produce too much data to keep. So how to manage that?
  • Events could be logged into ring buffers that wrap around. Each category of event (e.g. system, engine, app instance, ...) could have its own buffer containing e.g. the latest 1MB of messages in that category. These buffers could be shared memory (core.shm objects) and snabb-top could attach to "tail -f" and write them to disk if you want to store the full stream.
  • Breaths could be sampled. Perhaps you record SYSTEM and ENGINE events all of the time but for really detailed app events you only record 1% of breaths. This way you would still have the complete set of events for a subset of the breaths. (This would also reduce the overhead from e.g. 1% to e.g. 0.01% which should make it a no-brainer to enable in production.)
  • Sampling that works across processes/cores should be doable. This could be synchronized via the common clock (TSC) e.g. fully record the next 10 breaths each time the TSC passes a multiple of 1^9.
  • PEBS is awkward: if it randomly samples hardware events then it will randomly supply information about some breaths and not others. Could be that it doesn't fit into the timeline picture so well (even if the timeline could be useful for statistically analyzing PEBS events).

I am now aiming to quickly make this into something we can merge and then talk about enabling by default. The initial use cases and goals that I have in mind now are:

  • Log into shm files that will be persistent after process termination/crash.
  • Have the CI keep these files and make them available for analysis.
  • Ensure that overhead on throughput and latency is immeasurably small

So far in simple tests it seems like the basic overhead of calling an assembler routine is extremely low but I am not sure at what rate it is still efficient to call RDTSCP. Could be that efficiency also demands sampling rather than logging every event of every breath.

@lukego
Copy link
Member Author

lukego commented Apr 6, 2016

I added a timeline.md readme file explaining the idea and how the interface has evolved. Feedback would be welcome!

I have a much updated implementation but still in the process of making that match up with the API.

I spotted two typos and fixed them (entires -> entries).

And a question: In the example, shouldn't the string `example` appear somewhere in the resulting log ("The log now contains there entries")?
@sleinen
Copy link

sleinen commented Apr 6, 2016

Looks good to me. I was left with a few open questions:

  • What is the API for enabling/disabling error logging—I guess that this involves modifying the minimum priority that should be logged.
  • What are the defaults for the optional entries and stringtablesize arguments (and when would I want to care about stringtablesize)? Though these are the kind of details that I wouldn't mind UTSLing for.

lukego and others added 4 commits April 6, 2016 11:58
Typo fixes, and a question
Added category to example log output (thanks for feedback @sleinen).

Added functions: priority(), save(), dump()

Fixed formatting (more word wrap)
Updated the timeline implementation to match the documented API.

This is still a work in progress:

The messages printed by the selftest all have zeros for arguments for
unknown reasons;

The overhead of logging has doubled for unknown reasons, possibly due
to allocating with core.shm and putting a performance-critical
cacheline on a 4096-aligned address that may trigger associativity
problems (wild speculation, have to dig in).
@lukego
Copy link
Member Author

lukego commented Apr 6, 2016

Thanks for the feedback, @sleinen! I hope this is clearer now with the latest pushes?

The current version should be basically feature-complete now. I plan to do one more round of bug-fixes and then re-introduce the branch as something that people can use and experiment with.

The module now includes a really simple dumper that prints messages in reverse-chronological order with the timestamp deltas:

selftest: save and dump
  cycles category         message
       0 selftest         event1 a(0) b(0) c(0) d(0)
      96 selftest         event3
     100 selftest         event2 a(0)
      84 selftest         event1 a(0) b(0) c(0) d(0)
      80 selftest         event3
     100 selftest         event2 a(0)
      88 selftest         event1 a(0) b(0) c(0) d(0)
      76 selftest         event3
     104 selftest         event2 a(0)
      84 selftest         event1 a(0) b(0) c(0) d(0)
      80 selftest         event3

This is meant mostly as an example and I hope that we will evolve multiple interesting ways to process the timeline files. (I like the way even this simple dumper is sufficient to show that logging performance in this version is a bit irregular and slower than in earlier versions i.e. performance issue to resolve. Hopefully it will be able to illustrate bits of apps that slow down in the same way!)

I realized that when we take the timestamp with RDTSCP this is also
loading the current CPU core number and NUMA node into ECX. This patch
includes that register in the log dump. This seems useful so that we can
account for scheduling issues from the log.

Strictly speaking the value in ECX is the contents of the TSC_AUX
register but it seems that we can rely on Linux to load the core ID +
NUMA node there. Here is where I got the idea:
http://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on
The string pattern matching for detecting arguments ($arg) was wrong and
this caused all args to be logged as 0.
Show core number and NUMA node.

Write slightly better log messages for selftest.

Simplify the logging loop in the selftest to simply log three messages
per iteration. This way the timestamps will indicate the logging
overhead. (Previous version had some funny branches that could add a few
dozen cycles to the timestamps which is a bit distracting.)
@lukego
Copy link
Member Author

lukego commented Apr 7, 2016

I fixed the issue with arguments being 0-valued (was a string matching bug) and too many cycles between messages (was because I had made the loop that calls the logger more complicated).

I also added a nice feature that the core ID and NUMA node are stored for each log message. This information turns out to be available "for free" because the instruction that grabs the timestamp (RDTSCP) also loads this into a register. See http://stackoverflow.com/questions/22310028/is-there-an-x86-instruction-to-tell-which-core-the-instruction-is-being-run-on.

Here is how the selftest output looks now:

selftest: timeline
median logging time: 40 cycles
selftest: save and dump
numa core  -cycles category         message
0    4      <last> selftest         event with no args
0    4          60 selftest         event with one arg: i(1048575)
0    4          40 selftest         event with four args: i(1048575), 2i(2097150), 3i(3145725), and 4i(4194300)
0    4          40 selftest         event with no args
0    4          36 selftest         event with one arg: i(1048574)
0    4          40 selftest         event with four args: i(1048574), 2i(2097148), 3i(3145722), and 4i(4194296)
0    4          40 selftest         event with no args
0    4          36 selftest         event with one arg: i(1048573)
0    4          40 selftest         event with four args: i(1048573), 2i(2097146), 3i(3145719), and 4i(4194292)
0    4          40 selftest         event with no args
0    4          36 selftest         event with one arg: i(1048572)
0    4          36 selftest         event with four args: i(1048572), 2i(2097144), 3i(3145716), and 4i(4194288)
0    4          44 selftest         event with no args
0    4          36 selftest         event with one arg: i(1048571)
0    4          36 selftest         event with four args: i(1048571), 2i(2097142), 3i(3145713), and 4i(4194284)
0    4          40 selftest         event with no args
selftest: ok

@lukego
Copy link
Member Author

lukego commented Apr 7, 2016

I implemented the priority() function now. This makes it possible to selectively enable the level of detail that you want. I am imagining that in the engine we could randomly choose to enable higher levels of detail for certain breaths so that on the one hand we always have interesting samples in the timeline but on the other hand it does not wrap around too quickly.

I also renamed the priorities in a hopefully more helpful way. The main ones are:

  • trace for occasional messages.
  • app for per-breath messages e.g. messages that you print in your pull and push functions.
  • packet for per-packet messages e.g. messages that you print inside loops in your pull and push functions.
  • library for messages inside subroutines that could potentially be called very often e.g. checksum routine.

I hope this makes it easy to choose the right priority when defining messages and deciding how much logging to enable.

Fixed an off-by-one error (braino) in choosing which timestamps to
calculate the delta from.

Make the cycle column wider to better accommodate large values.
The selftest function now executes a few loops to see how long it takes
to log a million events that are disabled by priority.

Log excerpt:

    numa core      -cycles category         message
    0    3         5014856 selftest         invoked many events(1000000)
    0    3         5025592 selftest         invoked many events(1000000)
    0    3         5176004 selftest         invoked many events(1000000)

So looks like five cycles per event is the ballpark cost (including
the loop that makes the individual FFI calls).
@lukego
Copy link
Member Author

lukego commented Apr 7, 2016

I added a few "log a million priority-disabled events in a loop" to the selftest to estimate the cost of calling an event. Looks like around five cycles per event in total (1m event loop in 5m cycles). See 9b659f9.

Could be that you could even reasonably estimate the cost of the JIT compilation (interpretation, tracing, optimization, code generation) on the first iteration of the event logging loop:

numa core      -cycles category         message
0    3         5014856 selftest         invoked many events(1000000)
0    3         5025592 selftest         invoked many events(1000000)
0    3         5176004 selftest         invoked many events(1000000)

the first iteration takes around 150K cycles (~42 microseconds on this machine) longer than the next two. That strikes me as pretty fast compilation! Though some experience with timeline will be needed to know how safely such conclusion can be drawn from the logs.

@lukego
Copy link
Member Author

lukego commented Apr 7, 2016

Idea for the future: I notice that we have room for a 40-bit Performance Counter in the log entry. It could be really need to log the number of instructions in addition to the number of cycles. Then you could see the rate of instructions-per-cycle for every section of code. This could be handy as a heuristic to indicate whether or not you are really suffering from things like cache misses and branch mispredictions e.g. if instructions-per-cycle (IPC) is 1-2 that might be fine, more than 2 might be really good, less than 1 not so great. Or in other words if you want to optimize code the IPC hints you whether you should be trying to execute fewer instructions (e.g. making the JIT happier) or making your instructions run faster (e.g. trying harder to keep things in cache or keep branches predictable).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants