Skip to content
KamilCuk edited this page Jul 16, 2024 · 11 revisions

Why it exists?

I was always frustrated that I am not able to get the logs of all tasks of a Nomad job on startup. The navigation in the UI to get to the jobs was also hard, and there is split between stderr and stdtout logs with no timestamp information. I also wanted to run the job in the terminal, get all the logs of this job, and exit with the exit status of the task in the job.

This tool exists to do that.

How it works?

I tried to make the code in an object-oriented pyramid dependency fashion. Still, there are some global objects related to Nomad connection and signal handling.

Given a job or allocation to watch, a "database" thread is spawned. This thread listens on Nomad event stream on all events related to the job or allocation. Each event updates and internal database of the events. Everything is accumulated and kept in dictionaries.

However, on startup the database is clear, so the database does not have a clear view of what is currently in Nomad. After starting the thread in the background and making connection with Nomad event stream, a separate thread executes queries to get the current information about the job, job versions, deployments, allocations and evaluations. This is done to populate the initial database state.

The "main" thread then listens for changes in the database. For every allocation in the running state, for every started task within such allocation, two threads are spawned. One watching stdout of the task and the other watching stderr of the task. If an allocation terminates, the threads are given some time to collect all events from Nomad and terminate.

Additionally, depending on the mode of operation, on each "change event" from the database, a different terminating condition is selected. For example, the loop terminates never, or when all deployments finished, or when the current job version finishes. Additionally, "old" allocations - allocations coming from previous versions of the job - might be ignored to reduce the output.

Additionally, whenever somethign "interesting" happens with a state of evaluation, allocation or deployment, it is printed out. In particular, when an allocation starts a message is displayed that it did and on which NodeName of the machine. Also when a deployment finishes with healthy or failed healthcheck, this shoudl also be displayed. In the similar fashion the same information is displayed on Nomad UI.

How to use it?

I am most importantly using nomadtools watch run to execute changes to a job that I've made. With that, I can observe the logs and confirm the deployment was successful or not without leaving terminal. I can ctrl+C stop it anytime I am happy with the progress. Then I can make additional changes to the job and restart it nomadtools watch run to repeat the process.

Sometimes for debugging a job, I execute nomadtools watch job. This allows me to follow logs of all started tasks of a particular Nomad job. It helps in debugging errors.

Finally, there is nomadtools watch --attach run that can be used to run one-off jobs. Such jobs have logs printed to the terminal and sending ctrl+c to nomadtools causes stopping the job, keeping it in sync.

For running services (mostly for unit testing) I created notifications to be sent out of nomadtools. Consider some program that requires mysql for operation, may wait for mysql to be avilable:

trap 'rm "$fifo"; nomadtools run purge mysql' EXIT
fifo=$(mktemp -u)
mkfifo "$fifo"
nomadtools run --notifystarted=$fifo mysql.nomad.hcl &
read <"$fifo"
nomadtools run --attach --purge some_job_that_depends_on_mysql.nomad.hcl

The notifystarted event is sent only after all job services are available.

Gif?

I recorded the gif before I renamed the tool to nomadtools with one entrypoint.

What is nomad-watch?

Before I started making nomadtools the script here existed as a separate python script I created for single one shot operations. After some time, I decided to start this package and merge the common parts of the script.

The script in the source code are still very much separate. A single entry point in the form of nomadtools eases deployment operation and version checking and eases bash-completion installation.

What are the available job states?

Throughout the life of Nomad job, there are particular states that do not come straightforward, but are relevant for operations.

Job has "finished starting" when the job has:

  • no active deployments
  • no active evaluations
  • no pending allocations
  • no running allocations associated with previous versions of the job
  • there are job.group.count allocations running, no more no less
  • all "main tasks" of the job has been started

"Main tasks" of a job are tasks that:

  • do not have lifecycle
  • or have lifecycle prestart with sidecar = true
  • or have lifecycle poststart

A Nomad job is "finished" when the job has:

  • no active deployments
  • no active evaluations
  • no active allocations

An inactive deployment is a deployment with status cancelled, failed or successful.

An active evaluation is a evaluation with status pending or blocked.

An active allocation is an allocation with status pending or running.

Clone this wiki locally