Getting started

Installation
Usage
- Getting a node
First impressions and goals
what is a sentinel node
What are the node states
- State transitions methods
- State transitions table
How create a node replicas
The service discovery
- Lifecycle
- Fault tolerance
- How the discovery service works
What happens when duplicate nodes are found
Security
The load balancer
- How the load balancer works
- Nodes ignored by the your node state
The messages
- What does one of these messages look like
- Understanding the parts of a message

First impressions and goals

The architecture of kable is based on a decentralized service system where each service have in his memory an record of the location and status of all others services that are in his same network.

The main objective of Kable is make very easy, the service discovery process.
Important, you dont confuse Kable with other systems or methods of "service discovery", such as used in Kubernates or those made through Redis or using DNS, etc. Kable is a complex system, you should comparte Kable with Apache Zookeeper, etcd, Consul, etc.
The main differences of Kable and the previous systems, is that it is totally focused on projects made with Node.js and works in a decentralized way and has a load balancer.
Kable is designed to not emit exceptions, be very stable and consume few resources when it is in operation, since it must be coupled to the logic of your project, the exceptions can only occur on very important occasions.
Instead of each service having to register, deregister and update your status in a central system, each service has is responsible for carrying out this work separately with a low cost, maybe it may seem unattractive in a first impression but, what benefits have it?:
- Is highly fault tolerant, by his decentralized nature.
- Don't require install nothing outside of Node.js ecosystem.
- You don't need to worry by complex configurations.
- No extra hops, kable is extremely fast, in a decentralized system many request are made to achieve something simple task, this is very expensive in terms of performance, resource consumption and add network traffic noise.
Why kable owns a load balacer system?
- Why kable must be support node replication.
- You don't need to worry about setting up anything, the load balancer is smart.
- The architecture of Kable system depends obligatorily of one to work.
- The load balancer works in conjunction with the service discovery system, if they are together they can work very fast.

Installation

npm install https://github.com/11ume/kable

The project is under development but you can still try it.

Usage

Creating the demo environment

In the following context, we have two HTTP services (nodes) what should communicate between them. The services (nodes) are running in the port 3000 and 3001.

The first node is called foo, this will be your identifier inside of your nodes network, and looks like this:

import kable from 'kable'
import { createServer } from 'http'

const foo = kable('foo')
const server = createServer(async (_req, res) => {
    const pick = await foo.pick('bar') 
    res.end(`service ${pick.id} ${pick.host} ${pick.port} ${pick.state}`)
})

server.on('listening', foo.up)
server.on('close', foo.down)
server.listen(foo.port)

The second node is called bar

import kable from 'kable'
import { createServer } from 'http'

const bar = kable('bar', { port: 3001 })
const server = createServer(async (_req, res) => {
    const pick = await bar.pick('foo')
    res.end(`Node ${pick.id} ${pick.host} ${pick.port} ${pick.state}`)
})

server.on('listening', bar.up)
server.on('close', bar.down)
server.listen(bar.port)

Now when making http request to the foo service, and we receive the information of bar service, the same happens if we make a request to bar service, we will receive the foo information.

Time to try it:

curl http://localhost:3000 

# output
# Node foo <ip> 3001 RUNNNING

You can see a real world example in action:

Express HTTP services example

In this example the Foo service is the most interesting, since the kable logic is applied inside a middleware, every time a request is made, the state and the location of the required nodes are checked, this task only take a fraction of milliseconds.

We will analyze what is happening by parts.

Getting a node

The first thing to do is get a node using the pick method.

foo.pick('bar'): Promise<NodeRegistre>

This method is used to get the information of some node in particular.

Must be invoked for example; every time a request http/tcp/udp etc, is made for getting another node.

So, you must invoke this method to:

First, you must know where the node is located.
For know if he is available.
For know if he have replicas.

Possibles scenarios after requesting a node

Flow diagram of the process of get an node

                                                            +------------------+                 
                                                            |    Get an node   |                 
                                                            +------------------+                 
                                                                     ↓                         
                                                 +--------------------------------------+                 
                                                 |         The node is registred?       |  < ──────────────────────┐
                                                 +--------------------------------------+                          |
                                                    ↓                                ↓                             |
                                                 +-----+                          +-----+                          |
                                                 | Yes |                          | No  |                          |
                                                 +-----+                          +-----+                          |          
                                                    ↓                                ↓                             |
                 +-------------------------------------+                          +-------------------------+      |
                 |       This node have replicas?      |                          |       Wait for he       |      |
                 +-------------------------------------+                          +-------------------------+      |
                       ↓                        ↓                                              ↓                   | 
                    +-----+                  +-----+                              +-------------------------+      |
                    | Yes |                  | No  |                              |         Timeout         |      |
                    +-----+                  +-----+                              +-------------------------+      |
                       ↓                        ↓                                      ↓               ↓           | 
   +--------------------------+          +--------------------------+               +-----+         +-----+        |     
   |   Get first replica      |          |     Get the unique       |               | Yes |         | No  |  ──────┘
   |  available immediately   |          |   available immediately  |               +-----+         +-----+                  
   +--------------------------+          +--------------------------+                  ↓     
                                                                              +-------------------+
                                                                              |  Throw exception  |
                                                                              +-------------------+

The flow chart explanation.

Possibles scenarios

The bar node has not yet started or is in a state of unavailable.
- The node pick method, will put the request in a wait queue until the node bar has been announced, then will take the node immediately.
Exist multiple replicas of the bar node.
- Will take the first available node replica, in the next invocation of the method pick, will take the following replica applying Round Robin algorithm. Each node internally contains an ordered queue of available nodes.
The bar node is already available and is stored in the nodes registre of the node foo.
- Will take the node immediately.

Note: If everything is working normally, correctly and redundant, the second and the third scenarios going to be the most probable and fastest.

Kable extremely fast, can get a serie of node registres in a little fraction of second, less than ±50000 nanoseconds.

To understand that other magic 🧙 things are happening under the hood see:

The service discovery

The load balancer

Abort get node operations

This operation may be aborted when you deem is necessary, using a especial utility created by me op-abort.

Unfortunately the promises do not have a native logic of cancellation, to canceled it is necessary to use external tools.

npm install op-abort

import kable from 'kable'
import oa from 'op-abort'

(async function() {
    const foo = kable('foo')
    await foo.up()

    const opAbort = oa()
    setTimeout(opAbort.abort, 2000)

    await foo.pick('non-existent-node', { opAbort })
    const { aborted } = opAbort.state
    console.log('The node request was aborted after 2 seconds ', aborted)
}())

We have requested a node that does not exist, so the pick method will wait 5 seconds until the non-existent-node until it is announced, and this will never happens. In somes contexts like the shown in the previous example, the pick method could block the process for 5 seconds and whitout op-abort, you would have no way to cancel the operation.

Node state

Each node contains a states machine, with five possible states. The nodes can change of state using the following methods:

State transitions methods

const foo = kable('foo')

foo.up()
foo.start()
foo.stop()
foo.doing()
foo.down()

Note: The up method can receive a boolean argument, the default is true. When this method is invoked his places the node in the running state, is simply so you don't have to invoke up method and then start, maybe it may not make sense to you in this moment, but in another context you will need it.

const foo = kable('foo')
foo.up(false) // start kable in up state

const foo = kable('foo')
foo.up() // start kable in running state

As i said kable have a state machine, so the passage from one state to another is extremely strict, a transitions not allowed will invoke an exception.

State transitions table

States	Possible transitions
UP	RUNNING - DOING_SOMETHING - STOPPED - DOWN
DOWN	UP
RUNNING	DOING_SOMETHING - STOPPED - DOWN
STOPPED	DOING_SOMETHING - RUNNING - DOWN
DOING_SOMETHING	DOING_SOMETHING - RUNNING - STOPPED - DOWN

You will surely use other tools to monitor the status of your nodes, like PM2 but it is not enough for a distributed service system.

Is of critical order and necessary react before the things happens, for this kable need know in what state are the nodes with great pressicion.
The load balancing system needs to know what state the nodes are in to work well and faster.
You and the visualization and control systems, need to know what state your nodes are in.
Kable needs to know when must be start, stop, when to warn that a node is very busy or overloaded.

The states:

UP: This is normally the initial state.
- Indicates that the node has started to work but, it is still not serving.
RUNNING: This is normally the second state after up, and the first that must be invoked after any of the others.
- Indicates that the node totally operative and is ready to serve.
STOPPED: This state is invoked when you need to stop the node for some reason.
- Indicates that node is stopped and not serving.
DOING_SOMETHING: This state is invoked when you need indicate that node is doing something, and its not ready for serve, for example:
- A use case would be when the node is waiting for another node or external service.
- When the node event loop is overloaded, or prone to overload.
DOWN: This state is always the last state.
- Indicates that node is totally stopped and inoperative.

Duplicate node ids

Important Kable does not allow duplicate nodes ids. When any node detects a duplicate node id, it emits an error event called:

duplicate_node_id

The nodes with duplicate id are ignored by all nodes that already have its in their list. You can capture this event using Capturing the error that is emitted using the kable internals module:

import kableInternals from 'kable-internals'

const foo = kableInternals('foo')
foo.on('err', ({ event })) => event.duplicate_node_id === 'duplicate_node_id' && console.log(event))

Also with the vscode kable tool you will be able to visualize it.

Node sentinels

A sentinel node is a especial node prepared to run with the minimum configuration. His only objective is observe the status of a particular resource, such as a database or an external service, for then inform the other nodes.

You can see an example of how this work, in the examples folder of this repo:

Sentinel example

Node replicas

The replica nodes are and work in the same way as all the systems you already know.

This is where the load balancer and service discovery system come into play. You just have to tell Kable two things, then he will do all the smart work for you:

The first indicate the id "foo".
The second will be set the replica property in true { replica: true }.

The first node is called foo

const foo = kable('foo')
foo.up()

The second node is replica of foo

const foo = kable('foo', { replica: true })
foo.up()

Now we have a node called foo and his replica working, soo easy right?.

The Service discovery

How the discovery service works?

The service discovery system is really fast and automatic.

kable uses UDP Broadcast method whit a Broadcast address, by default 255.255.255.0, to locate each node inside of same network.

Each nodes send and recibe messages to the other nodes to inform about their state of health, their location, metadata, and other things, these messages are sent in intervals of time by default 3 seconds or immediately when a status update is performed in some node.

Each node keeps a record in his memory of all nodes that are found in his same network, this record is updated periodically.

For reduce the amout of data emited, that messages are serialized via Message Pack, therefore they are very small.

The messages are emitted every time an event is triggered:

update
- Is emitted when the node change of state.
unregistre
- Is emitted when the node informs that it will unsubscribe.
advertisement
- Is emitted periodically to inform in what state the node is, similar to a health check.

Note: Kable also supports unicast and multicast, but is recommended use always broadcast.

Important note: In most of production environments like Digitalocean or AWS EC2 etc, it is not possible to perform UDP brodcasting, therefore is necessary to use an overlay network like those provided by Docker Swarm, Docker Compose, Kubernates or others. In a future Kable could solve this problem by implementing a protocol called SWIM.

Lifecycle

The discovery service starts to working when the up method is invoked, and ends when the down method is called.

Fault tolerance

What happens if some node don't call the down method?

Well, kable always tries to emit his termination status, therefore if the process ends abruptly, it will intercept the termination signal before of this happens, and will issue the termination status down, with the signal and the exit code.

what happens if a node stops working abruptly whit out singal kill?

This would be the worst that could happen since the node would remain in its last state until it was removed from the records, there is no way to predict that a node will stop working whit anticipation, can be innumerable factors those who could generate this. But each node has a node timeout controller, that will remove the inactive node from his registry, once the estimated waiting time is over by default 3 seconds. In short, your entire system will take 3 seconds to react to this event, but if everything is properly designed and running it never shouldn't happen.

Security

kable handles the security of the messages it emits and receives through encryption. As explained above, Kable emits UDP messages via the broadcast method, by default these messages travel in plain text.

And anyone who is on the same network, will be able to read and modify these messages using MitM attack.

To mitigate this, Kable implements the encryption of each message that is emitted applying AES CBC 256 algorithm.

This will not prevent you from being a victim of a MitM attack, but the attacker will not be able to read the messages or modify them.

For example you can use openssl bash command to generates 32 random bytes (256 bits) key.

openssl rand -base64 32

This node now will encrypt all his messages, and rejects all messages coming from other nodes that do not have the same key.

const foo = kable('foo', { key: 'x4wl1vHLBcENpF+vbvnyWqYbNyZ1xUjNDZYAbLROTLE='})
foo.up()

The best way to create keys and manage them is using tools like Vault. You can devise your own way of sharing the keys but make sure it be safe.

The load balancer

kable have an smart and implicit load balancer.

How the load balancer works?

The load balancer has a queue of nodes in its register. Every time a node is announced or unsubscribed this add or removes that node from its queue.

As kable is based on a series of distributed nodes that will start, stop, and can change of state constantly. The load balancer system needs find the best way to organize the node queue in each node of same way. For make it possible, each node has an especial property called index, that is an simply unique number.

So, having a queue of numeric indexes like this: [2, 3, 4, 1], always possible sort them in the same way in each node queue. is a really simple solution for solve a complex problem.

Note: The load balancer applying Round Bobin algorithm and first to be available to work. So each node, have the same no sequencial but organized node queue inside, as I explained above.

In the next example we have seven nodes foo, bar and baz and a few foo replicas, let's see how their node tails look:

Nodes work queue demo

Foo

foo
  |  
  ├── baz
  └── bar

Bar

bar
  |  
  ├── baz
  ├── foo
  ├── foo:3
  ├── foo:1
  └── foo:2

Baz

baz  
  |
  ├── bar
  ├── foo
  ├── foo:3
  ├── foo:1
  └── foo:2

Now let's go back to the example where explain what happens when a node is requested Getting a node

If we see the organization of the row that i showed previously, and knowing as I said earlier that the load balancer uses the round Robing Algorithm, so thanks to it, is possible to predict the following behavior of these requests:

bar.pick('foo') // foo
bar.pick('foo') // foo:3
bar.pick('foo') // foo:1
bar.pick('foo') // foo:2

baz.pick('foo') // foo
baz.pick('foo') // foo:3
baz.pick('foo') // foo:1
baz.pick('foo') // foo:2

Thanks to this organization the load is always divided evenly and we do not overload any node.

Now remember that I said that Kable has an internal state machine, well the load balancer is based on the state of each node to decide whether to take a node or request the next in the row.

The next states are totally ignored by the load balacer alogorithm, it only contains in his node queue, the nodes that are in available state:

Node states ignored by the load balancer

States	Ignored
UP	yes
DOWN	yes
STOPPED	yes
RUNNING	no
DOING_SOMETHING	yes

Let's look at an example of this

How looks like the state of foo node and his replicas

foo running  
  |
  ├── foo3:running
  ├── foo1:stopped
  └── foo2:up

Suppose the foo2 node can be in running state after 2 seconds

The result would be the following:

baz.pick('foo') // foo
baz.pick('foo') // foo3
baz.pick('foo') // foo
baz.pick('foo') // foo3

// 2 seconds after 
baz.pick('foo') // foo2
baz.pick('foo') // foo

Note: The node queue organization will always be fluctuating, since the data is stored in memory and if the node is killed, the data will be lost, then when the node starts again it will be creates a new index number, and take a different place inside the node queues. To predict the behavior outlined above, we must observe previously, how the node queue is organized.

The messages

Remember that I said that kable send and receive messages, so now let's see what some of these messages look like and what each part means.

What does one of these messages look like?

{
    id: 'foo'
    , host: '192.168.0.1'
    , port: 3000
    , meta: {
       id: 'foo-service'
       , description: 'is a cool service called foo'
    }
    , hostname: 'DESKTOP-3MFPTDD'
    , state: 'RUNNING'
    , ensured: false
    , ignorable: false
    , adTime: 2000
    , event: 'advertisement'
    , iid: '621a334f-c748-47bd-9f9b-a926d7619a77'
    , pid: 'e993539d-bb12-45e5-beff-b9f1d8da470b'
    , index: 16160494567343020000
    , registre: ['bar', 'baz']
    , replica: {
        is: false
    }
    , stateData: {
        up: {
          time: 1583383484
        },
        doing: {
          time: 1583383486
          , reason: 'trying to reconnect with the database'
        }
    }
    , rinfo: {
        address: '192.168.0.1'
        , family: 'IPv4'
        , port: 5000
        , size: 255
    }
}

Understanding the parts of a message

id: Its a string unique, used for identifies the node in your network.
host: Contains the location of the node ip/dns/socks.
port: Contains the port of the node.
meta: Is additional information to briefly describes the node.
hostname: hostname
state: Show the current state of the node.
ensured: Is a boolean that shows if the data is being encrypted.
ignorable: Is a boolean which indicates if the node must be ignored for the others in the same network.
adTime: Indicate in miliseconds, how often this node should be announced.
event: The event that triggered the issuance of this message.
iid: Is a unique identifier that identifies instance of the node.
pid: Is a unique identifier that identifies an process of the node.
index: Is a unique number used for the load balancer: How the load balancer works.
registre: Show all nodes that have been registered.
replica: Indicates if the node is a replica of another node.
stateData: It has status information, such as time and reason, for example the detection of a node.
rinfo: This information comes from the transport module. It is used internally by kable, but can be used for monitoring and measurement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting started

Getting started

Table of Contents

First impressions and goals

Installation

Usage

Creating the demo environment

You can see a real world example in action:

Getting a node

So, you must invoke this method to:

Possibles scenarios after requesting a node

The flow chart explanation.

Abort get node operations

Node state

State transitions methods

State transitions table

Duplicate node ids

Node sentinels

Node replicas

The Service discovery

How the discovery service works?

Lifecycle

Fault tolerance

What happens if some node don't call the down method?

what happens if a node stops working abruptly whit out singal kill?

Security

The load balancer

How the load balancer works?

Nodes work queue demo

Node states ignored by the load balancer

How looks like the state of foo node and his replicas

The messages

What does one of these messages look like?

Understanding the parts of a message

Clone this wiki locally