Skip to content

Implementation of a distributed job scheduling mechinsm using Omnetpp

License

Notifications You must be signed in to change notification settings

fabiocarminati/job_scheduling

Repository files navigation

DISTRIBUTED JOB SCHEDULING

This repository contains the Distributed System project (2019-2020) by Fabio Carminati and Filippo Carloni.

We have used Omnet++ to implement and model an infrastructure to manage jobs submitted to a cluster of Executors.

Assumptions:

The channels between the various entities are bi-directional and it’s assumed they are reliable. In particular, it’s assumed TCP like connections when there is a channel So, unicast communication is possible between entities.

All the various parameter of the simulation (job complexity, timeouts, delay in the channels, ...) can be modified by the omnetpp.ini file. An exponential distribution is considered for the sending rate, timeout failure and job complexity in order to avoid deterministic values such that a more realistic behaviour of the system can be obtained. Thus for these 3 parameters only the mean is specified.

Entities:

There are mainly four different entities in the system implemented with different classes:

  1. Job
  2. Executor
  3. Storage
  4. Client

Jobs:

The Jobs are represented in the system by the JobMessage. JobMessages are exchanged amid the various entities in the system and contains all the essential information to represent a Job:

  • RelativeJobId 
  • ClientId  
  • OriginalExecId   
  • ActualExecId
  • JobComplexity
  • ReRouted
  • NewJob 
  •  
  • IsEnded
  • StartingTime
  • EndingTime

Executor:

The Executors execute the received jobs from the Clients in a distributed way.

The Executors communicate with:

  • Clients (status and job requests) 
  • Storages (send and receive internal status) 
  • Others Executors (load balancing operations and status) 

The Executor is the only entity that can fail in the system. Only crash are allowed (losing the partial computation of a job and losing internal state),No Byzantine and timing failures.

Can be in three main different modes:

  1. Normal Mode
  2. Failure Mode:Ignoring all the incoming packets until timeoutFailureDuration expires
  3. Reboot Mode: Ignoring all the incoming packets except those from storage

The Executors can crash (with different probabilities) either at the reception of each message or in the middle of computation.

Storage:

Each Executor has its own Storage connected. Each storage save permanently and in a reliable way the internal status of the executor attached to it.

The Storage CANNOT crash. The Storages save the jobs in four different std::map with a simple protocol:

  1. Insert operation: If a job arrives for a specific map and it doesn’t exist on it, the storage saves the job in the specified map
  2. Delete operation: If a job already exists in a specific map, the storage delete the job from that map

Client:

Clients communicate only with the Executors. Send new jobs periodically (Executor is random selected) and then they are added to notComputed cArray. Request info about the status of the Jobs (completed or not completed).

Each Client is completely unaware if load balancing is performed or not for each Job (e.g. who is the the effective Executor, ActualExecId).

If an Executor doesn’t respond to a new Job request with a JobId for a given period the job is resent for a given amount of times, after that another Executor is selected for that Job.

Periodically the Clients ask the status of the Jobs to the respective Executors (OriginalExecId), also they move all the Jobs to noStatusInfo cArray (assuming that there was a crash processing that job).

When there is a status response, if the Job is completed it is removed and notified, otherwise it is added to notComputed such that it will be re-asked later.

Statistics

In order to both understand whether the system behaves as expected and to monitor system performances we introduce some signals: In the Client :

  1. avgSendingRateSignal for the sending rate of the Jobs
  2. avgComplexitySignal for the processing time of a Job
  3. realTimeSignal for service time for each Job. From the moment in which the Job is sent to original the executor for the first time up to the time in which the client receives the completed status message

In the Storage :

  1. JobSignal for the evolution of the jobQueue length
  2. NewSignal for the evolution over time of the newJobsQueue length
  3. ReRoutedSignal for the evolution over time of the reRoutedQueue length

Results

Considering these simulation parameters:

We obtain the following results:

License

This project is licensed under the MIT License - see the LICENSE.md file for details

About

Implementation of a distributed job scheduling mechinsm using Omnetpp

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages