DISTRIBUTED JOB SCHEDULING

This repository contains the Distributed System project (2019-2020) by Fabio Carminati and Filippo Carloni.

We have used Omnet++ to implement and model an infrastructure to manage jobs submitted to a cluster of Executors.

Assumptions:

The channels between the various entities are bi-directional and it’s assumed they are reliable. In particular, it’s assumed TCP like connections when there is a channel So, unicast communication is possible between entities.

All the various parameter of the simulation (job complexity, timeouts, delay in the channels, ...) can be modified by the omnetpp.ini file. An exponential distribution is considered for the sending rate, timeout failure and job complexity in order to avoid deterministic values such that a more realistic behaviour of the system can be obtained. Thus for these 3 parameters only the mean is specified.

Entities:

There are mainly four different entities in the system implemented with different classes:

Job
Executor
Storage
Client

Jobs:

The Jobs are represented in the system by the JobMessage. JobMessages are exchanged amid the various entities in the system and contains all the essential information to represent a Job:

RelativeJobId
ClientId
OriginalExecId
ActualExecId
JobComplexity
ReRouted
NewJob
IsEnded
StartingTime
EndingTime

Executor:

The Executors execute the received jobs from the Clients in a distributed way.

The Executors communicate with:

Clients (status and job requests)
Storages (send and receive internal status)
Others Executors (load balancing operations and status)

The Executor is the only entity that can fail in the system. Only crash are allowed (losing the partial computation of a job and losing internal state),No Byzantine and timing failures.

Can be in three main different modes:

Normal Mode
Failure Mode:Ignoring all the incoming packets until timeoutFailureDuration expires
Reboot Mode: Ignoring all the incoming packets except those from storage

The Executors can crash (with different probabilities) either at the reception of each message or in the middle of computation.

Storage:

Each Executor has its own Storage connected. Each storage save permanently and in a reliable way the internal status of the executor attached to it.

The Storage CANNOT crash. The Storages save the jobs in four different std::map with a simple protocol:

Insert operation: If a job arrives for a specific map and it doesn’t exist on it, the storage saves the job in the specified map
Delete operation: If a job already exists in a specific map, the storage delete the job from that map

Client:

Clients communicate only with the Executors. Send new jobs periodically (Executor is random selected) and then they are added to notComputed cArray. Request info about the status of the Jobs (completed or not completed).

Each Client is completely unaware if load balancing is performed or not for each Job (e.g. who is the the effective Executor, ActualExecId).

If an Executor doesn’t respond to a new Job request with a JobId for a given period the job is resent for a given amount of times, after that another Executor is selected for that Job.

Periodically the Clients ask the status of the Jobs to the respective Executors (OriginalExecId), also they move all the Jobs to noStatusInfo cArray (assuming that there was a crash processing that job).

When there is a status response, if the Job is completed it is removed and notified, otherwise it is added to notComputed such that it will be re-asked later.

Statistics

In order to both understand whether the system behaves as expected and to monitor system performances we introduce some signals: In the Client :

avgSendingRateSignal for the sending rate of the Jobs
avgComplexitySignal for the processing time of a Job
realTimeSignal for service time for each Job. From the moment in which the Job is sent to original the executor for the first time up to the time in which the client receives the completed status message

In the Storage :

JobSignal for the evolution of the jobQueue length
NewSignal for the evolution over time of the newJobsQueue length
ReRoutedSignal for the evolution over time of the reRoutedQueue length

Results

Considering these simulation parameters:

We obtain the following results:

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Name		Name	Last commit message	Last commit date
Latest commit History 107 Commits
code		code
statistics		statistics
LICENSE.md		LICENSE.md
No_Failure-Optimal_Load_Balancing.anf		No_Failure-Optimal_Load_Balancing.anf
Presentation.pptx		Presentation.pptx
README.md		README.md
end_simulation_values.png		end_simulation_values.png
no_failure_opt_load_ini_file.png		no_failure_opt_load_ini_file.png
signals_2,4,no_failure,optimal_load.png		signals_2,4,no_failure,optimal_load.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISTRIBUTED JOB SCHEDULING

Assumptions:

Entities:

Jobs:

Executor:

Storage:

Client:

Statistics

Results

License

About

Releases

Packages

Contributors 2

Languages

License

fabiocarminati/job_scheduling

Folders and files

Latest commit

History

Repository files navigation

DISTRIBUTED JOB SCHEDULING

Assumptions:

Entities:

Jobs:

Executor:

Storage:

Client:

Statistics

Results

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages