TOLERANCE IN CHECK-POINTING APPROACH
secure virtual grid is demanding inwhich you can share any resource from any cluster
even in presence of fault in system.Grid computing is aimed toward large scale
systems that even span organizational boundaries which is distributed computing
paradigm that differs from traditional distributed computing. Reliability
challenges arise because of unreliable nature of grid infrastructurein addition
to the challenges of managing and scheduling these applications. A fault can occur due to link failure,
resource failure or by any other reason which is to be tolerated for working
the system smoothly and accurately without interrupting the current job. Many
techniques used accordingly for detection and recovery of these faults. An
appropriate fault detector can avoid loss which is occurring to the system due
to system crash and reliable fault tolerance technique can save from system
failure. In order to achieve reliability, availability and QOS, fault tolerance
is an important property. The fault tolerance mechanism used here sets job
checkpoints based on resource failure rate. Job is restarted from its last
successful state using a checkpoint file from another grid resource if resource
failure occurs.It is important to select optimal intervals of check pointing an
application for minimizing the run time of the application in the presence of
system failures. Fault Index based rescheduling algorithm reschedules the job
from the failed resource to some other available resource with the least
Fault-index value and executes the job from recently saved checkpoint in case
of resource failure. This ensures the job to be executed within the given deadline
with increased throughput and helps in making the grid environment trust
Grid computing, fault tolerance, check pointing,
computing is a term referring to the aggregation of computer resources from
multiple administrative domains to reach a common goal. Grid can be thought of
as a distributed system with workloads that are non-interactive and which
involve a large number of files. It is more common that a single grid will be
used for a variety of different purposes,although a grid can be dedicated to a
specialized application. Grids are often constructed with the aid of
general-purpose grid software libraries known as middleware.Grid enables sharing, selection, and aggregation of
a wide variety of geographically distributed resources including
supercomputers, storage systems, data sources and specialized devices owned by
different organizations.Management of these resources is an important
infrastructure in grid computing environment.
fault tolerance is fundamentally important since resources are geographically
distributed to achieve the promising potentials of computational grid. Moreover
the probability of resource failure is much greater than in traditional
parallel computing and the failure of resources affects job execution fatally.The
ability of a system to perform its function correctly even in the presence of
faults is Fault Tolerance and it makes the system more dependable. The fault
tolerance service is essential to satisfy QOS requirements in grid computing
and it deals with various types of resource failures, which include process
failure, processor failure and network failures.
interval or the period of checkpointing the application’s state is one of the
important parameters in a checkpointing system that provides fault tolerance.
Smaller checkpointing intervals lead to increased application execution
overheads due to checkpointing while larger checkpointing intervals lead to
increased times for recovery in the event of failures. Hence, in presence of
failure, optimal check-pointing intervals that lead to minimum application
execution time has to be determined.
1. If a fault occurs at a grid resource,
the job is rescheduled on another resource which eventually results in failing
to satisfy the user’s QOS requirement i.e. deadline. The reason is simple. As
the job is re executed, it consumes more time.
2. There are resources that fulfill the
criterion of deadline constraint, but they have a tendency toward faults in
computational based grid environments. The grid scheduler goes ahead to select
the same resource for mere reason that grid resource promises to meet user’s
requirements of grid jobs. This eventually results in compromising user’s QOS
parameters in order to complete the job.
3. Even though there is a fault in the
system, a task running should be finished on its deadline. There is no meaning
of such a task which is not finishing before its deadline. Hence, deadline in
real time is the major issue.
4. In real time distributed system
availability of end to end services and the ability to experience failures or
systematic attacks, without impacting customers or operations.
5. It is about the ability to handle
growing amount of work, and the capability of a system to increase total
throughput under an increased load when resources are added.
check-pointing fault tolerance approach is used to overcome above mentioned
drawbacks in such scenario. In this approach, every resource maintains fault
tolerance information. When a fault occurs, the resource updates the fault
occurrence information. During decision making of allocating resources to the
job, fault tolerance information is used.The check pointing is one of the most
popular techniques. To provide fault-tolerance on unreliable systems, the check
pointing is one of the most popular technique. It is a record of the snapshot
of the entire system state in order to restart the application after the occurrence
of some failure. Temporary as well as stable storage can be used to store
checkpoint. However, efficiency of the mechanism is strongly depending on the
length of check pointing interval. Frequent check pointing enhances the
overhead, while lazy check pointing may leads to the loss of significant
computation. Hence, decision about the size of check pointing interval and
check pointing technique is a complicated task and should be based upon the
knowledge about the system as well as the application.
depends on system’s MTTR. Usually a hard disk periodically saves the state of
application on stable storage. After a crash, the application is restarted from
last checkpoint rather than starting the application all over again. There are
three checkpointing strategies. They are coordinated
checkpointing, and communication-induced checkpointing.
1. In coordinated check pointing,
processes synchronize checkpoints to ensure their saved states are consistent
with each other, so that the overall combined, saved state is also consistent.
In contrast, 2. In uncoordinated check
pointing, processes schedule checkpoints is independent at different times
and do not account for messages.3.Communication-induced
check pointing attempts to coordinate only selected critical checkpoints.
grid resource is a member of a grid and it offers computing services to grid
users. Grid users register themselves to the Grid Information Server (GIS) of a
grid by specifying QoS requirements such as the deadline to complete the
execution, the number of processors, type of operating system and so on.
The components used in the architecture are described below:
Scheduler-Schedulers is an important
entity of a grid. It receives jobs from grid users. It selects feasible
resources for those jobs according to received information from GIS. Then it
generates job-to-resource mappings.When the schedule manager receives a grid
job from user, it gets details of available grid resources from GIS. It then
passes the available resource list to entities in MTTR scheduling strategy. The
Matchmaker entity performs match making of resources and job requirements.
Response Time Estimator entity estimates the response time for job on each
matched resource based on Transfer time, Queue Wait time and Service time of job. Resource selector selects the
resource with minimum response time. A job dispatcher dispatches the jobs one
by one to checkpoint manager.
GIS- GIS contains information about
all available grid resources. It maintains details of resources such as
processor speed, memory available, load,etc. GIS monitorsallthe grid resources that join and
leave the grid. A scheduler consults GIS to get information about available
grid resources whenever it has jobs to execute.
Checkpoint Manager-It receives the
scheduled job from the scheduler and sets checkpoint based on the failure rate
of the resource on which it is scheduled. Then it submits the job to the
resource. Checkpoint manager receives job completion message or job failure
message from the grid resource and responds to that accordingly. During
executionjob is rescheduled from last checkpoint instead of running from the
scratch, if job failure occurs.
Checkpoint Server-Job status is reported to
the checkpoint server on each checkpoint set by the checkpoint manager.
Checkpoint server saves job status and returns it on demand i.e., during
job/resource failure. For a particular job, the checkpoint server discards the
result of the previous checkpoint when a new value of checkpoint result is
Fault Index Manager- The fault index
value of each resource which indicates the failure rate of the resource is
maintained by it. The fault index of a resource is incremented every time when resource
does not completes the job assigned to it within the deadline and also on
resource failure. The fault index of a resource is decremented when the
resource completes the job assigned to it within the deadline. Fault index
manager updates the fault index of a grid resource using fault index update
Checkpoint Replication Server- When
new checkpoint is created, Checkpoint Replication Server initializes CRS which replicates
the created checkpoints into remote resources by applying RRSA. Details are
stored in Checkpoint Server after replication. To obtain information about all
checkpoint files, Replication Server queries Checkpoint Server.CRS monitors the
Checkpoint Server to detect newer checkpoint versions during the entire
application runtime.Information about available resources, hardware, memory and
bandwidth details are obtained from GIS. The required details are periodically
propagated by these tools to the GIS. CRS selects a suitable resource using
RRSA to replicate the checkpoint file depending on transfer sizes, available
storage of the resources and current bandwidth.
Throughput- One of the most important standard
metrics which is used to measure the performance of fault tolerant systems is
throughput. Throughput is defined as:
n is the total number of jobs submitted and T(n) is the total amount of time
required to complete n jobs. To measure the ability of the grid to accommodate
jobs, throughput is used. Generally, throughput of two systems decreases with
increase in percentage of faults injected in the grid. This is because of extra
delay which is encountered by both of them to complete jobs in case of some
tendency- Failure tendency is the percentage of the tendency of the selected
grid resources to fail and is defined as:
m is the total number of grid resources and Pfj is the failure rate of resource
j. Through this a metric, faulty behavior of the system can be expected.
all distributed environments fault tolerance is an important problem. Thus,by
dynamically adapting the checkpoint frequency, based on history of information
of failure and job execution time, which reduces checkpoint overhead and also,
increases the throughput by which the proposed work achieves fault tolerance.Hence,
following have been proposed new fault detection methods, client transparent
fault tolerance architecture, on demand fault tolerant techniques, economic
fault tolerant model, optimal failure prediction system, multiple faults
tolerant model and self adaptive fault tolerance framework to make the grid environment is more
dependable and trustworthy.