DeepSpotCloud provides a way to execute deep learning task in a cost efficient way using GPU
spot
instances in AWS EC2. It utilizes spot instances across regions (continents) to broaden
candidate instances. To preserve intermediate work even when an instance is interrupted due
to
an outbid event, DeepSpotCloud uses the checkpoint mechanism; an interrupted instance
uploads a
checkpoint outcome to a shared storage, and a new instance (in a different
region/availability
zone) resumes after downloading the checkpoint outcome.
The goal of this website is to show how DeepSpotCloud operates. It currently runs on an AWS
EC2
g2.2xlarge spot instance while executing CIFAR-10 TensorFlow job.