I am learning to program system which needs to survive over process crash in the cluster environment. And after reading and searching papers on the internet, I vaguely understand that would require program to provide checkpoint so that the state can be saved to stable (replicated) storage and recover later from there. I understand to achieve fault tolerance it would require other components e.g. failure detector, etc., but at the moment I want to gain more understanding on checkpoint issue.
However, most of the papers emphasize more on abstraction level. For instance, `Design Patterns for Checkpoint-Based Rollback Recovery' tells that communication induced checkpoint can prevent domino effect and it provides diagrams explaining the interaction between different components e.g. failure detector, checkpointer, etc. But now my problem is `how can I checkpoint to stable storage and recover seamlessly?' For instance, I will checkpoint a running program to a storage e.g. hadoop hdfs; when trying to recover the state, how can I ensure the program would resume to continuously execute as it were without a problem?