posted 14 years ago
Thank you, Chuck.
The adjusting of the size of cluster would not be done with the intent to pull out an instance while running some tasks. Instead the main intent it would be to increase the number of instances while already running the Hadoop cluster, at some decision point (in my hand also). And some kind of gracefull shutdowns of some nodes, after the huge amount of load is getting back to a much lower one. In such case a mechanism which will inform the Hadoop cluster manager to not start new tasks on some nodes and when the current task are done, to notify us, or self shutdown the instance in cause.
Knowing a distribution of the loads of our current jobs, I calculated that a resonable amount of resources (costs) could be optimized.
Because of multi job session, I have a situation when for one client I start EC2 several (50) instances, then later on it comes a very different job, which does not requires so much instances. If the biggest job finished, the smaller job will scale up so much that the time required for the remaining task will finish proportiately soon. But I know that not short enough to not pass the accounting hour limit and I get another 50 x instance hour on my bill, while I know that an amount of 10 instance could finish the job. The difference of 40 x instancehour price is lost. Frankly to say, that is not too much to be worst considering gracefull shutdown. At least for graceful shutdown not.
The possibility of enlarging the size of the cluster is more stringent, because while processing small mapreduce jobs it arrives on the pipeline a huge job, immediately I should start a completely new hadoop cluster with many instances, while the smaller jobs from the original scheduling could be finished in the spare time of the huger cluster (consider the losts when the huge task is ending). Also managing multiple clusters + hdfs maybe complicates too much the problem.
At the moment I feel that dynamically starting up new nodes it would be significant. Graceful shutdown is not so stringent, because scaling out factor of map reduce algorithms.