Apache Spark vs Apache Strom – which one to pick

Key Differences based on Technical Requirements

  • Latency: Is the performance of the streaming application critical to application? Storm can give sub-second latency much more easily and with less restrictions than Spark Streaming.
  • Development Cost: Do you required to to have similar code bases for batch processing and stream processing? With Spark, batching and streaming are very similar. Storm, however, departs dramatically from the MapReduce paradigm.
  • Message Delivery Guarantees: Do you need “Guarantee” delivery of every single record, or is some nominal amount of data loss acceptable? Disregarding everything else, Spark trivially yields perfect, exactly once message delivery. Storm can provide all three delivery semantics, but getting perfect exactly once message delivery requires more effort to properyly achieve.
  • Fault Tolerance: Do your process must have Fault Tolerance? Both systems actually handle fault-tolerance of this kind really well and in relatively similar ways.
    • Production Storm clusters will run Storm processes under supervision; if a process fails, the supervisor process will restart it automatically. State management is handled through ZooKeeper. Processes restarting will reread the state from ZooKeeper on an attempt to rejoin the cluster.
    • Spark handles restarting workers via the resource manager: YARN, Mesos, or its standalone manager. Spark’s standalone resource manager handles master node failure with standby-masters and ZooKeeper. Or, this can be handled more primatively with just local filesystem state checkpointing, not typically recommended for production environments.

Both Apache Spark Streaming and Apache Storm are great solutions that solve the streaming ingestion and transformation problem. Either system can be a great choice for part of an analytics stack.

References