The write-timeout may happens when you have have multiple nodes in your DataCenter. When you try to write with LOCAL_QUORUM then (Total/2)+1 nodes should be up to write data. This is because there are multiple nodes writing and confirming that write are successful. For example, if there is a 3 node cluster and out of with 2 nodes are went DOWN. In this situation you will get above write timeout error.
There are cases where nodes are up from the dashboards; however, due to high load a node may be slow to respond and thus effectively seemed down for the cluster. So you should check the load, Memory and CPU usage on exact same time when you got this error. Ideally you can use monitoring to alert you for the high usages.
It is also possible that there is a high network latency between your application servers and the Cassandra cluster.
Another way to check the system is to try running it via cqlsh with TRACING ON and CONSISTENCY LOCAL_QUORUM. The trace output will provide details on how the nodes are failing.
If you are observing dropped mutations on the nodes and hints being stored on coordinators, they are indicative of the commitlog disk not able to keep up with write IOs. This a capacity issue and you need to take one of the following actions:
- throttle down the amount of writes coming from the application,
- store the commitlog on a separate disk from the data directory,
- increase the capacity of your cluster by adding more nodes.
References:
- https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlConfigConsistency.html
- https://cassandra.apache.org/doc/latest/architecture/dynamo.html
- https://cassandra.apache.org/doc/latest/operating/hardware.html
- https://docs.datastax.com/en/archived/cassandra/3.0/cassandra/dml/dmlClientRequestsMultiDCWrites.html
- https://stackoverflow.com/questions/31683907/cassandra-local-quorum