Last Updated:

Data ingestion: Fluentd or Logstash. That's the question

Introduction

While preparing a stack for a project this summer I came across the decision to use a data ingestion tool and, after some days analysing different tools, the last two on the table were Fluentd and Logstash. So, I decided to research a bit and do my tests. Following is the result of that work with both tools.

A bit of history and factual data

Logstash is well known for being part of the ELK (Elasticsearch, Logstash, Kibana) stack. Fluentd is built by Treasure Data and is part of the Cloud Native Computing Foundation (CNCF) portfolio of tools that are increasingly used by many renamed DevOps-oriented communities like DockerGoogle Cloud Platform (GCP) and even Elasticsearch.

If you plan on using Elastic products or the whole suite, then you should tend to prefer Logstash (although Fluentd also has excellent support for Elastic). On the other hand, if you’re using any CNCF hosted project (e.g. Kubernetes, OpenTracing or Prometheus), you should probably go with Fluentd prefereably.

Some brief technical stuff to mention and not to forget

  • Both platforms can run either on Linux or Windows.
  • Both can be configured to function in a multi-tier setup.
  • In both cases forwarders can detect failure of a shipper and switch to another active shipper when necessary. A data shipper is a software unit in charge of automating the backup of a database and transaction data or log files on a primary production server onto a secondary or standby server. A data forwarder is a vehicle that carries data from the origin point to a destination point.
  • Both deliver events to many different receivers, for instance readily supporting SQS, Elasticsearch, and S3.
  • Both handle JSON events natively.
  • Both are under active development.
  • Both use a plugin architecture and have their own plugin manager tools.
  • Event data on Logstash uses a single stream and then uses algorithmic if-then statements to send them to the right destination. Fluentd relies on tags to route events; each tag tells Fluentd where it wants to be routed. Fluentd’s approach is more declarative whereas Logstash’s method is procedural.

Update (by Aaron Mildenstein, Software Engineer at Elastic): In Logstash 6.0 the product will fully support multiple, independent pipelines which can be stopped and started, and automatically reloaded on the fly.

  • Plugins are managed differntly by both tools. In the case of Logstash it uses a centralized repository in GitHub whereas Fluentd is based on a decentralized model. Both have hundreds of plugins available to almost everything you can get data from: application logs, network protocols, IoT devices, container technologies, databases, orchestration engines, message protocols, mail services, monitoring tools.... just to name a few.
  • Transport technology is where the big differences appear in plain. When we talk about transport technology we are referring to the act and fact of gather data from disparate sources and "transport this data" to the correct destinations (a database, a data lake, another application, an API or whatever it should be). Logstash lacks an internal persistent mechanism. Currently, it has an on-memory queue that holds a determined number of events and relies on an external message queue system like Redis, Kafka, ZeroMQ or RabbitMQ for persistence across restarts and scalability. For me this is centainly an issue for Logstash, and the community has been requesting for long to persist the queue on-disk. Fluentd, on the other hand, has a highly configurable buffering system. It can be either in-memory or on-disk, thus, in comparison to Logstash we can say that Fluentd has built-in reliability and scalability characteristics. The downside with Fluentd is that its configuration might be difficult or even tricky for newbies. Beside this, there's another few things to bear in mind. Beats, the agent-like tool from Elastic sends data to Logstash with a minimal filtering capabilities, thus, it sends over the network pretty much everything it records. Fluent Bit, or Fluentd Forwarder, offers some extended filtering capabilities to reduce the amount of data sent over the network. For many that wouldn't be important enough, except for those who have their data infrastructure deployed on hybrid or public clouds. Every bit send or received counts and it has a price.

Update (by Aaron Mildenstein, Software Engineer at Elastic): The persistent-queue feature has allowed persistence to disk for several point versions (in beta release form), and is now out of beta and fully supported. You can choose whether to use persist-to-disk, or the in-memory queue.

  • In terms of support, unfortunately if I'm not wrong it seems that Logstash does not offer enterprise grade support for Logstash per se. Fluentd, on the other hand, does support enterprises.

Some notes about performance

This is a grey area of discussion and I cannot provide more deep details than those already provided by the community of users. Asking colleagues on this point I came across the fact that Fluentd has a slightly better reputation when it comes to performance. In any case, from my personal point of view, the true thing is that both tools performed really well with high workloads. But, How high can they perform under the same tech specs?. This is some factual data I could gather:

  • Logstash consumes around 100-150 MB of RAM; Fluentd consumes around 30-70 MB of RAM. For modern hardware infrastructure this is ridiculous, but the difference between applications when you are deploying a whole datacentre might become tens, hundreds or thousands of additional RAM gigabytes that must be paid, so, despite it is a hardly meaningful technical fact, we should not forget never any tech spec or fact.
  • For small machines, or IoT devices, Logstash uses Elastic Beats, a miniaturized agent to perform a set of the product capabilities. Same happens with Fluentd, that can deploy Fluentd Bit and Fluentd Forwarder for the same task.
  • In my tests I came accross some slowliness on Logstash, compared to Fluentd but the difference is imperceptible in the scales we usually work.

Conclusion

Use either Logstash or Fluentd will depend on user experience and punctual project needs. We cannot remove from the picture the technology bias as well. That being said, Fluentd is written mostly in Ruby, with performance-sensitive parts written in C, and with a more convenient, pre-compiled stable version available. Fluentd also has a forwarder written in Go language, that provides an excelent performance. Logstash's forwarder is written in Go language too, while its shipper runs on JRuby, which requires the JVM. I don't like this, despite some people say that the JVM is a nice bless and hell if you are able to configure it well, but personally, I understand this is a serious overkill for a data shipper.

Finally, in terms of architecture, I feel myself more comfortable with Fluentd's pluggable architecture and built-in reliability and scalability features than with Logstash capabilities. The unified logging to JSON is another feature I like thinking on micro services and APIs, so, you can unify all facets of processing data: collecting, filtering, buffering, and outputting data across multiple sources and destinations.

My actual bet, on a generic project, is for Fluentd.

Best regards (and thank you Aaron, for the udpate on Logstash)

ARTICLE UPDATE: