Data & Analytics

Stream all the things

Published
of 42
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Related Documents
Share
Description
1. Stream All the Things Streaming Architectures for Data Streams that Never End Dean Wampler, Ph.D. (@deanwampler) 2. Free, as in
Transcript
  • 1. Stream All the Things Streaming Architectures for Data Streams that Never End Dean Wampler, Ph.D. (@deanwampler)
  • 2. Free, as in 🍺 bit.ly/fastdata-ORbook My book, published last fall that describes the points in this talk in greater depth. I’ve refined the talk a bit since this was published.
  • 3. Streaming Engines in Context…
  • 4. Classic Batch Architecture: Hadoop
  • 5. Logs Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search YARN Resource Manager Node Manager N M Batch MapReduce … Spark Flume SqoopDBs Schematic view of Hadoop. You need 3 core things…
  • 6. Logs Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search YARN Resource Manager Node Manager N M Batch MapReduce … Spark Flume SqoopDBs 1. Storage tier, either HDFS (Hadoop Distributed File System) or alternatives like S3 or databases.
  • 7. Logs Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search YARN Resource Manager Node Manager N M Batch MapReduce … Spark Flume SqoopDBs 2. You need a compute engine for processing data. First we had MapReduce, then a few other tools to replace it, finally Spark has emerged as the most viable successor.
  • 8. Logs Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search YARN Resource Manager Node Manager N M Batch MapReduce … Spark Flume SqoopDBs 3. Finally, you need a manager of resources, scheduler of jobs and tasks, etc.
  • 9. Logs Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search YARN Resource Manager Node Manager N M Batch MapReduce … Spark Flume SqoopDBs Other tools are built on this foundation or supplement it, like tools for ingesting data, such as Sqoop and Flume.
  • 10. New Streaming, “Fast Data” Architecture (but it also supports batch)
  • 11. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam Numbering is in the report, so it’s easier for the text to refer back to the figure.
  • 12. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 1 & 2. Data sources include streams of data over sockets and from logs. You might also ingest through REST channels, but unless they are async, the overhead will be too high, so REST might be used only for communicating with the microservices (3) you write to complete your environment.
  • 13. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 3. A real fast data environment is similar to more general services, you’ll have the “heavy hitters” like Spark, Kafka, etc., but you’ll also need to write support microservices to complete the environment.
  • 14. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 4. Some services, like Kafka, require ZooKeeper for managing shares state, such as leaders.
  • 15. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 5. Kafka is the core of this architecture, the place where data is ingested in queues, organized into topics. Publishers are consumers are decoupled and N-to-M. Kafka has massive scalability and excellent resiliency and data durability. All services can communicate through each other using Kafka, too, rather than having to manage arbitrary point-to-point connections.
  • 16. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 6, 8, and 10. Many pure streaming, mini-batch streaming, and batch engines are vying for your attention. We’ll focus into this area after this overview.
  • 17. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 7, 9, and 10. Data can also be read and written between these compute engines and storage, not just Kafka.
  • 18. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam 11. You can run this architecture on Mesos, YARN (Hadoop), on premise or in the cloud. (At Lightbend, we think Mesos is the best choice.)
  • 19. Streaming Engines Now let’s zero in…
  • 20. Features to Consider When deciding what to use, there are several dimensions or features to consider.
  • 21. •Low latency? How low? •… If you have tight latency requirements, you can’t use a mini-batch or batch engine. If you’re constraints are more flexible, you can do more sophisticated things, like training ML models, writing to databases, etc.
  • 22. •Low latency? How low? •High volume? How high? •… Some tools are better at large volumes than others. If you have low volumes, you’ll want tools that are efficient at low volumes, whereas some of the high-volume tools are efficient per record when amortized over the stream.
  • 23. •Low latency? How low? •High volume? How high? •Integration with other tools? •Which ones? •… Connecting everything through Kafka can eliminate this requirement, but often that’s not ideal and you may need some tools to have direct connections to other tools.
  • 24. •… •Which kinds of data processing, analytics? •Process individual events? •Process records in bulk? Is your data processing essential data warehouse kinds of analytics or similar SQL queries? Are you doing ETL of data? Are you doing aggregations? Who is consuming this data? Are you doing complex event processing (CEP) where you need to make decisions individually, per event? Or, is it a stream with processing “en masse”, like typical SQL and ETL processing. We’ll comment on how these characteristics are realized in the specific tools we’ll discuss next, but if you’re using other tools, consider how they fit these dimensions.
  • 25. Best of Breed Streaming Engines
  • 26. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam Focusing in on the pure (low-latency) streaming and mini batch engines on the right…
  • 27. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • Apache Beam • Based on Google Dataflow • Requires a “runner” • (Dataflow is the runner in Google Cloud…) • Most sophisticated streaming semantics. Google has spent years thinking about all the things a streaming engine has to handle if you want to do accurate calculations on streams, accounting for all the contingencies that can happen. The latest incarnation is Google Dataflow, available as a service in Google Cloud Platform. Google recently open-sourced the part of Dataflow for defining “data flows” with these sophisticated semantics, called Apache Beam. Beam doesn’t provide a runner, so third-party tools provide this capability.
  • 28. Here is an example of the scenarios you have to handle. Suppose you are computing per-minute aggregations (like sales per minute). The analysis machine has one clock that is not necessarily in sync with the other servers processing sales. Worse, there is an unavoidable time delay for events to arrive to the analysis server. Hence you need to process event time and you need to account for late arriving data, not only small delays where some events/minute will arrive within the following minute, but even large delays due to network partitions, servers busy, etc., etc. This is one example of the challenges posed by streaming when you want accurate vs. approximate results.
  • 29. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • Apache Flink • High volume • Low latency • Apache Beam runner • Evolving SQL and ML support Flink provides two unique capabilities among these choices, 1) low-latency processing at scale and 2) it is the most mature runner for Apache Beam data flows (at least that I know of) outside Google’s own Dataflow engine.
  • 30. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • Akka Streams • Complex event processing • Efficient per-event • Not for large scale “pipes” • Low latency • “Might” become an Apache Beam runner There is a lesser-known engine called Gearpump (Intel project) that offers similar capabilities to Flink, but is implemented in Akka, so it could provide a Beam runner capability on top of Akka Streams. Hence, Akka Streams might be extended to run Beam flows for the situation where event processing (i.e., CEP) is the preferred model, as opposed to “fat” data pipes. Akka also provides a powerful Actor model for rich concurrency, and libraries for clustering, state management, and interop with many sources and sinks (the “Alpakka” project).
  • 31. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • Kafka Streams • Low-overhead processing of Kafka topics. • Ideal for: • ETL of raw data • Last value for a key, aggregations (“KTable” abstraction) Kafka Streams is a great 80% solution that works close with Kafka (in terms of production deployment and overhead). It is designed to read Kafka topics, perform processing like transformations, filtering, aggregations, “last-seen” value for keys, etc. then write results back to Kafka. It nicely addresses many common design challenges, but isn’t designed to be a complete solution for stream processing, e.g., running SQL queries and training ML models.
  • 32. Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 Ka>a Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • Spark • Streaming: mini-batch model • Latency > ~500 msecs • Ideal for: • Rich SQL • ML model training • Beam support? Coming… Spark Streaming is a mini-batch model, although this is slowly being replaced with a true, low-latency streaming model. So, today, use it for more expensive calculations, like training ML models, but don’t use it when the lowest-latency processing is required. Lightbend’s Fast Data Platform is working on tools to make it easy to train ML models in Spark and serve them with the other tools. Another advantage of Spark is the rich ecosystem of tools for a wide variety of batch and streaming scenarios.
  • 33. What about Microservices? I’m going to argue that service architectures (classically three tier, but evolving…) and data architectures (classically Big Data like Hadoop) are converging.
  • 34. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam • How is this…
  • 35. • … like this? • Image from: You can get this great report by Jonas Boner here: lightbend.com/reactive-microservices-architecture
  • 36. • A Data app | microservice: • Has one responsibility • … Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam Each data app, streaming or batch, is typically doing one thing, like ETL this Kafka topic to Cassandra, or compute aggregates for a dashboard, etc. Similarly, services evolving now into microservices are also supposed to do one thing and communicate with other services for the “help” they need.
  • 37. • A Data app | microservice: • Has one responsibility • Ingests and processes a never ending stream of data | messages • … Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam A streaming data app will have a never-ending stream of data to process. Similarly, requests for service will never stop coming to microservices.
  • 38. • A Data app | microservice: • Has one responsibility • Ingests and processes a never ending stream of data | messages • Must be available, responsive, resilient, scalable, … reactive Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam Hence, both kinds of systems must provide high availability to respond to the stream of data|messages coming in. That means that both must be resilient against failure, scale on demand to meet the load. These are the reactive principles.
  • 39. • Thesis: • Microservice and Data architectures are converging. • Similar design problems. • Data tends to dominate as environment grows. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam So, both kinds of systems have similar design problems, hence both should be implemented in similar ways. Also, organizations that aren’t particularly data focused, especially when they are new endeavors, often find that data grows to be a dominant concern, a sign of success! For example, Twitter started as a classic three-tier application, but now most of their infrastructure looks like a petroleum refinery.
  • 40. • <Marketing_Plug /> • Lightbend Fast Data Platform • Converged data services and microservice tools • Quick start, expert guidance • Intelligent cluster management tools. Mesos, YARN, Cloud, … Logs Sockets REST ZooKeeper Cluster ZK Mini-batch Spark Streaming Batch Spark … Low Latency Flink Ka9a Streams Akka Streams Beam … Persistence S3 HDFS DiskDiskDisk SQL/ NoSQL Search 1 5 6 3 11 KaEa Cluster Ka9a Microservices RP Go Node.js … 2 4 7 8 9 10 Beam
  • 41. Free, as in 🍺 bit.ly/fastdata-ORbook Once again, the link to my book, published last fall that describes the points in this talk in greater depth. I’ve refined the talk a bit since this was published.
  • 42. Interested in Lightbend Fast Data Platform? lightbend.com/fast-data-platform This URL takes you to information about the product we’re building, Fast Data Platform, that helps teams succeed with these technologies, from initial installation and configuration, to writing applications, to runtime production monitoring, to advanced machine-learning based automation.
  • Search
    Related Search
    We Need Your Support
    Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

    Thanks to everyone for your continued support.

    No, Thanks