您好,登錄后才能下訂單哦!
本期內容:
Direct Access
Kafka
前面有幾期我們講了帶Receiver的Spark Streaming 應用的相關源碼解讀。但是現在開發Spark Streaming的應用越來越多的采用No Receivers(Direct Approach)的方式,No Receiver的方式的優勢:
1. 更強的控制自由度
2. 語義一致性
其實No Receivers的方式更符合我們讀取數據,操作數據的思路的。因為Spark 本身是一個計算框架,他底層會有數據來源,如果沒有Receivers,我們直接操作數據來源,這其實是一種更自然的方式。 如果要操作數據來源,肯定要有一個封裝器,這個封裝器一定是RDD類型。 以直接訪問Kafka中的數據為例:
object DirectKafkaWordCount { def main(args: Array[String]) { val Array(brokers, topics) = args // Create context with 2 second batch interval val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) // Get the lines, split them into words, count the words and print val lines = messages.map(_._2) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() } }
Spark Streaming會封裝一個KafkaRDD:
/** * A batch-oriented interface for consuming from Kafka. * Starting and ending offsets are specified in advance, * so that you can control exactly-once semantics. * @param kafkaParams Kafka <a to be set * with Kafka broker(s) specified in host1:port1,host2:port2 form. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD * @param messageHandler function for translating each message into the desired type */private[kafka]class KafkaRDD[ K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag, R: ClassTag] private[spark] ( sc: SparkContext, kafkaParams: Map[String, String], val offsetRanges: Array[OffsetRange], leaders: Map[TopicAndPartition, (String, Int)], messageHandler: MessageAndMetadata[K, V] => R ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges { override def getPartitions: Array[Partition] = { offsetRanges.zipWithIndex.map { case (o, i) => val (host, port) = leaders(TopicAndPartition(o.topic, o.partition)) new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port) }.toArray } ... override def compute(thePart: Partition, context: TaskContext): Iterator[R] = { val part = thePart.asInstanceOf[KafkaRDDPartition] assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part)) if (part.fromOffset == part.untilOffset) { log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " + s"skipping ${part.topic} ${part.partition}") Iterator.empty } else { new KafkaRDDIterator(part, context) } }
RDD中重要的方法 getPartitions 和 compute 其中compute中返回了一個 KafkaRDDIterator:
private class KafkaRDDIterator( part: KafkaRDDPartition, context: TaskContext) extends NextIterator[R] { val kc = new KafkaCluster(kafkaParams) ... private def fetchBatch: Iterator[MessageAndOffset] = { val req = new FetchRequestBuilder() .addFetch(part.topic, part.partition, requestOffset, kc.config.fetchMessageMaxBytes) .build() val resp = consumer.fetch(req) handleFetchErr(resp) // kafka may return a batch that starts before the requested offset resp.messageSet(part.topic, part.partition) .iterator .dropWhile(_.offset < requestOffset) } override def close(): Unit = { if (consumer != null) { consumer.close() } } override def getNext(): R = { if (iter == null || !iter.hasNext) { iter = fetchBatch } if (!iter.hasNext) { assert(requestOffset == part.untilOffset, errRanOutBeforeEnd(part)) finished = true null.asInstanceOf[R] } else { val item = iter.next() if (item.offset >= part.untilOffset) { assert(item.offset == part.untilOffset, errOvershotEnd(item.offset, part)) finished = true null.asInstanceOf[R] } else { requestOffset = item.nextOffset messageHandler(new MessageAndMetadata( part.topic, part.partition, item.message, item.offset, keyDecoder, valueDecoder)) } } } }
其中會調用KafkaCluster的connect方法:
org/apache/spark/streaming/kafka/KafkaCluster.scala def connect(host: String, port: Int): SimpleConsumer = new SimpleConsumer(host, port, config.socketTimeoutMs, config.socketReceiveBufferBytes, config.clientId)
KafkaCluster的connect方法返回了一個 SimpleConsumer,如果想自定義控制kafka消息的消費,則可自定義Kafka的consumer。
我們再回過頭看看:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet)
實際生成了什么:
def createDirectStream[ K: ClassTag, V: ClassTag, KD <: Decoder[K]: ClassTag, VD <: Decoder[V]: ClassTag, R: ClassTag] ( ssc: StreamingContext, kafkaParams: Map[String, String], fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R ): InputDStream[R] = { val cleanedHandler = ssc.sc.clean(messageHandler) new DirectKafkaInputDStream[K, V, KD, VD, R] ssc, kafkaParams, fromOffsets, cleanedHandler) }
生成了一個DirectKafkaInputDStream:
org/apache/spark/streaming/kafka/DirectKafkaInputDStream.scala override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) // Report the record number and metadata of this batch interval to InputInfoTracker. val offsetRanges = currentOffsets.map { case (tp, fo) => val uo = untilOffsets(tp) OffsetRange(tp.topic, tp.partition, fo, uo.offset) } val description = offsetRanges.filter { offsetRange => // Don't display empty ranges. offsetRange.fromOffset != offsetRange.untilOffset }.map { offsetRange => s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" }.mkString("\n") // Copy offsetRanges to immutable.List to prevent from being modified by the user val metadata = Map( "offsets" -> offsetRanges.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) val inputInfo = StreamInputInfo(id, rdd.count, metadata) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset) Some(rdd) }
這里面即產生了KafkaRDD實例。
我們再重新思考有Receiver和No Receiver的Spark Streaming應用 Direct訪問的好處:
1. 不需要緩存,不會出現OOM等問題(數據緩存在Kafka中)
2. 如果采用Receiver的方式,Receiver和Worker上Executor綁定了,不方便做分布式(配置一下也可以做)。如果采用Direct的方式,直接是RDD操作,數據默認分布在多個Executor上,天然就是分布式的。
3. 數據消費的問題,在實際操作的時候,如果采用Receiver的方式,如果數據操作來不及消費,Delay多次之后,Spark Streaming程序有可能崩潰。如果是Direct的方式,就不會。
4. 完全的語義一致性,不會重復消費,且只被消費一次。
備注:
1、DT大數據夢工廠微信公眾號DT_Spark
2、IMF晚8點大數據實戰YY直播頻道號:68917580
3、新浪微博: http://www.weibo.com/ilovepains
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。