您好,登錄后才能下訂單哦!
大數據開發中怎樣從cogroup的實現來看join是寬依賴還是窄依賴,很多新手對此不是很清楚,為了幫助大家解決這個難題,下面小編將為大家詳細講解,有這方面需求的人可以來學習下,希望你能有所收獲。
下面從源碼角度來看cogroup 的join實現
import org.apache.spark.rdd.RDD import org.apache.spark.{SparkConf, SparkContext} object JoinDemo { def main(args: Array[String]): Unit = { val conf = new SparkConf().setAppName(this.getClass.getCanonicalName.init).setMaster("local[*]") val sc = new SparkContext(conf) sc.setLogLevel("WARN") val random = scala.util.Random val col1 = Range(1, 50).map(idx => (random.nextInt(10), s"user$idx")) val col2 = Array((0, "BJ"), (1, "SH"), (2, "GZ"), (3, "SZ"), (4, "TJ"), (5, "CQ"), (6, "HZ"), (7, "NJ"), (8, "WH"), (0, "CD")) val rdd1: RDD[(Int, String)] = sc.makeRDD(col1) val rdd2: RDD[(Int, String)] = sc.makeRDD(col2) val rdd3: RDD[(Int, (String, String))] = rdd1.join(rdd2) println(rdd3.dependencies) val rdd4: RDD[(Int, (String, String))] = rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3))) println(rdd4.dependencies) sc.stop() } }
分析上面一段代碼,打印結果是什么,這種join是寬依賴還是窄依賴,為什么是這樣
關于stage劃分和寬依賴窄依賴的關系,從2.1.3 如何區別寬依賴和窄依賴就知道stage與寬依賴對應,所以從rdd3和rdd4的stage的依賴圖就可以區別寬依賴,可以看到join劃分除了新的stage,所以rdd3的生成事寬依賴,另外rdd1.partitionBy(new HashPartitioner(3)).join(rdd2.partitionBy(new HashPartitioner(3)))
是另外的依賴圖,所以可以看到partitionBy以后再沒有劃分新的 stage,所以是窄依賴。
前面知道結論,是從ui圖里面看到的,現在看join源碼是如何實現的(基于spark2.4.5)
先進去入口方法,其中withScope的做法可以理解為裝飾器,為了在sparkUI中能展示更多的信息。所以把所有創建的RDD的方法都包裹起來,同時用RDDOperationScope 記錄 RDD 的操作歷史和關聯,就能達成目標。
/** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Performs a hash join across the cluster. */ def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))] = self.withScope { join(other, defaultPartitioner(self, other)) }
下面來看defaultPartitioner
的實現,其目的就是在默認值和分區器之間取一個較大的,返回分區器
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val rdds = (Seq(rdd) ++ others) // 判斷有沒有設置分區器partitioner val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0)) //如果設置了partitioner,則取設置partitioner的最大分區數 val hasMaxPartitioner: Option[RDD[_]] = if (hasPartitioner.nonEmpty) { Some(hasPartitioner.maxBy(_.partitions.length)) } else { None } //判斷是否設置了spark.default.parallelism,如果設置了則取spark.default.parallelism val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) { rdd.context.defaultParallelism } else { rdds.map(_.partitions.length).max } // If the existing max partitioner is an eligible one, or its partitions number is larger // than the default number of partitions, use the existing partitioner. //主要判斷傳入rdd是否設置了默認的partitioner 以及設置的partitioner是否合法 //或者設置的partitioner分區數大于默認的分區數 //條件成立則取傳入rdd最大的分區數,否則取默認的分區數 if (hasMaxPartitioner.nonEmpty && (isEligiblePartitioner(hasMaxPartitioner.get, rdds) || defaultNumPartitions < hasMaxPartitioner.get.getNumPartitions)) { hasMaxPartitioner.get.partitioner.get } else { new HashPartitioner(defaultNumPartitions) } } private def isEligiblePartitioner( hasMaxPartitioner: RDD[_], rdds: Seq[RDD[_]]): Boolean = { val maxPartitions = rdds.map(_.partitions.length).max log10(maxPartitions) - log10(hasMaxPartitioner.getNumPartitions) < 1 } }
再進入join的重載方法,里面有個new CoGroupedRDD[K](Seq(self, other), partitioner)
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) ) } def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } //partitioner 通過對比得到的默認分區器,主要是分區器中的分區數 val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) } } /** * Return an RDD containing all pairs of elements with matching keys in `this` and `other`. Each * pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in `this` and * (k, v2) is in `other`. Performs a hash join across the cluster. */ def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))] = self.withScope { join(other, new HashPartitioner(numPartitions)) }
最后來看CoGroupedRDD,這是決定是寬依賴還是窄依賴的地方,可以看到如果左邊rdd的分區和上面選擇給定的分區器一致,則認為是窄依賴,否則是寬依賴
override def getDependencies: Seq[Dependency[_]] = { rdds.map { rdd: RDD[_] => if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer) } } }
join時候可以指定分區數,如果join操作左右的rdd的分區方式和分區數一致則不會產生shuffle,否則就會shuffle,而是寬依賴,分區方式和分區數的體現就是分區器。
看完上述內容是否對您有幫助呢?如果還想對相關知識有進一步的了解或閱讀更多相關文章,請關注億速云行業資訊頻道,感謝您對億速云的支持。
免責聲明:本站發布的內容(圖片、視頻和文字)以原創、轉載和分享為主,文章觀點不代表本網站立場,如果涉及侵權請聯系站長郵箱:is@yisu.com進行舉報,并提供相關證據,一經查實,將立刻刪除涉嫌侵權內容。