juan_gandhi | Entries tagged with spark

https://juan-gandhi.dreamwidth.org/1328236.html

http://dawn.cs.stanford.edu/

"DAWN is a five-year research project to democratize AI by making it dramatically easier to build AI-powered applications.

Our past research–from Spark to Mesos, DeepDive, and HogWild!–already powers major functionality all over Silicon Valley–and the world. Between fighting against human trafficking, assisting in cancer diagnosis and performing high-throughput genome sequencing, we’ve invested heavily in tools for AI and data product development.

The next step is to make these tools more efficient and more accessible, from training set creation and model design to monitoring, efficient execution, and hardware-efficient implementation. This technology holds the power to change science and society—and we’re creating this change with partners throughout campus and beyond."

override def eval(input: InternalRow): Any = {
  System.currentTimeMillis() * 1000L
}

In spark.core, RNG, specifically, normal distribution RNG.

It caches values (randomly). Now try to reseed.

There's a cogroup in data science. It consists of groupBy followed by outer join (in this case we actually will have a pullback).

"You can COGROUP up to but no more than 127 relations at a time" says Google.

"Spark is faster than map/reduce". A guy working at Databricks is giving a talk regarding how Spark works.

Omfg; how can one possibly deal with all this?

UPD. From the same talk. cartesion()

    // Set name from main class if not given
    name = Option(name).orElse(Option(mainClass)).orNull
    if (name == null && primaryResource != null) {
      name = Utils.stripDirectory(primaryResource)
    }

Just a chunk of their funny shitty code.

FYI, var name: String = null

I'd rewrite all that crap, but it's not only code, I'm afraid. It's the whole Spark world that needs a doctor.

  /**
   * Returns a Seq of the children of this node.
   * Children should not change. Immutability required for containsChild optimization
   */
  def children: Seq[BaseType]

  lazy val containsChild: Set[TreeNode[_]] = children.toSet

Imagine, we override the class, but our children are not to be mutable. Why t.f. then not declare it val?!

Another masterpiece

object CurrentOrigin {
  private val value = new ThreadLocal[Origin]() {
    override def initialValue: Origin = Origin()
  }

  def get: Origin = value.get()
  def set(o: Origin): Unit = value.set(o)

  def reset(): Unit = value.set(Origin())

  def setPosition(line: Int, start: Int): Unit = {
    value.set(
      value.get.copy(line = Some(line), startPosition = Some(start)))
  }

  def withOrigin[A](o: Origin)(f: => A): A = {
    set(o)
    val ret = try f finally { reset() }
    reset()
    ret
  }
}

So, we have a static object that wants to do something with a context. (See withOrigin.) That's "dependency injection", Fortran style. And we can do without vars, because, well, it's ThreadLocal.

Who wrote all this... could they hire Scala programmers, I wonder...

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/ClosureCleaner.scala

ClosureCleaner

So, spark gets a function to apply to data (e.g. in foreach, or in filter). Before running it, it (like a raccoon) cleans the function.

Enjoy the nice reading.

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Observations

Views from Souths

Entries tagged with spark

the birth of Spark, 8 years ago

dawn project

quoting Spark

just found a bug...

TIL

just learned that...

in case you are interested in Spark...

spark masterpieces

an amazingly funny piece of code

Profile

June 2025

Syndicate

Most Popular Tags

Page Summary

Active Entries

Style Credit

Expand Cut Tags