Apr. 6th, 2017

juan_gandhi: (Default)
        val counts = ((0 until 3 map (_ -> 0) toMap) /: (0 until something.length.toInt)) {
          case (c, i) => {
            val k = something.at8(i).toInt
            val v = c(k) + 1
            c + (k -> v)
          }
        }


Just counting. It's a piece of a test, so...
juan_gandhi: (Default)
github.com/h2oai/h2o-3/blob/f6b96396b9205fd342794d8308ea2e8a43c1a03a/h2o-scala/src/test/scala/water/userapi/H2ODatasetTest.scala

  test("Citibike, end to end") {
    // this is the path of the sample we use
    val path = "smalldata/demos/citibike_20k.csv"
    // we read the file here, and produce a dataset from it
    val dataset = H2ODataset.readFile(path)

    // removing these two column that we don't care about
    dataset.removeColumn("start station name", "end station name")

    // this is the expected number of rows in the dataset
    val expectedSize = 20000
    
    // checking that we got exactly the number of records we expected
    assert(expectedSize == dataset.length)

    // converting gender column to categorical type
    dataset.makeCategorical("gender")
    
    // the domain should be "male", "female", and "N/A"
    val categories = dataset.domainOf("gender")
    assert(Some(3) == categories.map(_.length))
    
    // apply oneHot encoding to all applicable columns except gender (we'll need it)
    val oneHot = dataset.oneHotEncodeExcluding("gender")
    
    // we expect 15 possible 
    assert(Some(15) == oneHot.domain.map(_.length))

    // Planning to do stratified split, so 0.75 go to train, 0.25 go to valid datasets
    val ratio = 0.25
    val expectedValidSize = (expectedSize * ratio).toInt
    val expectedTrainSize = expectedSize - expectedValidSize
    
    // do stratified split on gender column; 55555 is the random seed
    oneHot.stratifiedSplit("gender", ratio, 55555) match {
      case Some((train, valid)) =>
        assert(expectedTrainSize == ETL.length(train))
        assert(expectedValidSize == ETL.length(valid))

      case None =>
        fail("Failed to stratify by gender")
    }
  }

Profile

juan_gandhi: (Default)
Juan-Carlos Gandhi

May 2025

S M T W T F S
    1 2 3
456 7 8 9 10
11 121314151617
181920 21 222324
25262728293031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated May. 24th, 2025 09:34 am
Powered by Dreamwidth Studios