Oct. 17th, 2013
applicatives in production
Oct. 17th, 2013 04:42 pmdef downloadPDF(url: String): Result[(File, String)] = { loadPage(url) andThen waitForSelector("div.textLayer") andThen runJS("return extractPdfContent()") andThen { Thread.sleep(1000) // give browser a chance val extracted = runJS("return intBuf2hex(extractedPdf)") map (_.toString) val pdf = extracted flatMap (_.decodeHex #> File.createTempFile("download", ".pdf")) val html = runJS("return _$('div.textLayer').innerHTML") map (_.toString) pdf <*> html } }
What happens here.
I load a page in Mozilla, via Selenium. Actually a pdf, but Mozilla pretends it's html.
andThen
... (meaning, if it failed, no need to proceed, right?)Then I extract
innerHTML
of the content div, I need to parse it.Oh, the
_.js
means we convert the value of this string into a Javascript representation of the string, with apos escaped, wrapped in apos.But what the server sent is actually a pdf (rendered in Mozilla by pdf.js);
So I need the pdf binary. It was https, and there's no api for interception.
So I go into the guts of pdf.js, find the holder of the binary, and tell it to give me the bytes (in a continuation). But the whole communication with the browser is imperative; so I sleep for a second. No biggie.
When I wake up, the bytes I need are already pulled from pdf.js future and converted to a hex string (like 160k of text, one line).
I extract it from the browser.
Then I decode the hexes, producing bytes, and send them to a temp file; the
#>
op returns the file... actually, monadically, a hope for a file. There's
flatMap
here; we flatten all hopes within hopes into one big hope - or an explanation of why everything fell apart.Now we have, hopefully, a text, and, hopefully, a pdf. We apply tensor product to produce either a long list of explanations why we failed, or a tuple, a pair (text, file).
QED.
Questions? Obvious?