def downloadPDF(url: String): Result[(File, String)] = {
loadPage(url) andThen
waitForSelector("div.textLayer") andThen
runJS("return extractPdfContent()") andThen {
Thread.sleep(1000) // give browser a chance
val extracted = runJS("return intBuf2hex(extractedPdf)") map (_.toString)
val pdf = extracted flatMap
(_.decodeHex #> File.createTempFile("download", ".pdf"))
val html = runJS("return _$('div.textLayer').innerHTML") map (_.toString)
pdf <*> html
}
}
What happens here.
I load a page in Mozilla, via Selenium. Actually a pdf, but Mozilla pretends it's html.
andThen
... (meaning, if it failed, no need to proceed, right?)
Then I extract
innerHTML
of the content div, I need to parse it.
Oh, the
_.js
means we convert the value of this string into a Javascript representation of the string, with apos escaped, wrapped in apos.
But what the server sent is actually a pdf (rendered in Mozilla by pdf.js);
So I need the pdf binary. It was https, and there's no api for interception.
So I go into the guts of pdf.js, find the holder of the binary, and tell it to give me the bytes (in a continuation). But the whole communication with the browser is imperative; so I sleep for a second. No biggie.
When I wake up, the bytes I need are already pulled from pdf.js future and converted to a hex string (like 160k of text, one line).
I extract it from the browser.
Then I decode the hexes, producing bytes, and send them to a temp file; the
#>
op returns the file... actually, monadically, a hope for a file.
There's
flatMap
here; we flatten all hopes within hopes into one big hope - or an explanation of why everything fell apart.
Now we have, hopefully, a text, and, hopefully, a pdf. We apply tensor product to produce either a long list of explanations why we failed, or a tuple, a pair (text, file).
QED.
Questions? Obvious?