Friday, March 23, 2012

Scripting in Scala

Today was the first time I've felt comfortable using Scala to write a true script.  Something simple like taking an HTML file, and extracting certain anchor tags which can occur multiple times per line is surprisingly annoying to do in many scripting languages like python or perl.  You often end up with a regex from hell, and I'm not really one for regexes from hell and the declarative style ends up taking far more code that it really should.  You can do this in Perl in a sort of quick way, but it looks pretty damn ugly, and besides, it's 2012, surely we have something that can do this almost as well or better than Perl!

So, with no further ado, I give you my very simple script:

import io._

println(Source.fromFile(new File(args(0))).getLines.filter(_.contains("")).flatMap {x=>x.split("<")}.map {x=>
  (x.indexOf("href=")>0 match {
    case true => x.substring(x.indexOf("href=")).dropWhile("'\"".contains(_)).takeWhile(_!='>')
    case false => ""
})}.filter(x=>{x!="" && x.endsWith(".html\"")}).map {x=>x.dropWhile(_!='"').drop(1).takeWhile(_!='"')}.reduce(_+"\n"+_))

Is this the best way, the easiest or the most elegant, no. It's a script that I needed to write in ten minutes or less.  Almost a throw-away piece of code.

The big thing for me was that I've finally become familiar enough with Scala syntax that I could achieve this is less than ten minutes.  No wandering off to stack overflow to looks something up, or struggling with one of the erasures for the list comprehensions; I could just sit here and type and make it work with minimal debugging fuss.

When I think about the pure horror of trying to do this in Java, I shudder.  If I removed most of the newlines and the imports, this could exists on just two lines (I'm not sure if I can terminate a case clause with a semi-colon or not).  It is probably possible to write this as an immediate script right on the command line, and still be able to read it (well, mostly).

Today is a good day.

No comments:

Post a Comment