Technology Madness: 2012

Monday, December 17, 2012

Collect Chains for Creation - Sublime or Stupid?

I'm working on a piece of code to deserialize Facebook messages, and I'm looking at the mess I wrote to do this, and it occurred to me that I could do it as a progressive collect chain. Each time we know something new, we collect on it, or pass-thru if it's a false condition.

def makeNewMessage(id: Long)(json: JsValue)(facebookClient: FacebookClient)(session: scala.slick.session.Session): Option[Message] = {
  (((json \ "message").asOpt[String], (json \ "created_time").asOpt[String], (json \ "id").asOpt[String]) match {
    case ((Some(message), Some(createdTime), Some(fbId))) => {
      Some((userId: Long) => (likes: Int) => Message(
        id = id,
        subject = None,
        systemUserId = Some(userId),
        message = Some(message),
        createdTimestamp = Some(new sql.Date(facebookDateParser.parse(createdTime).getTime)),
        facebookId = Some(fbId),
        upVotes = Some(likes)
      ))
    }
  }).collect({ case f => {
    scaffoldFacebookUser(json)(facebookClient)(session) match {
      case Right(scaffoldedId: Int) => f(scaffoldedId)
    }
  }}).collect({ case f => f(((json \ "likes").asOpt[Int], (json \\ "likes").size) match {
    case (Some(likes), 0) => likes
    case (_, likes) => likes
  })})
}

The result of the first piece is a 2-ary function that may get populated with a userId and a like count. Once we figure out if we can build a user, we populate that, then we figure out the like count. If at any point the chain fails, it just returns None.

This all feels very imperative to me, perhaps I'm just tired, and it will come to me later. I swear I remember better ways of doing this using unapply magic, but I can't seem to figure it out, so this is where I'm going right now!

Wednesday, October 24, 2012

Head shaking moments - an ongoing Saga

I think I might start to keep track of examples of bad code that are right out there in the public view. Some of these examples are the language tutorials themselves even!

Today I'm gonna single out the object example in the CoffeScript guide. In this guide we get the Animal class example:

class Animal
  constructor: (@name) ->

  move: (meters) ->
    alert @name + " moved #{meters}m."

class Snake extends Animal
  move: ->
    alert "Slithering..."
    super 5

class Horse extends Animal
  move: ->
    alert "Galloping..."
    super 45

sam = new Snake "Sammy the Python"
tom = new Horse "Tommy the Palomino"

sam.move()
tom.move()

This code is in violation of the Call Super code smell. We finally get half decent classes in Javascript using CoffeeScript, and this is the first example given - one that is considered by some to be an anti-pattern. Below I'm going to feature an attempt to refactor out this smell.

Updated Code

class Animal
  constructor: (@name) ->

  move: (meters = @distance()) ->
    alert @movingMessage()
    alert @name + " moved #{meters}m."

  movingMessage: -> "Moving..."
  distance: -> 10


class Snake extends Animal
  movingMessage: -> "Slithering..."
  distance: -> 5

class Horse extends Animal
  movingMessage: -> "Galloping..."
  distance: -> 45

sam = new Snake "Sammy the Python"
tom = new Horse "Tommy the Palomino"

sam.move()
tom.move()

Delegate methods are a much better use for an inheritance contract than methods that override a super to provide essentially the same behavior, only with different parameters. I would argue that the second version is substantially clearer as each method does precisely one thing, including the move method of Animal.

Sunday, August 26, 2012

Dealing with Annoying JSON

How often do you have to work with an API that supplies badly formatted output? It's fine if you're a weakly typed mushy language like Javascript that doesn't care until it has to, but for those of us who like something a little more structured, and a little more performant, it presents a challenge. I'm gonna look at a way to deal with annoying JSON like this in Scala. The most recent one I'm running into is a field that may come back as a string, or may come back as a list.

For JSON like this, Jackson provides us with a way to cope with this. The solution doesn't seem to work well with case classes, and seems to require a good deal more annotations that it should, but it does get the job done in a none too egregious way.
Here's a sample of bad JSON:

[{
  "startDate" : "2010-01-01",
  "city" : "Las Vegas",
  "channel": "Alpha"
},{
  "startDate" : "2010-02-01",
  "city": "Tucson",
  "channel": ["Alpha","Beta"]
}]

You can see that in the first element, the field 'channel' is supplied as a string, and in the second, it's now a list. If you set the type of your field to List[String] in Scala, it will throw an error when deserializing a plain String rather than just converting it to a single element list. I understand why it's a good idea for deserialization to do this, but really, if you're using JSON, then schema compliance probably isn't at the top of the list of requirements.

You can deal with this using the JsonAnySetter annotation. Unfortunately, once you use this, it seems all hell breaks loose and you must then use JsonProperty on everything and it's brother. The method that you defined annotation by JsonAnySetter will accept two arguments that function as a key value pair. The key and value will be typed appropriately, so the key is always a String, and the value will be whatever type deserialization found most appropriate. In this case, it will be a String or an java.util.ArrayList. We can disambiguate these types with a case match construct, which for this seems perfect:

@BeanInfo
class Data(@BeanProperty @JsonProperty("startDate") var startDate: String,
  @BeanProperty @JsonProperty("city") var city: String,
  @BeanProperty @JsonIgnore("channel") var channel: List[String]) {

  // No argument constructor more or less needed for Jackson
  def Data() = this("", "", Nil)

  @JsonAnySetter
  def setter(key: String, value: Any) = {
    key match {
      case "channel" => {
        value match {
          case s: String => { channel = List(s) }
          case l: java.util.ArrayList[String] => {
            channel = Range(0,l.size()).map(l.get(_)).toList
          }
          case _ => { // No-op if you're ignoring it, or exception if not
          }
        }
      }
    }
  }
}

Now when the bad JSON gets passed into deserialization it will get mapped more smartly than it was generated, and we win!

I might have a poke at it to see if I can get it working with less additional annotation crazy too.

Tuesday, August 21, 2012

Testing with FluentLenium and Scala

I posted some time ago about Browser based testing in Scala using Cucumber leveraging JUnit and Selenium. That mechanism is pretty complicated, and there seems a much better way of doing it. The FluentLenium library gives a good way to integrate browser based testing into Scala. There are still some challenges with continuous integration that have to be solved, and I'll talk about that later.

What does a FluentLenium test case look like with this system? Here's a simple example that opens a home page and clicks on a link:

class HomePageSpec extends FunSuite with ShouldMatchers {

  test("Visit the links page") {
    withBrowser {
      startAt(baseUrl) then click("a#linksPage") then
        assertTitle("A List of Awesome Links") then testFooter
    }
  }
}

And we can fill in a form and submit it like this:

class RegistrationSpec extends FunSuite with ShouldMatchers {
  val testUser = "ciTestUser"

  test("Creating a fake user account") {
    withBrowser {
      startAt(baseUrl) then click("a#registerMe") then
        formText("#firstName", "test") then
        formText("#lastName", "user") then
        formText("#username", testUser) then
        formText("#email", "test.user@example.com") then
        formText("#password", "123") then
        formText("#verify", "123") then
        hangAround(500) then click("#registerSubmit")
    }
  }
}

Much of what you see above isn't out of the box functionality with FluentLenium. Scala gives us the power to create simple DSLs to provide very powerful functionality that is easy to read and easy write. People often don't like writing tests, and Scala is a language that is still somewhat obscure. A DSL like this makes it trivial for any developer, even one who is totally unfamiliar with Scala to construct effective browser-based tests.

Now I'm going to delve into some of the specifics of how this is constructed! (example code can be found at: git://gitorious.org/technology-madness-examples/technology-madness-examples.git)

The first piece is the basic configuration for such a project. I'm using the play project template to start with as it offers some basic helper functionality that's pretty handy. The first thing to do is create a bare play project

play create fluentlenium-example

I personally prefer ScalaTest to the built-in test mechanism in play, and the fluentlenium dependencies are needed, so the project's Build.scala gets updated with the following changes:

val appDependencies = Seq(
    "org.scalatest" %% "scalatest" % "1.6.1" % "test",
    "org.fluentlenium" % "fluentlenium-core" % "0.6.0",
    "org.fluentlenium" % "fluentlenium-festassert" % "0.6.0"
)

val main = PlayProject(appName, appVersion, appDependencies, mainLang = JAVA).settings(
// Add your own project settings here
  testOptions in Test := Nil
)

Now for the main test constructs. A wrapper object is constructed to allow us to chain function calls, and that object is instantiated with the function startAt():
case class BrowserTestWrapper(fl: List[TestBrowser => Unit]) extends Function1[TestBrowser, Unit] { def apply(browser: TestBrowser) { fl.foreach(x => x(browser)) } def then(f: TestBrowser => Unit): BrowserTestWrapper = { BrowserTestWrapper(fl :+ f) } }
This object is the container if you will for a list of test predicates that will execute once the test has been constructed. It is essentially a wrapped list of functions which we can see from the type List[TestBrowser => Unit]. Each test function doesn't have a return value because it's using the test systems built-in assertion system and therefore doesn't return anything useful. When this object is executed as a function, it simply runs through it's contained list and executed the tests against the browser object that is passed in.

The special sauce here is the then() method. This method takes in a new function, and builds a new BrowserTestWrapper instance with the currently list plus the new function. Each piece of the test chain simply creates a new Wrapper object!

Now we add a few helper functions in the companion object:

object BrowserTestWrapper {
  def startAt(url: String): BrowserTestWrapper = {
    BrowserTestWrapper(List({browser => browser.goTo(url)}, hangAround(5000)))
  }

  def hangAround(t: Long)(browser: TestBrowser = null) {
    println("hanging around")
    Thread.sleep(t)
  }


  def click(selector:String, index: Int = 0)(browser:TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).get(index).click()
  }


  def formText(selector: String, text: String)(browser: TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).text(text)
  }

  def waitForSelector(selector: String, browser: TestBrowser) {
    waitFor(3000, NonZeroPredicate(selector))(browser)
  }


  def waitFor(timeout: Long, predicate: WaitPredicate): TestBrowser => Unit = { implicit browser =>
    val startTime = new Date().getTime

    while(!predicate(browser) && new Date().getTime < (startTime + timeout)) {
      hangAround(100)()
    }
  }
}

sealed trait WaitPredicate extends Function[TestBrowser, Boolean] {
}

case class NonZeroPredicate(selector: String) extends WaitPredicate {
  override def apply(browser: TestBrowser) = browser.$(selector).size() !=0
}

This gives us the basic pieces for the test chain itself. Now we need to define the withBrowser function so that the test chain gets executed:

object WebDriverFactory {
  def withBrowser(t: BrowserTestWrapper) {
    val browser = TestBrowser(getDriver)
    try {
      t(browser)
    }
    catch {
      case e: Exception => {
        browser.takeScreenShot(System.getProperty("user.home")+"/fail-shot-"+("%tF".format(new Date())+".png"))
        throw e
      }
    }
    browser.quit()
  }

  def getDriver = {
      (getDriverFromSimpleName orElse defaultDriver orElse failDriver)(System.getProperty("driverName"))
  }

  def baseUrl = {
    Option[String](System.getProperty("baseUrl")).getOrElse("http://www.mysite.com").reverse.dropWhile(_=='/').reverse + "/"
  }

  val defaultDriver: PartialFunction[String, WebDriver] = {
    case null => internetExplorerDriver
  }

  val failDriver: PartialFunction[String, WebDriver] = { case x = > throw new RuntimeException("Unknown browser driver specified: " +  x) }

  val getDriverFromSimpleName: PartialFunction[String, WebDriver] = {
    case "Firefox" => firefoxDriver
    case "InternetExplorer" => internetExplorerDriver
  }

  def firefoxDriver = new FirefoxDriver()

  def internetExplorerDriver = new InternetExplorerDriver()
}

This gives us just about all the constructs we need to run a browser driven test. I'll leave the implementation of assertTitle() and some of the other test functions up to the reader.

Once we have this structure, we can run browser tests from our local system, but it doesn't dovetail easily with a Continuous Integration server. As I write this, my CI of choice doesn't have an SBT plugin, so, I have to go a different route. Pick your poison as you may, mine is Maven, so I create a Maven pom file for the CI to execute that looks something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>fluentlenium-tests</artifactId>
  <version>1.0.0</version>
  <inceptionYear>2012</inceptionYear>
  <packaging>war</packaging>
  <properties>
    <scala.version>2.9.1</scala.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
    <repository>
      <id>typesafe</id>
      <name>typesafe-releases</name>
      <url>http://repo.typesafe.com/typesafe/repo</url>
    </repository>
    <repository>
      <id>codahale</id>
      <name>Codahale Repository</name>
      <url>http://repo.codahale.com</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.version}</artifactId>
      <version>1.8</version>
    </dependency>

    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-core</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-festassert</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play-test_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>org.scala-tools.testing</groupId>
      <artifactId>specs_${scala.version}</artifactId>
      <version>1.6.9</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>app</sourceDirectory>
    <testSourceDirectory>test</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <configuration>
          <argLine>-DdriverName=Firefox</argLine>
          <includes>
            <include>**/*Spec.class</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>

You might notice that the above Maven configuration uses JUnit to execute out Spec tests. This doesn't happen by default, as JUnit doesn't pick up those classes, so we have to add an annotation at the head of the class to signal JUnit to pick up the test:

@RunWith(classOf[JUnitRunner]) class HomePageSpec extends FunSuite with ShouldMatchers { ... }

Monday, July 9, 2012

Data processing, procedural, functional, parallelism and being Captain Obvious

I'm not going to tell you anything you don't already know in this post. I might however manage to aggregate some things your already knew into one place and make them dance to a slightly new tune.

At the heart of this post is somewhat of an epiphany I had today. It has to do with how code is written to do data processing. This is a very common task for programmers, perhaps one that is in fact ubiquitous.

Ultimately data processing almost always looks something like this:

You load some stuff, parse it, transform it, filter it and output it. Those things may happen in different orders, but ultimately, something like that.

One of the things you already know is that the implementation of this should look like a production line. Read a datum in, send it through the processing process, rinse repeat, batch as need be for efficiency.

The amazing thing is that when you look at implementations, they often end up looking like this:

Code is written that loads the entire set into memory as a list of objects, which then pass through some methods who change that list of objects, either by transforming the objects themselves, or worse, copying the list in entirety to another list of different objects, filtering the list in situ, then saving the whole lot out. These programs end up requiring some amount RAM at least as large as the data itself as a result. Everybody knows this is a bad way to do things, so why do people keep writing code that looks like this?

We all know it should look more like

I think this is perhaps the tune to which many have not given thought to. The problem just pops up, and people start scrambling to fix it, trying to dance triple time to the beat of the old drum. I believe one significant cause maybe time and effort. Data processing code often starts life as a humble script before it morphs into something bigger. Most scripts are written in procedural languages. In these environments, parallelization and streaming are more complicated to write than loading in the whole file and passing it around as an object, so people default to the latter. Why write a mass of code dealing with thread pools and batching when you don't have to? (I know there are libraries and frameworks, but often, people don't know them, or don't have enough time to use them).
This problem is easy to solve in a language where functions are first order values. For each flow step, you define a function to perform that operation. Not any different than procedural. Instead of the function taking a value as an input and returning a new value, our functional variant instead returns the function which is the transformation, that being a function taking an object an returning one. The flow can then be defined as a function that executes a list of transform functions, which can itself be a function that returns a function which takes an object and returns an object. Now we can apply that flow to any object, single, multiple or otherwise very easily, as the flow itself is now just a value.
In Scala, you have streaming sequences, so it becomes as easy as:

io.Source.fromUrl("http://www.example.com/foo.data").flatMap(myFlow(_)).foreach(output(_))

In Scala, there are some helpers that can then apportion and parallelize this quite easily, which I talked about in my previous post. As we now have a process as our primary value, instead of a chunk of data as our primary value, parallelization becomes much easier, passing our processing function around between threads is far easier than coping with a big chunk of mutable data being shared about.

You can implement this pattern in Java, or C++ or Perl, but most people have to stop and think to do so, the languages doesn't give it to you for free. In functional programming, from what I'm learning, this is a very common pattern. In fairness, it's a common pattern in Java too, but many folks don't ever think of it as a default choice until it's already too late.

Monday, July 2, 2012

Logging and Debugging

I'm finding one of the biggest challenges working with Scala is debugging, and secondarily logging. The former seems to be a tooling issue as much as anything, and to be honest, the latter is a matter of my putting time in to figuring it out.

With debugging, break points in the middle of highly condensed list comprehensions are very very hard to make. I end up mangling the code with assignments and code blocks that I then have to re-condense later.

I've attached a debugger using the usual jdwp method, but it slows everything down so badly, and it's just not that much better than print statements. I've been going through the Koans with a new employee at work, and it's been helping both of us greatly. There's one koan that describes a way to sort of "monkey patch" objects, and as much as I dislike that approach in general, it sure as heck beats Aspects which are hard to control and often fickle unless they are part of your daily routine.

I came up with a small monkey patch for the List class that lets me use inline log function calls to get basic information about the state of a list in the middle of a comprehension chain, so I include it here in the hopes that somebody will find it useful, or have some better ideas!

class ListLoggingWrapper[+T](val original: List[T]) {
  def log(msg: String): List[T] = {
    println(msg + " " + original.size)
    original
  }
  def logSelf(msg: String, truncateTo: Int = 4096): List[T] = {
    println(original.toString().take(truncateTo))
    original
  }
}

implicit def monkeyPatchIt[T](value: List[T]) = new ListLoggingWrapper[T](value)

This helpful snippet allows you to call a new method 'log' on a List object that prints out the List size, and similar with 'logSelf' which allows you to print out the result of toString, truncated (working with large lists means you always end up with pages of hard to pick through output if you don't truncate I've found).

A list comprehension chain ends up looking something like this:

Util.getJsonFilePaths(args(0)).map {
      x: String =>
        new File(x).listFiles().toList.log("File List Size").filter(file => {
          Character.isDigit(file.getName.charAt(0))
        }).map(_.getPath).filter(_.endsWith(".json")).log("Json File Count").flatMap {
          jsonFile =>
            io.Source.fromFile(jsonFile).getLines().toList.log("Line Count for " + jsonFile).map(line => Json.parse[List[Map[String, Object]]](line)).flatMap(x => x).log("Elements in file").logSelf("Elements are", 255).filter(jsonEntry => {
              jsonEntry.get("properties").get.asInstanceOf[java.util.Map[String, Object]].asScala.get("filterPropertyHere") match {
                case None => false
                case Some(value) if (value.toString == "0") => false
                case Some(value) if (value.toString == "1") => true
                case _ => false
              }
            }
            )
        }
    }

Which is a piece of code to aggregate data across multiple JSON files filtering by a given property using Jerkson (which I still feel like I'm missing something with as it seems harder than it should be).

Thursday, June 21, 2012

Parallel Processing of File Data, Iterator groups and Sequences FTW!

I have occasion to need to process very large files here and there. It seems that Scala is very good at this in general. There is a nice feature in the BufferedSource class that allows you to break up file parsing or processing into chunks so that parallelization can be achieved.

If you've tried the obvious solution, simply adding .par, the method isn't present. So, you might convert to a List with toList. When you convert like this, Scala will then compile all the lines into a List in memory before passing it on. If you have a large file, you'll quickly run out of memory and your process will crash with an OutOfMemoryException.

BufferedSource offers us another way to do this with the grouped() method call. You can pass a group size into the method call to break your stream into a sequence of lists. So, instead of just a String sequence made up of millions of entries, one for each line, you get an set of Iterators made up of Sequences with 10,000 lines in each. A BufferedSource is a kind of Iterator, and any kind of Iterator can be grouped in this way, Sequences or Lists included. Now you have a Sequence type with a finite element count which you can parallelize the processing on and increase throughput, and flatMap the results back together at the end.

The code looks something like this:

io.Source.stdin.getLines().grouped(10000).flatMap { y=>
      y.par.map({x: String =>
        LogParser.parseItem(x)
      })}.flatMap(x=>x).foreach({ x: LogRecord =>
         println(x.toString)
      })

So with this, we can read lines from stdin as a buffered source, and also parallelize without the need to hold the entire dataset in memory!

At the moment, there is no easy way to force Scala to increase the parallelization level beyond your CPU core count that I could get to work. This kind of I/O splitting wasn't what the parallelization operations had in mind as far as I know, it's more a job for Akka or similar. Fortunately, in Scala 2.10, we'll get Promises and Futures which will make this kind of thing much more powerful and give us more easy knobs and dials to turn on the concurrency configuration. Hopefully I'll post on that when it happens!

Tuesday, June 12, 2012

Parsing CSVs in Scala

I did a quick google on parsing CSVs in Scala, and one of the top hits was a stack overflow question where the answer was wrong. Very wrong. So, I threw together a quick parser in Scala to get the job done. I'm not saying it's good, but it passes the spec tests I have included quotes and quoted commas both with single and double quotes. I hope this is useful, and perhaps somebody can improve upon it.

object CSVParser extends RegexParsers {
  def apply(f: java.io.File): Iterator[List[String]] = io.Source.fromFile(f).getLines().map(apply(_))
  def apply(s: String): List[String] = parseAll(fromCsv, s) match {
    case Success(result, _) => result
    case failure: NoSuccess => {throw new Exception("Parse Failed")}
  }

  def fromCsv:Parser[List[String]] = rep1(mainToken) ^^ {case x => x}
  def mainToken = (doubleQuotedTerm | singleQuotedTerm | unquotedTerm) <~ ",?".r ^^ {case a => a}
  def doubleQuotedTerm: Parser[String] = "\"" ~> "[^\"]+".r <~ "\"" ^^ {case a => (""/:a)(_+_)}
  def singleQuotedTerm = "'" ~> "[^']+".r <~ "'" ^^ {case a => (""/:a)(_+_)}
  def unquotedTerm = "[^,]+".r ^^ {case a => (""/:a)(_+_)}

  override def skipWhitespace = false
}

Wednesday, June 6, 2012

Data Migration - Scala and Play along the Way

I've been nibbling at data migration system for many years. It's gone through various transformations, and it's latest addition is mostly working. The original purpose of the program I forget, but it's main use for awhile has been to extract the EVE Online database data from the SQL Server database dump that CCP kindly provides. Each EVE revision, I take the CCP dump, spin up a Windows server in the cloud, import the database and extract what I need to port it into PostgreSQL, which is my system of choice.

Over the years, JDBC has improved, and technologies have moved along. In the beginning I wrote Hermes-DB, a simple ORM that was very much not type safe, but coped with many of the auto-probing of table information that comes along with a more dynamic style ORM. One can argue that this isn't really ORM at all, and at this point, I'm inclined to agree.

Having said that, the auto-probing capabilities turned out to be very very useful in extracting data. Because the system was predicated on the idea that learning about the database should be the job of the framework, not the developer, it had a reasonably well formed concept of representing tables and columns as objects. With a bit of tweaking, adding a new metadata class along the way, the package can represent a table definition fairly well now.

What this allows me to do today, is create both a solid database dump, and the DDL to build the table structure. Theoretically this system could be modified to pull from any datastore and generate for any other datastore. The system was built in a way that was hopefully designed to facilitate that.

2012 rolls around, and things have changed. The landscape for web development has been shifting over the last decade as people struggle to find way to get the tools out of the developer's way, and enable them to do their job more and fight with code less. The most recent evolution in that sequence that I've been working with, is Scala and Play. As I work with two tools, I'm increasingly finding it easier to build systems that are stable, and take much less code to write.

Hermes-DB was originally designed just to output DDL, but when I started working with JPA, a system that requires a whole lot of scaffolding, it made sense to have one of the output "DDLs" be Java classes with JPA annotations. Over the last few days, I've been making a new variety of output, Scala case classes designed to work with Play and therefore Anorm. Anorm is very powerful, and gives you tools that "get out of your way", but doesn't have a lot when it comes to scaffolding. I've poked around a bit, and it seems there was a scaffolding plugin for Play 1, but none exists for Play 2. This little utility, is helping fill that gap for me. It outputs Scala class and companion object definitions based on the database schema.

The EVE Online database comes out of the box with about 75 tables. 75 tables that I'd rather not have to manually create mappings for for model classes. This little utility made my life much easier. A bit cheer for code generation tools!

It is open source of course, and can be found on gitorious with the git URL: git@gitorious.org:export4pg/export4pg.git

Please note that some of this code is very very old, and it's worked for probably close to a decade so some of it is a bit ancient in both understanding and coding style. It is however, very useful, and possibly one of the pieces of code I've written that's still in usage and not broken from constant tinkering!

Tuesday, May 15, 2012

On wired and wireless networking

I saw the following article on G+ today:

http://lifehacker.com/5910335/what-awesome-things-still-require-a-wire--does-plugging-in-even-matter-anymore?utm_campaign=socialflow_lifehacker_facebook&utm_source=lifehacker_facebook&utm_medium=socialflow

And thought I'd comment on it. I used to be a big proponent of wired systems, sufficiently that I put effort into wiring my home with Cat 5e. That was back in the days of 802.11g, and honestly, back then 802.11g didn't come close to its potential most of the time.

Today we live in an age of the wireless. I use laptops that are truly portable, and iPad and iPhone and iPod touch. I agree that there are some places where wired makes sense, but I think this article makes both valid points, and invalid points. I'm gonna break it down here a bit, and take it one.

Backup Faster over the Network

This is mostly a valid point. If you need to backup over a network, you're better off plugged in. This does however assume your NAS supports gigabit ethernet, that the NAS's operating system doesn't suck, and that the drive inside can do better than 10-15MB/sec. I've seen many cases where none of the above are true, and it's one reason I switched to Apple.
Mostly, I don't use NAS. It's generally quirky, unreliable, expensive and slow, regardless of your network connection. I spent a great deal of time going through NAS devices until I finally just gave up and used a directed attached device. He also talks about remembering to keep your device turned on being a problem. If you use a wired connection, the same issue holds, so it's not really a good argument for wired.
I think on balance, this is a poor argument, though, I think it has some validity.

Keep up with your ultra-fast network

This is a really elitist kind of point. The number of folks who come close to having 100Mbit internet is miniscule. I'm a programmer, and I don't have 100Mbit. Even with 100Mbit, the number of times I'd get 100Mbit from the other end is about zero. Even at 25Mbit, I often don't see that saturated from download sites. This is a poor argument in my opinion.

USB 3.0 (and 2.0 Too)

Comparing wireless networking with direct attached peripherals seems a bit silly. And it goes on and on in this article. This is a both a valid point and an invalid point. If the device on the other end can truly saturate 802.11n, then this is true. Many devices just can't. Backups are the prime candidate here, and, well, I think backups are a good use of direct attached.

Remote Control Your Camera

Very very esoteric usage here. Firstly, it assumes you have a DSLR. Secondly it assumes you have a need to control it wireless and view the images on a laptop. Most folks aren't doing inside or studio shooting, even if they own a DSLR, and if they are, then why not just use USB from your computer, which is wired of course, but the need for wireless her at all seems a stretch.
This is a really crap argument.

Record High Quality Audio

Little bit of bandwidth calculation is required here. WAV format, that which is used in CDs is 44Khz at 16bit. This means you need 44,000 samples of 16 bits per second. Simple multiplication shows that comes it under 1Mbit. Lets take this up a notch and go to studio level 24 bit at 192Khz. If you have software and devices that can do this, it still only clocks in at 4.6Mbit. I've used 24 channel recording desks that use firewire. They were Firewire 800, which is 80Mbit. I'm pretty sure it wasn't saturated, and that's within the capability of 802.11n.
This is an invalid argument. Other than the fact that audio devices don't come with wireless support. But, let's face it, most computers don't come with Firewire 800 either.

Anything That Can Be Done with a Thumb Drive

I'm not really sure what the argument here is. If it's speed, then it's a really bad argument. Most thumb drives are really slow. I had to go out of my way to buy one that was even a half-sensible speed. This is also the reason I feel that Micro-SD slots in your Android device are pretty silly. Most people don't know they have to buy a high-speed SD card, or USB key for it to be much use. With wireless, it's not that hard to transfer files over the network to folks. It take a bit of knowledge unless you have a Mac, but it's not that hard. I haven't ever been given a USB key for a mix-tape (tapes, now we're talking modern tech) or mix-CD.

Charge your Other Gadgets

Powering USB devices. An iPad, a pretty hungry device I believe charges at 12W. You could reasonably charge your iPad off your unwired laptop without too much pain, and give your device some more juice at the cost of some laptop time in a pinch. Also, power transfer without wires is still some pretty new technology, and I think comparing it with wireless networking is a bit disingenuous.
I think this argument is valid in as much as you can't charge a device wireless, but, I think it's a silly argument given the original context of wireless being ethernet.

Audio and Video Cables

We've already covered audio. Video was only recently able to be transmitted over a serial connection, not HDMI level, but computer monitor level. Whilst this is true, it's also a bit silly, and see below for why.

Put Your Tablet or Smartphone On Your TV

Two words: Apple TV (maybe that's three)
Nuff said, this is an invalid argument. It also sort of invalidates the previous point. You can't transmit full-quality video over wireless, but you can transmit compressed high-def, and I think that satisfies the requirement in my opinion. There have been a few articles comparing iTunes 1080 with Blu-ray and iTunes hasn't come out too badly.

Get the Highest Quality Sound

Isn't this a repeat of "Record High Quality Audio"? In short, no. This is invalid.

Final Score

I think out of the arguments, three out of ten have some semblance of validity, of those, I'm struggling with two of them. There are things that need to be wired, your speakers to your stereo will still need to be wired. There are wireless solutions but they either suck, or are very expensive. Generally I think this article tries a bit too hard to demonstrate a need for wired in a world that is already mostly wireless. Trying to convince people to backup over ethernet when they're already doing it wireless is gonna be a pretty hard sell.

Saturday, May 12, 2012

Scala is very nice - very very nice

Today I am gushing over Scala's par method and XML literals. I am fetching about 30,000 entries over REST calls. The server isn't super fast on this one, so each call takes a bit of time. Enter list.par stage left.

list.par creates a parallelizable list which given an operation will perform it in parallel across multiple CPUs. It spawns threads and performs the operation, then joins all the results together at the end, very handy.

This little three letter method is turning what would be a very very long arduous process into a much less long one. Much much less.

val myList = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}

It gets better. In Scala, XML can be declared as a literal. Not only that, but it runs inline like a normal literal, with a few special rules. This service is combining a bunch of json into an XML output.

val myOutput = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}.map { x =>
  Json.parse[Map[String, Object]](x)("url").toString
}.map { x =>
  <entry>
    <url>{ x }</url>
  </entry>
}.toString

Which I can now happily write to wherever I need to, a file, or a web service response. Nifty in the extreme.

In 2012, we live in a world of JSON and XML. Perl had it's day when text processing was king. Today, a language is needed that can cope with JSON, XML and Parallelization and still yield sane-looking code. I'm not a big Ruby fan, as anyone who knows me will tell you, but I'm willing to keep an open in. I'd like to see if Ruby can do this kind of thing as elegantly and easily and demonstrate it's a language for the web in 2012. Also, I should mention Akka as well, though I don't yet know enough about it, other than it can allegedly take parallelization inter-compuer with similar simplicity.

Wednesday, May 9, 2012

Simple Scala scripts : Scan a directory recursively

I'm using Scala increasingly as a scripting language at the moment. As my confidence with it is increasing, I'm finding it's becoming more and more useful for those throw-away scripting situations. Especially when then end up being not so throw-away after all.

def findFiles(path: File, fileFilter: PartialFunction[File, Boolean] = {case _ => false}): List[File] = {
  (path :: path.listFiles.toList.filter {
    _.isDirectory
  }.flatMap {
    findFiles(_)
  }).filter(fileFilter.isDefinedAt(_))
}

(replace {} with (), ditch newlines and it goes on one line well-enough, just doesn't fit in a Blogger template that way)
We might be duplicating the a shell find:

find | grep 'foo'

find ./ -name "foo"

And whilst the Scala is more complex, the Scala function can do operations on a File object, which gives you a lot of the rest of the power of the find command thrown in to the bargain. Plus, as it accepts a partial function, you can chain together filters. If you truly just wanted an analog for find:

def findFiles(path: File): List[File]  = 
  path :: path.listFiles.filter {
    _.isDirectory
  }.toList.flatMap {
    findFiles(_)
  }

Which is less complex that the first. This is still more work than find, but, the list you get back is an actual list. If you added anything useful to your find, say an md5 for each file, it gets less happy

find ./ | awk '{print "\""$0"\""}' | xargs -n1 md5sum

Maybe there's a better way, but that's what I've always ended up doing. The Scala is starting to compete now. Bump up the complexity one more notch, and I think Scala actually starts becoming less code and less obscure.

You might also notice that the example above can be fit nicely within the Map/Reduce paradigm. Scripting that is not only relatively easy, but can also be thrown at Hadoop for extra pzazz, and NoSQL buzz-worthyness.

On things that "save time"

Over the years, I've often heard things about things that "save time" in development. For many years, I was gun-shy of IDEs. Too often they break down, and the entire thing has to be reset and reconfigured from nothing. It made the cost out-weight the benefits. After a few more years passed, IDEs got better, and when somebody introduced me to IntelliJ, I was finally convinced that IDEs could actually save me time overall, not cost me, normally at the most inopportune moment.

So now we have IDEs that don't suck. Take a simple thing like method lookup. What's the time difference between hitting Command-B in IntelliJ, or having to do Ctrl-H, change the tab, and type it in in Eclipse (I'm sure there's a better way in Eclipse, there always is, but it normally non-obvious). It amounts to a few seconds at most. So, there's not really a significant difference right? In time, this is perhaps true, and some might argue that a few seconds here and there can add up, and I might go into that later. For me, the real issue is not time at all, it's space. Space in your brain.

A brain is like a CPU in some ways, and like a CPU it has a cache (at least this seems like a good analogy to me), you have multiple levels, at least L1 and L2, maybe L3. L1 caches are small and very fast. They handle what's the immediate focus of attention in your brain right now. Jumping through the code, tracing back a problem, going up the code path. When needing to search, instead of jumping directly to the caller, you have to go through a set of operations. This results in only a small time difference, but, it's like having to put three or four operations in your L1 cache instead of none. Hitting Ctrl-B is a zero effort operation. It's just like an op-code - Ctrl-B does this, that's what I need. Opening the search dialog is a zero effort operation. Remembering to switch the tab, not a zero effort operation, copy/pasting the right string in, not a zero effort operation, checking to make sure it's including the right files, similar, and if it's a big project, watching the search run, and then popping up an error dialog, not zero effort.

Another four things are now put into focused attention, significantly depleting what's there. Two seconds of time has busted through maybe 20% or more of a brain's L1 cache (I think I read somewhere that the average human can only concentrate on no more than four to six things at once). That two seconds can turn into two hours as the most important thing that was being held on to at the top of the stack in your brain which was in "L1" gets lost down into L2 or worse. We fix the immediate problem, but forget why. The local manifestation is gone perhaps, but the bigger issue is forgotten, and still very present.

Two seconds, concentration was diminished, which caused two hours of lost time. This is one way how every little operation in a development environment can be critical. Is this an exaggeration? I'm not sure it is. Even if it is for this one thing, imagine this problem multiplied by two, or four. Not just one missing zero-cost operation, but two, or three. Suddenly with a more fluid environment, with just a few things made drastically better, development becomes less stilted and happens better.

Wednesday, April 25, 2012

Div and CSS formatting

This is a topic that has proven a pain in the ass for just about every web developer I've ever known. How the hell do the various display options for divs truly work.

Googling this today, I found a page that has both good explanations, and also a demo at the bottom that you can mess around with to figure it out through experimentation. Perfect:

http://www.quirksmode.org/css/display.html

Tuesday, April 24, 2012

On recursion - a bit more

I have to say that the more I use recursion to solve simple problems the more it makes sense and the easier it becomes. The simple things in life are often where I find joy, and today's joy comes from perhaps a very silly but necessary piece of code. When you entity encode UTF-8 in HTML entities, you and up with strings that have thinks like Ӓ in them. This entity represents a single character for the purpose of considering string "length". All the standard String functions return length not understanding that the content of the string is encoded. Decoding it could lead to nasty unintended vulnerabilities, so I have created a function that copes with these functions. In a previous life (or about a year ago), I would have solved this with what I would now consider a kind of ugly for loop. Today, I prefer to solve more functionally:

 String.prototype.lengthDecoded = function(){
  return (this.length==0 ? 0 : 
     1 + (this.substring(0,1)=="&" ? this.substring(this.indexOf(";")+1).lengthDecoded() : this.substring(1).lengthDecoded()));
 }

I'm not convinced I have the best name yet, but in terms of brevity and clarity I think this is a massive improvement over a for loop which will always be in danger of off-by-one errors and simple verbosity.

Here we have what amounts to a single line of code, formatted for clarity that "eats" a string calculating its length as it goes. Simple recursive solutions I'm finding are easier and clearer than their declarative counter-parts. Other than the fact that I'm sitting here writing a blog post about it!

I have considered the idea of creating a subclass of string to do this too, but I'm not convinced the extra complexity that it would bring, and potential to simply not get the correct type in the right place versus using a different function would be better. I could easily make an argument for either.

Thursday, March 29, 2012

Play and Heroku

I've been messing around with Play, and decided that I'd push it up to Heroku based on the tutorial and things I've heard about Heroku.

I'm going to expand on this later but, if you forget the Procfile when deploying your Play application, it may cause your app to get totally jammed and never be able to start. I spent the next hour or two trying to figure out why my app wouldn't start, even after I'd put the Procfile in.

I solved the problem by deleting my app on Heroku and creating a new one. Then it started fine.

The docs on pushing a Play 2.0 app to Heroku all disagree with one another too, so I hope I can find a few to post a tutorial based on how I got it working!

Saturday, March 24, 2012

First day with Play

I started looking at play 2.0 for the first time today. I got a few hours with it at least, and I've been impressed with a few things.

The first and biggest thing is perhaps the most simple: compile on the fly. Grails does this, but very badly, and if I'm observing right, I think I can see why. It seems that Play compile checks at request time, not at file save time. As one who grew up computing, Ctrl-S or it's equivalent has become a nervous tick. Grails recompiling at save time almost always just ends up trashing the environment as I save about ten times a minute, and I end up having to restart it, which is very slow.

With FSC, Play compiles changes very quickly and I barely notice a lag at all. It doesn't get stuck anywhere I've noticed yet either like Grails can.

I feel like within a couple of hours, I got pretty far into having a functional, albeit basic web app going. Working with Anorm is interesting too, I'm not sure if I like how they've done it yet, but after years living with JPA and Hibernate's arcane query structures and unexpected query stupidity (although there was always a good reason, it was still annoying), I find this way of integrating SQL and code better than most. It has some similarity with O/R Broker which is what I've been using with Jersey so far, but Anorm is more sophisticated and I think easier to work with.

The error reporting in Play is also excellent. You can see quickly and precisely what went wrong with your code, there's no guesswork and decrypting enigmatic error messages, it just tells you with your source code right there: this is where it's broken!

Friday, March 23, 2012

Scripting in Scala

Today was the first time I've felt comfortable using Scala to write a true script. Something simple like taking an HTML file, and extracting certain anchor tags which can occur multiple times per line is surprisingly annoying to do in many scripting languages like python or perl. You often end up with a regex from hell, and I'm not really one for regexes from hell and the declarative style ends up taking far more code that it really should. You can do this in Perl in a sort of quick way, but it looks pretty damn ugly, and besides, it's 2012, surely we have something that can do this almost as well or better than Perl!

So, with no further ado, I give you my very simple script:

import io._
import java.io.File

println(Source.fromFile(new File(args(0))).getLines.filter(_.contains("http://www.foo.org/")).flatMap {x=>x.split("<")}.map {x=>
  (x.indexOf("href=")>0 match {
    case true => x.substring(x.indexOf("href=")).dropWhile("'\"".contains(_)).takeWhile(_!='>')
    case false => ""
})}.filter(x=>{x!="" && x.endsWith(".html\"")}).map {x=>x.dropWhile(_!='"').drop(1).takeWhile(_!='"')}.reduce(_+"\n"+_))

Is this the best way, the easiest or the most elegant, no. It's a script that I needed to write in ten minutes or less. Almost a throw-away piece of code.

The big thing for me was that I've finally become familiar enough with Scala syntax that I could achieve this is less than ten minutes. No wandering off to stack overflow to looks something up, or struggling with one of the erasures for the list comprehensions; I could just sit here and type and make it work with minimal debugging fuss.

When I think about the pure horror of trying to do this in Java, I shudder. If I removed most of the newlines and the imports, this could exists on just two lines (I'm not sure if I can terminate a case clause with a semi-colon or not). It is probably possible to write this as an immediate script right on the command line, and still be able to read it (well, mostly).

Today is a good day.

Thursday, March 22, 2012

Update on anonymous functions

I am an idiot. I left out perhaps one of the most fun things about anonymous functions: return data.

I'm still getting used to working in a functional way of thinking, but this is perhaps one of the most fun ways to use an anonymous function. It might be a little be evil, I'm not sure yet, but it's certainly interesting. take a simple string concatenation:

function getThingie(x) {
  var a = "I would like to buy";
  var b = "for a good price.";
  var c = "nothing";

  if (x == 1) {
    c = "a dog";
  }
  else if (x == 2) {
    c = "a cat";
  }
  else if (x == 3) {
    c = "a bath";
  }

  return a + " " + c + " " + b;
}

If we replace this with an anonymous function it can then look like this:

function getThingie(whatThing) {
  return "I would like to buy " + (function(x) {
    if (x == 1) {
      return "a dog";
    }
    else if (x == 2) {
      return "a cat";
    }
    else if (x == 3) {
      return "a bath";
    }
    else {
      return "nothing";
    }
  })(whatThing) + " for a good price.";
}

I'm not 100% sure which is 'better', but for my money, I find the flow of having the conditional inline as a function more clear. If this were a language like Scala, we could have a match clause here, or a partially applied chain. Ultimately, it puts shit where it goes, and that's my number one rule of programming. In this case, it means you don't have an assignment somewhere that has a chain of logic that is basically out-of-line of the program flow. When we read a sentence, we scan along the sentence and piece together information in our head in the order we read it, whether that's left to right or right to left, it's just easier to digest if we don't have to skip around to figure out what is going on in a story, or, in a piece of code. So, in that way, I think this obeys "put shit where it goes" pretty well.
To give an example, here's one way it might look in Scala:

def getThingie(whatThing) = {
  "I would like to buy " + (x match {
    case 1 => "a dog"
    case 2 => "a cat"
    case 3 => "a bath"
    case _ => "nothing"
  ) + " for a good price"
}

Arguably this would be an interesting case to use a partially applied function:

def mapMeMyPurhcaseItem : PartialFunction[Int, String] = { 
  case 1 => "a dog"
  case 2 => "a cat"
  case 3 => "a bath"
}

def getThingie(whatThing) = {
  return "I would like to buy " + 
    (mapMeMyPurchaseItem(x) getOrElse "nothing") + 
    " for a good price.";
}

Why would you want to do this? It could be argued that this obeys the "put shit where it goes" rule even better. The function only returns if the arguments is within the domain of the function, otherwise you must provide an alternative. It means the fall-through clause is explicit and local so the fall through case isn't hidden inside a function. Partially applied function are extremely powerful constructs I'm beginning to learn, and I think I'm going to find them to be a good friend in building extremely robust programs, but that's a story for another time!

Tuesday, March 6, 2012

Programmers, guns and analogies

An analogy I've heard bandied around over the years, not just by the Ruby folks, is the programmers and guns analogy. Give people freedom and they won't destroy the world, they'll be happy. Give a bunch of programmers the proverbial AK47 and they'll not start shooting each other.

For all the ways this is a bad analogy, and a bad model, none is worse than the fact that it's basis it completely flawed.

Bad code is not something you have to pull the trigger on, it's the default. It's like bad art, or bad bricklaying. When you don't know any different, you make bad art, lay crooked bricks and write crappy code. People don't get born into a state of understanding and skill. To extend this analogy thing, the gun is already firing, and you have to work to dodge it. Giving programmers the proverbial gun isn't giving them a tool. It's like a gun that once you've pulled the trigger, it won't stop firing, you just better hope you figure out how to aim it so it doesn't hit anyone else. Friendly fire in a war zone is a leading cause of death. Only extreme training will reduce it, and it's rarely eliminated.

Of course, it's coding so there aren't lives at risk right? Well, unless you're writing the software that flies a 747 through the sky, or a system that measured medication for the pilot who's flying the 747, or perhaps the code that set his smartphone alarm off two hours early and now he's low on sleep when the emergency happens. It might feel like an edge case, until you consider all the possible ways the human part of that system, or it's dependent systems, like the flight crew, the ground staff, the air traffic controllers might have been affected by a simple code bug which could end up being just as serious as a malfunction in the guidance system of a 747.

Hyperbole? A little, but also, not entirely.

Saturday, March 3, 2012

Scala : On constructors and companion objects

With companion objects, you can utilize the apply method to generate constructors that look like case classes. This might not seem like much, but it's a nicety that feels idiomatic to me at this point. If you're new to Scala, this is something that I think is worth a moment of your time. It is helpful not only in taking about constructors and companion objects, but most importantly the apply method. The apply method isn't something I really had a feel for until quite recently, and it's such a powerful mechanism that I think it really should be Scala 101. It's a difficult concept to work with when you're still very much in the Java mindset. I think there are aspects of Scala like this that fit in the mind of a C++ programmer more easily than a Java programmer. Let's have a look at some code.

import io.Source

class Thingy(name: String, info: String, saveFile: File) {
  val fw = new FileWriter(saveFile)
  fw.write(info)
  fw.close()

  def this(name: String, info: String) = this(name, info, "data/"+name.filter(Character.isLetter(_)))
}

object Thingy {
  def apply(name: String, info: String) = new Thingy(name, info)
  def apply(name: String, info: String, File: saveFile) = new Thingy(name, info, saveFile)
  def apply(name: String) = {
    val file = new File("data/"+name.filter(Character.isLetter(_)))
    new Thingy(name, Source.fromFile(file).mkString, file)
  }
}

The Thingy class is used to represent a simple data object that correlates to a file. This code isn't DRY and it could stand some improvements, but it's a simple example that works for the purposes of a demonstration.

There are two things here that both kind of look like Java classes. One is labelled with the object keyword and the other with the class keyword. In simple terms, the object is like an class definition that contains the methods that would have been declared static in a Java object. It's a bit more than this, but I'm not gonna delve into that here. The one that is labelled with the class keyword is more like a typical Java object. It has instances and maintains state (sort of). The definition that is labelled "object" is called a companion object in Scala.

Initially many Java programmers, myself included, look at this and ask why the heck would you want to do that? Seems a bit clunky perhaps. Then you come across the apply method. This one use case for me, helped give the object type a specific useful meaning beyond just static methods.

The apply method is a special marker in Scala. Let's look at a simple case, a List:

scala> val k = List("one","two","three")
k: List[java.lang.String] = List(one, two, three)

scala> k(2)
res1: java.lang.String = three

This uses the scala REPL which can be accesses just by typing "scala" (assuming you have the scala executable in your path).

The call k(2) is just like k[2] in Java. In this case we can see that it's just like an array index operator. The trick is that it's actually calling the apply() method on the class (not the object here, but the usage is similar). The interesting thing is this means for a custom object, like our Thingy, we can define what behavior we want when we call an object in this fashion. In our case, we have three apply methods. Two that create an instance, and one that acts a bit like an array index.

Now when we call Thingy("file_I_want_to_read.txt") I get back a Thingy that has been fully initialized from the file just like I might call something like
new Thingy(new File("data/"+name.filter(Character.isLetter(_))).

In Java, we cannot execute any code in a constructor prior to calling the super() method. This is pretty annoying at times, and we end up making work-arounds for this by using the factory pattern. The factory pattern is just fine, but I feel like it's overkill for something this simple, and Scala has a handy way of doing this that is pretty similar to the factory pattern, but syntactically much nicer^*.

The object gives us the power to do stuff before we perform the initialization of our object, and therefore makes this very easy.

You may have noticed that the "new" keyword went away in there. Using the apply method on the companion object, we have created what more or less appears to be what a static constructor might look like in Java, or like I mentioned, a short-cut for a Factory. It might be more convenience than anything, but it makes our code look better, removing the ugly details of constructor somersaults from the object itself and removing the littering of "new" calls throughout our code. Because it's a separate method call, it is not limited to returning an object of the same type as it's namesake. This has interesting implications/applications for dependency injection and cache management.

* It it easy to dismiss much of the niceties of things like Scala and Groovy as "syntactic sugar", and whilst that's true in some cases (not so much in others), sugar can sure make my coffee taste a whole lot better. Syntactic short-cuts like this can make your code both more readable and less bug prone, so give it a go before you dismiss features as "syntactic sugar".

Wednesday, February 29, 2012

Scala and Selenium

Update: I'm now using fluentlenium instead of this mechanim with the play framework. It's much better!

I've been working with Scala for awhile, and I've just dipped my toe into Selenium. Selenium is a web testing tool that allows you to drive a browser like Firefox, to do more detailed testing of web pages. Coupled with Cucumber, doing BDD based testing is pretty cool.

Setting this up in Java is a fair bit of work, and I managed to get it going with a little help from the internet. Then I wondered what it would look like in Scala. He below lies a configuration for settings up Selenium testing using Cucumber in Scala.

You can find all the pieces for this out there, but I hope having them all in one happy place will save a few folks some time! I could have used sbt, but chances are, if you're using sbt, you can easily figure out how to convert this pom. I use maven here because it's a common denominator (perhaps the lowest, but still), you can import it easily into IntelliJ or Eclipse or Netbeans.

As per usual, I'm no Scala expert, but in my project this works though there may be more idiomatic ways of solving some of the things in here. Please please comment if you have thoughts or ideas on this. The pom file:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <artifactId>selenium-tests-scala</artifactId>
    <version>1.0.0</version>
    <inceptionYear>2012</inceptionYear>
    <packaging>jar</packaging>
    <properties>
        <scala.version>2.9.1</scala.version>
    </properties>

    <repositories>
        <repository>
            <id>scala-tools.org</id>
            <name>Scala-Tools Maven2 Repository</name>
            <url>http://scala-tools.org/repo-releases</url>
        </repository>
        <repository>
            <id>sonatype-snapshots</id>
            <url>https://oss.sonatype.org/content/repositories/snapshots</url>
        </repository>
    </repositories>

    <pluginRepositories>
        <pluginRepository>
            <id>scala-tools.org</id>
            <name>Scala-Tools Maven2 Repository</name>
            <url>http://scala-tools.org/repo-releases</url>
        </pluginRepository>
    </pluginRepositories>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-compiler</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.10</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.specs</groupId>
            <artifactId>specs</artifactId>
            <version>1.2.5</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>org.seleniumhq.selenium</groupId>
            <artifactId>selenium-java</artifactId>
            <version>2.17.0</version>
        </dependency>
        <dependency>
            <groupId>info.cukes</groupId>
            <artifactId>cucumber-junit</artifactId>
            <version>1.0.0.RC16</version>
            <scope>test</scope>
        </dependency>
        <dependency>
            <groupId>info.cukes</groupId>
            <artifactId>cucumber-scala</artifactId>
            <version>1.0.0.RC16</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                    <args>
                        <arg>-target:jvm-1.5</arg>
                    </args>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-eclipse-plugin</artifactId>
                <configuration>
                    <downloadSources>true</downloadSources>
                    <buildcommands>
                        <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
                    </buildcommands>
                    <additionalProjectnatures>
                        <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
                    </additionalProjectnatures>
                    <classpathContainers>
                        <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
                        <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
                    </classpathContainers>
                </configuration>
            </plugin>

        </plugins>
    </build>
    <reporting>
        <plugins>
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <configuration>
                    <scalaVersion>${scala.version}</scalaVersion>
                </configuration>
            </plugin>
        </plugins>
    </reporting>
</project>

The features file is used to speak BDD language. This file is processed by Cucumber and matched against the step definitions file.

Feature: The home page should show useful content

Scenario: Load the home page

     Given I want to see my home page
     Then I should see valid content

This class functions essentially as a marker to JUnit to execute Cucumber to build the test instead of scanning this class for methods directly. You'd think there'd be an easier way to do this without having to have something that amounts to no more than a stub here.

package com.example.selenium.cucumber

import cucumber.junit.Cucumber
import cucumber.junit.Feature
import org.junit.runner.RunWith

/**
 * Test definition class that indicates this test is to be run using Cucumber and with the given feature definition.
 * The feature definition is normally put in src/main/resources
 */
@RunWith(classOf[Cucumber])
@Feature("Home.feature")
class HomePageTest {
}

The step definitions file is used by Cucumber to match the features text agains. There are several basic predicates in Cucumber like "When" and "Then" and "And" that you can use via the DSL. Cucumber supports many languages in this regard so you're not limited to English for this.

package com.example.selenium.cucumber

import org.openqa.selenium.WebDriver
import junit.framework.Assert.assertEquals
import junit.framework.Assert.assertTrue
import com.example.selenium.pageobject.HomePage
import com.example.selenium.WebDriverFactory
import cucumber.runtime.{EN, ScalaDsl}

/**
 * Step definitions for the Home Page behaviours
 */
class HomeStepDefinitions extends ScalaDsl with EN {
    Before() {
        driver = new FirefoxDriver()
    }

    After() {
        driver.close()
    }

    Given("^I want to see my home page$") {
        home = new HomePage(driver)
    }

    Then("^I should see valid content$") {
        var actualHeadLine: String = home.getTitle
        assertEquals(actualHeadLine, "My Awesome Home Page!")
        assertTrue("Page content should be longer than this", home.getContent.length > 4096)
    }

    private var driver: WebDriver = null
    private var home: HomePage = null
}

The HomePage object here is used to store information about the home page, and potentially actions that you can perform whilst on the HomePage which I've omitted here for simplicity. In my working system, I make most of this live in a base class that all my page classes extend. Things like the ability to screen shot are useful everywhere, as is reading the title and page content. Some generic tests that can simply match the title and content for a whole bunch of pages can be very useful for bootstrapping the test process. They might not be good test, but they're definitely better than nothing!

package com.example.selenium.pageobject

import org.openqa.selenium.By
import org.openqa.selenium.WebDriver
import org.openqa.selenium.WebElement

/**
 * A page object used to represent the home page and associated functionality.
 */
class HomePage(driver: WebDriver, baseUrl: String) {

    private var driver: WebDriver = null
    private var baseUrl: String = null

    def getDriver: WebDriver = driver

    protected def takeScreenShot(e: RuntimeException, fileName: String): Throwable = {
        FileUtils.copyFile(
            (driver.asInstanceOf[TakesScreenshot]).getScreenshotAs(OutputType.FILE),
            new File(fileName + ".png")
        )
        e
    }

    def getContent: String = getDriver.getPageSource

    def getTitle: String = getDriver.findElement(By.tagName("title")).getText
}

I hope I haven't left anything critical out that would be required to get this system bootstrapped. Obviously this set up doesn't do very much, but, it will at least get you up and running. You can probably figure out the rest from the Cucumber and Selenium documentation at this point. I might post a follow up with some more detail, but it took me a few weeks to get this posted!

Hosting

I've been looking at server hosting for a decade, and I'm a little perplexed at this point. For many years with a managed hosting environment, we'd sign up for 12 months, then at the end of the year, we'd migrate to a new environment. The new environment was typically the same cost as the old one, but somewhere between 1.5x and 2x times as powerful with about the same again in disk space.

Then the cloud came along. I've been using Rackspace's cloud solution, Amazon EC2, Linode and a few others over the last few years, and I'm not seeing the capability of the server time you spend money on increasing at anywhere near the same rate as previously. Whilst CPUs have risen in core count, and memory capacity has gone up, my cloud server still costs about the same as it did three years ago and still has the same capability as it did three years ago. Amazon et al have had some small price decrements, but nothing close to the doubling every year we used to see.

I think this is perhaps one very big drawback for cloud computing that many people didn't bet on. My guess is that once a cloud infrastructure is created, the same systems just sit in their racks happily chugging away. There is only minor incentive to upgrade these systems as their usage is on-demand and they aren't being actively provisioned like traditional server hardware was. You can no longer easily compare and contrast hosting options because it's complicated now. The weird thing is that this situation seems to have infected regular hosting also! I am in the process of trying to reallocate my hosting to reduce my costs, and it seems everywhere I turn it's the same story. Looking at Serverbeach, their systems have barely shifted upwards, other than CPU model in five years. my $100/mo still buys a dual core system with about the same RAM and disk as it did five years ago, albeit the CPU is a newer model.

For those of us developing on heavier platforms, like Grails or JEE, the memory requirements of our applications are increasing, and the memory on traditional servers and our developer machines is increasing in step, but cloud computing resources are not. I simply cannot run a swath of Grails applications on a small EC2 instance, the memory capacity just isn't big enough. My desktop has 8GB of RAM today, and it won't be long before it has 16GB, yet my Amazon instance is stuck at 1.7GB. Looking at shared hosting, the story is the same. Tomcat hosting can be had quite cheaply, if you don't mind only have 32-64MB of heap. You couldn't even start a grails app in that space, it's recommended that PermGen is set to 192MB at least.

The story isn't universally the same I've noticed, some hosting providers have increased the disk capacity availability quite dramatically, but with regard to RAM, the story seems pretty consistent. You just don't get more than you used to.

What does this mean for the little guy like me, trying to host applications in a cost effective way? At this point I really don't know. I'm starting to consider investigating co-location at this point; talk about dialing the clock back a decade. I can throw together a headless machine for less than $400 which has 10x the capacity of a small instance, seems kinda sad. Right now, I'm considering shifting away from rich server architectures and refocusing on a more Web 2.0 approach, which brings a small shudder, but still, I guess there's a reason. A simple REST server doesn't need a big heavy framework, so I can build something that will fit in that measly 64MB heap.

Moore's law might apply to computing power, but apparently not to hosting companies.

Thursday, January 26, 2012

Anonymous function, part 2, win. Epic win.

I think I've figured out why anonymous functions are great, and here's why:

function doSomething() {
  var x
  if (someLogic) {
     x = somethingHere
  }

  var k = doOtherStuff()

  if (k) {
    x = k.something
  }

  if (x) {
    doSomethingHere(x)
  }
}

Looking at this, it becomes very easy to forget to set x, or to screw up dealing with x. This code is all backwards. the thing we care about is x, not all the crap around it.

function doSomething() {
  ((function(x) {
    doSomethingHere(x)
  })(someLogic ? somethingHere : 
       (function(k) {k ? k.something : null})(doOtherStuff())))
}

Why is this good? It looks a bit like a cluster-fuck at first glance. I think there are two reasons why it's good. And I'm going to talk briefly about why I've discovered I LOVE the ternary operator.

With the ternary operator, we have no choice but to declare the result for both sides of the conditional. We can't just fall through if we don't care, we always have to care. This means we eliminate lazy code paths. We can't accidentally forget about a potentially important condition. And that bubbles up to the overall logic. Instead of some complex chain of if then and honestly maybe, we have a clear decision tree. With sequential if statements we essentially create a new clause which I'm going to call "maybe".

To rewrite the line

if (someLogic) {
  x = somethingHere
}

to
x maybe assigned somethingHere if someLogic happens to be true

I know that doesn't really scan in some ways, because ultimately, x will definitely be assigned is someLogic is true. But, we're not saying what happens after that, and that's when it becomes a maybe. Because there is no code path on the negative side of k, by the time we reach our doSomethingHere call, x maybe somethingHere, or it maybe k.something, or maybe it's nothing at all!

If we make this function more and more complex it becomes increasingly hard to track what order things needed to be in and create a bug because we screwed up the if statement order. If the order is important, then it should have been nested properly not created sequentially, and that's where our crazy functional syntax comes in. You can't write code sequentially in this style. As a result, you have to express the conditions clearly as a decision tree.

Wednesday, January 25, 2012

Anonymous functions, epic win, or total fail?

As I've been working with Scala over the past weeks and months, I've started using the anonymous function construct a fair bit. I like the way it wraps things up, and it seems like the best way to do certain things.

The same construct is available in some other languages, most notable Javascript, so something like:

function doSomething() {
  var o = document.getElementById('thing');
  o.innerHTML = "Stuff";
  o.style.height="300px";
  o.style.backgroundColor="#ffffff;"
}

could end up as:

function doSomething() {
  (function(o) {
    o.innerHTML = "Stuff";
    o.style.height="300px";
    o.style.backgroundColor="#ffffff";
  })(document.getElementById('thing');
}

why is this good?

I'm not entirely sure. I think that when you have multiple blocks and when you have multiple arguments it could help with scoping. I think it might make things clearer potentially, though I'm not so sure. It would mean the args to the function are at the bottom not the top. That could be seen as either a good thing or a bad thing.

I could see it being helpful syntactically when processing lists perhaps?

function doSomething() {
  (function(v,c) {
    for (var t = 0; t<v.length; t+=1) {
      v[t].style.backgroundColor=c;
    }
  })(document.getElementByTagName('div'),'#ffffff');
}

Or is it really just functional programming wankery that doesn't have any place outside the confines of Scala and other similar things?

I've noticed there is a way to call an anonymous function recursively too, but I think that is definitely descending into the realm of esoteric to the point of silliness.