Sunday, August 26, 2012

Dealing with Annoying JSON

How often do you have to work with an API that supplies badly formatted output? It's fine if you're a weakly typed mushy language like Javascript that doesn't care until it has to, but for those of us who like something a little more structured, and a little more performant, it presents a challenge. I'm gonna look at a way to deal with annoying JSON like this in Scala.  The most recent one I'm running into is a field that may come back as a string, or may come back as a list.

For JSON like this, Jackson provides us with a way to cope with this. The solution doesn't seem to work well with case classes, and seems to require a good deal more annotations that it should, but it does get the job done in a none too egregious way.
Here's a sample of bad JSON:

[{
  "startDate" : "2010-01-01",
  "city" : "Las Vegas",
  "channel": "Alpha"
},{
  "startDate" : "2010-02-01",
  "city": "Tucson",
  "channel": ["Alpha","Beta"]
}]

You can see that in the first element, the field 'channel' is supplied as a string, and in the second, it's now a list.  If you set the type of your field to List[String] in Scala, it will throw an error when deserializing a plain String rather than just converting it to a single element list.  I understand why it's a good idea for deserialization to do this, but really, if you're using JSON, then schema compliance probably isn't at the top of the list of requirements.

You can deal with this using the JsonAnySetter annotation.  Unfortunately, once you use this, it seems all hell breaks loose and you must then use JsonProperty on everything and it's brother.  The method that you defined annotation by JsonAnySetter will accept two arguments that function as a key value pair.  The key and value will be typed appropriately, so the key is always a String, and the value will be whatever type deserialization found most appropriate.  In this case, it will be a String or an java.util.ArrayList.  We can disambiguate these types with a case match construct, which for this seems perfect:
@BeanInfo
class Data(@BeanProperty @JsonProperty("startDate") var startDate: String,
  @BeanProperty @JsonProperty("city") var city: String,
  @BeanProperty @JsonIgnore("channel") var channel: List[String]) {

  // No argument constructor more or less needed for Jackson
  def Data() = this("", "", Nil)

  @JsonAnySetter
  def setter(key: String, value: Any) = {
    key match {
      case "channel" => {
        value match {
          case s: String => { channel = List(s) }
          case l: java.util.ArrayList[String] => {
            channel = Range(0,l.size()).map(l.get(_)).toList
          }
          case _ => { // No-op if you're ignoring it, or exception if not
          }
        }
      }
    }
  }
}

Now when the bad JSON gets passed into deserialization it will get mapped more smartly than it was generated, and we win!

I might have a poke at it to see if I can get it working with less additional annotation crazy too.

Tuesday, August 21, 2012

Testing with FluentLenium and Scala

I posted some time ago about Browser based testing in Scala using Cucumber leveraging JUnit and Selenium.  That mechanism is pretty complicated, and there seems a much better way of doing it.  The FluentLenium library gives a good way to integrate browser based testing into Scala.  There are still some challenges with continuous integration that have to be solved, and I'll talk about that later.

What does a FluentLenium test case look like with this system?  Here's a simple example that opens a home page and clicks on a link:

class HomePageSpec extends FunSuite with ShouldMatchers {

  test("Visit the links page") {
    withBrowser {
      startAt(baseUrl) then click("a#linksPage") then
        assertTitle("A List of Awesome Links") then testFooter
    }
  }
}

And we can fill in a form and submit it like this:

class RegistrationSpec extends FunSuite with ShouldMatchers {
  val testUser = "ciTestUser"

  test("Creating a fake user account") {
    withBrowser {
      startAt(baseUrl) then click("a#registerMe") then
        formText("#firstName", "test") then
        formText("#lastName", "user") then
        formText("#username", testUser) then
        formText("#email", "test.user@example.com") then
        formText("#password", "123") then
        formText("#verify", "123") then
        hangAround(500) then click("#registerSubmit")
    }
  }
}

Much of what you see above isn't out of the box functionality with FluentLenium.  Scala gives us the power to create simple DSLs to provide very powerful functionality that is easy to read and easy write. People often don't like writing tests, and Scala is a language that is still somewhat obscure.  A DSL like this makes it trivial for any developer, even one who is totally unfamiliar with Scala to construct effective browser-based tests.

Now I'm going to delve into some of the specifics of how this is constructed! (example code can be found at: git://gitorious.org/technology-madness-examples/technology-madness-examples.git)

The first piece is the basic configuration for such a project.  I'm using the play project template to start with as it offers some basic helper functionality that's pretty handy.  The first thing to do is create a bare play project

play create fluentlenium-example

I personally prefer ScalaTest to the built-in test mechanism in play, and the fluentlenium dependencies are needed, so the project's Build.scala gets updated with the following changes:

val appDependencies = Seq(
    "org.scalatest" %% "scalatest" % "1.6.1" % "test",
    "org.fluentlenium" % "fluentlenium-core" % "0.6.0",
    "org.fluentlenium" % "fluentlenium-festassert" % "0.6.0"
)

val main = PlayProject(appName, appVersion, appDependencies, mainLang = JAVA).settings(
// Add your own project settings here
  testOptions in Test := Nil
)

Now for the main test constructs.  A wrapper object is constructed to allow us to chain function calls, and that object is instantiated with the function startAt():
case class BrowserTestWrapper(fl: List[TestBrowser => Unit]) extends Function1[TestBrowser, Unit] {   def apply(browser: TestBrowser) {     fl.foreach(x => x(browser))   }   def then(f: TestBrowser => Unit): BrowserTestWrapper = {     BrowserTestWrapper(fl :+ f)   } }
This object is the container if you will for a list of test predicates that will execute once the test has been constructed.  It is essentially a wrapped list of functions which we can see from the type List[TestBrowser => Unit].  Each test function doesn't have a return value because it's using the test systems built-in assertion system and therefore doesn't return anything useful.  When this object is executed as a function, it simply runs through it's contained list and executed the tests against the browser object that is passed in.

The special sauce here is the then() method.  This method takes in a new function, and builds a new BrowserTestWrapper instance with the currently list plus the new function.  Each piece of the test chain simply creates a new Wrapper object!

Now we add a few helper functions in the companion object:

object BrowserTestWrapper {
  def startAt(url: String): BrowserTestWrapper = {
    BrowserTestWrapper(List({browser => browser.goTo(url)}, hangAround(5000)))
  }

  def hangAround(t: Long)(browser: TestBrowser = null) {
    println("hanging around")
    Thread.sleep(t)
  }


  def click(selector:String, index: Int = 0)(browser:TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).get(index).click()
  }


  def formText(selector: String, text: String)(browser: TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).text(text)
  }

  def waitForSelector(selector: String, browser: TestBrowser) {
    waitFor(3000, NonZeroPredicate(selector))(browser)
  }


  def waitFor(timeout: Long, predicate: WaitPredicate): TestBrowser => Unit = { implicit browser =>
    val startTime = new Date().getTime

    while(!predicate(browser) && new Date().getTime < (startTime + timeout)) {
      hangAround(100)()
    }
  }
}

sealed trait WaitPredicate extends Function[TestBrowser, Boolean] {
}

case class NonZeroPredicate(selector: String) extends WaitPredicate {
  override def apply(browser: TestBrowser) = browser.$(selector).size() !=0
}

This gives us the basic pieces for the test chain itself.  Now we need to define the withBrowser function so that the test chain gets executed:
object WebDriverFactory {
  def withBrowser(t: BrowserTestWrapper) {
    val browser = TestBrowser(getDriver)
    try {
      t(browser)
    }
    catch {
      case e: Exception => {
        browser.takeScreenShot(System.getProperty("user.home")+"/fail-shot-"+("%tF".format(new Date())+".png"))
        throw e
      }
    }
    browser.quit()
  }

  def getDriver = {
      (getDriverFromSimpleName orElse defaultDriver orElse failDriver)(System.getProperty("driverName"))
  }

  def baseUrl = {
    Option[String](System.getProperty("baseUrl")).getOrElse("http://www.mysite.com").reverse.dropWhile(_=='/').reverse + "/"
  }

  val defaultDriver: PartialFunction[String, WebDriver] = {
    case null => internetExplorerDriver
  }

  val failDriver: PartialFunction[String, WebDriver] = { case x = > throw new RuntimeException("Unknown browser driver specified: " +  x) }

  val getDriverFromSimpleName: PartialFunction[String, WebDriver] = {
    case "Firefox" => firefoxDriver
    case "InternetExplorer" => internetExplorerDriver
  }

  def firefoxDriver = new FirefoxDriver()

  def internetExplorerDriver = new InternetExplorerDriver()
}

This gives us just about all the constructs we need to run a browser driven test.  I'll leave the implementation of assertTitle() and some of the other test functions up to the reader.

Once we have this structure, we can run browser tests from our local system, but it doesn't dovetail easily with a Continuous Integration server.  As I write this, my CI of choice doesn't have an SBT plugin, so, I have to go a different route.  Pick your poison as you may, mine is Maven, so I create a Maven pom file for the CI to execute that looks something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>fluentlenium-tests</artifactId>
  <version>1.0.0</version>
  <inceptionYear>2012</inceptionYear>
  <packaging>war</packaging>
  <properties>
    <scala.version>2.9.1</scala.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
    <repository>
      <id>typesafe</id>
      <name>typesafe-releases</name>
      <url>http://repo.typesafe.com/typesafe/repo</url>
    </repository>
    <repository>
      <id>codahale</id>
      <name>Codahale Repository</name>
      <url>http://repo.codahale.com</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.version}</artifactId>
      <version>1.8</version>
    </dependency>

    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-core</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-festassert</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play-test_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>org.scala-tools.testing</groupId>
      <artifactId>specs_${scala.version}</artifactId>
      <version>1.6.9</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>app</sourceDirectory>
    <testSourceDirectory>test</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <configuration>
          <argLine>-DdriverName=Firefox</argLine>
          <includes>
            <include>**/*Spec.class</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>

You might notice that the above Maven configuration uses JUnit to execute out Spec tests.  This doesn't happen by default, as JUnit doesn't pick up those classes, so we have to add an annotation at the head of the class to signal JUnit to pick up the test:

@RunWith(classOf[JUnitRunner]) class HomePageSpec extends FunSuite with ShouldMatchers { ... }

Monday, July 9, 2012

Data processing, procedural, functional, parallelism and being Captain Obvious

I'm not going to tell you anything you don't already know in this post.  I might however manage to aggregate some things your already knew into one place and make them dance to a slightly new tune.

At the heart of this post is somewhat of an epiphany I had today.  It has to do with how code is written to do data processing.  This is a very common task for programmers, perhaps one that is in fact ubiquitous.

Ultimately data processing almost always looks something like this:


You load some stuff, parse it, transform it, filter it and output it.  Those things may happen in different orders, but ultimately, something like that.

One of the things you already know is that the implementation of this should look like a production line.  Read a datum in, send it through the processing process, rinse repeat, batch as need be for efficiency.

The amazing thing is that when you look at implementations, they often end up looking like this:


Code is written that loads the entire set into memory as a list of objects, which then pass through some methods who change that list of objects, either by transforming the objects themselves, or worse, copying the list in entirety to another list of different objects, filtering the list in situ, then saving the whole lot out.  These programs end up requiring some amount RAM at least as large as the data itself as a result.  Everybody knows this is a bad way to do things, so why do people keep writing code that looks like this?

We all know it should look more like



I think this is perhaps the tune to which many have not given thought to.  The problem just pops up, and people start scrambling to fix it, trying to dance triple time to the beat of the old drum.  I believe one significant cause maybe time and effort.  Data processing code often starts life as a humble script before it morphs into something bigger.  Most scripts are written in procedural languages.  In these environments, parallelization and streaming are more complicated to write than loading in the whole file and passing it around as an object, so people default to the latter.  Why write a mass of code dealing with thread pools and batching when you don't have to?  (I know there are libraries and frameworks, but often, people don't know them, or don't have enough time to use them).
This problem is easy to solve in a language where functions are first order values.  For each flow step, you define a function to perform that operation.  Not any different than procedural.  Instead of the function taking a value as an input and returning a new value, our functional variant instead returns the function which is the transformation, that being a function taking an object an returning one.  The flow can then be defined as a function that executes a list of transform functions, which can itself be a function that returns a function which takes an object and returns an object.  Now we can apply that flow to any object, single, multiple or otherwise very easily, as the flow itself is now just a value.
In Scala, you have streaming sequences, so it becomes as easy as:

io.Source.fromUrl("http://www.example.com/foo.data").flatMap(myFlow(_)).foreach(output(_))


In Scala, there are some helpers that can then apportion and parallelize this quite easily, which I talked about in my previous post.  As we now have a process as our primary value, instead of a chunk of data as our primary value, parallelization becomes much easier, passing our processing function around between threads is far easier than coping with a big chunk of mutable data being shared about.

You can implement this pattern in Java, or C++ or Perl, but most people have to stop and think to do so, the languages doesn't give it to you for free.  In functional programming, from what I'm learning, this is a very common pattern.  In fairness, it's a common pattern in Java too, but many folks don't ever think of it as a default choice until it's already too late.

Monday, July 2, 2012

Logging and Debugging

I'm finding one of the biggest challenges working with Scala is debugging, and secondarily logging.  The former seems to be a tooling issue as much as anything, and to be honest, the latter is a matter of my putting time in to figuring it out.

With debugging, break points in the middle of highly condensed list comprehensions are very very hard to make.  I end up mangling the code with assignments and code blocks that I then have to re-condense later.

I've attached a debugger using the usual jdwp method, but it slows everything down so badly, and it's just not that much better than print statements.  I've been going through the Koans with a new employee at work, and it's been helping both of us greatly.  There's one koan that describes a way to sort of "monkey patch" objects, and as much as I dislike that approach in general, it sure as heck beats Aspects which are hard to control and often fickle unless they are part of your daily routine.

I came up with a small monkey patch for the List class that lets me use inline log function calls to get basic information about the state of a list in the middle of a comprehension chain, so I include it here in the hopes that somebody will find it useful, or have some better ideas!

class ListLoggingWrapper[+T](val original: List[T]) {
  def log(msg: String): List[T] = {
    println(msg + " " + original.size)
    original
  }
  def logSelf(msg: String, truncateTo: Int = 4096): List[T] = {
    println(original.toString().take(truncateTo))
    original
  }
}

implicit def monkeyPatchIt[T](value: List[T]) = new ListLoggingWrapper[T](value)

This helpful snippet allows you to call a new method 'log' on a List object that prints out the List size, and similar with 'logSelf' which allows you to print out the result of toString, truncated (working with large lists means you always end up with pages of hard to pick through output if you don't truncate I've found).

A list comprehension chain ends up looking something like this:

Util.getJsonFilePaths(args(0)).map {
      x: String =>
        new File(x).listFiles().toList.log("File List Size").filter(file => {
          Character.isDigit(file.getName.charAt(0))
        }).map(_.getPath).filter(_.endsWith(".json")).log("Json File Count").flatMap {
          jsonFile =>
            io.Source.fromFile(jsonFile).getLines().toList.log("Line Count for " + jsonFile).map(line => Json.parse[List[Map[String, Object]]](line)).flatMap(x => x).log("Elements in file").logSelf("Elements are", 255).filter(jsonEntry => {
              jsonEntry.get("properties").get.asInstanceOf[java.util.Map[String, Object]].asScala.get("filterPropertyHere") match {
                case None => false
                case Some(value) if (value.toString == "0") => false
                case Some(value) if (value.toString == "1") => true
                case _ => false
              }
            }
            )
        }
    }

Which is a piece of code to aggregate data across multiple JSON files filtering by a given property using Jerkson (which I still feel like I'm missing something with as it seems harder than it should be).

Thursday, June 21, 2012

Parallel Processing of File Data, Iterator groups and Sequences FTW!

I have occasion to need to process very large files here and there.  It seems that Scala is very good at this in general.  There is a nice feature in the BufferedSource class that allows you to break up file parsing or processing into chunks so that parallelization can be achieved.

If you've tried the obvious solution, simply adding .par, the method isn't present.  So, you might convert to a List with toList.  When you convert like this, Scala will then compile all the lines into a List in memory before passing it on.  If you have a large file, you'll quickly run out of memory and your process will crash with an OutOfMemoryException.

BufferedSource offers us another way to do this with the grouped() method call.  You can pass a group size into the method call to break your stream into a sequence of lists.  So, instead of just a String sequence made up of millions of entries, one for each line, you get an set of Iterators made up of Sequences with 10,000 lines in each.  A BufferedSource is a kind of Iterator, and any kind of Iterator can be grouped in this way, Sequences or Lists included.  Now you have a Sequence type with a finite element count which you can parallelize the processing on and increase throughput, and flatMap the results back together at the end.

The code looks something like this:

io.Source.stdin.getLines().grouped(10000).flatMap { y=>
      y.par.map({x: String =>
        LogParser.parseItem(x)
      })}.flatMap(x=>x).foreach({ x: LogRecord =>
         println(x.toString)
      })

So with this, we can read lines from stdin as a buffered source, and also parallelize without the need to hold the entire dataset in memory!

At the moment, there is no easy way to force Scala to increase the parallelization level beyond your CPU core count that I could get to work.  This kind of I/O splitting wasn't what the parallelization operations had in mind as far as I know, it's more a job for Akka or similar.  Fortunately, in Scala 2.10, we'll get Promises and Futures which will make this kind of thing much more powerful and give us more easy knobs and dials to turn on the concurrency configuration.  Hopefully I'll post on that when it happens!

Tuesday, June 12, 2012

Parsing CSVs in Scala

I did a quick google on parsing CSVs in Scala, and one of the top hits was a stack overflow question where the answer was wrong.  Very wrong.  So, I threw together a quick parser in Scala to get the job done.  I'm not saying it's good, but it passes the spec tests I have included quotes and quoted commas both with single and double quotes.  I hope this is useful, and perhaps somebody can improve upon it.

object CSVParser extends RegexParsers {
  def apply(f: java.io.File): Iterator[List[String]] = io.Source.fromFile(f).getLines().map(apply(_))
  def apply(s: String): List[String] = parseAll(fromCsv, s) match {
    case Success(result, _) => result
    case failure: NoSuccess => {throw new Exception("Parse Failed")}
  }

  def fromCsv:Parser[List[String]] = rep1(mainToken) ^^ {case x => x}
  def mainToken = (doubleQuotedTerm | singleQuotedTerm | unquotedTerm) <~ ",?".r ^^ {case a => a}
  def doubleQuotedTerm: Parser[String] = "\"" ~> "[^\"]+".r <~ "\"" ^^ {case a => (""/:a)(_+_)}
  def singleQuotedTerm = "'" ~> "[^']+".r <~ "'" ^^ {case a => (""/:a)(_+_)}
  def unquotedTerm = "[^,]+".r ^^ {case a => (""/:a)(_+_)}

  override def skipWhitespace = false
}

Wednesday, June 6, 2012

Data Migration - Scala and Play along the Way

I've been nibbling at data migration system for many years.  It's gone through various transformations, and it's latest addition is mostly working.  The original purpose of the program I forget, but it's main use for awhile has been to extract the EVE Online database data from the SQL Server database dump that CCP kindly provides.  Each EVE revision, I take the CCP dump, spin up a Windows server in the cloud, import the database and extract what I need to port it into PostgreSQL, which is my system of choice.


Over the years, JDBC has improved, and technologies have moved along.  In the beginning I wrote Hermes-DB, a simple ORM that was very much not type safe, but coped with many of the auto-probing of table information that comes along with a more dynamic style ORM.  One can argue that this isn't really ORM at all, and at this point, I'm inclined to agree.


Having said that, the auto-probing capabilities turned out to be very very useful in extracting data.  Because the system was predicated on the idea that learning about the database should be the job of the framework, not the developer, it had a reasonably well formed concept of representing tables and columns as objects.  With a bit of tweaking, adding a new metadata class along the way, the package can represent a table definition fairly well now.


What this allows me to do today, is create both a solid database dump, and the DDL to build the table structure.  Theoretically this system could be modified to pull from any datastore and generate for any other datastore.  The system was built in a way that was hopefully designed to facilitate that.


2012 rolls around, and things have changed.  The landscape for web development has been shifting over the last decade as people struggle to find way to get the tools out of the developer's way, and enable them to do their job more and fight with code less.  The most recent evolution in that sequence that I've been working with, is Scala and Play.  As I work with two tools, I'm increasingly finding it easier to build systems that are stable, and take much less code to write.


Hermes-DB was originally designed just to output DDL, but when I started working with JPA, a system that requires a whole lot of scaffolding, it made sense to have one of the output "DDLs" be Java classes with JPA annotations.  Over the last few days, I've been making a new variety of output, Scala case classes designed to work with Play and therefore Anorm.  Anorm is very powerful, and gives you tools that "get out of your way", but doesn't have a lot when it comes to scaffolding.  I've poked around a bit, and it seems there was a scaffolding plugin for Play 1, but none exists for Play 2.  This little utility, is helping fill that gap for me.  It outputs Scala class and companion object definitions based on the database schema.


The EVE Online database comes out of the box with about 75 tables.  75 tables that I'd rather not have to manually create mappings for for model classes.  This little utility made my life much easier.  A bit cheer for code generation tools!


It is open source of course, and can be found on gitorious with the git URL: git@gitorious.org:export4pg/export4pg.git


Please note that some of this code is very very old, and it's worked for probably close to a decade so some of it is a bit ancient in both understanding and coding style.  It is however, very useful, and possibly one of the pieces of code I've written that's still in usage and not broken from constant tinkering!