Thursday, June 21, 2012

Parallel Processing of File Data, Iterator groups and Sequences FTW!

I have occasion to need to process very large files here and there.  It seems that Scala is very good at this in general.  There is a nice feature in the BufferedSource class that allows you to break up file parsing or processing into chunks so that parallelization can be achieved.

If you've tried the obvious solution, simply adding .par, the method isn't present.  So, you might convert to a List with toList.  When you convert like this, Scala will then compile all the lines into a List in memory before passing it on.  If you have a large file, you'll quickly run out of memory and your process will crash with an OutOfMemoryException.

BufferedSource offers us another way to do this with the grouped() method call.  You can pass a group size into the method call to break your stream into a sequence of lists.  So, instead of just a String sequence made up of millions of entries, one for each line, you get an set of Iterators made up of Sequences with 10,000 lines in each.  A BufferedSource is a kind of Iterator, and any kind of Iterator can be grouped in this way, Sequences or Lists included.  Now you have a Sequence type with a finite element count which you can parallelize the processing on and increase throughput, and flatMap the results back together at the end.

The code looks something like this:

io.Source.stdin.getLines().grouped(10000).flatMap { y=>
      y.par.map({x: String =>
        LogParser.parseItem(x)
      })}.flatMap(x=>x).foreach({ x: LogRecord =>
         println(x.toString)
      })

So with this, we can read lines from stdin as a buffered source, and also parallelize without the need to hold the entire dataset in memory!

At the moment, there is no easy way to force Scala to increase the parallelization level beyond your CPU core count that I could get to work.  This kind of I/O splitting wasn't what the parallelization operations had in mind as far as I know, it's more a job for Akka or similar.  Fortunately, in Scala 2.10, we'll get Promises and Futures which will make this kind of thing much more powerful and give us more easy knobs and dials to turn on the concurrency configuration.  Hopefully I'll post on that when it happens!

Tuesday, June 12, 2012

Parsing CSVs in Scala

I did a quick google on parsing CSVs in Scala, and one of the top hits was a stack overflow question where the answer was wrong.  Very wrong.  So, I threw together a quick parser in Scala to get the job done.  I'm not saying it's good, but it passes the spec tests I have included quotes and quoted commas both with single and double quotes.  I hope this is useful, and perhaps somebody can improve upon it.

object CSVParser extends RegexParsers {
  def apply(f: java.io.File): Iterator[List[String]] = io.Source.fromFile(f).getLines().map(apply(_))
  def apply(s: String): List[String] = parseAll(fromCsv, s) match {
    case Success(result, _) => result
    case failure: NoSuccess => {throw new Exception("Parse Failed")}
  }

  def fromCsv:Parser[List[String]] = rep1(mainToken) ^^ {case x => x}
  def mainToken = (doubleQuotedTerm | singleQuotedTerm | unquotedTerm) <~ ",?".r ^^ {case a => a}
  def doubleQuotedTerm: Parser[String] = "\"" ~> "[^\"]+".r <~ "\"" ^^ {case a => (""/:a)(_+_)}
  def singleQuotedTerm = "'" ~> "[^']+".r <~ "'" ^^ {case a => (""/:a)(_+_)}
  def unquotedTerm = "[^,]+".r ^^ {case a => (""/:a)(_+_)}

  override def skipWhitespace = false
}

Wednesday, June 6, 2012

Data Migration - Scala and Play along the Way

I've been nibbling at data migration system for many years.  It's gone through various transformations, and it's latest addition is mostly working.  The original purpose of the program I forget, but it's main use for awhile has been to extract the EVE Online database data from the SQL Server database dump that CCP kindly provides.  Each EVE revision, I take the CCP dump, spin up a Windows server in the cloud, import the database and extract what I need to port it into PostgreSQL, which is my system of choice.


Over the years, JDBC has improved, and technologies have moved along.  In the beginning I wrote Hermes-DB, a simple ORM that was very much not type safe, but coped with many of the auto-probing of table information that comes along with a more dynamic style ORM.  One can argue that this isn't really ORM at all, and at this point, I'm inclined to agree.


Having said that, the auto-probing capabilities turned out to be very very useful in extracting data.  Because the system was predicated on the idea that learning about the database should be the job of the framework, not the developer, it had a reasonably well formed concept of representing tables and columns as objects.  With a bit of tweaking, adding a new metadata class along the way, the package can represent a table definition fairly well now.


What this allows me to do today, is create both a solid database dump, and the DDL to build the table structure.  Theoretically this system could be modified to pull from any datastore and generate for any other datastore.  The system was built in a way that was hopefully designed to facilitate that.


2012 rolls around, and things have changed.  The landscape for web development has been shifting over the last decade as people struggle to find way to get the tools out of the developer's way, and enable them to do their job more and fight with code less.  The most recent evolution in that sequence that I've been working with, is Scala and Play.  As I work with two tools, I'm increasingly finding it easier to build systems that are stable, and take much less code to write.


Hermes-DB was originally designed just to output DDL, but when I started working with JPA, a system that requires a whole lot of scaffolding, it made sense to have one of the output "DDLs" be Java classes with JPA annotations.  Over the last few days, I've been making a new variety of output, Scala case classes designed to work with Play and therefore Anorm.  Anorm is very powerful, and gives you tools that "get out of your way", but doesn't have a lot when it comes to scaffolding.  I've poked around a bit, and it seems there was a scaffolding plugin for Play 1, but none exists for Play 2.  This little utility, is helping fill that gap for me.  It outputs Scala class and companion object definitions based on the database schema.


The EVE Online database comes out of the box with about 75 tables.  75 tables that I'd rather not have to manually create mappings for for model classes.  This little utility made my life much easier.  A bit cheer for code generation tools!


It is open source of course, and can be found on gitorious with the git URL: git@gitorious.org:export4pg/export4pg.git


Please note that some of this code is very very old, and it's worked for probably close to a decade so some of it is a bit ancient in both understanding and coding style.  It is however, very useful, and possibly one of the pieces of code I've written that's still in usage and not broken from constant tinkering!

Tuesday, May 15, 2012

On wired and wireless networking

I saw the following article on G+ today:

http://lifehacker.com/5910335/what-awesome-things-still-require-a-wire--does-plugging-in-even-matter-anymore?utm_campaign=socialflow_lifehacker_facebook&utm_source=lifehacker_facebook&utm_medium=socialflow

And thought I'd comment on it.  I used to be a big proponent of wired systems, sufficiently that I put effort into wiring my home with Cat 5e.  That was back in the days of 802.11g, and honestly, back then 802.11g didn't come close to its potential most of the time.

Today we live in an age of the wireless.  I use laptops that are truly portable, and iPad and iPhone and iPod touch.  I agree that there are some places where wired makes sense, but I think this article makes both valid points, and invalid points.  I'm gonna break it down here a bit, and take it one.

Backup Faster over the Network

This is mostly a valid point.  If you need to backup over a network, you're better off plugged in.  This does however assume your NAS supports gigabit ethernet, that the NAS's operating system doesn't suck, and that the drive inside can do better than 10-15MB/sec.  I've seen many cases where none of the above are true, and it's one reason I switched to Apple.
Mostly, I don't use NAS.  It's generally quirky, unreliable, expensive and slow, regardless of your network connection.  I spent a great deal of time going through NAS devices until I finally just gave up and used a directed attached device.  He also talks about remembering to keep your device turned on being a problem.  If you use a wired connection, the same issue holds, so it's not really a good argument for wired.
I think on balance, this is a poor argument, though, I think it has some validity.

Keep up with your ultra-fast network

This is a really elitist kind of point.  The number of folks who come close to having 100Mbit internet is miniscule.  I'm a programmer, and I don't have 100Mbit.  Even with 100Mbit, the number of times I'd get 100Mbit from the other end is about zero.  Even at 25Mbit, I often don't see that saturated from download sites.  This is a poor argument in my opinion.

USB 3.0 (and 2.0 Too)

Comparing wireless networking with direct attached peripherals seems a bit silly.  And it goes on and on in this article.  This is a both a valid point and an invalid point.  If the device on the other end can truly saturate 802.11n, then this is true.  Many devices just can't.  Backups are the prime candidate here, and, well, I think backups are a good use of direct attached.

Remote Control Your Camera

Very very esoteric usage here.  Firstly, it assumes you have a DSLR.  Secondly it assumes you have a need to control it wireless and view the images on a laptop.  Most folks aren't doing inside or studio shooting, even if they own a DSLR, and if they are, then why not just use USB from your computer, which is wired of course, but the need for wireless her at all seems a stretch.
This is a really crap argument.

Record High Quality Audio

Little bit of bandwidth calculation is required here.  WAV format, that which is used in CDs is 44Khz at 16bit.  This means you need 44,000 samples of 16 bits per second.  Simple multiplication shows that comes it under 1Mbit.  Lets take this up a notch and go to studio level 24 bit at 192Khz.  If you have software and devices that can do this, it still only clocks in at 4.6Mbit.  I've used 24 channel recording desks that use firewire.  They were Firewire 800, which is 80Mbit.  I'm pretty sure it wasn't saturated, and that's within the capability of 802.11n.
This is an invalid argument.  Other than the fact that audio devices don't come with wireless support.  But, let's face it, most computers don't come with Firewire 800 either.

Anything That Can Be Done with a Thumb Drive

I'm not really sure what the argument here is.  If it's speed, then it's a really bad argument.  Most thumb drives are really slow.  I had to go out of my way to buy one that was even a half-sensible speed.  This is also the reason I feel that Micro-SD slots in your Android device are pretty silly.  Most people don't know they have to buy a high-speed SD card, or USB key for it to be much use.  With wireless, it's not that hard to transfer files over the network to folks.  It take a bit of knowledge unless you have a Mac, but it's not that hard.  I haven't ever been given a USB key for a mix-tape (tapes, now we're talking modern tech) or mix-CD.

Charge your Other Gadgets

Powering USB devices.  An iPad, a pretty hungry device I believe charges at 12W.  You could reasonably charge your iPad off your unwired laptop without too much pain, and give your device some more juice at the cost of some laptop time in a pinch.  Also, power transfer without wires is still some pretty new technology, and I think comparing it with wireless networking is a bit disingenuous.
I think this argument is valid in as much as you can't charge a device wireless,  but, I think it's a silly argument given the original context of wireless being ethernet.

Audio and Video Cables

We've already covered audio.  Video was only recently able to be transmitted over a serial connection, not HDMI level, but computer monitor level.  Whilst this is true, it's also a bit silly, and see below for why.

Put Your Tablet or Smartphone On Your TV

Two words: Apple TV (maybe that's three)
Nuff said, this is an invalid argument.  It also sort of invalidates the previous point.  You can't transmit full-quality video over wireless, but you can transmit compressed high-def, and I think that satisfies the requirement in my opinion.  There have been a few articles comparing iTunes 1080 with Blu-ray and iTunes hasn't come out too badly.


Get the Highest Quality Sound

Isn't this a repeat of "Record High Quality Audio"?  In short, no.  This is invalid.

Final Score

I think out of the arguments, three out of ten have some semblance of validity, of those, I'm struggling with two of them.  There are things that need to be wired, your speakers to your stereo will still need to be wired.  There are wireless solutions but they either suck, or are very expensive.  Generally I think this article tries a bit too hard to demonstrate a need for wired in a world that is already mostly wireless.  Trying to convince people to backup over ethernet when they're already doing it wireless is gonna be a pretty hard sell.

Saturday, May 12, 2012

Scala is very nice - very very nice

Today I am gushing over Scala's par method and XML literals. I am fetching about 30,000 entries over REST calls. The server isn't super fast on this one, so each call takes a bit of time. Enter list.par stage left.

list.par creates a parallelizable list which given an operation will perform it in parallel across multiple CPUs.  It spawns threads and performs the operation, then joins all the results together at the end, very handy.

This little three letter method is turning what would be a very very long arduous process into a much less long one. Much much less.

val myList = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}

It gets better. In Scala, XML can be declared as a literal. Not only that, but it runs inline like a normal literal, with a few special rules. This service is combining a bunch of json into an XML output.

val myOutput = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}.map { x =>
  Json.parse[Map[String, Object]](x)("url").toString
}.map { x =>
  <entry>
    <url>{ x }</url>
  </entry>
}.toString


Which I can now happily write to wherever I need to, a file, or a web service response. Nifty in the extreme.

In 2012, we live in a world of JSON and XML. Perl had it's day when text processing was king. Today, a language is needed that can cope with JSON, XML and Parallelization and still yield sane-looking code. I'm not a big Ruby fan, as anyone who knows me will tell you, but I'm willing to keep an open in. I'd like to see if Ruby can do this kind of thing as elegantly and easily and demonstrate it's a language for the web in 2012.  Also, I should mention Akka as well, though I don't yet know enough about it, other than it can allegedly take parallelization inter-compuer with similar simplicity.

Wednesday, May 9, 2012

Simple Scala scripts : Scan a directory recursively

I'm using Scala increasingly as a scripting language at the moment. As my confidence with it is increasing, I'm finding it's becoming more and more useful for those throw-away scripting situations.  Especially when then end up being not so throw-away after all.

def findFiles(path: File, fileFilter: PartialFunction[File, Boolean] = {case _ => false}): List[File] = {
  (path :: path.listFiles.toList.filter {
    _.isDirectory
  }.flatMap {
    findFiles(_)
  }).filter(fileFilter.isDefinedAt(_))
}

(replace {} with (), ditch newlines and it goes on one line well-enough, just doesn't fit in a Blogger template that way)
We might be duplicating the a shell find:

find | grep 'foo'
or
find ./ -name "foo"

And whilst the Scala is more complex, the Scala function can do operations on a File object, which gives you a lot of the rest of the power of the find command thrown in to the bargain. Plus, as it accepts a partial function, you can chain together filters. If you truly just wanted an analog for find:

def findFiles(path: File): List[File]  = 
  path :: path.listFiles.filter {
    _.isDirectory
  }.toList.flatMap {
    findFiles(_)
  }

Which is less complex that the first. This is still more work than find, but, the list you get back is an actual list. If you added anything useful to your find, say an md5 for each file, it gets less happy
find ./ | awk '{print "\""$0"\""}' | xargs -n1 md5sum
Maybe there's a better way, but that's what I've always ended up doing. The Scala is starting to compete now. Bump up the complexity one more notch, and I think Scala actually starts becoming less code and less obscure.

You might also notice that the example above can be fit nicely within the Map/Reduce paradigm. Scripting that is not only relatively easy, but can also be thrown at Hadoop for extra pzazz, and NoSQL buzz-worthyness.

On things that "save time"

Over the years, I've often heard things about things that "save time" in development.  For many years, I was gun-shy of IDEs.  Too often they break down, and the entire thing has to be reset and reconfigured from nothing.  It made the cost out-weight the benefits.  After a few more years passed, IDEs got better, and when somebody introduced me to IntelliJ, I was finally convinced that IDEs could actually save me time overall, not cost me, normally at the most inopportune moment.

So now we have IDEs that don't suck.  Take a simple thing like method lookup.  What's the time difference between hitting Command-B in IntelliJ, or having to do Ctrl-H, change the tab, and type it in in Eclipse (I'm sure there's a better way in Eclipse, there always is, but it normally non-obvious).  It amounts to a few seconds at most.  So, there's not really a significant difference right?  In time, this is perhaps true, and some might argue that a few seconds here and there can add up, and I might go into that later.  For me, the real issue is not time at all, it's space.  Space in your brain.

A brain is like a CPU in some ways, and like a CPU it has a cache (at least this seems like a good analogy to me), you have multiple levels, at least L1 and L2, maybe L3.  L1 caches are small and very fast.  They handle what's the immediate focus of attention in your brain right now.  Jumping through the code, tracing back a problem, going up the code path.  When needing to search, instead of jumping directly to the caller, you have to go through a set of operations.  This results in only a small time difference, but, it's like having to put three or four operations in your L1 cache instead of none.  Hitting Ctrl-B is a zero effort operation.  It's just like an op-code - Ctrl-B does this, that's what I need.  Opening the search dialog is a zero effort operation.  Remembering to switch the tab, not a zero effort operation, copy/pasting the right string in, not a zero effort operation, checking to make sure it's including the right files, similar, and if it's a big project, watching the search run, and then popping up an error dialog, not zero effort.

Another four things are now put into focused attention, significantly depleting what's there.  Two seconds of time has busted through maybe 20% or more of a brain's L1 cache (I think I read somewhere that the average human can only concentrate on no more than four to six things at once).  That two seconds can turn into two hours as the most important thing that was being held on to at the top of the stack in your brain which was in "L1" gets lost down into L2 or worse.  We fix the immediate problem, but forget why.  The local manifestation is gone perhaps, but the bigger issue is forgotten, and still very present.

Two seconds, concentration was diminished, which caused two hours of lost time.  This is one way how every little operation in a development environment can be critical.  Is this an exaggeration?  I'm not sure it is.  Even if it is for this one thing, imagine this problem multiplied by two, or four.  Not just one missing zero-cost operation, but two, or three.  Suddenly with a more fluid environment, with just a few things made drastically better, development becomes less stilted and happens better.