Technology Madness

Understanding versus working state

2020-01-21T10:43:00.001-08:00

I'm noticing a prevalent trend in development, and particularly DevOps as a new discipline that is very problematic. It seems that most developers (and operations folks/DevOps folks), are valuing working state over systemic understanding. That is, people at the command line will continue to cut and paste solutions from Azure documentation, or StackOverflow until it works. At the end of it; they probably won't have much of an idea of why it's all working. This leads to a system that is very fragile. It works today, but tomorrow could bring breaking changes, and nobody will have any idea why they broke the system, or how to fix it!

I'm noticing that the increasing trend in online documentation is to present solution based documentation, which is useful in a pinch, but ultimately very counter-productive. The same is starting to be true of books now too, if one even exists in the domain you're looking for that isn't just a reprint of the online documentation, and thusly has very little explanation.

So I'm going to do my best here to write up an account of working into a new Azure account, and getting things working. As a developer/CTO who is interested first and foremost in system longevity , and reproducibility; I will focus not just on code examples but also on explanation of what is going on (as far as I understand it).

Future[T] and Future[Try[T]] - Round Two!

2018-11-29T21:44:00.002-08:00

In my last post, I was discussing some of the issues I'm facing around Future[T]; and how if you look inside the implementation of Future, what you see is

val value: Option[Try[T]].

Of course, for my part, I want access to that Juicy Try[T]. without it, flows can just fail; another point in case, the post-process code in our system:

for {
  flowDone <- flow.runWith(Sink.foreach(e => logger.trace("Updated reference " + e)))
  catDone <- categoryService.updateCategoryListWithProductCount(loginId)
  childSkus <- backPropagateChildSkus(loginId)
  reIndexDone <- searchReindexerService.callReIndex(loginId)(ws)
} yield reIndexDone

Once the product ingest work is done, I then want to chain that future through a series including updating category classifications, updating Child Sku information and sending the whole lot off wholesale for reindexing in the search cluster.

With the current implementation and function of Future, even if these methods all return Future[Try[T]], any one of them might fail for an unpredictable reason, and generate a Failure[Try[T]] where the Failure's Try is Failure, and the inner value is never exposed. And in fact, on the first execution this is precisely what happened. Another exception was swallowed I suspect, because whilst all the products seem to have been inserted, the reindex started as indicated on logging, but didn't complete successfully.

Something must be done!

Attempts to do something very brute force like subvert Future[T] to return Future[Try[T]] in all cases seem dubious; particularly as we now have the problem of dealing with somehow making sure Future[Try[T]] still returns Future[Try[T]] and not Future[Try[Try[T]]. My type foo is not strong enough to untangle that one; and perhaps that's a good thing.

So having arrived at that conclusion, what do I want to do about it. I think I'm going to give this a spin:

class FutureTryAutoTransformer[T](original: Future[Try[T]])(implicit executionContext: ExecutionContext) {
  val liftF: Try[Try[T]] => Try[Try[T]] = {
    case Success(Success(v)) => Success(Success(v))
    case Success(Failure(e)) => Success(Failure(e))
    case Failure(e) => Success(Failure(e))
  }

  def lift: Future[Try[T]] = original.transform[Try[T]](liftF)(executionContext)
}

implicit def future2FutureTryAutoTransformer[T](

  f: Future[Try[T]])(implicit executionContext: ExecutionContext) =

   new FutureTryAutoTransformer[T](f)

With that little piece of magic, I have a way to "lift" the Future[Try[Try[T]] up so that the inner and out Try components are handled together and I get a real Future[Try[Try[T]] with no hidden agendas!!

let's see how we feel about that - I've heard tell of a scalaz solution which I might look in to... watch this space!

On Future[T] and Future[Try[T]]

2018-11-27T11:13:00.001-08:00

In the Scala universe there is some debate of the usage of Future[Try[T]], and how to best encapsulate failure in the context of a Future. For my part, I like using Monads to communicate context, and meaning of the expectation, especially around failure. One of the biggest reasons for the existence of Option[T] is to pro-actively handle null cases; and with Try[T], the same thing with exceptions (noting that Try is not technically Monadic, but... it's close). This becomes especially bothering once you drop into an Akka streams situation with flows where errors can easily just get eaten by the system completely, with no exception trace or notification. I have one application where it ingests millions of rows, and occasionally the flow blows up, and what do you see on the logging? Nothing at all.

So - how can you at least address this situation; if you're like me, and like explicit understanding based on your Type arrows; how can you wrap this up in a way that gives you the context you desire:

  def tryingFutTry[T](f: => Future[Try[T]])(implicit executionContext: ExecutionContext): Future[Try[T]] = 
    try {
      f.recoverWith({ case e: Throwable => Future(Failure(e))})
    }
    catch {
      case e: Exception => Future(Failure(e))
    }

  def tryingFut[T](f: => Future[T])(implicit executionContext: ExecutionContext): Future[Try[T]] = 
    try {
      f.map[Try[T]](Success.apply).recoverWith({ case e: Throwable => Future(Failure(e))})
    }
    catch {
      case e: Exception => Future(Failure(e))
    }

  def trying[T](f: => T)(implicit executionContext: ExecutionContext): Try[T] = try {
    Success(f)
  }
  catch {
    case e: Exception => Failure(e)
  }

Whilst this isn't very pretty, or perhaps even very well named; I'm not loving it yet; it does at least give you a way to "lift" non-try wrapped Future in a Try context to lift out the failure case from inside the Future that gets eaten, and allow you to expose swallowed exceptions when you have an explicit Try context that may not be catching all exceptions.

A lot of these ideas were take from a blog post that I found here:
https://alvinalexander.com/scala/how-exceptions-work-scala-futures-oncomplete-failure

On Good Code

2018-04-11T09:25:00.001-07:00

Good code. What makes a codebase good? What makes good code... well, good?

Coming into a new company again has refreshed my mind on what it is like to delve into a complex pre-existing codebase for the first time. Sometimes the experience is agonizing, sometimes it's fairly straightforward, and sometimes it lies somewhere in between.

I remember when I first started at PlayStation, who are for the most part, a Java shop. Getting my environment set-up, discovering the shape of things, where to find things, what libraries were there and beginning to dig into the project I'd be working on. Opening up my IDE for the first time and initializing the first maven pom into a project for IntelliJ to index and for me to digest.

I pulled out my OmniGraffle, and started making diagrams for my own edification, tracing from the start of the application flow, where requests arrive, and following the call flow all the way down into the guts of the application where SQL queries start popping up and the particulars of the relational organization of the platform become evident, making flow charts and relational diagrams as I go so I can understand what kind of a picture it all paints.

Today, coming into a new company, I begin the process all over again, but this time I'm in a director role, so today, my chief concern is more to do with deliverables and project timelines, overarching technology goals and direction that it is with the low level of our implementation. However, being a startup, it means that whilst my primary focus is more long-term, I have to be conversant with all those gruesome details, and able to function therein. It's PHP, it's AWS, it's Docker, it's Linux, it's MySQL.

My standards for what "good code" looks like are very high. Over the years I've discovered this in my professional life, and that most people aren't nearly as exacting as my preferences would prefer. It's been a long journey from the young up-and-commer who left Southampton University for the U.S.A with a baby on the way and dived into commercial software development with gusto to today, a Engineering Director for a small company in California. Over those years I've worked on many systems, from the smallest companies like ZiftIt, where the technology team consisted of me doing all the core engineering, a front-end guy doing UI, and the CTO, to vast sprawling multi-billion dollar organizations like Sony PlayStation where I was just one cog in a vast machine delivering content at massive scale to users the world over.

One common thread that shows up across these companies is that good code makes a difference. Not in theory, but in practice. I've lived in companies where the toxicity of the codebase rose up and strangled the organization from within as it took more engineers just to beat back the zombies and skeletons of rushed implementations, where interacting with system became an effort in managing the edge cases that were so prevalent it was like trying to play patty-cake with Edward Scissorhands, and the edge-cases were so much at the edges as baked all the way through.

Let me start by describing what good code feels like. When your codebase is good, it feels safe. It's a warm blanket that welcomes you to work in the morning, where you feel confident that your timelines are accurate. Where you can estimate with ease, and new features are just a matter of solving for the complexity of the design. Where you go home on Friday, and thinking about refactoring something, arrive Monday, and the refactoring is done by close of business Tuesday. Where when a business owner asks for a new feature, you smile and say, it mostly already does that because it's just a logical extension of the relational design. Where you can look at your database, and immediately get a sense of what the data means.

Contrast that with bad code. Where any step taken is fraught with peril. You can't change anything for fear of the whole system collapsing like a house of cards, worse even, you daren't even step heavily around it, in case the table shakes and the whole thing just collapses apparently of it's own accord. Where implementing anything requires long heavy test cycles that seem to take forever, and where business is always angry that what they are being given is so full of bugs and problems that never seem to go away all the way.

I want to take a moment now to look at why this is. What makes one codebase such a pleasure to work with, and another such a horrible pain. Let's think about the human psyche, where we came from and who we are for the world, let's get metaphysical for a moment. When you open up a good novel and dive in, what is it that is engaging? When you look at a page of mathematics, unless you have a PhD in Math, why does it occurs as noise? All this points to the first trait of a good codebase:

It tells a story.

Open up the sourcecode to your project, and see, what is the story it's telling say? Can you tell? The statistics suggest that the average developer spends 10x as much time reading code in a day than they spent writing it. If your codebase isn't telling a compelling story, and doing so in a way a good book does, you've probably got a pile of frustrated coders on your hands. When you open up a class or script and your brain fires off in horror "Oh my god, who wrote this?!", or "Oh my god, what does this even do?!", you know you might have a problem.

I'm not an English major, my wife holds that distinction in our family, but I know that a good story has compelling characters, solid plot, a place it starts, a clear direction, and a place it ends up. The best stories might be surprising or insightful or emotional; but they are all engaging and compelling.

If your codebase doesn't tell a compelling story, there's a pretty good chance that your product doesn't either, and that your company doesn't either. Conway's law says that companies write applications with the same structure as their organization. If your code base looks a certain way, it might be an indication of your organizational culture, and, that might also be something you want to look at.

If you look at the Clean Code book by Robert C Martin, you'll see that code that has clearly distinguished levels of abstract will have a mixture of fuction or method types. There will be methods that talk almost in English: return userDataAccess.fetchUsers() map (getPersonalData andThen tokenize). A non-developer can, with just a little explanation of what what "map" does, fully understand what this accomplishes! This function tells a story. If you don't have functions or methods in your code that looks like this, this is a strong symptom of failing to have appropriate levels of abstraction. It also means that you like have a great deal of copy/paste in your system going on and that refactoring anything is going to result in you finding places where the same operation is performed with slight variations that were never normalize.

Okay, so you've realized that what you have on your hands is a dry math paper, and not a novel. What can you do about it?

Cure for code that doesn't tell a story: normalize the heck out of it.

Go through, start seeing where there are services present. like userDataAccess.fetchUsers. If what happens all over the place is raw queries to your datastore, you'll benefit from normalizing this into a service component. Normalize with a passion, normalize with vigor. You'll see the size of your codebase shrinking and shrinking. You'll start to see the story of your system emerge. And, you'll start to see productivity rise. If you didn't have any before, you'll be able to write tests now. If you had tests before, you'll start seeing them simplify greatly. You'll start to see developers actually want to write then because it enables them to develop new features faster. Incidently, this is also one antidote for tangling and scattering, another common problem.

That's enough pontificating for one morning, I'll come back and write a part two, where, I'll talk more about tangling and scattering, and also abstraction versus simplification.

Technology... The Madness of MySQL

2018-04-10T13:17:00.004-07:00

Back working for a start-up again. This brings all the pluses and minuses as per usual. Crazy hours sometimes, fun projects, more control, less process.

It also brings something else. MySQL.

Anyone who knows me, knows just how much visceral hate I have for this "database". In the last 48 hours, I've learnt two more fun facts to add to the list of solid reasons why to skip MySQL in favor of better solutions:

1. Nullable false leads to an implicit default.

We use Liquibase to version our database. It's good. But, it means that generally I find myself writing schema in XML format, not in SQL format. Creating a new column on a table is easy enough, and there's even a handy XML block for constraints. I copied the definition from another column definition from before, which included a nullable=false constraint, which on first glance, seemed appropriate enough.

I run the update, and wonder what's taking so long...

Turns out if you specify nullable="false", MySQL will use a default value, all databases would, that's perfectly sensible. The thing that's not sensible is is that in this case, I didn't specify one! So MySQL, instead of throwing an error, telling my schema change is invalid and missing something, just goes ahead and implies a default. Stop implying things MySQL, you're guessing what I mean, not doing what I say. This is generally a bad thing for software to do. And it implied I meant that I did in fact want a default value and that default value should be 0 for a bigint column. Not a bad assumption necessarily, but, an assumption nonetheless; and to assume, makes an ass out of u and me, but in this case, mostly just me. This of course then expended a lot of CPU resources to apply as this particular table has a great many rows. No problem, kill the session and remove the constraint.

Which leads us swiftly to number 2...

2. Adding a column to a table in MySQL requires a full table update, even in InnoDB.

o_o. o_O. O_O.

I think in 2018, every single other database engine handles this correctly. It's a meta-data change. Not so in MySQL. Even with no default value, the database engine insists on rebuilding the entire database store.

So if you're an enterprise with a large(ish) table, and you need a new column. Downtime will be required. Downtime in 2018 is not what users have come to expect. Downtime is never acceptable to users.

MySQL. I didn't think I would discover new reasons to despise you for being a pile of poorly implemented not really ACID compliant unhelpful non-SQL standard compliant database. I was unpleasantly surprised.

Law of Demeter and perhaps something more strict that isn't quite

2016-08-23T09:28:00.002-07:00

Looking at a piece of code today and thinking about the consequences of discovering things about objects a method is passed.

If I have a method that is responsible for performing a mapping, let's call it from type A to type B so

A -> B

or to use a more Scala-ish syntax:

f(A): B

then I might argue that the method f, should not attempt to enhance in any way the object of type A. If the properties that require the construction of B are not immediately present in A as per the law of Demeter, or if we recast the system slightly such that A may represent a composite, the set of objects represented by A, then there should be an intermediary method that gathers the required information for the mapping operation and creates an enhanced context. This would be a separation of concerns, one being the enhancement of the object of type A, and the other the operation of generating a B.

A -> B then expands to A -> C -> B

where C is the set of information required to construct B, so we might get two methods:

constructB(C): B so that our type arrow is C -> B

and enhance(A): C so that the type arrow is A -> C

This means that should somebody construct logic that does some kind of side-effecting operation in enhance(), that operation can be isolated from the mapping.

This means that when we look at a mapping function, we can say it should not contain ANY additional type arrows within. It should only access and map properties from the composite C to create an object of type B. Any additional mapping or derivation that occurs within, or the processing of a type arrow breaks separation of concerns.

This feels pretty strict, but I'm looking at code today where if that rule had been followed, a very nasty side-effecting piece of code that was buried several levels deep in an abstraction would never have been permitted!

Java 8 - Exploring FunctionalInterface

2015-08-20T10:15:00.001-07:00

A few days ago I posted a highly frustrated post on Facebook about Java having Lambdas, but not having any Try<> mechanism meaning that in most cases, you're left declaring a try block inside a lambda. Turns out there's a different way to approach this that gives a different resolution.

Say you're a Scala person like me, and have discovered that checked exceptions actually are more of a pain than they're worth, and believe they actually break SOLID engineering principles, particularly the part about encapsulation. When first exploring Java 8, it seemed to me that the lack of a Try<> type was pretty bad news. I still think Try<> would be useful, but there is at least a way to get around having a very ugly try/catch block inside a lambda.

So Java, I take it back - you've done something weird, but cool. Turns out you don't need to worry about Function<> specifically; any interface that declares only a single method is functional, and will be eligible for syntax magic. (Though I don't like magic, it's at least traceable magic). It's perfectly valid to declare:

@FunctionalInterface
public interface ExceptionalFunction<A, B> {
B f(A a) throws Exception;
default B apply(A a) {
try { return f(a); }
catch (Exception e) { throw new RuntimeException(e); }
}
}

and then your call that uses thusly:

public <T> T withMyThing(ExceptionalFunction<MyThing, T> f) {
f(fetchMyThing);
}

and then

withMyThing(x -> isAwesome(x));

or because apparently you can:

withMyThing(this::isAwesome(x));

This means that if isAwesome() throws a checked exception, our wrapper will capture it and it will be suppressed down to a runtime exception. I'm not going to debate the merits of that here, only to say that here be dragons, and that probably breaks expected behavior in many situations, but, at the same time can be pretty useful too, particularly in Test Suites, where exceptional behavior is either being explicitly elicited, or explicitly checked against. Though I supposed that if you're eliciting it, getting back a RuntimeException containing the expected might break the test case... like I said, here be dragons.

Though we might have noticed that now apparently interfaces in Java can have method bodies... Uh wut? This is doesn't seem any different to me than having say:

@FunctionalInterface
public abstract class ExceptionalFunction<A, B> {
public abstract B f(A a) throws Exception;
public <B> B apply(A a) {
try { return f(a); }
catch (Exception e) { throw new RuntimeException(e); }
}
}

I suppose it does have the syntactic implication, that the function you're declaring could be something other than public, which in a function interface context wouldn't make sense, but perhaps that should be a compiler error rather than changing what an "Interface" fundamentally means in Java?

So be here yea forewarned: Interfaces in Java 8 may have method bodies!!

Software Development -- Why we need software engineers

2014-07-09T09:57:00.000-07:00

I frequently see posts by some folks talking about how software development is the realm of the elite and erudite, requiring much training and arcane knowledge. Somehow, these people seem to think this is a bad thing, or an odd thing that is in need of remediation. It doesn't. And there are many simple analogies that can help demonstrate why.

The most basic one goes as follows. As a business owner, you have an office probably, or a place of business. This place has electrical systems, and plumbing, and a building with doors and a roof and other such infrastructure. If your sink gets clogged, you might pour some drano down it and try to unclog it. Anything beyond that - you call a plumber. If a roof leaks, you call a roofer. If your door needs fixing, you call a carpenter. Without training, you can't just make a new door, or install a new sink, or do a major roof repair. These things require skill and training to do well, and in some cases even at all. You don't know which tools you need, and you certainly don't own them. As a business person, you also understand that your time is much better invested in doing things you are an expert at doing, and using that revenue to pay someone who is an expert in plumbing to do that rather than wasting ten times the effort of our own time trying to do it yourself. It's called delegation, and whilst I'm no Harvard MBA, I'm pretty sure it's a major component in how you make a business successful. What on earth would posses you to believe that given you have zero training and knowledge of computer systems, that you could whip up a piece of software that was anything more than the software equivalent of drano down a sink, or duct tape on a pipe?! Or to the people writing these kinds of articles, what on earth would posses you to think that what you do is any less difficult than what many professionals do, and is something that any ordinary Joe could manage with little or no training? Software deals with many highly complex interlocking problems and requires training and specialized knowledge to do, and certainly to do well. You have to know about the tools that exist, and you have to know how to use them. For a software engineer who does this full time and then some, the range of tools and techniques available is dizzying, far more than is available to a carpenter or a plumber. The notion that a regular Joe should be able to write software, of any kind, anything really beyond basic spreadsheet macros is ridiculous. For decades, people have been trying to make these magical 4GLs that allow you to plug big blocks of awesome together to make systems that get things done. To date, they have all failed to one degree or another.

I believe that most people, even programmers often have little idea about what we actually do every day. What we do everyday is mostly not engineering. That's right, I'm a programmer, and what I do everyday is mostly NOT engineering. There is an engineering component, but I believe that most of what I do every day is philosophy.

Programming is a discipline that is more akin to philosophy that it is to engineering, and I believe that this is often true for higher mathematics and higher physics also, though I'm not in those fields, so I may be wrong.

What we do everyday is reason about the universe around us, and then turn it into a model that the computer executes. Most of the programming we do is essentially a simulacrum. It's a mirror of the real word, defined by us, the philosopher and executed by a machine, the computer. How good our software is, is typically far more to do with how well we can reason about our problem domain and far less about how good an engineer we are. Any programmer who has seem code written by Math people has an idea about this. From an engineering perspective, the code is often pretty terrible, but the things it can do are amazing, and us lowly engineers often don't have the mathematical knowledge and understanding to reason about it, and thusly can't understand it. The mathematicians software is brilliant because she knew a language with which to describe the universe that was very advanced, and could precisely describe that universe back to a machine to do interesting things with it.

Why do you think people do postgraduate work in science on a thing called a PhD, it's not a Doctor of Science, it's a Doctor of Philosophy.

To describe our model of the universe to the computer, we need engineering, and that is honestly the part that is pretty tedious after you've been around the wheel a few times. The languages we user are arcane because a computer has to know precisely what to do. There can be no ambiguity to a computer, and when there are ambiguities in our code, that's one major cause of bugs. Humans are very bad at describing things precisely, but it's the only way a computer can function and so we have a disjoin. We also have a disjoin because our model of the universe is typically very small, and only represents a very very small part of the universe. And sometimes the universe we're modeling is centered around things that only exist as constructs in human conception, like money. They don't have solid rules and have weird exception cases and often don't make a lot of sense from a purely logical perspective. They are models of human behavior, which humans who study psychology and neurology don't even fully understand, so what chance as a software developer do we have, trying to describe it for a computer to understand?! The raw fact of the matter is that that model is incomplete and imperfect. No amount of hand-wringing is ever going to make it less than that because we ourselves don't have a perfect model of the universe, and certainly not of human behavior, all we have are approximations.

I also am coming to believe that engineering, the actual implementation of our models is a secondary concern to the philosophy. That is to say regardless of what languages and methodologies we use, if our fundamental model of the universe, the shapes of the pieces of our simulacrum we defined in our mind are a poor fit, no amount of brilliant engineering will make the software good. The most elegant code that doesn't serve the user is ultimately not going to be useful and have a long life beyond being a teaching aid. The most ugly code that serves the user well, is going to have a long life, even if it is the bane of the engineers maintaining it, and we've all worked on that kind of code in our career. That legacy app that is just so critical to the business that it can't be left to die, despite it being the most horrible piece of code ever conceived? That little gem, is a wonder of philosophy, but a travesty of engineering, and guess what, it's value as a model as a machine as a tool for the business outweighs it's ugliness as a work of engineering.

So I would call programmers to be philosophers and historians first, and engineers second. Take the time to gain a better understanding of the world around us first, learn about what other programmers have done before you and the challenges they faced second, and then be a syntax god and engineer third. And most of all, what we do, day in and day out is hard. It's part philosophy, part art, part music, part mathematics and part engineering. To be good at programming, you have to be at least capable in all of those disciplines, and whilst engineering might be third, it's still really important. Not many people have the capacity to be cross-disciplined to that level, programmers have to be, and that quite frankly, makes us pretty special, and that's okay!

Subcut with GlobalSettings and Filters in the Play Framework

2014-04-05T14:53:00.001-07:00

Quick post on an issue with Filters when using the Subcut template from activator. If you use the Subcut template from TypeSafe Activator, and decide you want to use Filters on your Global object, you will likely run into this issue:

java.lang.ClassCastException: Global cannot be cast to play.GlobalSettings

or you will see that you can't call super.doFilter(next) as suggested in the documentation.

This is because there are two GlobalSettings classes in Play, one which is a Java class, and the other that is the Scala trait. In most scala applications you will be extending a Scala trait. The Subcut template uses the Java class, and that's because the Subcut template relies on your Global being an instance, rather than a companion object, and the Scala GlobalSettings trait only works on an object, not an instance.

You can fix this problem by extending from both of these classes:

class Global extends GlobalSettings with play.api.GlobalSettings {

  override def doFilter(next: EssentialAction): EssentialAction = {
    Filters(super.doFilter(next), LoggingFilter)
  }

Thank goodness for traits!

Raising Option[T] to Try[T] - Handling error situations with aplomb

2014-04-03T00:49:00.001-07:00

I've started using this pattern for dealing with situation where functions I'm working with may return an Option[T] for a given query, but the contextual meaning of returning None is really an error case. A good example of this is looking up an Item in a database by it's ID based on a shopping cart. If the application receives an item ID during a shopping cart process of an item that doesn't exist in the DB, then returning None on the DAO access is fine, but the upshot is an error condition. The application has received bad data somewhere along the way, and this should be manifested as an Exception state, which I'm choosing to encapsulate in a Try[T] so I can pass it cleanly up the stack rather than violating SOLID by throwing an exception, which I know is a subject of some debate.

To help with this, I wrote a simple wrapper class that I've called MonadHelper thusly:

object MonadUtil {
  implicit def option2wrapper[T](original: Option[T]) = new OptionWrapper(original)

  class OptionWrapper[T](original: Option[T]) {
    def asTry(throwableOnNone: Throwable) = original match {
      case None => Failure(throwableOnNone)
      case Some(v) => Success(v)
    }
  }
}

This allows one to construct a for comprehension elevating None returns to an error state somewhat gracefully like this slightly contrived example:

case class CartItemComposite(account: Tables.AccountRow, item: Item)

trait AccountDAO {
  def findById(userId: Long): Option[Tables.AccountRow]
}
trait ItemDAO {
  def findById(itemId: Long): Option[Item]
}

def findShoppingCartItem(itemId: Long, userId: Long)(userDAO: AccountDAO, itemDAO: ItemDAO): Try[CartItemComposite] = {
  for {
    user <- userDAO.findById(userId).asTry(new Throwable("Failed to find user for id " + userId))
    item <- itemDAO.findById(itemId).asTry(new Throwable("Failed to find item for id " + itemId))
  } yield CartItemComposite(user, item)
}

But you get the idea. You can check a set of conditions for validity, giving appropriate error feedback at each step along the way instead of losing the error meaning as you would with simple Option[T] monads in a way that looks less than insane.

Don't know if this is a great pattern yet, but, I'm giving it a whirl!

SimpleDateFormat sillyness

2013-11-06T10:07:00.001-08:00

Ran into an amazingly dumb bug yesterday. I would say that this is clearly a bug in the behavior of SimpleDateFormat in Java. Why is it that when I give it a date that looks valid, and a format string that's not right, and in fact contains invalid numbers, it will go ahead and parse my date string producing a ridiculous result. But not so ridiculous it's obvious, or in fact, just throws an Exception, which for my money would be the desired outcome.

So this is the scenario. Parsing a date like this:

val dateStr = "2013-02-04 05:35:24.693 GMT"

with the date parsing string:

val dateFormat = new SimpleDateFormat("yyyy-MM-dd HH:MM:ss.SSS z")

If you're paying very close attention, you will see the problem here; the month component is repeated. This yields the following date in the resulting Date object: "Tue Nov 03 21:00:24 PST 2015"

This result is clearly very different than what was sent in. I see two problems here. The date string contained two references to the same field. I can see where sometimes this might be useful, but honestly, I feel like you should have to set a flag or something for this to be silently ignored. In most cases having a reference to the same part of a date within a format string seems like a warning flag at least. The second problem is that the erroneous value for month that was give is beyond the scope of a calendar year. You can't have 35 months in a year. In my opinion this should have thrown an exception. I understand that potentially in some calendar system somewhere on this earth there maybe more than 35 'months' in a year or something, but this is very unexpected behavior, way outside of what I would considered normal.

In short, if you have a date string that is being parsed and coming out the other end with a very strange unexpected and wrong result, there's a good chance the format string is off, and probably only very slightly and in a way that's hard to spot without a very very careful eye.

On folding, reducing and mapping

2013-06-10T09:15:00.000-07:00

I haven't updated in a while, but it's time to get back into doing this blog.

Today I want to do a brief little thing on folding, reducing and mapping. In the past, I've found myself doing things like this when building a SQL query:

val k = Map("name" -> "John", "address" -> "123 Main St", "city" -> "Los Angeles")

(k.foldLeft("")((a,b) => a + " and " + b._1 + " = ?")).drop(6)

Which on the face of it doesn't seem so bad, might have done something not that disimilar using a for loop in Java. The thing about this is that it's actually kind of stupid. There is a more idiomatic way to do this. When you think about it, the initial operation is actually a matter of mapping the input list to a set of string, and then joining those strings with the and clause. In Scala, the reduce operation essentially does what join does in the new Java, or other languages, but with any list. When we think about it this way, we can therefore do this:

val k = Map("name" -> "John", "address" -> "123 Main St", "city" -> "Los Angeles")

k.map(x = > x._1 + " = ?").reduce(_+" and "+_)

or:

k.map(x = > x._1 + " = ?").mkString(" and ")

Ultimately ending up as something like:

val query: PreparedStatement = buildStatement(k.map(x = > x._1 + " = ?").mkString(" and "))
k.zipWithIndex.foreach(x => query.setObject(x._1, x._2))

Much cleaner, and more idiomatic. Some might say this is obvious, but, it wasn't obvious to me immediately, and so I'm gonna guess it's not to others either!

Database Record Updates with Slick in Scala (and Play)

2013-01-03T01:02:00.001-08:00

This is a simple operation that I found absolutely zero reference to in the documentation or the tutorials or the slides. Eventually after digging through old mailing lists, I came across the solution:

        
(for { m <- MessageQ if m.id === oldMessage.id } yield(m))
  .mutate(r=>(r.row = newMessage))(session)

This is for a simple message class and a function that takes two arguments: oldMessage and newMessage. The frustrating thing is that this is inconsistent with the simple formula for a single column update:

MessageQ.filter(_.id === 1234L).map(_.subject)
  .update("A new subject")(session)

When you try to apply this thinking to an update, you end up at a dead end. The mutate operator is also used for deletion too:

MessageQ.filter(_.id === 1234L)
  .mutate(_.delete())(session)

Note that you can typically leave out the session argument as it's declared implicit within the appropriate scope. I'm also switching between syntax alternates because for some reason, either my IDE or the compiler gets grumpy when I try to use the filter() style rather than the list comprehension style in certain contexts. I still have to figure that out.

I'd like to write a longer post later at some point, but this at least covers the highlights.

Collect Chains for Creation - Sublime or Stupid?

2012-12-17T22:48:00.000-08:00

I'm working on a piece of code to deserialize Facebook messages, and I'm looking at the mess I wrote to do this, and it occurred to me that I could do it as a progressive collect chain. Each time we know something new, we collect on it, or pass-thru if it's a false condition.

def makeNewMessage(id: Long)(json: JsValue)(facebookClient: FacebookClient)(session: scala.slick.session.Session): Option[Message] = {
  (((json \ "message").asOpt[String], (json \ "created_time").asOpt[String], (json \ "id").asOpt[String]) match {
    case ((Some(message), Some(createdTime), Some(fbId))) => {
      Some((userId: Long) => (likes: Int) => Message(
        id = id,
        subject = None,
        systemUserId = Some(userId),
        message = Some(message),
        createdTimestamp = Some(new sql.Date(facebookDateParser.parse(createdTime).getTime)),
        facebookId = Some(fbId),
        upVotes = Some(likes)
      ))
    }
  }).collect({ case f => {
    scaffoldFacebookUser(json)(facebookClient)(session) match {
      case Right(scaffoldedId: Int) => f(scaffoldedId)
    }
  }}).collect({ case f => f(((json \ "likes").asOpt[Int], (json \\ "likes").size) match {
    case (Some(likes), 0) => likes
    case (_, likes) => likes
  })})
}

The result of the first piece is a 2-ary function that may get populated with a userId and a like count. Once we figure out if we can build a user, we populate that, then we figure out the like count. If at any point the chain fails, it just returns None.

This all feels very imperative to me, perhaps I'm just tired, and it will come to me later. I swear I remember better ways of doing this using unapply magic, but I can't seem to figure it out, so this is where I'm going right now!

Head shaking moments - an ongoing Saga

2012-10-24T13:43:00.000-07:00

I think I might start to keep track of examples of bad code that are right out there in the public view. Some of these examples are the language tutorials themselves even!

Today I'm gonna single out the object example in the CoffeScript guide. In this guide we get the Animal class example:

class Animal
  constructor: (@name) ->

  move: (meters) ->
    alert @name + " moved #{meters}m."

class Snake extends Animal
  move: ->
    alert "Slithering..."
    super 5

class Horse extends Animal
  move: ->
    alert "Galloping..."
    super 45

sam = new Snake "Sammy the Python"
tom = new Horse "Tommy the Palomino"

sam.move()
tom.move()

This code is in violation of the Call Super code smell. We finally get half decent classes in Javascript using CoffeeScript, and this is the first example given - one that is considered by some to be an anti-pattern. Below I'm going to feature an attempt to refactor out this smell.

Updated Code

class Animal
  constructor: (@name) ->

  move: (meters = @distance()) ->
    alert @movingMessage()
    alert @name + " moved #{meters}m."

  movingMessage: -> "Moving..."
  distance: -> 10


class Snake extends Animal
  movingMessage: -> "Slithering..."
  distance: -> 5

class Horse extends Animal
  movingMessage: -> "Galloping..."
  distance: -> 45

sam = new Snake "Sammy the Python"
tom = new Horse "Tommy the Palomino"

sam.move()
tom.move()

Delegate methods are a much better use for an inheritance contract than methods that override a super to provide essentially the same behavior, only with different parameters. I would argue that the second version is substantially clearer as each method does precisely one thing, including the move method of Animal.

Dealing with Annoying JSON

2012-08-26T22:28:00.000-07:00

How often do you have to work with an API that supplies badly formatted output? It's fine if you're a weakly typed mushy language like Javascript that doesn't care until it has to, but for those of us who like something a little more structured, and a little more performant, it presents a challenge. I'm gonna look at a way to deal with annoying JSON like this in Scala. The most recent one I'm running into is a field that may come back as a string, or may come back as a list.

For JSON like this, Jackson provides us with a way to cope with this. The solution doesn't seem to work well with case classes, and seems to require a good deal more annotations that it should, but it does get the job done in a none too egregious way.
Here's a sample of bad JSON:

[{
  "startDate" : "2010-01-01",
  "city" : "Las Vegas",
  "channel": "Alpha"
},{
  "startDate" : "2010-02-01",
  "city": "Tucson",
  "channel": ["Alpha","Beta"]
}]

You can see that in the first element, the field 'channel' is supplied as a string, and in the second, it's now a list. If you set the type of your field to List[String] in Scala, it will throw an error when deserializing a plain String rather than just converting it to a single element list. I understand why it's a good idea for deserialization to do this, but really, if you're using JSON, then schema compliance probably isn't at the top of the list of requirements.

You can deal with this using the JsonAnySetter annotation. Unfortunately, once you use this, it seems all hell breaks loose and you must then use JsonProperty on everything and it's brother. The method that you defined annotation by JsonAnySetter will accept two arguments that function as a key value pair. The key and value will be typed appropriately, so the key is always a String, and the value will be whatever type deserialization found most appropriate. In this case, it will be a String or an java.util.ArrayList. We can disambiguate these types with a case match construct, which for this seems perfect:

@BeanInfo
class Data(@BeanProperty @JsonProperty("startDate") var startDate: String,
  @BeanProperty @JsonProperty("city") var city: String,
  @BeanProperty @JsonIgnore("channel") var channel: List[String]) {

  // No argument constructor more or less needed for Jackson
  def Data() = this("", "", Nil)

  @JsonAnySetter
  def setter(key: String, value: Any) = {
    key match {
      case "channel" => {
        value match {
          case s: String => { channel = List(s) }
          case l: java.util.ArrayList[String] => {
            channel = Range(0,l.size()).map(l.get(_)).toList
          }
          case _ => { // No-op if you're ignoring it, or exception if not
          }
        }
      }
    }
  }
}

Now when the bad JSON gets passed into deserialization it will get mapped more smartly than it was generated, and we win!

I might have a poke at it to see if I can get it working with less additional annotation crazy too.

Testing with FluentLenium and Scala

2012-08-21T14:00:00.000-07:00

I posted some time ago about Browser based testing in Scala using Cucumber leveraging JUnit and Selenium. That mechanism is pretty complicated, and there seems a much better way of doing it. The FluentLenium library gives a good way to integrate browser based testing into Scala. There are still some challenges with continuous integration that have to be solved, and I'll talk about that later.

What does a FluentLenium test case look like with this system? Here's a simple example that opens a home page and clicks on a link:

class HomePageSpec extends FunSuite with ShouldMatchers {

  test("Visit the links page") {
    withBrowser {
      startAt(baseUrl) then click("a#linksPage") then
        assertTitle("A List of Awesome Links") then testFooter
    }
  }
}

And we can fill in a form and submit it like this:

class RegistrationSpec extends FunSuite with ShouldMatchers {
  val testUser = "ciTestUser"

  test("Creating a fake user account") {
    withBrowser {
      startAt(baseUrl) then click("a#registerMe") then
        formText("#firstName", "test") then
        formText("#lastName", "user") then
        formText("#username", testUser) then
        formText("#email", "test.user@example.com") then
        formText("#password", "123") then
        formText("#verify", "123") then
        hangAround(500) then click("#registerSubmit")
    }
  }
}

Much of what you see above isn't out of the box functionality with FluentLenium. Scala gives us the power to create simple DSLs to provide very powerful functionality that is easy to read and easy write. People often don't like writing tests, and Scala is a language that is still somewhat obscure. A DSL like this makes it trivial for any developer, even one who is totally unfamiliar with Scala to construct effective browser-based tests.

Now I'm going to delve into some of the specifics of how this is constructed! (example code can be found at: git://gitorious.org/technology-madness-examples/technology-madness-examples.git)

The first piece is the basic configuration for such a project. I'm using the play project template to start with as it offers some basic helper functionality that's pretty handy. The first thing to do is create a bare play project

play create fluentlenium-example

I personally prefer ScalaTest to the built-in test mechanism in play, and the fluentlenium dependencies are needed, so the project's Build.scala gets updated with the following changes:

val appDependencies = Seq(
    "org.scalatest" %% "scalatest" % "1.6.1" % "test",
    "org.fluentlenium" % "fluentlenium-core" % "0.6.0",
    "org.fluentlenium" % "fluentlenium-festassert" % "0.6.0"
)

val main = PlayProject(appName, appVersion, appDependencies, mainLang = JAVA).settings(
// Add your own project settings here
  testOptions in Test := Nil
)

Now for the main test constructs. A wrapper object is constructed to allow us to chain function calls, and that object is instantiated with the function startAt():
case class BrowserTestWrapper(fl: List[TestBrowser => Unit]) extends Function1[TestBrowser, Unit] { def apply(browser: TestBrowser) { fl.foreach(x => x(browser)) } def then(f: TestBrowser => Unit): BrowserTestWrapper = { BrowserTestWrapper(fl :+ f) } }
This object is the container if you will for a list of test predicates that will execute once the test has been constructed. It is essentially a wrapped list of functions which we can see from the type List[TestBrowser => Unit]. Each test function doesn't have a return value because it's using the test systems built-in assertion system and therefore doesn't return anything useful. When this object is executed as a function, it simply runs through it's contained list and executed the tests against the browser object that is passed in.

The special sauce here is the then() method. This method takes in a new function, and builds a new BrowserTestWrapper instance with the currently list plus the new function. Each piece of the test chain simply creates a new Wrapper object!

Now we add a few helper functions in the companion object:

object BrowserTestWrapper {
  def startAt(url: String): BrowserTestWrapper = {
    BrowserTestWrapper(List({browser => browser.goTo(url)}, hangAround(5000)))
  }

  def hangAround(t: Long)(browser: TestBrowser = null) {
    println("hanging around")
    Thread.sleep(t)
  }


  def click(selector:String, index: Int = 0)(browser:TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).get(index).click()
  }


  def formText(selector: String, text: String)(browser: TestBrowser) {
    waitForSelector(selector, browser)
    browser.$(selector).text(text)
  }

  def waitForSelector(selector: String, browser: TestBrowser) {
    waitFor(3000, NonZeroPredicate(selector))(browser)
  }


  def waitFor(timeout: Long, predicate: WaitPredicate): TestBrowser => Unit = { implicit browser =>
    val startTime = new Date().getTime

    while(!predicate(browser) && new Date().getTime < (startTime + timeout)) {
      hangAround(100)()
    }
  }
}

sealed trait WaitPredicate extends Function[TestBrowser, Boolean] {
}

case class NonZeroPredicate(selector: String) extends WaitPredicate {
  override def apply(browser: TestBrowser) = browser.$(selector).size() !=0
}

This gives us the basic pieces for the test chain itself. Now we need to define the withBrowser function so that the test chain gets executed:

object WebDriverFactory {
  def withBrowser(t: BrowserTestWrapper) {
    val browser = TestBrowser(getDriver)
    try {
      t(browser)
    }
    catch {
      case e: Exception => {
        browser.takeScreenShot(System.getProperty("user.home")+"/fail-shot-"+("%tF".format(new Date())+".png"))
        throw e
      }
    }
    browser.quit()
  }

  def getDriver = {
      (getDriverFromSimpleName orElse defaultDriver orElse failDriver)(System.getProperty("driverName"))
  }

  def baseUrl = {
    Option[String](System.getProperty("baseUrl")).getOrElse("http://www.mysite.com").reverse.dropWhile(_=='/').reverse + "/"
  }

  val defaultDriver: PartialFunction[String, WebDriver] = {
    case null => internetExplorerDriver
  }

  val failDriver: PartialFunction[String, WebDriver] = { case x = > throw new RuntimeException("Unknown browser driver specified: " +  x) }

  val getDriverFromSimpleName: PartialFunction[String, WebDriver] = {
    case "Firefox" => firefoxDriver
    case "InternetExplorer" => internetExplorerDriver
  }

  def firefoxDriver = new FirefoxDriver()

  def internetExplorerDriver = new InternetExplorerDriver()
}

This gives us just about all the constructs we need to run a browser driven test. I'll leave the implementation of assertTitle() and some of the other test functions up to the reader.

Once we have this structure, we can run browser tests from our local system, but it doesn't dovetail easily with a Continuous Integration server. As I write this, my CI of choice doesn't have an SBT plugin, so, I have to go a different route. Pick your poison as you may, mine is Maven, so I create a Maven pom file for the CI to execute that looks something like this:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.example</groupId>
  <artifactId>fluentlenium-tests</artifactId>
  <version>1.0.0</version>
  <inceptionYear>2012</inceptionYear>
  <packaging>war</packaging>
  <properties>
    <scala.version>2.9.1</scala.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
    <repository>
      <id>typesafe</id>
      <name>typesafe-releases</name>
      <url>http://repo.typesafe.com/typesafe/repo</url>
    </repository>
    <repository>
      <id>codahale</id>
      <name>Codahale Repository</name>
      <url>http://repo.codahale.com</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>
    <dependency>
      <groupId>org.scalatest</groupId>
      <artifactId>scalatest_${scala.version}</artifactId>
      <version>1.8</version>
    </dependency>

    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-core</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>org.fluentlenium</groupId>
      <artifactId>fluentlenium-festassert</artifactId>
      <version>0.7.2</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>play</groupId>
      <artifactId>play-test_${scala.version}</artifactId>
      <version>2.0.3</version>
    </dependency>
    <dependency>
      <groupId>org.scala-tools.testing</groupId>
      <artifactId>specs_${scala.version}</artifactId>
      <version>1.6.9</version>
    </dependency>
  </dependencies>

  <build>
    <sourceDirectory>app</sourceDirectory>
    <testSourceDirectory>test</testSourceDirectory>
    <plugins>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <configuration>
          <source>1.6</source>
          <target>1.6</target>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>

      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-surefire-plugin</artifactId>
        <configuration>
          <argLine>-DdriverName=Firefox</argLine>
          <includes>
            <include>**/*Spec.class</include>
          </includes>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>

You might notice that the above Maven configuration uses JUnit to execute out Spec tests. This doesn't happen by default, as JUnit doesn't pick up those classes, so we have to add an annotation at the head of the class to signal JUnit to pick up the test:

@RunWith(classOf[JUnitRunner]) class HomePageSpec extends FunSuite with ShouldMatchers { ... }

Data processing, procedural, functional, parallelism and being Captain Obvious

2012-07-09T16:29:00.003-07:00

I'm not going to tell you anything you don't already know in this post. I might however manage to aggregate some things your already knew into one place and make them dance to a slightly new tune.

At the heart of this post is somewhat of an epiphany I had today. It has to do with how code is written to do data processing. This is a very common task for programmers, perhaps one that is in fact ubiquitous.

Ultimately data processing almost always looks something like this:

You load some stuff, parse it, transform it, filter it and output it. Those things may happen in different orders, but ultimately, something like that.

One of the things you already know is that the implementation of this should look like a production line. Read a datum in, send it through the processing process, rinse repeat, batch as need be for efficiency.

The amazing thing is that when you look at implementations, they often end up looking like this:

Code is written that loads the entire set into memory as a list of objects, which then pass through some methods who change that list of objects, either by transforming the objects themselves, or worse, copying the list in entirety to another list of different objects, filtering the list in situ, then saving the whole lot out. These programs end up requiring some amount RAM at least as large as the data itself as a result. Everybody knows this is a bad way to do things, so why do people keep writing code that looks like this?

We all know it should look more like

I think this is perhaps the tune to which many have not given thought to. The problem just pops up, and people start scrambling to fix it, trying to dance triple time to the beat of the old drum. I believe one significant cause maybe time and effort. Data processing code often starts life as a humble script before it morphs into something bigger. Most scripts are written in procedural languages. In these environments, parallelization and streaming are more complicated to write than loading in the whole file and passing it around as an object, so people default to the latter. Why write a mass of code dealing with thread pools and batching when you don't have to? (I know there are libraries and frameworks, but often, people don't know them, or don't have enough time to use them).
This problem is easy to solve in a language where functions are first order values. For each flow step, you define a function to perform that operation. Not any different than procedural. Instead of the function taking a value as an input and returning a new value, our functional variant instead returns the function which is the transformation, that being a function taking an object an returning one. The flow can then be defined as a function that executes a list of transform functions, which can itself be a function that returns a function which takes an object and returns an object. Now we can apply that flow to any object, single, multiple or otherwise very easily, as the flow itself is now just a value.
In Scala, you have streaming sequences, so it becomes as easy as:

io.Source.fromUrl("http://www.example.com/foo.data").flatMap(myFlow(_)).foreach(output(_))

In Scala, there are some helpers that can then apportion and parallelize this quite easily, which I talked about in my previous post. As we now have a process as our primary value, instead of a chunk of data as our primary value, parallelization becomes much easier, passing our processing function around between threads is far easier than coping with a big chunk of mutable data being shared about.

You can implement this pattern in Java, or C++ or Perl, but most people have to stop and think to do so, the languages doesn't give it to you for free. In functional programming, from what I'm learning, this is a very common pattern. In fairness, it's a common pattern in Java too, but many folks don't ever think of it as a default choice until it's already too late.

Logging and Debugging

2012-07-02T13:00:00.000-07:00

I'm finding one of the biggest challenges working with Scala is debugging, and secondarily logging. The former seems to be a tooling issue as much as anything, and to be honest, the latter is a matter of my putting time in to figuring it out.

With debugging, break points in the middle of highly condensed list comprehensions are very very hard to make. I end up mangling the code with assignments and code blocks that I then have to re-condense later.

I've attached a debugger using the usual jdwp method, but it slows everything down so badly, and it's just not that much better than print statements. I've been going through the Koans with a new employee at work, and it's been helping both of us greatly. There's one koan that describes a way to sort of "monkey patch" objects, and as much as I dislike that approach in general, it sure as heck beats Aspects which are hard to control and often fickle unless they are part of your daily routine.

I came up with a small monkey patch for the List class that lets me use inline log function calls to get basic information about the state of a list in the middle of a comprehension chain, so I include it here in the hopes that somebody will find it useful, or have some better ideas!

class ListLoggingWrapper[+T](val original: List[T]) {
  def log(msg: String): List[T] = {
    println(msg + " " + original.size)
    original
  }
  def logSelf(msg: String, truncateTo: Int = 4096): List[T] = {
    println(original.toString().take(truncateTo))
    original
  }
}

implicit def monkeyPatchIt[T](value: List[T]) = new ListLoggingWrapper[T](value)

This helpful snippet allows you to call a new method 'log' on a List object that prints out the List size, and similar with 'logSelf' which allows you to print out the result of toString, truncated (working with large lists means you always end up with pages of hard to pick through output if you don't truncate I've found).

A list comprehension chain ends up looking something like this:

Util.getJsonFilePaths(args(0)).map {
      x: String =>
        new File(x).listFiles().toList.log("File List Size").filter(file => {
          Character.isDigit(file.getName.charAt(0))
        }).map(_.getPath).filter(_.endsWith(".json")).log("Json File Count").flatMap {
          jsonFile =>
            io.Source.fromFile(jsonFile).getLines().toList.log("Line Count for " + jsonFile).map(line => Json.parse[List[Map[String, Object]]](line)).flatMap(x => x).log("Elements in file").logSelf("Elements are", 255).filter(jsonEntry => {
              jsonEntry.get("properties").get.asInstanceOf[java.util.Map[String, Object]].asScala.get("filterPropertyHere") match {
                case None => false
                case Some(value) if (value.toString == "0") => false
                case Some(value) if (value.toString == "1") => true
                case _ => false
              }
            }
            )
        }
    }

Which is a piece of code to aggregate data across multiple JSON files filtering by a given property using Jerkson (which I still feel like I'm missing something with as it seems harder than it should be).

Parallel Processing of File Data, Iterator groups and Sequences FTW!

2012-06-21T21:57:00.000-07:00

I have occasion to need to process very large files here and there. It seems that Scala is very good at this in general. There is a nice feature in the BufferedSource class that allows you to break up file parsing or processing into chunks so that parallelization can be achieved.

If you've tried the obvious solution, simply adding .par, the method isn't present. So, you might convert to a List with toList. When you convert like this, Scala will then compile all the lines into a List in memory before passing it on. If you have a large file, you'll quickly run out of memory and your process will crash with an OutOfMemoryException.

BufferedSource offers us another way to do this with the grouped() method call. You can pass a group size into the method call to break your stream into a sequence of lists. So, instead of just a String sequence made up of millions of entries, one for each line, you get an set of Iterators made up of Sequences with 10,000 lines in each. A BufferedSource is a kind of Iterator, and any kind of Iterator can be grouped in this way, Sequences or Lists included. Now you have a Sequence type with a finite element count which you can parallelize the processing on and increase throughput, and flatMap the results back together at the end.

The code looks something like this:

io.Source.stdin.getLines().grouped(10000).flatMap { y=>
      y.par.map({x: String =>
        LogParser.parseItem(x)
      })}.flatMap(x=>x).foreach({ x: LogRecord =>
         println(x.toString)
      })

So with this, we can read lines from stdin as a buffered source, and also parallelize without the need to hold the entire dataset in memory!

At the moment, there is no easy way to force Scala to increase the parallelization level beyond your CPU core count that I could get to work. This kind of I/O splitting wasn't what the parallelization operations had in mind as far as I know, it's more a job for Akka or similar. Fortunately, in Scala 2.10, we'll get Promises and Futures which will make this kind of thing much more powerful and give us more easy knobs and dials to turn on the concurrency configuration. Hopefully I'll post on that when it happens!

Parsing CSVs in Scala

2012-06-12T08:15:00.000-07:00

I did a quick google on parsing CSVs in Scala, and one of the top hits was a stack overflow question where the answer was wrong. Very wrong. So, I threw together a quick parser in Scala to get the job done. I'm not saying it's good, but it passes the spec tests I have included quotes and quoted commas both with single and double quotes. I hope this is useful, and perhaps somebody can improve upon it.

object CSVParser extends RegexParsers {
  def apply(f: java.io.File): Iterator[List[String]] = io.Source.fromFile(f).getLines().map(apply(_))
  def apply(s: String): List[String] = parseAll(fromCsv, s) match {
    case Success(result, _) => result
    case failure: NoSuccess => {throw new Exception("Parse Failed")}
  }

  def fromCsv:Parser[List[String]] = rep1(mainToken) ^^ {case x => x}
  def mainToken = (doubleQuotedTerm | singleQuotedTerm | unquotedTerm) <~ ",?".r ^^ {case a => a}
  def doubleQuotedTerm: Parser[String] = "\"" ~> "[^\"]+".r <~ "\"" ^^ {case a => (""/:a)(_+_)}
  def singleQuotedTerm = "'" ~> "[^']+".r <~ "'" ^^ {case a => (""/:a)(_+_)}
  def unquotedTerm = "[^,]+".r ^^ {case a => (""/:a)(_+_)}

  override def skipWhitespace = false
}

Data Migration - Scala and Play along the Way

2012-06-06T09:48:00.000-07:00

I've been nibbling at data migration system for many years. It's gone through various transformations, and it's latest addition is mostly working. The original purpose of the program I forget, but it's main use for awhile has been to extract the EVE Online database data from the SQL Server database dump that CCP kindly provides. Each EVE revision, I take the CCP dump, spin up a Windows server in the cloud, import the database and extract what I need to port it into PostgreSQL, which is my system of choice.

Over the years, JDBC has improved, and technologies have moved along. In the beginning I wrote Hermes-DB, a simple ORM that was very much not type safe, but coped with many of the auto-probing of table information that comes along with a more dynamic style ORM. One can argue that this isn't really ORM at all, and at this point, I'm inclined to agree.

Having said that, the auto-probing capabilities turned out to be very very useful in extracting data. Because the system was predicated on the idea that learning about the database should be the job of the framework, not the developer, it had a reasonably well formed concept of representing tables and columns as objects. With a bit of tweaking, adding a new metadata class along the way, the package can represent a table definition fairly well now.

What this allows me to do today, is create both a solid database dump, and the DDL to build the table structure. Theoretically this system could be modified to pull from any datastore and generate for any other datastore. The system was built in a way that was hopefully designed to facilitate that.

2012 rolls around, and things have changed. The landscape for web development has been shifting over the last decade as people struggle to find way to get the tools out of the developer's way, and enable them to do their job more and fight with code less. The most recent evolution in that sequence that I've been working with, is Scala and Play. As I work with two tools, I'm increasingly finding it easier to build systems that are stable, and take much less code to write.

Hermes-DB was originally designed just to output DDL, but when I started working with JPA, a system that requires a whole lot of scaffolding, it made sense to have one of the output "DDLs" be Java classes with JPA annotations. Over the last few days, I've been making a new variety of output, Scala case classes designed to work with Play and therefore Anorm. Anorm is very powerful, and gives you tools that "get out of your way", but doesn't have a lot when it comes to scaffolding. I've poked around a bit, and it seems there was a scaffolding plugin for Play 1, but none exists for Play 2. This little utility, is helping fill that gap for me. It outputs Scala class and companion object definitions based on the database schema.

The EVE Online database comes out of the box with about 75 tables. 75 tables that I'd rather not have to manually create mappings for for model classes. This little utility made my life much easier. A bit cheer for code generation tools!

It is open source of course, and can be found on gitorious with the git URL: git@gitorious.org:export4pg/export4pg.git

Please note that some of this code is very very old, and it's worked for probably close to a decade so some of it is a bit ancient in both understanding and coding style. It is however, very useful, and possibly one of the pieces of code I've written that's still in usage and not broken from constant tinkering!

On wired and wireless networking

2012-05-15T11:27:00.002-07:00

I saw the following article on G+ today:

http://lifehacker.com/5910335/what-awesome-things-still-require-a-wire--does-plugging-in-even-matter-anymore?utm_campaign=socialflow_lifehacker_facebook&utm_source=lifehacker_facebook&utm_medium=socialflow

And thought I'd comment on it. I used to be a big proponent of wired systems, sufficiently that I put effort into wiring my home with Cat 5e. That was back in the days of 802.11g, and honestly, back then 802.11g didn't come close to its potential most of the time.

Today we live in an age of the wireless. I use laptops that are truly portable, and iPad and iPhone and iPod touch. I agree that there are some places where wired makes sense, but I think this article makes both valid points, and invalid points. I'm gonna break it down here a bit, and take it one.

Backup Faster over the Network

This is mostly a valid point. If you need to backup over a network, you're better off plugged in. This does however assume your NAS supports gigabit ethernet, that the NAS's operating system doesn't suck, and that the drive inside can do better than 10-15MB/sec. I've seen many cases where none of the above are true, and it's one reason I switched to Apple.
Mostly, I don't use NAS. It's generally quirky, unreliable, expensive and slow, regardless of your network connection. I spent a great deal of time going through NAS devices until I finally just gave up and used a directed attached device. He also talks about remembering to keep your device turned on being a problem. If you use a wired connection, the same issue holds, so it's not really a good argument for wired.
I think on balance, this is a poor argument, though, I think it has some validity.

Keep up with your ultra-fast network

This is a really elitist kind of point. The number of folks who come close to having 100Mbit internet is miniscule. I'm a programmer, and I don't have 100Mbit. Even with 100Mbit, the number of times I'd get 100Mbit from the other end is about zero. Even at 25Mbit, I often don't see that saturated from download sites. This is a poor argument in my opinion.

USB 3.0 (and 2.0 Too)

Comparing wireless networking with direct attached peripherals seems a bit silly. And it goes on and on in this article. This is a both a valid point and an invalid point. If the device on the other end can truly saturate 802.11n, then this is true. Many devices just can't. Backups are the prime candidate here, and, well, I think backups are a good use of direct attached.

Remote Control Your Camera

Very very esoteric usage here. Firstly, it assumes you have a DSLR. Secondly it assumes you have a need to control it wireless and view the images on a laptop. Most folks aren't doing inside or studio shooting, even if they own a DSLR, and if they are, then why not just use USB from your computer, which is wired of course, but the need for wireless her at all seems a stretch.
This is a really crap argument.

Record High Quality Audio

Little bit of bandwidth calculation is required here. WAV format, that which is used in CDs is 44Khz at 16bit. This means you need 44,000 samples of 16 bits per second. Simple multiplication shows that comes it under 1Mbit. Lets take this up a notch and go to studio level 24 bit at 192Khz. If you have software and devices that can do this, it still only clocks in at 4.6Mbit. I've used 24 channel recording desks that use firewire. They were Firewire 800, which is 80Mbit. I'm pretty sure it wasn't saturated, and that's within the capability of 802.11n.
This is an invalid argument. Other than the fact that audio devices don't come with wireless support. But, let's face it, most computers don't come with Firewire 800 either.

Anything That Can Be Done with a Thumb Drive

I'm not really sure what the argument here is. If it's speed, then it's a really bad argument. Most thumb drives are really slow. I had to go out of my way to buy one that was even a half-sensible speed. This is also the reason I feel that Micro-SD slots in your Android device are pretty silly. Most people don't know they have to buy a high-speed SD card, or USB key for it to be much use. With wireless, it's not that hard to transfer files over the network to folks. It take a bit of knowledge unless you have a Mac, but it's not that hard. I haven't ever been given a USB key for a mix-tape (tapes, now we're talking modern tech) or mix-CD.

Charge your Other Gadgets

Powering USB devices. An iPad, a pretty hungry device I believe charges at 12W. You could reasonably charge your iPad off your unwired laptop without too much pain, and give your device some more juice at the cost of some laptop time in a pinch. Also, power transfer without wires is still some pretty new technology, and I think comparing it with wireless networking is a bit disingenuous.
I think this argument is valid in as much as you can't charge a device wireless, but, I think it's a silly argument given the original context of wireless being ethernet.

Audio and Video Cables

We've already covered audio. Video was only recently able to be transmitted over a serial connection, not HDMI level, but computer monitor level. Whilst this is true, it's also a bit silly, and see below for why.

Put Your Tablet or Smartphone On Your TV

Two words: Apple TV (maybe that's three)
Nuff said, this is an invalid argument. It also sort of invalidates the previous point. You can't transmit full-quality video over wireless, but you can transmit compressed high-def, and I think that satisfies the requirement in my opinion. There have been a few articles comparing iTunes 1080 with Blu-ray and iTunes hasn't come out too badly.

Get the Highest Quality Sound

Isn't this a repeat of "Record High Quality Audio"? In short, no. This is invalid.

Final Score

I think out of the arguments, three out of ten have some semblance of validity, of those, I'm struggling with two of them. There are things that need to be wired, your speakers to your stereo will still need to be wired. There are wireless solutions but they either suck, or are very expensive. Generally I think this article tries a bit too hard to demonstrate a need for wired in a world that is already mostly wireless. Trying to convince people to backup over ethernet when they're already doing it wireless is gonna be a pretty hard sell.

Scala is very nice - very very nice

2012-05-12T13:29:00.000-07:00

Today I am gushing over Scala's par method and XML literals. I am fetching about 30,000 entries over REST calls. The server isn't super fast on this one, so each call takes a bit of time. Enter list.par stage left.

list.par creates a parallelizable list which given an operation will perform it in parallel across multiple CPUs. It spawns threads and performs the operation, then joins all the results together at the end, very handy.

This little three letter method is turning what would be a very very long arduous process into a much less long one. Much much less.

val myList = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}

It gets better. In Scala, XML can be declared as a literal. Not only that, but it runs inline like a normal literal, with a few special rules. This service is combining a bunch of json into an XML output.

val myOutput = io.Source.fromFile("list.txt").getLines.par.map { x =>
  callService("FooService", "{id=\""+x"\"}")
}.map { x =>
  Json.parse[Map[String, Object]](x)("url").toString
}.map { x =>
  <entry>
    <url>{ x }</url>
  </entry>
}.toString

Which I can now happily write to wherever I need to, a file, or a web service response. Nifty in the extreme.

In 2012, we live in a world of JSON and XML. Perl had it's day when text processing was king. Today, a language is needed that can cope with JSON, XML and Parallelization and still yield sane-looking code. I'm not a big Ruby fan, as anyone who knows me will tell you, but I'm willing to keep an open in. I'd like to see if Ruby can do this kind of thing as elegantly and easily and demonstrate it's a language for the web in 2012. Also, I should mention Akka as well, though I don't yet know enough about it, other than it can allegedly take parallelization inter-compuer with similar simplicity.

Simple Scala scripts : Scan a directory recursively

2012-05-09T01:07:00.000-07:00

I'm using Scala increasingly as a scripting language at the moment. As my confidence with it is increasing, I'm finding it's becoming more and more useful for those throw-away scripting situations. Especially when then end up being not so throw-away after all.

def findFiles(path: File, fileFilter: PartialFunction[File, Boolean] = {case _ => false}): List[File] = {
  (path :: path.listFiles.toList.filter {
    _.isDirectory
  }.flatMap {
    findFiles(_)
  }).filter(fileFilter.isDefinedAt(_))
}

(replace {} with (), ditch newlines and it goes on one line well-enough, just doesn't fit in a Blogger template that way)
We might be duplicating the a shell find:

find | grep 'foo'

find ./ -name "foo"

And whilst the Scala is more complex, the Scala function can do operations on a File object, which gives you a lot of the rest of the power of the find command thrown in to the bargain. Plus, as it accepts a partial function, you can chain together filters. If you truly just wanted an analog for find:

def findFiles(path: File): List[File]  = 
  path :: path.listFiles.filter {
    _.isDirectory
  }.toList.flatMap {
    findFiles(_)
  }

Which is less complex that the first. This is still more work than find, but, the list you get back is an actual list. If you added anything useful to your find, say an md5 for each file, it gets less happy

find ./ | awk '{print "\""$0"\""}' | xargs -n1 md5sum

Maybe there's a better way, but that's what I've always ended up doing. The Scala is starting to compete now. Bump up the complexity one more notch, and I think Scala actually starts becoming less code and less obscure.

You might also notice that the example above can be fit nicely within the Map/Reduce paradigm. Scripting that is not only relatively easy, but can also be thrown at Hadoop for extra pzazz, and NoSQL buzz-worthyness.