Hi, I‘m Martin. You should follow me: @martinklepsch

October 2014 S3-Beam — Direct Upload to S3 with Clojure & Clojurescript

In a previous post I described how to upload files from the browser directly to S3 using Clojure and Clojurescript. I now packaged this up into a small (tiny, actually) library: s3-beam.

An interesting note on what changed to the process described in the earlier post: the code now uses pipeline-async instead of transducers. After some discussion with Timothy Baldridge this seemed more appropriate even though there are some aspects about the transducer approach that I liked but didn’t get to explore further.

Maybe in an upcoming version it will make sense to reevaluate that decision. If you have any questions, feedback or suggestions I’m happy to hear them!


October 2014 Patalyze — An Experiment Exploring Publicly Available Patent Data

For a few months now I’ve been working on and off on a little “data-project” analyzing patents published by the US Patent & Trademark Office. Looking at the time I spent on this until now I think I should start talking about it instead of just hacking away evening after evening.

It started with a simple observation: there are companies like Apple that sometimes collaborate with smaller companies building a small part of Apple’s next device. A contract like this usually gives the stock of the small company a significant boost. What if you could foresee those relationships by finding patents that employees from Apple and from the small company filed?

An API for patent data?

Obviously this isn’t going to change the world for the better but just the possibility that such predictions or at least indications are possible kept me curious to look out for APIs offering patent data. I did not find much. So thinking about something small that could be “delivered” I thought a patent API would be great. To build the dataset I’d parse the archives provided on Google’s USPTO Bulk downloads page.

I later found out about Enigma and some offerings by Thomson Reuters. The prices are high and the sort of analysis we wanted to do would have been hard with inflexible query APIs.

For what we wanted to do we only required a small subset of the data a patent contains. We needed the organization, it’s authors, the title and description, filing- and publication dates and some identifiers. With such a reduced amount of data that’s almost only useful in combination with the complete data set I discarded the plan to build an API. Maybe it will make sense to publish reduced and more easily parseable versions of the archives Google provides at some point. Let me know if you would be interested in that.

What’s next

So far I’ve built up a system to parse, store and query some 4 million patents that have been filed at the USPTO since beginning of 2001. While it sure would be great to make some money off of the work I’ve done so far I’m not sure what product could be built from the technology I created so far. Maybe I could sell the dataset but the number of potential customers is probably small and in general I’d much more prefer to make it public. I’ll continue to explore the possibilities with regards to that.

For now I want to explore the data and share the results of this exploration. I setup a small site that I’d like to use as home for any further work on this. By now it only has a newsletter signup form (just like any other landing page) but I hope to share some interesting analysis with the subscribers to the list every now and then in the near future. Check it out at patalyze.co. There even is a small chart showing some data.


September 2014 Running a Clojure Uberjar inside Docker

For a sideproject I wanted to deploy a Clojure uberjar on a remote server using Docker. I imagined that to be fairly straight foward but there are some caveats you need to be aware of.

Naively my first attempt looked somewhat like this:

FROM dockerfile/java
ADD https://example.com/app-standalone.jar /
EXPOSE 8080
ENTRYPOINT [ "java", "-verbose", "-jar", "/app-standalone.jar" ]

I expected this to work. But it didn’t. Instead it just printed the following:

[Opened /usr/lib/jvm/java-7-oracle/jre/lib/rt.jar]
# this can vary depending on what JRE you're using

And that has only been printed because I added -verbose when starting the jar. So if you’re not running the jar verbosely it’ll fail without any output. Took me quite some time to figure that out.

As it turns out the dockerfile/java image contains a WORKDIR command that somehow breaks my java invocation, even though it is using absolute paths everywhere.

What worked for me

I ended up splitting the procedure into two files in a way that allowed me to always get the most recent jar when starting the docker container.

The Dockerfile basically just adds a small script to the container that downloads and starts a jar it downloads from somewhere (S3 in my case).

FROM dockerfile/java
ADD fetch-and-run.sh /
EXPOSE 42042
EXPOSE 3000
CMD ["/bin/sh", "/fetch-and-run.sh"]

And here is fetch-and-run.sh:

#! /bin/sh
wget https://s3.amazonaws.com/example/yo-standalone.jar -O /yo-standalone.jar;
java -verbose -jar /yo-standalone.jar

Now when you build a new image from that Dockerfile it adds the fetch-and-run.sh script to the image’s filesystem. Note that the jar is not part of the image but that it will be downloaded whenever a new container is being started from the image. That way a simple restart will always fetch the most recent version of the jar. In some scenarios it might become confusing to not have precise deployment tracking but in my case it turned out much more convenient than going through the process of destroying the container, deleting the image, creating a new image and starting up a new container.


September 2014 Using core.async and Transducers to upload files from the browser to S3

In a project I’m working on we needed to enable users to upload media content. In many scenarios it makes sense to upload to S3 directly from the browser instead of routing it through a server. If you’re hosting on Heroku you need to do this anyways. After digging a bit into core.async this seemed like a neat little excuse to give Clojure’s new transducers a go.

The Problem

To upload files directly to S3 without any server in between you need to do a couple of things:

  1. Enable Cross-Origin Resource Sharing (CORS) on your bucket
  2. Provide special parameters in the request that authorize the upload

Enabling CORS is fairly straightforward, just follow the documentation provided by AWS. The aforementioned special parameters are based on your AWS credentials, the key you want to save the file to, it’s content-type and a few other things. Because you don’t want to store your credentials in client-side code the parameters need to be computed on a server.

We end up with the following procedure to upload a file to S3:

  1. Get a Javascript File object from the user
  2. Retrieve special parameters for post request from server
  3. Post directly from the browser to S3

Server-side code

I won’t go into detail here, but here’s some rough Clojure code illustrating the construction of the special parameters and how they’re sent to the client.

Client-side: Transducers and core.async

As we see the process involves multiple asynchronous steps:

To wrap all that up into a useful minimal API that hides all the complex back and forth happening until a file is uploaded core.async channels and transducers turned out very useful:

(defn s3-upload [report-chan]
      (let [upload-files (map #(upload-file % report-chan))
            upload-chan  (chan 10 upload-files)
            sign-files   (map #(sign-file % upload-chan))
            signing-chan (chan 10 sign-files)]

        (go (while true
              (let [[v ch] (alts! [signing-chan upload-chan])]
                ; that's not really required but has been useful
                (log v))))
        signing-chan))

This function takes one channel as argument where it will put! the result of the S3 request. You can take a look at the upload-file and sign-file functions in this gist.

So what’s happening here? We use a channel for each step of the process: signing-chan and upload-chan. Both of those channels have an associated transducer. In this case you can think best of a transducer as a function that’s applied to each item in a channel on it’s way through the channel. I initially trapped upon the fact that the transducing function is only applied when the element is being taken from the channel as well. Just putting things into a channel doesn’t trigger the execution of the transducing function.

signing-chan’s transducer initiates the request to sign the File object that has been put into the channel. The second argument to the sign-file function is a channel where the AJAX callback will put it’s result. Similary upload-chan’s transducer initiates the upload to S3 based on information that has been put into the channel. A callback will then put S3’s response into the supplied report-chan.

The last line returns the channel that can be used to initiate a new upload.

Using this

Putting this into a library and opening it up for other people to use isn’t overly complicated, the exposed API is actually very simple. Imagine an Om component upload-form:

(defn queue-file [e owner {:keys [upload-queue]}]
      (put! upload-queue (first (array-seq (.. e -target -files)))))
(defcomponent upload-form [text owner]
      (init-state [_]
        (let [rc (chan 10)]
          {:upload-queue (s3-upload rc)
           :report-chan rc}))
      (did-mount [_]
        (let [{:keys [report-chan]} (om/get-state owner)]
          (go (while true (log (<! report-chan))))))
      (render-state [this state]
        (dom/form
         (dom/input {:type "file" :name "file"
                     :on-change #(queue-file % owner state)} nil))))

I really like how simple this is. You put a file into a channel and whenever it’s done you take the result from another channel. s3-upload could take additional options like logging functions or a custom URL to retrieve the special parameters required to authorize the request to S3.

This has been the first time I’ve been doing something useful with core.async and, probably less surprisingly, the first time I played with transducers. I assume many things can be done better and I still need to look into some things like how to properly shut down the go blocks. Any feedback is welcome! Tweet or mail me!

Thanks to Dave Liepmann who let me peek into some code he wrote that did similar things and to Kevin Downey (hiredman) who helped me understand core.async and transducers by answering my stupid questions in #clojure on Freenode.


July 2014 Emacs & Vim

After using Vim for more than four years my recent contacts with Lisp encouraged me to take another look at Emacs. I used to make jokes about Emacs just as Emacs users about Vim but actually it seems to be a pretty decent piece of software.

Being a Vim user in the Clojure community sometimes feels weird. You are happy with Vim. Running Clojure code with right from the editor works well these days. Still you wonder why all those people you consider smart seem to be so committed to Emacs. So I decided to try it once again.

Keybindings

Let’s start with a slight rant: I completely do not understand how anyone can use Emacs’ default keybindings. Being a Vim user I obviously have a thing for mode-based editing but Emacs’ keybindings are beyond my understanding. Some simple movement commands to illustrate this:

Command Emacs Vim
Move cursor down one line Ctrl-n j
Move cursor up one line Ctrl-p k
Move cursor left one character Ctrl-b h
Move cursor right one character Ctrl-f l

These are the commands recommended in the Emacs tutorial (which you open with Ctrl-h t). They are mnemonic, what makes them easy to learn–but is that really the most important factor to consider for commands you will use hundreds of times a day? I don’t think so. I tried to convince myself that it might be worth to get used to Emacs’ default keybindings but after some time I gave up and installed evil-mode.

Mode-based Editing with Evil Mode

In my memory evil-mode sucked. I was delightfully surprised that it doesn’t (anymore?). Evil brings well-done mode based editing to Emacs. As you continue to evolve your Emacs configuration you will most likely install additional packages that bring weird Emacs-style keybindings with them. Since you now have a mode-based editor you can use shorter, easier to remember keybindings to call functions provided by these packages. A useful helper that fits a sweet spot in my Vim-brain is evil-leader which allows you to setup <leader> based keybindings, just like you can do it in Vim:

(evil-leader/set-leader ",")
(evil-leader/set-key
  "," 'projectile-find-file)

With this I can open a small panel that finds files in my project in a fuzzy way (think Ctrl-p for Vim) hitting , two times instead of Ctrl-c p f.

Batteries Included

What I really enjoyed with Emacs was the fact that a package manager comes right with it. After adding a community maintained package repository to your configuration you have access to some 2000 packages covering Git integration, syntax and spell checking, interactive execution of Clojure code and more. This has been added in a the last major update (v24) after being a community project for some years.

Conclusion

Vim’s lack of support for async execution of code has always bugged me and although there are some projects to change this I can’t see it being properly fixed at least until NeoVim becomes the go-to Vim implementation. Emacs allows me to kick off commands and do other things until they return. In addition to that it nicely embeds Vim’s most notable idea, mode-based editing, very well, allowing me to productively edit text while having a solid base to extend and to interactively write programs.

If you are interested in seeing how all that comes together in my Emacs configuration you can find it on Github.


« 1 2 3 4 5 6 »