Fascinating linkdump

Yup, I'm still alive :) busy moving apartments, but otherwise still interested in THE FUTURE.

I've also recently finished a Scalable Machine Learning course at edX. It was my first time trying out an online course and it turned out to be quite interesting. Especially the final week's assignment produced some really cool results. Apache Spark is so much nicer to work with than Hadoop.

Posted in Tech , Thoughts | Tagged

A Brave New Internet (or: why I stopped using Twitter)

Twitter has been one of my favorite Internet places for at least the past 5 years. Twitter always distinguished itself from Facebook for me because of its 'just a quick thought'-ness. Anything you can think of, just dump it on Twitter. Friends might follow you on Twitter, but thoughts on there are generic and meant to be seen by the world. Anything personal that I want kept between my circle of friends goes on Facebook, anything that doesn't require a friend context goes on Twitter.

Or at least that's how I started using Twitter, but I no longer use it that way. I've been tweeting my random thoughts less and less - more on that later. My main use of Twitter for the past few years(!) has been to complain. There's nothing more satisfying when your train is running late again than to fire off an angry tweet towards the train company (that's you, TFL). Or if my mobile phone's internet has failed yet again during my daily commute (that's you, Three). Or if the software for my fitness tracker is just so shit I can't bear to use it (that's you , Garmin). But I digress. What used to be a frivolous form of quick mind-blogging has turned into an utterly useless hatefest. It's not healthy to use Twitter just for that.

So what about the other use? What about the short random thoughts? I had a fun random thought the other day at work, while performing a request for a client that was a bit out of my comfort zone, but at the same time no trouble at all and quickly handled. In a split-second I came up with "I'm a developer, not a nanny", fired off a tweet and forgot about it. My mood never darkened, I didn't brood on it, I just thought it was a funny thought because it drew parallels with the "I'm a doctor, not an X" meme from Star Trek which (I assumed) would resonate among my developer friends that are following me. Instead I got a concerned message from one of the people I work for asking me if everything was OK.

This is wrong on so many levels, but the main thing that bothers me is this: I know my colleagues quite well (I think), and I think they know me quite well. They know my personality and they know I'll speak up when something bothers me, yet somehow the off chance that an extremely generic statement I make on a personal Twitter account might somehow end up reflecting badly on my client, or my client's client, means that I will be spoken to by someone who did not understand the context in which the remark was made, or what it was even meant to represent.

And you know what the worst part is? I am in the wrong! I'm not saying that sarcastically, I truly believe that I did the wrong thing. I am wrong to assume that it's ok to share a short message publicly without context and expect people to understand all of its nuances. The past that I fondly remember is not 'better' because people back then knew you better and knew to take things in context or not place too highly a value on it; it's just that I used to get away with it because it just didn't occur to anyone to check on Twitter what people you know are saying in public. You can argue very strongly for the right to say anything you want on the internet and get away with it but from a purely game-theoretic perspective an employer would be stupid not to check. All things being equal you'd rather have an employee with zero public presence than an employee with a potentially negative web presence.

I don't fault people for thinking this way but, again from a game-theoretic perspective, my chances of remaining employed, and getting job interviews, only increases by shutting down my Twitter account. I'm not saying any public presence is a risk by definition. Github is a really good example of something that will very likely benefit you, even if you don't write a lot of code publicly. Even if you suck at coding, Github would be a representation of that. It's hard (but not impossible!) to take source code out of context, especially compared to a message of max 140 characters.

This is not 2005 any more. Back then as a fresh 20-something in Japan I could tweet and blog anything I liked without consequences. But in 2015 as a 30-something trying to be a responsible developer you simply can't blurt out random things in public.

Lack of context creates misunderstandings. I truly believe in openness of information and that, provided all parties have the full context available, more information can only have a positive effect. But no one on the internet has the time or the interest to research the full context of something before making up their opinion. That simple little fact makes Twitter a risk without a reward.

 

I've pondered on whether I should write about this at all, and I'm still pondering about closing this blog again in favor of having an anonymous blog, which is what I did a while back. I came back here, to the good old Colorful Wolf, because I believed that I could provide the context people need to understand me. I naively believe that I still can have a net positive effect on the world by writing and sharing the things that interest me.

61129068

Posted in Tech , Thoughts | Tagged ,

A tale of two bugs

A long time ago in a content management system far, far away.. a manager was sending out lifecycle emails. These emails would be sent out by the CMS automatically once a day, iterating over all recipients in the system. The CMS would weed out the recipients that did not match the email's filters, those recipents that had opted out, and those that had already received the email. It would then send the email to all the new recipients every night.

Sounds fairly simple, right? Let's complicate things a little: one of the sources of email addresses for these lifecycle emails was a customizable form that would be presented on the public-facing site. Since the email might want to reference some of the fields in the user's form submission, the email module had to be made aware of this data. Because both the email module and the forms module were unaware of each other some glue code had to be introduced: an automated task runs every night and imports form submissions from the forms module and turns them into usable recipients for the email module. For each form submission the task first calculates a checksum over the user's data; if there's already a record in the email module that matches the checksum then the same record won't be copied twice.

This system in the way that it's described here ran for many months without any issues. Until one day, just before I was about to leave of course, a project manager came to me and asked: "How come this email only has 300 form submissions but has been sent 3000 times?". A quick check of the database and server logs didn't show up any obvious bugs, so I knew I'd be working through the evening that day.

The logs confirmed that an email was indeed being sent multiple times to the same recipient. I began by trying to reproduce the problem, and immediately hit frustration. I tried to manually add myself multiple times as a recipient to a new email, but I only received it once. Then I tried to reproduce the entire workflow, creating a form on the public site and trying to register myself via there. Since the form wouldn't allow be to register multiple times, I manually added multiple form submissions for myself with the same email address. That didn't work either. Not only that: the glue code successfully detected that my multiple form submissions had the same checksum, and only created a single email recipient in the email module.

If there's one thing that sucks about programming, it's when things work when you expect them to fail. At least when things fail you've got an obvious thing you can work on, but if things are (seemingly) working you need to figure out how to break them, and that can be way more difficult.

I went back to the logs of the original email to try and find out if there was a pattern to the email addresses that were receiving the email multiple times. Something immediately stood out: out of all the addresses that the email was being sent to, only a single address received the email more than once. The real moment of clarity came when I saw the address: rather than being all lowercase, it looked something like this: "AAaaAA@BBbb.com".

With this clue I managed to reproduce the problem. I first tried to add my own email address to the email module directly using mixed case letters, but this didn't cause me to receive multiple emails. So I repeated the test doing a form submission, ran the glue code task and sent the email again. It worked! Finally I managed to reproducibly break the system.

Now I had two major clues: a) it only happens when the email address is not all-lowercase, and b) it only happens on form submissions. It was pretty easy to debug from there. Here's a step-by-step of what happened.

  • The daily email task starts up and tries to send an email to 'AAA@BBB.COM'
  • It finds that no email was sent yet, so it re-saves the email recipients with the lowercase address 'aaa@bbb.com'. It then sends the email and creates a log record confirming that the email was indeed sent to 'aaa@bbb.com'.
  • The daily data duplication task runs and happily (but mistakenly) adds a new email recipient with the address 'AAA@BBB.COM'.
  • The next day the whole process repeats itself.

Two major bugs!

  1. The email task checked if a sent record existed before sanitizing (lowercasing) the email address.
  2. Because the record in the email module got sanitized, its checksum changed. So the glue task ended up duplicating the same record over and over again every time it ran.

If there was just the data duplication bug then the email address would have gotten sanitized correctly, and only one email would ever have been sent out. If there was just the address sanitizing bug then the recipient would have been sanitized correctly after the first time, and only one email would ever have been sent out. With both bugs present things turned into an infinite loop of sanitizing and duplication.

We deleted the duplicate data, made sure that the checksum check would lowercase all fields when calculating the checksum, and modified the email task to sanitize the email address before doing the 'was-already-sent' check, not after. It would have been better to sanitize all of our existing data and to rewrite the email module, but that never happened. The entire CMS, including the email module, eventually got rewritten from scratch, but that's a story for another time.

Besides the fix, the most important lessons learned were in procedure. Always have an extra set of eyes on critical code. Code reviews are massively useful to catch problems like this. The same goes for unit testing: if we had thought of testing the email address sanitizer, it's quite likely that we would've thought of testing it with mixed case email addresses. Neither code reviews nor unit tests are an absolute guarantee that problems like this won't happen, but they make it a hell of a lot less likely.

Posted in Tech | Tagged

How to deal with decreasing awesomeness of the internet

Is the internet well? Opinions may differ, but I'm here to evangelize mine: No, the internet is not well. The problem is with the concept of the internet as a place of freedom; a place where anything can be said and done, and shared with everyone all over the world.

That functionality, although it still exists, is becoming more and more regulated. The authorities have caught up, and certain things are starting to move to the fringes (eg. Tor hidden sites). It starts with all the blatantly illegal things, but whole-internet censorship is really not far away in many countries, including the UK, where I live.

It's not just legality issues that are making the internet less free, a far worse culprit is monetization. Youtube clips have ads on them, streaming websites throw ads in as well. No more free database service for you; the internet rule of the day is that either you pay premium, or you accept that you'll get annoyed at crap advertisements that people throw at you.

Then there's politics. Surveillance. All of these reasons make it very likely that, a few years from now, you'll no longer be able to access your favorite service for free. You don't have to agree with this, of course. Perhaps you think that the internet will be a happy shiny perfect place in 5 years, but in my opinion signs point in the other direction.

So what to do? First things first: GET YOUR DATA OUT. Never, ever, keep your only copies of things in Flickr, Facebook, Google or even Dropbox. Don't trust backup services either; they're a nice extra, but they may go bankrupt or become unreachable any time. Think of the legal issues involved: do you know exactly which files you are backing up, and do you know the rules of possessing those files in the country you are storing your backups in? I'm guessing no.

In the Netherlands it used to be legal to make a copy of any copyrighted thing, as long as it is for your own personal use and you won't commercialize it. That is no longer true, because it conflicts with European regulations which override it. Historically, this kind of thing has been hard to enforce, but the authorities have done an amazing job of catching up, and they're only going to get more in-your-face about it in the future. If you can legally justify getting something that's available in your country right now, get it now. Laws will change, availability will decrease, and you will be branded a criminal for doing something that was legal just a few years ago.


Edit: perhaps I was wrong. Google Earth was one of things I was thinking of while writing this, and I expected the free version to go away in the future. Instead, one day after I wrote this, Google made Earth Pro free. Perhaps the 'users are the product' philosophy will set us all free?

(no)

 

Posted in Tech , Thoughts

Gumbug: A better way to browse real estate

Last summer I really wanted to find a decent rental apartment around London. Every day I scoured Gumtree, Rightmove and the likes in search of something affordable. In the end I decided to wait until I was able to buy an apartment instead, but I spent several weeks searching and getting annoyed at real estate sites nonetheless. I decided I could save myself a lot of time and effort by automating some of the steps of my search process. My search process went roughly like this:

  •  Go to Gumtree, search by location and price
  • Mentally filter out all the ads that I'd already rejected, usually because they were old or just looked crappy
  • Check the new ads, decide which ones I might be interested in based on my more subjective criteria (not ground floor, too far from public transport, high-crime area etc.)
  • Repeat the above process for a different set of locations
  • Repeat the above process for all locations on a different website (Rightmove, Zoopla etc.)

Thus Gumbug was born. Initally it was meant to search both Gumtree and Rightmove for rental apartments, but I've adapted it to only do Rightmove's To Buy section, for now. I've found a lot of duplication between sites that are listing property to sell, whereas for rental apartments there was often a whole category of quirky private listings that would only appear on Gumtree. The need to scrape multiple sites seems a lot less when only considering things to buy.

You can find Gumbug on github: https://github.com/rv/gumbug. I'm also running a semi-public version of it on Heroku, although it won't be very fast if a lot of people end up using it. You can have a play with it here: http://floating-forest-4090.herokuapp.com/, or to see some example search results, have a look at this link: http://floating-forest-4090.herokuapp.com/s/gzr1vwthsd. Since it might not handle the load, I'll describe how it works.

For each search you can add multiple sources, which are all consolidated into one page. I tried to avoid pagination of things as much as possible because I just want to see everything on one big page that I can scroll through at my leisure. If a listing appears on more than one source url it'll only appear once in the results. If the listing is already in the system its details won't be re-fetched every search, to save time. Adding urls as input might be a bit 'techy' but it saves a lot of coding time and allows me to specify a whole bunch of hard filters right at the source, since the url can already contain filters for price range, number of bedrooms etc.

Keywords Keywords

You can add a list of keywords to ignore and a list of keywords that are required. Eg. you can ignore 'ground floor, retirement' and you can require 'leasehold'. For the ignored keywords, if a listing contains at least one of the keywords, it'll be marked as ignored and moved to the bottom. For the required keywords, if an add doesn't contain at least one of the required keywords, it will also be marked as ignored and moved to the bottom.

Filter by distance to public transport Filter by distance to public transport

The public transport filter lets you select the stations you wish to be near to (or far away from). The list of stations is prepopulated from the zoned stations around London, but it'll automatically update after every search. If you add at least one station filter, all the listings will have to match at least one of your station filters, or else they will be ignored. Eg. if you add two filters: between 0.0-0.5 miles from Chesham station and between 0.2-1.0 miles from Amersham station, a listing must be either close to Chesham or close to Amersham (but not necessarily both) to match.

The distance filter is pretty stupid because distances are simply scraped from Rightmove, which (as far as I can tell) only shows straight-line distance. You might have to make a massive detour to get to the station, but Rightmove will still happily report that the listing is right next to the station.

Once the search is complete you get to see all the results on one page: all their images, important information and a map. No useless clicking through tiny thumbnails here. The key feature in the search results page is this: you can manually mark listings as either favorited or ignored, and any future searches you do from that particular search result page will preserve your favorites and ignored listings. So let's say you haven't searched anything for a week or so, all you have to do is press the search button to perform the exact same search again to get the new listings. Gumbug will pre-filter the new listings according to your criteria and will automatically move the ones you've already ignored manually down to the bottom.

So, why am I showing the ignored listings at all, if I'm clearly not interested in them? The reason for this is that humans (especially real-estate agents) make mistakes. They will mislabel things, forget to mention a keyword that every other ad that you're interested in has, or they'll add something stupid like "not ground floor" which throws off the keyword filters.

A second reason to display ignored listings is because you might be sharing the link to the search results with more than one person, and the other person might want to un-ignore a listing. Gumbug isn't exactly built on security: any person that you share the search results url with can favorite and ignore listings. This is great for me because I want to share search results with my girlfriend so she can go through them as well, but when sharing in public it's better to spawn a new search with a new url.

Lastly, there's the map. One of the things I've consistently found myself doing when checking listings, is to cross-reference the area with the deprivation map, which gives a rough indication of how much crime/poverty/incidents/bad things there are in an area. You can also click the name of each public transport station to display walking directions, so you know if that 0.6 miles is actually 0.6 miles (hint: it usually isn't).

Deprivation and Directions Deprivation and Directions

Gumbug will continue to be a work-in-progress, but it's reached a point where I'm quite able to use it to make my own life easier. Maybe it can help someone else too. Here's some of its issues:

  • When you flag something as ignored and then go to the next page, the ignored listing will pop up again because it's been moved to the back of the sort order.
  • No street view support yet
  • Some map issues when viewing on mobile
  • No floor plans yet

Feel free to give it a try on Heroku. If for some reason your search doesn't seem to be working then that might be because the worker process is not running. Since Heroku's not cheap I'm running the worker process on my local machine. Heroku's database is very tiny so it might fill up very quickly. If there's enough demand I could consider setting up a more proper version of it, so consider this an attempt to gauge the public interest. Let me know what you think :)

Posted in Tech , UK

Controlling foobar 2000 from Ubuntu with global hotkeys

Uh, long title, short explanation: I work primarily on an Ubuntu laptop but I listen to music from my Windows machine right next to it using foobar 2000, still the best mp3 player available (come at me bro!). On Windows I'm used to using foobar's global hotkey functionality to quickly pause and switch tracks, but on Ubuntu any way you try to pause or skip a track requires a context switch, which is damn annoying if you're in the programming zone. Here's how I solved it.

  • Get the foobar http control plugin: https://code.google.com/p/foo-httpcontrol/
  • Configure it to require a password just to be safe. Without it anyone can log in to your music player and mess up your playlist and what you're listening to. With the password on they can still do exactly that but they'll have to sniff the network packets, which really isn't worth it just to control a music player.
  • Once configured, use the python script below to remote-control your foobar from the shell.
  • Put a shell script in usr/bin (or usr/local/bin, I forget) that calls the python script with the appropriate parameter (PlayOrPause, StartNext, StartPrevious). For more commands you can check the javascript of the browser interface of the http control plugin.
  • In Ubuntu's hotkey configuration settings, add your hotkey and make it call the shell script you just created.

Here's the script:



import sys

import requests

from requests.auth import HTTPBasicAuth

requests.get("http://your-ip-address-goes-here:1234/default?cmd=%s" % sys.argv[1],

                 auth=HTTPBasicAuth('username', 'password'))





Voila! Cross-platform music hotkeys :D

Posted in Tech

Constructing a mind palace... in Minecraft

I absolutely love Minecraft. Though my level of obsession has dimmed a bit compared to when I was first mindblown, it's still an amazingly satisfying sandbox to play in. There always seems to be something new to build, which always manages to recapture my interest.

One of the things I noticed while playing Minecraft is that I pretty much know exactly what, where and how I built the things in my world. If I somehow lost my world and all of its backups, I am positive that I could recreate an extremely large portion, if not all of it, just from memory. The connection to a mind palace should now become evident.

In the past I've tried to build mind palaces of things, and have been more or less successful, up until the point where I try to populate the rooms in my mind with actually useful information. That's where my memory stops functioning well, I suspect because an entirely imaginary mind palace is just too unreal for me to hold in my mind. But if you tied a mind palace to something tangible (well, more or less) like a Minecraft world, a place with actual houses and paths and rooms, then perhaps it would be a lot easier to store knowledge in. If you go so far as to place things that you want to remember in signs and books, I bet you could remember a lot.

Another good example of a mind palace is my photo folder on my hard drive. I've organized it chronologically and hierarchically, first by year and then by month+day. While I can't remember exactly what happened on which day, using this folder structure as a mental guideline, I could tell you with reasonably high confidence what I was doing at any given month. But only for those months that I have photos of. My hobby of photography has waned a lot over the past years..

tl;dr: create a physical or virtual structure to hold your mind palace, then populate it with real-world information.

Posted in Tech , Thoughts | Tagged

As days go by

I haven't blogged in a while. Despite having switched from enjoying-life-mode back into grind-and-earn-money mode, I've managed to maintain a remarkable sense of self-actualization over the past few weeks. I think the reason for that is partly because I try to work less long days, as I mentioned in the previous post. I get time to recover and clear my mind at the end of the day, rather than never fully clearing it and piling up new workloads the next day without having fully processed the previous day.

Working less hours is part of the reason, but also a consequence of something else. My goals in life have become startlingly clear to me after I found out exactly how much money I need to buy a house in this bloody country. It'll take years and years of savings to fully pay off a nice house. Even if I found  a better paying job, the difference it would make will never be as significant as I want it to be. And even with a better paying job you're bound by obligations and forced to work for the better part of the year. Given that fact, I'd say I've got a pretty damn good job right now, and I see no reason to change it for something marginally better.

Financial independence is the final goal. It's not even worth thinking about what I'll do after I achieve it, because the possibilities will be endless. In the past I tried several times to 'do a startup', sometimes alone, sometimes with friends. But what I've come to realize is that the startup life is not something that I want for myself. I'm usually quite introverted, and although I learned that I can muster up the extroversion needed to function capably in a startup role, it's not something I enjoy doing or would feel comfortable with doing for a long period of time.

This is the point where people tell me "but to gain something you will have to step out of your comfort zone". Well, yes and no. Stepping too far out of your comfort zone is simply not sustainable and will wear you down. For me, I think I function at my best while 95% within my comfort zone, using the remaining 5% to explore new territories. I need to find things out for myself. Advice from others only helps at the most superficial level, any concrete advice will be noted only for reference while I make my own mistakes, from within that very comfortable 95% plan.

Realizing that I am more reluctant to leave my comfort zone than I previously though, I began to list my options. The list is limited, of course, compared to before, but the remaining options are those that I feel much more enthusiastic about than anything else. And because the options are 95% within my comfort zone, I get to expand my knowledge while actually enjoying it rather than feeling stressed out.

I don't believe that any advance in knowledge in the field of programming is going to help me to make progress as a human being. While it's true that I'm getting better at coding, especially within a project atmosphere, most of the things that I learned, that I value highly, are as a result of interactions with people. Focusing deeply on a topic will teach you two things: in-depth knowledge of the topic, and how to focus deeply. I think I've learned enough on how to focus deeply on something to apply it to things other than programming. Don't get me wrong, I still love to code. But I find that a lot of my peers see coding as the final goal, whereas whatever the thing is that they're coding is just a happy side effect. I want to use programming as a means to an end, whatever end that could be, even if it has nothing to do with coding or dev-ops or anything technical. I believe that if I can use programming in this way, I can become better as a person.

Posted in Daily Life , Tech , Thoughts

The law of diminishing returns

There's an ideal amount of time you can spend at work, working. In fact there's more than one ideal amount of time. In my case, I find that if I work for 6 hours and then go home, I still have enough mental energy left to work on personal projects after the commute. Working 8 hours is also good, although productivity does decrease a lot in the later hours. But it's better than working 7 hours, because in that case I find myself both mentally tired and not with enough time and mental energy to do stuff at home.

Posted in Daily Life , Tech