The 9gag data mining scam

(Or: what is your <insert meme here> name? Click here to find out.)

Perhaps you're an occasional browser of 9gag or similar sites, and have come across images like these before:

Instagram-TropicalHouse-8dc5ec

People inevitably reply in the comments section with their hilarious name. Lately I've seen a lot more of these images than usual, and I'm guessing it's actually an attempt at data mining by some hacker trying to dox people. The questions on the images vary; sometimes it's the first letter of your name, sometimes it's the date you were born, the month you were born, the first letter of your mother's maiden name, and so on. People who are stupid enough to reply to these images with that data eventually create a very usable data trail that can be used by hackers to impersonate them on the phone. So yeah, have fun explaining to your bank that you publicly posted all details about your life in a poorly obfuscated manner.

Posted in Tech

Responsiveness and Patreon!

As you may have noticed from the blinding orange link at the top of this page, I am now on Patreon :)

That said, I am fully aware that this is a personal blog with personal content, but on the off chance that someone enjoys my writing and would like to support me, I thought Patreon provides quite a nice way to do that. I'm perfectly happy to never get anything from Patreon and will still continue blogging, as I have done for the past 11 years. Yes, it's been 11 years.. Interestingly that makes this blog the longest-lasting thing/activity/venture that I have in life. With possible exception of my iPod classic, those things last forever.

Creating a Patreon page was quite interesting. It really forced me to think further than 'It would be great if people could throw money at me for writing silly stuff like this'. I had to think about what value this blog creates, what topics are the most likely things people want to see, and what is the best way to take payment. It didn't make sense to offer payment per blogpost, since I would forfeit my freedom to write silly things that I wouldn't dare ask people money for. Monthly contribution seems to make more sense. I assumed that of my writings, the cycling trip day-by-day travel reports are probably the most interesting to read, so any funding I get on Patreon will likely go to the cycling cause.

As a secondary effect, every time I think about increasing public exposure of this blog I get nervous about its quality. I can't do much about the content after it's been written, but I can do something about the styling and user-friendliness. So yesterday I wrote some css that will hopefully make this blog a lot more mobile-friendly. It's still nowhere near perfect, but it'll have to do for now.

Random life thing: I'm supposed to be preparing for the upcoming cycling trip later this month but instead I'm currently experiencing quite possibly the worst cold I've ever had in my life. I'm feeling myself start to recover, but it'll be a few more days until I'm back to normal. I've got some ideas for posts lined up but nothing written down yet. There'll be something new here soon. Stay tuned!

(ZEST)

Posted in Tech

Why I won't be moving away from WordPress

I've made some time for myself to do some personal projects, and one of the things that's been on my mind for the longest time was to move this blog away from Wordpress and host it statically on S3 instead. Serving a single static html page (plus a few resource files, but not many) from S3 would be so much faster than letting the shared hosting server parse endless lines of php, most of which I don't even want. The benefit in that aspect is clear, but in other areas it's less obvious.

Easy of deployment is one of the issues. I'm not bothered about not being able to post by email or mobile - whenever I have something to write I'll usually have my laptop with me. It doesn't bother me that it won't be WYSIWYG either (which it wouldn't be if I write my own blog software because I don't care about WYSIWYG). The problem is always with the software and the libraries. My software of choice for my would-be static blog is python, but I'll inevitably end up requiring some libraries that will need to be downloaded and/or set up on each machine that I want to publish from. Knowing myself, I will forget to do this before I go on a trip and end up having to download those libraries at ultra-slow speed at some hotel in the middle of nowhere. Or I might be on a public machine, which will be even worse.

Comments are another issue. If the blog is completely static I'd have to go for a javascript-based comment service. I also have old comments that will either have to be converted to the new commenting system or else inserted into the html somehow if I want to preserve them. Again, there are solutions, but they're hardly easier than what I have with WordPress.

Finally there's the issue of dynamic pages: the calendar, archive and search functionality. Calendar and archive pages are pretty easily generated, but I'd either have to remove the search function or rely on an external service for that. Bleh.

The system I had in mind would be a collection of small tools:

  • A converter tool to convert a wordpress database into raw blogpost/page files (the exact format of which I would have to think about).
  • A compiler tool that reads the raw posts and writes usable html
    • Inserts content into a predefined template, I was thinking Django templates. This doesn't have to be fast.
    • Regenerates related pages (front page, archives, calenders, category pages)
    • Converts any markdown text to html
  • An uploader tool that publishes the generated html
    • Would upload to S3, perhaps pluggable so other providers would work too.
    • Would be smart enough to recognize linked images/files and upload those too.
I'd leave the template design and making the javascript commenting system work all up to the user, since that'd be a one-off job (for me, anyway). The goal of all this is to make my life easier, but it's an awful lot of work for an awfully small amount of easiness.. Anyway, if someone else would also be interested in using something like this (or willing to pay for something like this!) then do let me know.

 

Posted in Tech | Tagged , ,

The rational sacrifice

Check out this post on Elon Musk on Wait But Why. The post goes into great detail about how Musk reasons from first principle. By starting at the beginning and thinking about what's best for humanity as a whole, Musk ends up being motivated to make reality renewable energy and travel to Mars. He didn't just start there from scratch of course, and the article shows how he worked his way up from internet startups towards more lofty goals. It's a good read, highly recommended.

Somehow, when I think about my own life in this way, and trying to be as rational as possible, I do not find myself reaching the same conclusions as Musk has. I think what Musk has done is, is you consider the human factor, not the most rational solution to the problem 'what should I do with my life'. And that makes what he does all the more admirable. I'll try to explain my reasoning with some personal life examples.

In my day job as software developer it pays off to be completely and coldly rational about your product. For example, even if you intend to be on a project for 6 months, it still pays off to focus on the extreme-long-term of the project if it was made to last long, even if you don't intend to be on the project for that long. Rationally, what is best for the project is also best for you as a developer, because you are accountable for the state of the project. If the project goes well and continues to go well in the future, that means people think highly of you and will consider you again for future projects. Personal motivation and that of the whole are aligned.

Now compare that to Musk. He made his personal motivation to be the motivation of the whole. There is just no way that this can be an intrinsic, gut-feeling type of motivation, and he says so himself. He reasoned from first principles and arrived rationally at a conclusion about what he should do with his life. I know how this type of reasoning works; I use it all the time at work. In my case it means that I choose to increase test coverage, review code that someone else has already reviewed but I really want to be sure of, or spend a day debugging some hard-to-catch bug on production -- all tedious tasks that don't improve my knowledge as a developer. As a developer I can improve myself faster by learning new frameworks, trying new languages and venturing out into different areas of software. But I am employed for one particular project, and that project benefits the most from me doing what needs to be done, because no one else will do it.

I have no Musk-comparison for the opposite case: the case where you do something out of your gut feeling because you know it feels right, uncaring about the consequences, aiming towards nothing rational in particular but just wanting inner peace. But I do have an example of my own: that of cycling. Being on the road all day, seeing many things as I cycle along, with only room for one goal in my mind, sometimes pondering the book I've read the previous night. Cycling trips to me are a form of meditation, a way of clearing out my mind of unwanted thoughts and focusing on the here and now. It is a mental reset that, as far as I'm aware, I can only experience in that particular way, and it is a very powerful experience, especially when put into contrast against my daily life.

As someone who considers himself to be highly rational, I find it unreal that I am at my [best, happiest, most peaceful] when I am doing something irrational.

But then, rationally speaking, if I know that my mind and body can provide me with this peace of mind if I seek the irrational path, isn't that the path I should rationally be pursuing? And isn't everything else in my life second to that goal? It's a deeply personal (even spiritual?) goal, one that is of no use to the people around me, society or humanity as a whole. I am no Elon Musk, and I can not rationally justify me sacrificing myself to save humanity. Or perhaps I just rate my chances of success pretty low. Either way, I know I will never be spiritually motivated that way, and I think neither can he be, which is why I think of what he's doing as a sacrifice. I respect him for that, but I do not want it for myself.


Reading back what I just wrote I realized there's a contradiction between how I describe my goals versus the whole in the third and fourth paragraph. Although I wrote that I don't want to sacrifice myself in the way Musk does, it appears that I'm doing exactly that in the smaller-scale context of my project. That's probably not good. I think I can attain higher goals before resorting to a self-sacrificing position.

Posted in Tech , Thoughts

Fascinating linkdump

Yup, I'm still alive :) busy moving apartments, but otherwise still interested in THE FUTURE.

I've also recently finished a Scalable Machine Learning course at edX. It was my first time trying out an online course and it turned out to be quite interesting. Especially the final week's assignment produced some really cool results. Apache Spark is so much nicer to work with than Hadoop.

Posted in Tech , Thoughts | Tagged

A Brave New Internet (or: why I stopped using Twitter)

Twitter has been one of my favorite Internet places for at least the past 5 years. Twitter always distinguished itself from Facebook for me because of its 'just a quick thought'-ness. Anything you can think of, just dump it on Twitter. Friends might follow you on Twitter, but thoughts on there are generic and meant to be seen by the world. Anything personal that I want kept between my circle of friends goes on Facebook, anything that doesn't require a friend context goes on Twitter.

Or at least that's how I started using Twitter, but I no longer use it that way. I've been tweeting my random thoughts less and less - more on that later. My main use of Twitter for the past few years(!) has been to complain. There's nothing more satisfying when your train is running late again than to fire off an angry tweet towards the train company (that's you, TFL). Or if my mobile phone's internet has failed yet again during my daily commute (that's you, Three). Or if the software for my fitness tracker is just so shit I can't bear to use it (that's you , Garmin). But I digress. What used to be a frivolous form of quick mind-blogging has turned into an utterly useless hatefest. It's not healthy to use Twitter just for that.

So what about the other use? What about the short random thoughts? I had a fun random thought the other day at work, while performing a request for a client that was a bit out of my comfort zone, but at the same time no trouble at all and quickly handled. In a split-second I came up with "I'm a developer, not a nanny", fired off a tweet and forgot about it. My mood never darkened, I didn't brood on it, I just thought it was a funny thought because it drew parallels with the "I'm a doctor, not an X" meme from Star Trek which (I assumed) would resonate among my developer friends that are following me. Instead I got a concerned message from one of the people I work for asking me if everything was OK.

This is wrong on so many levels, but the main thing that bothers me is this: I know my colleagues quite well (I think), and I think they know me quite well. They know my personality and they know I'll speak up when something bothers me, yet somehow the off chance that an extremely generic statement I make on a personal Twitter account might somehow end up reflecting badly on my client, or my client's client, means that I will be spoken to by someone who did not understand the context in which the remark was made, or what it was even meant to represent.

And you know what the worst part is? I am in the wrong! I'm not saying that sarcastically, I truly believe that I did the wrong thing. I am wrong to assume that it's ok to share a short message publicly without context and expect people to understand all of its nuances. The past that I fondly remember is not 'better' because people back then knew you better and knew to take things in context or not place too highly a value on it; it's just that I used to get away with it because it just didn't occur to anyone to check on Twitter what people you know are saying in public. You can argue very strongly for the right to say anything you want on the internet and get away with it but from a purely game-theoretic perspective an employer would be stupid not to check. All things being equal you'd rather have an employee with zero public presence than an employee with a potentially negative web presence.

I don't fault people for thinking this way but, again from a game-theoretic perspective, my chances of remaining employed, and getting job interviews, only increases by shutting down my Twitter account. I'm not saying any public presence is a risk by definition. Github is a really good example of something that will very likely benefit you, even if you don't write a lot of code publicly. Even if you suck at coding, Github would be a representation of that. It's hard (but not impossible!) to take source code out of context, especially compared to a message of max 140 characters.

This is not 2005 any more. Back then as a fresh 20-something in Japan I could tweet and blog anything I liked without consequences. But in 2015 as a 30-something trying to be a responsible developer you simply can't blurt out random things in public.

Lack of context creates misunderstandings. I truly believe in openness of information and that, provided all parties have the full context available, more information can only have a positive effect. But no one on the internet has the time or the interest to research the full context of something before making up their opinion. That simple little fact makes Twitter a risk without a reward.

 

I've pondered on whether I should write about this at all, and I'm still pondering about closing this blog again in favor of having an anonymous blog, which is what I did a while back. I came back here, to the good old Colorful Wolf, because I believed that I could provide the context people need to understand me. I naively believe that I still can have a net positive effect on the world by writing and sharing the things that interest me.

61129068

Posted in Tech , Thoughts | Tagged ,

A tale of two bugs

A long time ago in a content management system far, far away.. a manager was sending out lifecycle emails. These emails would be sent out by the CMS automatically once a day, iterating over all recipients in the system. The CMS would weed out the recipients that did not match the email's filters, those recipents that had opted out, and those that had already received the email. It would then send the email to all the new recipients every night.

Sounds fairly simple, right? Let's complicate things a little: one of the sources of email addresses for these lifecycle emails was a customizable form that would be presented on the public-facing site. Since the email might want to reference some of the fields in the user's form submission, the email module had to be made aware of this data. Because both the email module and the forms module were unaware of each other some glue code had to be introduced: an automated task runs every night and imports form submissions from the forms module and turns them into usable recipients for the email module. For each form submission the task first calculates a checksum over the user's data; if there's already a record in the email module that matches the checksum then the same record won't be copied twice.

This system in the way that it's described here ran for many months without any issues. Until one day, just before I was about to leave of course, a project manager came to me and asked: "How come this email only has 300 form submissions but has been sent 3000 times?". A quick check of the database and server logs didn't show up any obvious bugs, so I knew I'd be working through the evening that day.

The logs confirmed that an email was indeed being sent multiple times to the same recipient. I began by trying to reproduce the problem, and immediately hit frustration. I tried to manually add myself multiple times as a recipient to a new email, but I only received it once. Then I tried to reproduce the entire workflow, creating a form on the public site and trying to register myself via there. Since the form wouldn't allow be to register multiple times, I manually added multiple form submissions for myself with the same email address. That didn't work either. Not only that: the glue code successfully detected that my multiple form submissions had the same checksum, and only created a single email recipient in the email module.

If there's one thing that sucks about programming, it's when things work when you expect them to fail. At least when things fail you've got an obvious thing you can work on, but if things are (seemingly) working you need to figure out how to break them, and that can be way more difficult.

I went back to the logs of the original email to try and find out if there was a pattern to the email addresses that were receiving the email multiple times. Something immediately stood out: out of all the addresses that the email was being sent to, only a single address received the email more than once. The real moment of clarity came when I saw the address: rather than being all lowercase, it looked something like this: "AAaaAA@BBbb.com".

With this clue I managed to reproduce the problem. I first tried to add my own email address to the email module directly using mixed case letters, but this didn't cause me to receive multiple emails. So I repeated the test doing a form submission, ran the glue code task and sent the email again. It worked! Finally I managed to reproducibly break the system.

Now I had two major clues: a) it only happens when the email address is not all-lowercase, and b) it only happens on form submissions. It was pretty easy to debug from there. Here's a step-by-step of what happened.

  • The daily email task starts up and tries to send an email to 'AAA@BBB.COM'
  • It finds that no email was sent yet, so it re-saves the email recipients with the lowercase address 'aaa@bbb.com'. It then sends the email and creates a log record confirming that the email was indeed sent to 'aaa@bbb.com'.
  • The daily data duplication task runs and happily (but mistakenly) adds a new email recipient with the address 'AAA@BBB.COM'.
  • The next day the whole process repeats itself.
Two major bugs!
  1. The email task checked if a sent record existed before sanitizing (lowercasing) the email address.
  2. Because the record in the email module got sanitized, its checksum changed. So the glue task ended up duplicating the same record over and over again every time it ran.
If there was just the data duplication bug then the email address would have gotten sanitized correctly, and only one email would ever have been sent out. If there was just the address sanitizing bug then the recipient would have been sanitized correctly after the first time, and only one email would ever have been sent out. With both bugs present things turned into an infinite loop of sanitizing and duplication.

We deleted the duplicate data, made sure that the checksum check would lowercase all fields when calculating the checksum, and modified the email task to sanitize the email address before doing the 'was-already-sent' check, not after. It would have been better to sanitize all of our existing data and to rewrite the email module, but that never happened. The entire CMS, including the email module, eventually got rewritten from scratch, but that's a story for another time.

Besides the fix, the most important lessons learned were in procedure. Always have an extra set of eyes on critical code. Code reviews are massively useful to catch problems like this. The same goes for unit testing: if we had thought of testing the email address sanitizer, it's quite likely that we would've thought of testing it with mixed case email addresses. Neither code reviews nor unit tests are an absolute guarantee that problems like this won't happen, but they make it a hell of a lot less likely.

Posted in Tech | Tagged

How to deal with decreasing awesomeness of the internet

Is the internet well? Opinions may differ, but I'm here to evangelize mine: No, the internet is not well. The problem is with the concept of the internet as a place of freedom; a place where anything can be said and done, and shared with everyone all over the world.

That functionality, although it still exists, is becoming more and more regulated. The authorities have caught up, and certain things are starting to move to the fringes (eg. Tor hidden sites). It starts with all the blatantly illegal things, but whole-internet censorship is really not far away in many countries, including the UK, where I live.

It's not just legality issues that are making the internet less free, a far worse culprit is monetization. Youtube clips have ads on them, streaming websites throw ads in as well. No more free database service for you; the internet rule of the day is that either you pay premium, or you accept that you'll get annoyed at crap advertisements that people throw at you.

Then there's politics. Surveillance. All of these reasons make it very likely that, a few years from now, you'll no longer be able to access your favorite service for free. You don't have to agree with this, of course. Perhaps you think that the internet will be a happy shiny perfect place in 5 years, but in my opinion signs point in the other direction.

So what to do? First things first: GET YOUR DATA OUT. Never, ever, keep your only copies of things in Flickr, Facebook, Google or even Dropbox. Don't trust backup services either; they're a nice extra, but they may go bankrupt or become unreachable any time. Think of the legal issues involved: do you know exactly which files you are backing up, and do you know the rules of possessing those files in the country you are storing your backups in? I'm guessing no.

In the Netherlands it used to be legal to make a copy of any copyrighted thing, as long as it is for your own personal use and you won't commercialize it. That is no longer true, because it conflicts with European regulations which override it. Historically, this kind of thing has been hard to enforce, but the authorities have done an amazing job of catching up, and they're only going to get more in-your-face about it in the future. If you can legally justify getting something that's available in your country right now, get it now. Laws will change, availability will decrease, and you will be branded a criminal for doing something that was legal just a few years ago.


Edit: perhaps I was wrong. Google Earth was one of things I was thinking of while writing this, and I expected the free version to go away in the future. Instead, one day after I wrote this, Google made Earth Pro free. Perhaps the 'users are the product' philosophy will set us all free?

(no)

 

Posted in Tech , Thoughts

Gumbug: A better way to browse real estate

Last summer I really wanted to find a decent rental apartment around London. Every day I scoured Gumtree, Rightmove and the likes in search of something affordable. In the end I decided to wait until I was able to buy an apartment instead, but I spent several weeks searching and getting annoyed at real estate sites nonetheless. I decided I could save myself a lot of time and effort by automating some of the steps of my search process. My search process went roughly like this:

  •  Go to Gumtree, search by location and price
  • Mentally filter out all the ads that I'd already rejected, usually because they were old or just looked crappy
  • Check the new ads, decide which ones I might be interested in based on my more subjective criteria (not ground floor, too far from public transport, high-crime area etc.)
  • Repeat the above process for a different set of locations
  • Repeat the above process for all locations on a different website (Rightmove, Zoopla etc.)
Thus Gumbug was born. Initally it was meant to search both Gumtree and Rightmove for rental apartments, but I've adapted it to only do Rightmove's To Buy section, for now. I've found a lot of duplication between sites that are listing property to sell, whereas for rental apartments there was often a whole category of quirky private listings that would only appear on Gumtree. The need to scrape multiple sites seems a lot less when only considering things to buy.

You can find Gumbug on github: https://github.com/rv/gumbug. I'm also running a semi-public version of it on Heroku, although it won't be very fast if a lot of people end up using it. You can have a play with it here: http://floating-forest-4090.herokuapp.com/, or to see some example search results, have a look at this link: http://floating-forest-4090.herokuapp.com/s/gzr1vwthsd. Since it might not handle the load, I'll describe how it works.

For each search you can add multiple sources, which are all consolidated into one page. I tried to avoid pagination of things as much as possible because I just want to see everything on one big page that I can scroll through at my leisure. If a listing appears on more than one source url it'll only appear once in the results. If the listing is already in the system its details won't be re-fetched every search, to save time. Adding urls as input might be a bit 'techy' but it saves a lot of coding time and allows me to specify a whole bunch of hard filters right at the source, since the url can already contain filters for price range, number of bedrooms etc.

Keywords Keywords

You can add a list of keywords to ignore and a list of keywords that are required. Eg. you can ignore 'ground floor, retirement' and you can require 'leasehold'. For the ignored keywords, if a listing contains at least one of the keywords, it'll be marked as ignored and moved to the bottom. For the required keywords, if an add doesn't contain at least one of the required keywords, it will also be marked as ignored and moved to the bottom.

Filter by distance to public transport Filter by distance to public transport

The public transport filter lets you select the stations you wish to be near to (or far away from). The list of stations is prepopulated from the zoned stations around London, but it'll automatically update after every search. If you add at least one station filter, all the listings will have to match at least one of your station filters, or else they will be ignored. Eg. if you add two filters: between 0.0-0.5 miles from Chesham station and between 0.2-1.0 miles from Amersham station, a listing must be either close to Chesham or close to Amersham (but not necessarily both) to match.

The distance filter is pretty stupid because distances are simply scraped from Rightmove, which (as far as I can tell) only shows straight-line distance. You might have to make a massive detour to get to the station, but Rightmove will still happily report that the listing is right next to the station.

Once the search is complete you get to see all the results on one page: all their images, important information and a map. No useless clicking through tiny thumbnails here. The key feature in the search results page is this: you can manually mark listings as either favorited or ignored, and any future searches you do from that particular search result page will preserve your favorites and ignored listings. So let's say you haven't searched anything for a week or so, all you have to do is press the search button to perform the exact same search again to get the new listings. Gumbug will pre-filter the new listings according to your criteria and will automatically move the ones you've already ignored manually down to the bottom.

So, why am I showing the ignored listings at all, if I'm clearly not interested in them? The reason for this is that humans (especially real-estate agents) make mistakes. They will mislabel things, forget to mention a keyword that every other ad that you're interested in has, or they'll add something stupid like "not ground floor" which throws off the keyword filters.

A second reason to display ignored listings is because you might be sharing the link to the search results with more than one person, and the other person might want to un-ignore a listing. Gumbug isn't exactly built on security: any person that you share the search results url with can favorite and ignore listings. This is great for me because I want to share search results with my girlfriend so she can go through them as well, but when sharing in public it's better to spawn a new search with a new url.

Lastly, there's the map. One of the things I've consistently found myself doing when checking listings, is to cross-reference the area with the deprivation map, which gives a rough indication of how much crime/poverty/incidents/bad things there are in an area. You can also click the name of each public transport station to display walking directions, so you know if that 0.6 miles is actually 0.6 miles (hint: it usually isn't).

Deprivation and Directions Deprivation and Directions

Gumbug will continue to be a work-in-progress, but it's reached a point where I'm quite able to use it to make my own life easier. Maybe it can help someone else too. Here's some of its issues:

  • When you flag something as ignored and then go to the next page, the ignored listing will pop up again because it's been moved to the back of the sort order.
  • No street view support yet
  • Some map issues when viewing on mobile
  • No floor plans yet
Feel free to give it a try on Heroku. If for some reason your search doesn't seem to be working then that might be because the worker process is not running. Since Heroku's not cheap I'm running the worker process on my local machine. Heroku's database is very tiny so it might fill up very quickly. If there's enough demand I could consider setting up a more proper version of it, so consider this an attempt to gauge the public interest. Let me know what you think :)

Posted in Tech , UK