Reveiling a bit more about MediaList

What is about programming that makes it so fun to do at night? Or so terrible to do in the morning? Maybe I'm just a night person.

I ran into an uncomfortable realization yesterday while working on MediaList. Since I've switched from Java to Python I've focused on keeping my code clean, empty and generally sense-making. I decided to prioritize readability and cleanliness over performance, which is something I seldom if ever do in Java. Figuring that this is a hobby project I thought I might get away with it. The future will prove me right or wrong, but I'm starting to have my doubts already.

As a way of generating a large volume of high-quality content for my site dynamically I'm planning to let the site's users input URLs of other websites. MediaList will load the site, scrape all the relevant info from it and insert it into a database. I've already mentioned before that the site allows you to rate stuff, so here's an example: you can rate a movie by pasting a link from IMDb.com. MediaList will then fetch information from the page on IMDb, like the movie title, year it was released and its duration. I chose not to let users add this content manually for two reasons. One: I don't have users. (And I'm sure as hell not going to add all that crap myself.) Two: letting people input things manually will surely mean a lot of mistakes. With a large community that's not an issue as moderators will notice the mistakes and fix them, but with a small userbase (or none at all) it's just a lot easier to scrape the data from somewhere else.

Here's where I ran into a lot of problems. IMDb does not have an official API, and the unofficial ones don't have information about the IMDb URL  belonging to each movie, which is vital for my purposes. I decided to take a rather risky step and parse the raw HTML from IMDb directly. It's risky because it can change at the whim of the people at IMDb, and when that happens I'll have to update my parser. IMDb, if you're reading this, a public API (preferrably using JSON) would be awesome.

After messing around with Python's htmllib and sgmllib I realized that they both sucked, and that if I wanted to get something done quickly (in dev time, not in processing time) I'd need a DOM parser. After sniffing around the net a bit I quickly found BeautifulSoup, a wonderful piece of code that builds a DOM tree and provides search functions for *ML documents. The code I wrote using BeautifulSoup is easy to understand and easy to modify, quite unlike the turd I wrote with sgmllib.

Building a model of a single IMDb page in BeautifulSoup takes mere milliseconds on my system at home. The bottleneck lies in fetching the urls from IMDb. While the time it takes for a single url is acceptable, importing a list of several hundred urls takes painfully long, and is not something I can let the users wait around for. Fortunately I only have to lead each url once, after which the information is cached inside MediaList. If the information on IMDb was incorrect (which, after testing this feature, turns out to be the case more often than imagined), the information inside ML will also have to be updated. Manually.

Still, after all the work on the MediaList concept, this has been the only potential performance bottleneck I've encountered. I'm confident that, this point excluded, the site will still work fine up to 1000+ active users, even on shared hosting. For all I know the site might bog up at 50 users though, I haven't tested that yet. I do feel satisfied about the code I wrote though. It's always a trade-off. Pretty code, high performance, rapid development: pick two. I'm glad I chose a different combination for this experiment.

 

 

 

 

Posted in Tech | Tagged ,

Trust yourself over others

Lately I've been shutting myself off from the world a bit. Or rather, from social media and instant messaging. I'm still in the mood for producing, which is why I'm still doing this blog and an occasional tweet as well. It's the consuming part that bores me. Maybe later I'll pick up an interest in it again. But not now.

I've been focusing on two things: working on a website project as a hobby and looking for jobs. The job search isn't going well, but that's because I'm not looking very hard. If I lowered my standards a little I could probably find a job sooner, but since I'm not in any particular hurry I've decided to try and make the most of it. On that note, my opinion of recruiters is dropping every day. They all contact you full of enthusiasm, promising you 10 interviews within the week, but most of them didn't even call me back after the first call. The ones that did were divided into two groups. One group was genuinely enthusiastic and technically well informed, but as time passed they either stopped contacting me or presented me with lame jobs that didn't match my skills at all. The second group did that from the beginning. Some of the company names presented to me by recruiters were high profile, some weren't, but they all had one thing in common: the job description was boring. Maybe recruiting works better for people in sales or some other non-technical field, but so far I would suggest looking for opportunities yourself if you want a good tech job. It seems I'm not the only one who came to this conclusion.

During the jobquest I happened to come across an ad for Django developer that looked genuinely interesting. Quite unlike the dozens of Java ads I've seen for the past month, which all looked alike. I decided to try applying for it, and figured it would be a good idea to have my current hobby project online as a reference. Which meant a lot of work in getting it presentable. For the past week or so I've been spending a lot of time on it, both in the back-end and in the front-end. The website is now near minimum-viable-product so I thought I'd talk about it a bit more. I've been using the internal name 'MediaList' for the project, but decided to rename the project after finding out that all domainnames similar to that were taken already... As the saying goes, girls are like domain names: the good ones are already taken. (But you can still get one from a strange country).

As I was developing the site, strange thoughts entered my head: of fame and fortune, wealth and riches. I would place ads on the site, introduce a paid subscription service so money would flow in and I would never have to work again. As I mentioned before, hanging around in teh startup-frenzied internetz tends to put happy thoughts into your head. I have since repented and forced myself to face the reality, which is that I'll probably never get more than 10000 users, and most of those users won't be willing to pay. Since the whole reason for the site's existence is to fix a pet peeve of mine (well, and to flex my developmental muscles a bit) I've decided not to take any action towards monetizing it. If ever the hosting costs become too high I'll think about it again, but for now it's just a happy hobby project.

At this point I was getting ready to launch into a full explanation of the site, but I realized that that might be a bit much, especially before I have something I can show you. Instead, I'll ask you all to wait a bit longer as I solve the last big issues before release. This week should be it :)

Posted in Tech , Thoughts

Abound

No matter how hard I try, I can't get my photos to have warm colors while I'm in Holland. It's just such a cold country.

Posted in Photography

Uniqueness and Quality

Can Quality be correlated with Uniqueness? If so, how?

Posted in Thoughts

Beethoven's Moonlight Sonata

Take a listen at how different people play the third movement of Beethoven's Moonlight Sonata in different ways.

I'd want to comment on how these people play, but I really can't. I don't know enough about classical music yet. Somehow the fourth video appeals to me the most, although I can't quite put into words why. The third video is also incredible.

(I added a music category to my blog especially for this post)

Posted in Music | Tagged

Sinterklaas

Here's the trailer about a new Dutch horror movie. It's about the Dutch version of Santa Clause: Sinterklaas. I normally avoid Dutch movies like the plague but this seems so cheesy that it might actually be hilarious.

For something way better, check out this awesome trailer of Hobo with a Shotgun:

Posted in Dutch

Sleep Now

Posted in Japan , Thoughts | Tagged