Posts

Summify’s Technology Examined

Web 2.0 icons

Following on from examining Quora’s technology, I thought I would look at a tech company closer to home. Home being Vancouver, BC. While the tech scene is much smaller here than in the valley, it is here. In fact, Vancouver boasts the largest number of entrepreneurs per capita.

Summify.com is a website that strives to make our lives easier and helps us deal with the information overload we all experience every time we sit down at our computers. The founders of this start-up, Cristian Strat and Mircea Paşoi, seem to have all the right ingredients for success. This is their biggest venture so far, but not their first. They have previously built Infoarena.ro and Balaur.ro, which are both focused on their home country of Romania.

“We’re a team of two Romanian hackers and entrepreneurs, passionate about technology and Internet startups. We’ve interned at Google and Microsoft and we’ve kicked ass in programming contests like the International Olympiad in Informatics and TopCoder.”
– Summify Team. “Our Story”

In this post I will look at the technology infrastructure they have built for Summify.com, the details of which they were kind enough to share with me.

Jump In

The Components Of Summify

  • It retrieves your feeds from Twitter, Google Reader and Facebook
  • It crawls the content from these feeds, converting Twitter links into full webpages
  • It cleans up the content
  • It displays your content via a streaming webpage
  • A daily digest of your feeds is sent to you by e-mail
  • Soon there will be an iPhone app

Technical Challenges

I see two technical challenges that Summify face…

  • Crawling a large volume of feeds and web pages
  • Live streaming updates to the website

To date Summify has crawled over 200 million web pages. This is fair amount. Even if you are not impressed with this count, then you should understand that they are not in control of this number. They crawl on-demand. It is not unlikely that their user-base would suddenly spike and they would have to ramp up their volume of crawling quickly. Crawling is easy, but crawling on-demand and being ready to display a re-rendered webpage in near real-time is much harder.

If you read my post on Quora’s technology, then you will have some understanding of the problems faced when displaying live updates to the user’s browser. Summify have similar challenges and they are using many of the same technologies as Quora does, such as Tornado for dealing with the large number of concurrent connections you see when you implement long-polling.

What’s Cooking Under That Hood?

Amazon EC2

Summify has approximately 15 different services running behind the scenes. These services are spread across 5 Amazon EC2 machines.

Nginx

Nginx is used as the load-balancer and is the first line of defense.

One of Nginx’s jobs is as a reverse-proxy onto Django / G-Unicorn and Tornado. As a reverse-proxy, Nginx takes on the job of dealing with multi-line headers, malformed input and handling slow clients. Essentially, Nginx is protecting the application servers from the outside world.

Multi-line Headers And Malformed Input

In Tornado’s own admission, it does not handle multi-line headers and malformed input very well. Therefore, it is recommended to put Nginx in-front of it to protect it from the gritty outside world. Thanks to Nginx, Tornado only sees clean well-formed HTTP requests.

Handling Slow Clients

One benefit of using a reverse-proxy, such as Nginx, is that application servers, such as Gunicorn (or Unicorn), are not always designed to handle slow clients. If a client, in this case a web-browser, has a slow connection to the Internet then it will take up a lot of the application server’s valuable time. Nginx is designed to handle a huge number of connections concurrently and does not care how fast or slow these connections are. Nginx can buffer the requests and the responses between the client and application server. Only when a request from a client is fully received does the Nginx concern the application server. It can pass this full request quickly back to the application server and can fully receive its response just as quickly. Nginx then begins sending the response back to client. Again, this may take some time if the client’s connection is slow. As far as the application server is concerned it is only ever dealing with fast clients and it can concentrate on efficiently serving those fast clients (again, this is just Nginx). Nginx deals with the real-world, simplifying it for the application server.

Without ensuring that your architecture can handle slow clients you will be wide-open to simple denial-of-service attacks. Gunicorn’s website recommends using slowloris for checking that you are handling slow clients correctly.

Shift Your Paradigms

If you are wondering why the application server cannot handle slow clients itself, then the answer is… you. Most developers are not accustomed to writing highly performant non-blocking event-driven code. This is how Nginx is built and if you were to write an Nginx module then you would have to understand this paradigm. This type of code is harder to develop and maintain. Off-loading the task of handling slow-clients to Nginx is one way that you can make your code simpler and continue to use the more powerful web application frameworks that are not built this way.

One way to make your code easier to read, and not look so eventful, is to use an abstraction layer such as Swirl (“some sugar for Tornado”). Summify use Swirl with their Tornado code.

Event-driven frameworks are become more and more popular and it is likely that more web application frameworks will be built on-top these technologies. Some technologies to look at are Tornado, which we will discuss below, Node.js (Evented I/O for V8 JavaScript) and Apache Mina (network application framework).

Gunicorn

Gunicorn (a.k.a. “Green Unicorn”) is a Python HTTP server in which Summify run the Django web framework. It manages the worker processes.

Gunicorn ‘Green Unicorn’ is a Python WSGI HTTP Server for UNIX. It’s a pre-fork worker model ported from Ruby’s Unicorn project. The Gunicorn server is broadly compatible with various web frameworks, simply implemented, light on server resources, and fairly speedy.
– gunicorn.org

Gunicorn’s website has some great information on how to deploy Gunicorn with Nginx, in a similar way to how Summify is using this. Cristian points out that the main difference is that they are using TCP sockets, not UNIX sockets.

By night the Green Unicorn fights crime with his trusty sidekick Kato.

Django

Django is a Python web application framework. This is used to serve up much of the more trivial aspects of Summify’s website, such as sign-up, login, navigation and serving up the initial Summify-specific HTML pages you see as you navigate around the site.

In Summify’s architecture Django works closely with MySQL.

Tornado

Tornado is an open-source version of FriendFeed‘s web server and is one of the technologies that powers Facebook, Quora and many others.

Tornado is a web server that is ideally suited to handling the dramatic up-tick in concurrent connections you will see when you start playing with “long polling” (comet).

Tornado is able to handle this large volume of connections by being event-driven and non-blocking. It can use the same thread for maintaining many thousands of connections at the same time, since most of these connections are idle at any given time. It multiplexes between the connections, switching to handling the next connection in the queue each time a blocking event occurs (generally disk and network reads and writes). Using epoll, the operating system will notify Tornado when it has returned from a blocking event at which point the Tornado server will add an event for the related connection to the queue of events to process.

If you are interested installing Tornado behind Nginx, in the way that Summify have, then this is good reference.

Tornado Clients

Just as Tornado the server is great at keeping connections open for a long time and concurrently handling thousands of clients, Tornado the client is capable of connecting to many servers concurrently.

Just like the server, this is done with non-blocking calls. Your HTTP request code provides a callback to the code to call once the response is received. This way, you can fire-off multiple requests asynchronously.

Here is an example…

class MainHandler(tornado.web.RequestHandler):
    @tornado.web.asynchronous
    def get(self):
        http = tornado.httpclient.AsyncHTTPClient()
        http.fetch("https://friendfeed-api.com/v2/feed/bret",
                   callback=self.async_callback(self.on_response))

    def on_response(self, response):
        if response.error: raise tornado.web.HTTPError(500)
        json = tornado.escape.json_decode(response.body)
        self.write("Fetched %d entries from the FriendFeed API" %
            len(json['entries']))
        self.finish()

Swirl

The above Tornado client example is from Eric Naeseth’s website. Eric has created several MIT-licensed open-source projects, such as Swirl. Swirl builds on the above concepts to allow you to write asynchronous code using Tornado without using a callback function.

Cristian and Mircea are using Swirl in Summify’s Tornado client code, which they use to crawl the web asynchronously.

MongoDB

MongoDB is used to store the news articles after they have been retrieved and reduced. The Summify co-founders have great praise for the technology and the great support they have received from MongoDB’s developers at 10gen.

Unfortunately, they have found limitations with the way they are using it. Similar to the problems FourSquare faced with memory page fragmentation, Summify is suffering from ever growing storage issues. Deleting data does not delete a whole page, so the page is not released and the storage size remains the same. Constantly adding and removing data to MongoDB is causing more and more of these fragments that never become removed up. This is similar to the disk fragmentation you will have seen on Windows. Unlike Windows, MongoDB does not come with disk defragmenter.

For this reason they are looking at retiring MongoDB and promoting Redis, which is already part of their architecture, to handle more of their data needs.

“We have a history of doing great things together. Starting from high-school and later in college we’ve built what is now the biggest Romanian community of Computer Science geeks: Infoarena is where you train for and compete in programming contests.”
– Summify Team. “Our Story”

Redis

Redis is used for storing key-values, such as the URL redirects.

URL Redirects

Nearly every link used on Twitter is a redirect to a longer URL. Different tweeters use different URL-shorteners for the original URL, such as bit.ly, goo.gl, tinyurl, ow.ly or Twitter’s own t.co. Knowing that two URLs ultimately resolve to the same URL is very important for efficiency. Cristian and Mircea have found Redis very good at efficiently handling this data.

MySQL

MySQL is used for those situations where [database table] joins are required. For instance, users and related user information, such as their feeds (Twitter, Google Reader and Facebook), are stored here.

The list of user feeds in MySQL are used by the feed retrieval service that treats this as queue of jobs to process. Once it reaches the end of the list, it restarts at the beginning in a round-robin manner.

jQuery Template

Much of site is dynamic. For instance, the news feed is very interactive. Updates are made via AJAX which retrieves new content in the form of JSON. The JSON is used by JavaScript in the browser to render HTML using jQuery Templates that are pre-loaded.

If you look at the source of a Summify page you will see these templates included there. Here is a snippet of what you might see…

<script id="user-tmpl" type="text/x-jquery-tmpl" class="spaceless">
  <div class="top">
    <div class="badge">
      {{if avatar}}
        <span class="avatar"><img src="${avatar}" alt="That's you!" width="32" height="32"/></span>
      {{/if}}

I asked Mircea if they had considered any other JavaScript templating languages, such as Mustache. jQuery Template was the right solution for them, since they wanted the flexibility to use JavaScript in the templates. Mustache is more static in its approach and would limit them, for instance, in formatting of a date-time field.

Algorithms

For Summify to summarize your daily feeds into a single email containing only a handful of stories, they have to algorithmically boil down from the hundreds of stories contained in your feeds. This is a key component to their success and, like Google’s search algorithm, would be constantly evolving.

The most important inputs we’re using are how many and which of your friends shared an article, (not all your friends are equal) a time decay factor, and popularity across the web.
– Cristian Strat, Summify

Beware Of The Kawasaki

Cristian and Mircea contacted Guy Kawasaki to see if he would be interested in testing Summify. As tests go, it was good one. Guy managed to immediately break their system.

Guy Kawasaki’s Twitter feed is unlike most others. As I type this he is currently following 306,014 other tweeters. Even if each of these tweeters only tweeted once a day, then that is still a lot of tweets and a lot of URLs for Summify to crawl. I am sure Guy’s eyes must get sore from trying to keep up with his Twitter feed.

While they were humbled in finding Summify’s limitations, Cristian and Mircea decided it would be best to cap the number of URLs to 5000 per user. Not all tweets contain URLs, so this is not the limit on the number of tweets. This limit also covers content from Google Reader and the Facebook feed. It is unlikely that anyone is capable of reading more than 5000 articles per day.

The Dark-Side Of The Long-Tail

Everyone loves the long-tail. That is were the majority the users hang-out and if we can tap into the long-tail’s raw potential then we will find untold wealth. The pot of gold is no longer at the end of the rainbow. Instead, that pot of gold has relocated to the end of the long-tail.

“The Long Tail, in a nutshell”

The theory of the Long Tail is that our culture and economy is increasingly shifting away from a focus on a relatively small number of “hits” (mainstream products and markets) at the head of the demand curve and toward a huge number of niches in the tail. As the costs of production and distribution fall, especially online, there is now less need to lump products and consumers into one-size-fits-all containers. In an era without the constraints of physical shelf space and other bottlenecks of distribution, narrowly-targeted goods and services can be as economically attractive as mainstream fare.

Chris Anderson

For Summify, and similar companies, the long-tail means more work and more processing power. 80% of the RSS feeds that Summify crawls only have a one or two subscribers. While this can partially be put down to the fact that Summify is still far from [critical mass](https://en.wikipedia.org/wiki/Criticalmass(sociodynamics)), if the theory of the long-tail holds true, then this will be an ongoing problem. It is the way in which start-ups creatively address problems like this that defines the successful start-ups and ensures their technology remains stable as they reach escape velocity.

JSON API

I will not go into details of the complete API, 1) because I do not have it and 2) because this is just for Summify’s internal use and Cristian and Mircea might know where I live. Instead this is intended as a look at how a JSON API for a dynamically updating website might be implemented, rather than a reference to the API.

API Request via HTTP GET

This request is made when I click the

summify click more
GET /api/recent.json?count=20&order_lt=1298761351.0000%2C1328%2C0000%2C3f4d HTTP/1.1
Host: summify.com
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13) Gecko/20101203 Firefox/3.6.13
Accept: application/json, text/javascript, */*; q=0.01
Accept-Language: en-us,en;q=0.5
Accept-Encoding: *;q=0
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
X-Requested-With: XMLHttpRequest
Referer: https://summify.com/client
Cookie: csrftoken=b781c99a21bfa6ec66f73576ece73b3b; sessionid=bdb49da344d0be4659534fd2d8e11872; __utma=133505548.1812117490.1298762364.1298762364.1298762923.2; __utmb=133505548.2.10.1298762923; __utmz=133505548.1298762364.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmc=133505548

You will notice that the request basically just comprises of the following…

count=20

“Get 20 more articles”.

order_lt=1298761351.0000,1328,0000,3f4d    (URL encoded)

This is order id associated with oldest article we have on the screen. So we are requesting to get any newer articles.

1298761351 is the current date-time as an Unix time (number of seconds elapsed since January 1, 1970). The server can use this to check if there are new updates to send back to the browser.

Note that I have disabled response encoding in my request. Generally your browser will request “gzip,deflate” encoding using the “Accept-Encoding” HTTP header. I have disabled this in Firefox, since I want to be able to more easily see the content returned.

API Response HTTP Header

HTTP/1.1 200 OK
Server: nginx/0.7.67
Date: Sat, 26 Feb 2011 23:29:21 GMT
Content-Type: application/json
Connection: keep-alive
Vary: Accept-Encoding
Content-Length: 23289
Etag: "4928161e5bbc814b22c37e82b83820b4df611de7"

The response header is pretty straight forward. It specifies “keep-alive” to maintain the HTTP connection for future requests. We can also see that 23Kb of content is returned to us in the body of response. This is actually larger than would be transferred to you since I have disabled content encoding.

API Response Body – JSON

The content of the response is JSON encoded. I have prettified the JSON here, since this data is actually sent as a single line of JSON and is not so easy on the eyes. Also, your browser will request gzip encoding. If you can read un-prettified gzip-encoded JSON then I take my hat off to you.

[
   {
      "article_id" : "4d69875eaf55ef771200304d",
      "published_at" : "2011-02-26T23:02:31Z",
      "marked_read" : false,
      "keep_unread" : false,
      "global_reactions" : {
         "twitter" : {
            "tweet_count" : 0
         },
         "facebook" : {
            "click_count" : 0,
            "comment_count" : 0,
            "like_count" : 0,
            "share_count" : 0,
            "total_count" : 0
         }
      },
      "image" : {
         "width" : 250,
         "url" : "https://www.linglom.com/media/images/Linux/ChangeIPAddress/_1.png",
         "height" : 233
      },
      "sources" : [
         {
            "user_avatar" : "https://a2.twimg.com/profile_images/1153016992/180_warau_normal.jpg",
            "published_at" : "2011-02-26T23:02:31Z",
            "tweet_id" : "41634242127470592",
            "user_name" : "junichi_y",
            "user_screen_name" : "junysb3",
            "generator" : "twitter",
            "service_username" : [
               "philwhln"
            ],
            "text" : "How to change IP Address on Linux Redhat | Linglom.com:  https://bit.ly/dZpYOg",
            "user_id" : 81482943
         }
      ],
      "order" : "1298761351.0000,0592,0000,b52a",
      "article" : "4d69875eaf55ef771200304d",
      "url" : {
         "domain" : "www.linglom.com",
         "full" : "https://www.linglom.com/2008/04/20/how-to-change-ip-address-on-linux-redhat/"
      },
      "title" : "How to change IP Address on Linux Redhat"
   },

This shows the first of 20 records returned and comprises of the various attributes that are displayed on the screen for this article. You can see the title, url, image, sources, article id (unique to Summify) and “order” attribute. The last “order” id is used as the next request to the API. For instance, if this was the last article returned then the next request for “More” would be…

GET /api/recent.json?count=20&order_lt=1298761351.0000%2C0592%2C0000%2Cb52a HTTP/1.1

Next Stop – iPhone

Next on the cards is the iPhone application. As you saw above, JSON is used to update Summify’s website interface. The JSON API they have built can easily be utilized from within an iPhone application.

Reading blog posts on-the-go is difficult. This is mainly due to the way in which each blog is formatted differently. Blogs are often not formatted for use on mobile devices. Since Summify strips out the formatting and leaves just the content they can easily re-format that content to better fit the mobile device in a consistent way.

“We’ve turned down some cozy job offers from companies like Google and Facebook and we moved half-way across the world for this incredible opportunity.”
– Summify Team. “Our Story”

Conclusion

Summify is not scared to try different technologies within their architecture and they appear very agile in their approach, while building a solid foundation to take Summify to the next level.

When I wrote the previous post on Quora’s technology, I had a strong sense that Quora knew exactly what they doing. Summify is using many of the same technologies. In fact, they showed interest in Quora’s internally developed technologies, such as LiveNode. Hopefully LiveNode will become released as open-source, benefiting many more companies like Summify.

I think we will be hearing much more from Summify’s founders, Cristian and Mircea. They have a great long-standing partnership and both understand the technology they are using very well.

At the time of writing this Summify has aggregated more than 200 million stories.

Comments

  1. Mircea Paşoi

    Great article, thanks for the review!

  2. Jerry Tian

    Phil, thanks a lot for the great article.
    To the Summify guys, interesting use of Redis. How much data is Redis storing currently? It sounds like a lot data to keep for a in memory data store. Is Redis coping well? How many Redis clusters do you have?

    1. Mircea Paşoi

      Hi Jerry,

      We’re storing about 8-9GB of data in Redis currently and it’s coping well so far. We have 4 Redis instances on the same machine, each with different data sets and with different access patterns (between 10 and 1000 requests/second)

  3. Jesse Heaslip

    Great read Phil! Always interesting to have a peek under the hood of a product like this. Thanks for a thorough post!

  4. XU Qiang

    i find ur blog really helpful, thanks a lot~~~

  5. Alex Toulemonde

    Wow I miss this article. Great one. I knew Cris and Mircea were awesome but not as much :) Thanks for taking the time to break their technology down. Great learning.

    PS: Go Vancouver Startups Go!

  6. Matt Smith

    Super thorough article, thanks! Couldn’t stop laughing at the reference to Tank (https://matrix.wikia.com/wiki/Tank).