19 responses to “How To Get Experience Working With Large Datasets”

  1. Chris Hemedinger

    Another great source is Kiva.org, with tons of loan and lender data related to microfinance transactions.

    I’ve written about this at my blog for World Statistics Day (last month):

  2. Jeremy Weiss

    How large does a dataset have to be, before it’s considered ‘large’?

  3. Revue de presse Industrialisation PHP de la semaine 50 (2010) | Industrialisation PHP

    [...] Une application fonctionnant parfaitement peut subitement s'effondrer sous une affluence soudaine et massive de données. Pour anticiper ce genre de problème, on effectue en amont des tests de charge. La difficulté est généralement de trouver suffisamment de données pour charger le système. Phil Whelan propose différentes pistes pour récupérer ou générer de grandes quantités de données réalistes. [...]

  4. LATW Episode 1 (December 1-10, 2010)

    [...] How To Get Experience Working With Large Datasets [...]

  5. Brad

    Here’s another source:

  6. kencochrane

    Another source I’m surprised someone didn’t mention yet is Amazon’s Public data sets: http://aws.amazon.com/datasets

  7. Yildirim Mungan

    Great blog Phil! I’m reading everything you put. Thank you very much indeed.
    Although I’ve searched google, I could not find a big sample data for telco CDR (Call Detail Records).
    I am planning to apply some analysis on CDR files, and provide a solution for telco big data analysis problem.
    Can you recommend a web address where I may find what I am looking for?


  8. dodgy_coder

    > “Download War And Peace, Alice In Wonderland or any other book that is now out of copyright if you need real strings of words for your fake tweets”

    Nice post, and this above is a good idea but in doing so don’t forget about non-english languages such as Chinese and Japanese which have different character sets…

  9. Travis Reeder

    Great post Phil. I hope this will encourage people that want to get into the big data space to just get out there and do it. The best experience someone can get is to make something, whether it’s for work or just a project for fun/learning. When I’m hiring, I put a heavy weight on people that have done their own projects just for the sake of learning.


  10. Patrick Dobson

    Voter files are also freely available. You could get millions of real records from every state. Here is Ohio:


  11. Aquecedor

    I have always been fascinated by large datasets. I hope to play with this as soon as I have time. Great and inspiring post.

  12. Seamus Abshere

    We’ve got some curated datasets here:


  13. JJ

    Thanks for the great post. I have been spending some time trying to get creative with data set sources. If you could have access to any data set out there, what would choose? Its fun to dream.

  14. Sudhakar Singh

    From where can I find Transnational Datasets? I want to apply Association Rule Mining algorithms on it.