Big Data, 30,000 Scientists and a Startup
Every day we create 2.5 quintillion bytes of data. To put this number into context it’s so much data that 90% of the world’s data today has been created in the last two years. That’s right; from the beginning of the earth till now we’re producing and capturing more data than we’ve ever seen in history. All of this data comes from a variety of different sources such as social media, digital pictures, video, smart phone GPS signals, and sensors that capture weather information, to name a few. If that’s not enough to take in all of this data comes in a variety of different forms
- Structured Data –follows explicit rules so it can be neatly modeled, organized and formatted to be easily to manipulate and manage. Structured data is seemingly boring but quite easy to work with and examples include databases to the mundane spreadsheet, fixed-format files log files etc.
- Unstructured Data – You’re probable quite familiar with this as well even if you don’t know what it is because, you’re consuming unstructured data right now as you read this article. Unstructured data incorporates the mass of information that does not fit easily into a set of database tables. The most recognizable form of unstructured data is text in documents, such as articles, tweets, or the message components of emails.
- Semi-Structured Data – This refers to sets of data in which there is some implicit structure that is generally followed, but not enough of a regular structure to “qualify” for the kinds of management and automation usually applied to structured data. We are bombarded by semi-structured data on a daily basis, both in technical and non-technical environments. For example, web pages follow certain typical forms, and content embedded within HTML often have some degree of metadata within the tags. This automatically implies certain details about the data being presented. A non-technical example would be traffic signs posted along highways. While different areas use their own local protocols, you will probably figure out which exit is yours after reviewing a few highway signs.
With the hockey stick curve of data being produced and the wide variety of the types of data that’s being created we’re facing an interesting conundrum on what to do with all of it. With this kind of massive problem you probably guessed that there’s a startup trying to answer it.
Kaggle, is coming close to reaching it’s first birthday and with their innovative approach for statistical/analytical crowdsourcing and I’m guessing they’re going to be around for quite some time. With over 30,000 data scientists in their community companies from all over the globe can take their big data problems to Kaggle and receive some interesting results.
When a company faces a data problem that they can’t solve they can work with Kaggle to create a statistical contest with their community of uber geeks. These data scientist compete to provide the best solution and at the end the company pays the winner for the intellectual property.
Overall it’s a very interesting business model to a growing problem that businesses are going to face. And just a side note if you’re looking for a job at Facebook, do a little digging at Kaggle and you’ll find some competitions that Facebook is setting up to vet new employees. I’m expecting big things and some innovative breakthrough coming from this new startup.