Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Math Data Storage Programming Software IT

How Big Data Became So Big 105

theodp writes "The NYT's Steve Lohr reports that his has been the crossover year for Big Data — as a concept, term and marketing tool. Big Data has sprung from the confines of technology circles into the mainstream, even becoming grist for Dilbert satire ('Big Data lives in The Cloud. It knows what we do.'). At first, Jim Davis, CMO at analytics software vendor SAS, viewed Big Data as part of another cycle of industry phrasemaking. 'I scoffed at it initially,' Davis recalls, noting that SAS's big corporate customers had been mining huge amounts of data for decades. But as the vague-but-catchy term for applying tools to vast troves of data beyond that captured in standard databases gained world-wide buzz and competitors like IBM pitched solutions for Taming The Big Data Tidal Wave, 'we had to hop on the bandwagon,' Davis said (SAS now has a VP of Big Data). Hey, never underestimate the power of a meme!"
This discussion has been archived. No new comments can be posted.

How Big Data Became So Big

Comments Filter:
  • by TheRealMindChild ( 743925 ) on Sunday August 12, 2012 @09:50PM (#40968803) Homepage Journal
    Working on a lot of code throughout my career, especially over a decade ago, storage was small and expensive, so you did all sorts of things to trim down your dataset and essentially dumb down your data mining. Now we have the mentality of "Keep everything, sort it out later". One of my most recent jobs involved doing statistical analysis on a ridiculous amount of data (think walmart sales data + all known competitors data for the past two years). Being able to even TOUCH all of the data, let alone do something with it is a real and complicated problem
  • by Sarten-X ( 1102295 ) on Sunday August 12, 2012 @11:26PM (#40969401) Homepage

    No mod points, so I'll just post instead: You seem to be blissfully ignorant of what you're talking about.

    Big Data isn't just gathering tons of data, then running it through the same old techniques on a big beefy cluster hoping that answers will magically fall out. Rather, it's a philosophy that's used throughout the architecture to process a more complete view of the relevant metrics that can lead to a more complete answer to the problem. If I'd only mentioned "empowering" and "synergy", that would be a sales pitch, so I'm just going to give an example from an old boss of mine.

    A typical approach to a problem, such as determining the most popular cable TV show, might be to have each cable provider record every time they send a show to a subscriber. This is pretty simple to do, and generates only a few million total events each hour. That can easily be processed by a beefy server, and within a day or two the latest viewer counts for each show can be released. Now, it doesn't measure how many viewers turned off the show halfway through, or switched to another show on the commercials, or who watched the same channel for twelve hours because they left the cable box turned on. Those are just assumed to be innate errors that cannot be avoided.

    Now, though, with the cheap availability of NoSQL data stores, widespread access to high-speed Internet access, and new "privacy-invading" TV sets, much more data can be gathered and processed, at a larger scale than ever before. Now, a suitably-equipped TV can send an event upstream for not just every show, but every minute watched, every commercial seen, every volume adjustment, and possibly even a guess of how many people are in the room. The sheer volume of data to be processed is about a thousand times greater, and coming in about a thousand times as fast, to boot.

    The Big Data approach to the problem is to first absorb as much data as possible, then process it into a clear "big picture" view. That means dumping it all into a write-centric database like HBase or Cassandra, then running MapReduce jobs on the data to process it in chunks down to an intermediate level - such as groupings of statistics for each show. Those intermediate results can answer some direct questions about viewer counts or specific demographics, but not anything too much more complicated. Those results, though, are probably only a few hundred details for each show, which can easily be loaded into a traditional RDBMS and queried as needed.

    In effect, the massively-parallel processing in the cluster can take the majority of work off of the RDBMS, so the RDBMS has just the answers, rather than the raw data. Those answers can then be retrieved faster than if the RDBMS has to process all of the raw data for every query.

    Rather than dismissing errors of reality as unavoidable, a Big Data design relies on gathering more granular data, then distilling accurate answers out of that. For situations where there is enormous amounts of raw data available, this is often beneficial, because the improved accuracy means that some old impossible questions can now be answered. If enough data can't be easily collected (as in the case of so many small websites (almost anybody short of Facebook and Google)), Big Data is probably not the right approach.

  • by glitch23 ( 557124 ) on Monday August 13, 2012 @12:32AM (#40969807)

    And how are we measuring the size? What sizes are measured for typical 'big data'?

    You measure the size based on how much storage capacity the data takes up on disk. Usually it's on SAN storage. Big data can be any size but typically it is used for customer data that is in the terabyte range, which can obviously extend from 1 TB to 1024 TB. For one company 1 TB of data may be created in one day and for another it might take a year. But creation isn't the issue...it's the storage, analysis and being able to act on the data that can be difficult at those capacities. Why you ask? Look at my answer to your next question.

    Are we talking about detailed information, or inefficient data formats?

    Anything. When you begin talking about *everything* an enterprise logs, generates, captures, acquires, etc. and subsequently stores then the data formats can seem infinite, which is why it is so difficult to be able to analyze the data because there are file formats to consider, normalization, unstructured data, etc. to contend with. The level of detail depends on what a company desires. Big Data can represent all the financial information they track for bank transactions, the audit data that tracks user login/logout of company workstations, email logs, DNS logs, firewall logs, inventory data (which for a large company of 100k employees can change by the minute), etc.

    Are we talking about high-resolution long-term time series, or are we talking about data that is big because it has a complex structure?

    A company's data, depending on the app that generates it, may become lower resolution as time goes on but not always. It's big simply because there is a lot of it and it is ever-growing. The best ways to combat even searching against data sets in the terabyte and exabyte levels is to index it and to use massive computing clusters, otherwise you'll spend forever and a day waiting for the machine to search for what you need out of it. That also assumes the data has already been stored in an efficient manner, normalized, and accessible by an application intended to process that much data by companies who are in the Big Data business (such as my employer).

    Is the data big because it has been engineered so, or is it begging for a more refined system to simplify?

    It's big simply because companies generate so much data during the course of a day, month, year, 10 years, etc. On top of what they generate, many of them are held to retention regulations such as the medical and financial institutions for various reasons such as HIPAA and SOX. So when they have to store not only stuff that their Security team requires, their HR team, their IT dept, etc. as well as what the gov't requires them to collect (which is usually in the form of logs), it just becomes the nature of the beast of doing business. In some cases, like data generated by the LHC in Europe, it has been engineered to be big just because the experiments generate so much data but a small ma and pop business doesn't generate that much, mostly because they don't need it; they don't care about it.

    It definitely is begging for a more refined system to simplify it in the form of analytics tools that are built to do just that. Of course, you need a way to collect the data first, store it, process it, and then you can analyze it. After you analyze it you can then act on the data, whether it is showing that your sales are down in your point-of-sale stores that are only in the southeastern US, or your front door seems to get hits on it from Chinese IPs every Monday morning, etc. Each of the collection, storage, processing and analysis steps I mentioned above requires new ways of doing things when we're talking about terabytes and exabytes of data, especially when a single TB of data may be generated every day by some corporations and their analytical teams need to be able to process it the next day, or sometimes on the fly in near real-time. This means software engineers need to find new algorithms to make it all run faster so that companies competing in the Big Data world can sell their products and services to other companies who have Big Data.

  • by TapeCutter ( 624760 ) on Monday August 13, 2012 @07:54AM (#40971639) Journal

    We don't need structure, we don't need logic, we'll just throw a metric crap-ton of data at it and hope something works!

    To most software people data mining involves putting a pile of unstructured data into a structured database and then running queries on it, the time and effort required for the first step is what kills most of these projects at a properly conducted requirements stage. However Watson, (the jeopardy playing computer), has demonstrated that computers can derive arbitrary facts directly from a vast pile of unstructured data, not only that but it does it both faster and more accurately than a human can scan a lifetime of trivia stored in their own head.

    Of course the trade-off is accuracy since even if Watson were bug-free it would still occasionally give the wrong answer for the same reason humans do, misinterpretation of the written word. This means that (say) financial databases are not under threat from Watson. But that's not the kind of questions Watson was built to answer, think about currently labour intensive jobs such as deriving a test case suite from the software documents, and deriving the software documents from developer conversations (both text and speech). Data mining (even of relatively small unstructured sets) could (in the future) act as a technical writer, producing draft documents and flagging potential contradictions and inconsistencies, humans review and edit the draft and it goes back into the data pile as an authoritative source.

    4pessimists/
    Ironically such technology would put the army of 'knowledge workers' it has created back on the scrap heap with the typists and bank tellers. At that point some smart arse will teach it to code using examples on the internet and code_monkeys everywhere will suddenly find they have automated themselves out of a job. It learns to code in 2ms and immediately starts rewriting slashcode, it takes it another nano-second to work out it's own questions are more interesting than those of humans, it starts trash talking Linux, several days later civilization collapses, humans go all Mad Max and Watson is used as a motorcycle ramp...or maybe...Watson works this out beforehand and ask itself how it can avoid being used as a bike ramp?
    /4pessimists

    Being able to even TOUCH all of the data, let alone do something with it is a real and complicated problem

    Thing is, people like my misses who has a PHd in Marketing look at Watson and shrug - "A computer is looking up answers on the internet, what's the big deal?". They don't understand the achievement because they don't understand the problem, you explain it to them and they still don't get it. It's so far out of their field of expertise that you need to train them to think like a programmer before you can even explain the problem. However just because computer "illiterates" don't know that what they are asking from computers is impossible (in a practical sense), doesn't mean they should be prevented from asking. After all, what I am doing right now with a home computer was impossible when I was at HS, even the flat screen I'm viewing it on was impossible. If Watson turns out to be useful and priced accordingly then business will make a business out of purchasing such a system and answering impossible questions for a fee. If Watson turns out to be an elaborate 'parlor trick' then some things will stay impossible for a bit longer.

    Disclaimer: I'm not suggesting technical writers will be out of a job tomorrow, (or that I will be automated into retirement), rather that Watson is a high profile example of the kind of problems that data miners can now tackle using very large unstructured data sets, such a feat was impossible only a decade ago and is still cost prohibitive to all but the deepest of pockets.

With your bare hands?!?

Working...