In information technology, big data consists of data sets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger data sets allowing analysts to “spot business trends, prevent diseases, and combat crime.”
Data sets also grow in size because they are increasingly being gathered by information-sensing mobile devices, remote sensing technologies, software logs, cameras, microphones, Radio-frequency identification readers, and wireless sensor networks. The world’s technological per capita capacity to store information has roughly doubled every 40 months since the 1980s (about every 3 years) and every day 2.5 quintillion bytes of data are created.
One current feature of big data is the difficulty working with it using relational databases and desktop statistics/visualization packages, requiring instead “massively parallel software running on tens, hundreds, or even thousands of servers”. The size of “big data” varies depending on the capabilities of the organization managing the set. For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration.
Is this rapid accumulation of data useful? Also, if it is not useful today, will it become useful and usable tomorrow? How long should it be stored? And what are the measures of usefulness?
These are profound questions, and the answer to them is- in all probability – “I don’t know.” Because:
1. This data is new to the organization. Can we look at benchmarks in other organizations and get an idea of the usefulness of certain data/s- similar to what we are exploring? Will aping pass their mistakes and un-explored potential over to us?
2. How long should we retain data to understand the value? We can retain data for a reasonable period of time. Post this, put a data – exploration exercise on it. If the outcome is not too relevant, junk those. And retain what-ever is relevant. Or get a consultant to help?
3. How useful is the data? – This is related to the point no 2 above. Retain on the basis of a logical tie up with desired outcomes which will be eventually brought to test at a future date – test and control. Innovation is use and interpretation of data from information-sensing mobile devices, remote sensing technologies, software logs etc. will continue. And data which may currently not be making too much sense can soon get utilized.
The few other points which cause anxiety with regard to Analytics on ‘big data’ are :-
1. Variable and Data understanding and Data consolidation: – An analyst needs to understand the organization’s data at a level that most people are not concerned with. Naming and understanding diverse data leads to confusing and very large variable names. Understanding and piecing together data appropriate for each analysis becomes a humungous task. Big Data often involves bringing information from a greater number of sources, so understanding the source systems and data warehouse involved is an important challenge.
2. Variable transformation: – As the number of variables go up, is it possible to run hundreds of transformations per variable and finish an analysis in a reasonable time? A lot of analysts apply several data transformations to all the raw variables and then let a variable selection process pick the best output. Can a more discerning process be used for the initial stages in the analytics process? This will cut down the time and energy and costs for using the now much larger data sets
3. Scalable infrastructure: – CPU capacity of Analytics servers need to expand at the rate of growth of data. And this may be often upset projected rates of adding into infrastructure.
The challenges of data management are getting more complex by the day! And the struggle to harness information from data becomes more acute . So this space will see much more Action in the days to come.
The Big Data Challenge reminds you of the second labor of Hercules – which was to kill the Lernean Hydra. From the murky waters of the swamps near a place called Lerna, the hydra would rise up and terrorize the countryside. A monstrous serpent with nine heads, the hydra attacked with poisonous venom. Nor was this beast easy prey, for one of the nine heads was immortal and therefore indestructible. But, in the end, Hercules won! And so will the Analyst.