MIT/Stanford Venture Lab Big Data Panel
May 15, 2012, VLAB, Stanford University—A panel comprised of James Phillips from Couchbase, Max Schireson from 10gen, Doug Cutting from the Apache Software Foundation, Andrew Mendelsohn from Oracle, and Ravi Mohan from Shasta Ventures looked at the issues of noSQL and big data. Robert Scoble from Rackspace moderated the panel.
Scoble started out by noting that the issue is the exponentially increasing amount of data being generated. Big data is changing the world, allowing some to study people and learn more about their behavior. The increases in sensors and other data acquisition methods fuels the machine-to-machine data streams.
Waze.com has about 12M members who generate about 1PB a day in traffic information. The site models traffic flow in cities from the inputs from members and other sources and generates alerts and notifications to its members on existing and impending vehicle traffic issues in their locality.
Schireson followed with a description of noSQL, for “not only SQL”. The opportunities are coming from the existence of big data. The total data generated every year is 1.8 zettabytes (1.8 * 10^21 or 1024 exabytes or 1,048,576 petabytes.) and the volume is doubling every two years. This volume of data enables people to identify trends and other functions due to its big audience. Most of the data are available through a browser, but suffers from access problems and a lack of interactive software for viewing the data.
We got here by starting in '75 with relational databases. These databases were designed for static populations, business automation, and structured data all on a central CPU with little memory. Now, the requirements and uses are very different. The populations are dynamic and browser based. The data are used for business process innovation taking into account all types of data and unstructured relationships. The data resides in a high connectivity environment with distributed computing and lots of low cost memory.
The move to a browser-based database allows for linear scale-out at the app level, and permits the administrator to monitor and manage the load balancer, web server and database from a remote location. Unfortunately, the database doesn't scale. The costs of scaling a relational database are non-linear and come in large increments and lots of internal programming. The alternative is to use a noSQL database that scales with the number of boxes in the system.
An example of a fast scaling operation was OMGPOP. It went from zero to 30M users in one month, then was bought by Zynga. They managed to scale without much effort due to the noSQL database and cloud capabilities. The newer databases enable flexibility and applications solutions, and can address all vertical markets. The software is available from many sources and is open source. A transactional database provides focus, simplicity, and the capability for rapid adjustments.
Phillips responded that legacy databases are becoming irrelevant. The structure of record, tables, and keys is geared for a centralized datacenter and is not scalable. One way users have worked around the bases limitations is to partition the database into sub-sections and work to synchronize the separate sections. Others are creating an m-cache and putting more memory in front of the database to speedup operations.
Types of databases?
Cutting answered that transactional databases are good for on-line operations. Oracle is still good for many other operations. Unstructured data and unknown queries need a big data solution. The economics are in favor of open software and commodity hardware.
Stack changing to open compute and open stack, with file systems like Hadoop benefiting developers?
Cutting suggested that the costs and easy of evaluation and diagnosis provide an edge. Having the source code and not having any lock-in functions in combination with resilient development reduces risk.
Mongodb and evaluation of the best technology?
Schireson offered that the openness and evaluation capabilities are driving the decision from the CIO to the developers, who are comfortable with the agility and natural operations, and appreciate an active development community.
Dashboards on systems and software tools? China is changing data center architectures to emphasize dashboards and scale?
Schireson suggested that visibility increases the requirements to monitor services through an agent. For someone doing 3B transactions per day on 6 servers, the multi-data center is the only way to go. This architecture does have its challenges, including policies on replication, and regional control and governance issues.
Mohan responded that opportunities exist with reductions in costs. Tools that enhance visibility and work with existing technologies allow services and insights into the data. Other areas include systems management and directed search to move data into spaces for little cost. These tools need to do more than previous tools and will need new algorithms and ways to sort the data. New technologies have to be integrated into one package and that package needs to have analytics as one key component.
VC and big data?
Mohan said that this is an unknown area. There are many size and diversity issues and the market intelligence on the areas are still vague. This may be an area for the big companies.
Philips remarked that the big companies are like Wang and others in the '80's. they didn't see the entire industry changing around them. The current big companies are in the same boat today.
Cutting opined that big data needs a schema-less view. So far, unstructured functions don't exist. The users save data in a native format with no presets. The schema are based on expected schema and search, but now big data allows for flexibility on the data. The controlling issue is the cost of the hardware and the storage volume.
Mohan said that some companies are meeting to handle management issues.
Cutting added that standards encourage competing implementations without competing, but collaboration can reduce demand. SQL is still important for some big data functions.
Schireson objected that the rate of change is too high for a committee to follow.
Phillips agreed that standard query language(s) may help, and a commercial standard is needed.
Mendelsohn noted that relational database standards are still needed.
Schireson stated that a range of approaches is viable. The most common is to not allow complex transactions. The range of consistent models updates across series and conflicts are resolved later. Only allow document-level transactions.
Mohan asked why not use mongo database. The relational model is natural and tables are good for many tasks. They use multi-statement transactions and commitments or roll back to resolve.
Phillips answered transactions need to update the database. A denormalized data set will allow one update per document and perform inter database synchronization.
Use cases and data structures in applications like accounting systems have high risk if the data states don't match. Is this over simplified?
Phillips noted that general-purpose database technology is not god for business intelligence. It needs too many procedures to define the requirements.
Mendelsohn suggested that noSQL has issues. It has a good API for simple transactions, but the cost for complex transactions is too high.
Uses for big data?
Schireson suggested managing machine-generated data, archiving, schema evolution over time. The relational data model needs a meta-data manager.
Accuracy versus decentralized?
Schireson offered that anyone can get precise information from a distributed system. The issues are in synchronization, compute resources, and the ability to move the data in and out of the system. In many cases, approximate content is good enough if it is faster.
Architectural choices, speed and performance?
Schireson stated that there are many variables; consistency models, levels of transactions, data models—hierarchy, columns, graphs, etc. –functions like read and write frequency, and availability versus data security.
Mohan asked how do you set appropriate metrics for a system?
Schireson added resource limits, join operations, multi-transactions, development tools, and if all this comes with reasonable tradeoffs.
Mendelsohn added that wining customers is important for existing noSQL companies. They have to demonstrate key values. SQL is a mature technology that is good for most users. Developers understand when to use one over the other.
Next data challenge?
Phillips offered infrastructure is next. A solution that uses tools to solve problems, a how to understand what the data is, the ability to understand policies, legal, etc and support the system.
Mohan considered the cloud. Usability and helping managing open source systems. Need solutions that can handle new releases and languages. For example, Ruby expanded to PHP and python and created ways to make the data usable.
Cutting opined that action is next. The new databases are not just for social and mobile, but will move to conventional companies like agriculture, transport, and manufacturing as more business moves to the Internet.