Big Data, Cloud

On this page we cover the following topics:

1. Big Data

2. Datawarehouse on Cloud

3. Cloud, elasticity

4. Other Technologies

1. Big Data

Term “Big Data” usually refers simply to big amounts of data which can not be fit into a single file, and is handled by splitting into multiple files across many computers. Example – Google has its data split over millions of servers (architecture called “BigTable”). Other big examples – Amazon, Microsoft, Facebook, etc. Splitting of data allows to process data in parallel. Most common paradigm is “map-reduce”. For example, a Google search request is “mapped” (sent) to multiple servers which keep pieces of data. Each server searches through its data – and sends its findings back to the center. Responses from multiple servers are “reduced” – merged together, sorted – and displayed.

Here are some links about Google’s original architecture:

The table below shows some open-source and proprietary software tools for maintaining and processing data on big distributed systems.

Open-Source and Proprietary Software Tools for Maintaining and Processing Data on Big Distributed Systems
Hadoop Open Source system to store data distributed over multiple (thousands) servers. Created in 2005 by Doug Cutting and Mike Cafarella for “Nutch” search engine project. Name “Hadoop” comes from the name of Cutting’s little son’s toy elephant. Cutting was working at Yahoo! at the time. Hadoop is written in Java.
Hadoop distributions Main distributions (package, develop, support): – Claudera -Palo Alto, CA – MapR – San Hose, CA – Hortonworks – Palo Alto, CA, funded by Yahoo, Teradata, etc.
HDFS Hadoop Distributed File System
YARN Yet Another Resource Negotiator – for better resource management and map reduce in Hadoop 2.x
Storm Distributed processing and streaming of data. Open source. Written in Clojure. Uses “spouts” and “bolts” to define topologies of moving data. Integrates well with many common messaging systems (RabbitMQ, Kestrel, Kafka, etc).
Samza Open Source distributed stream processing framework. Uses Apache Kafka for messaging. Written in Java and Scala.
Hive SQL-like language called HiveQL to query Hadoop HDFS data. DataWarehouse infrastructure. Can run on top of Spark for faster performance.
HBase Open Source distributed database on top of HDFS (Hadoop Distributed File System). Non-relational. Modeled after Google’s BigTable – storing large quantities of sparse data. Written in Java. Good for analyzing huge 2-dim data (billions of rows, millions of columns – searches, log processing, etc.). HTTP or Thrift interface.
Cassandra Open source distributed high-performance database to store and query huge amounts of data. Originally from Facebook. Has its own data model (not Hadoop). Many users load data from Hadoop into Cassandra to do analytics. Uses SQL-like query language (CQL3). Written in Java. Great for web analytics, transaction logs, etc.
Spark Open source, written in Scala, runs MapReduce up to 100x faster than Hadoop on top of Hadoop Distributed File System (HDFS). Originally developed in the AMPLab at UC Berkeley. Company – Databricks.
SparkSQL Open source, allows to use SQL (or HiveQL or Scala) over Spark. It is recommended to move from Hive to SparkSQL.
Spark RDD Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.
Tachyon Open source memory-caching for distributed file systems (HDFS, S3, GlusterFS, etc.). Hadoop compatible. Existing Spark and MapReduce programs can run on top of it without any code change. Used in Berkeley Data Analytics Stack (BDAS).
DataPad Data Analysis tools (Wes McKinney et al) – PyData, June 2014, 40minGraphLab Conf, July 2014, 22min
Pig provides language (Pig Latin) for MapReduce over HDFS. Originally from Yahoo, 2006.
More …
Mahout machine learning algorythm on the Hadoop platform. Also provides math and statistics.
Oozie A workflow scheduler system to manage Hadoop jobs (web app, Java servlets)
Flume distributed service for collecting, aggregating, and moving large amounts of log data, good for analytic applications
Sqoop tool for moving bulk data between Apache Hadoop and other storages (for example, relational databases)
Claudera Hue Cloudera web interface for using Hadoop and analyzing data.
Claudera CDH Cloudera Hadoop Distribution (free download)
Cloudera Express CDH + Cloudera Manager (free download) – best way to start with Hadoop
Cloudera Impala fast SQL queying of HDFS data (massive parrallel processing) – good for analytics, eliminates the need of moving data into some other datastore for analytics. Impala = medium-sized African antelope known for fast running and jumping.
MongoDB Open Source NoSQL database for JSON-like documents. Horizontally scalable. Written in C++. Queries (dynamic) in Javascript. You can define indexes. Fairly good performance. (name “mongo” comes from “humongous”, not from “Blazing Saddles” Mongo).
CouchDB Open source NoSQL database, uses JSON to store data, JavaScript as query language, HTTP for an API. Good for mobile devices.
Redis Open source, VERY FAST (written in C) in-memory (disk-backed) key-value database. REDIS = REmote DIctionary Server. DB should fit in RAM (or span RAMs of several computers using clustering). Supports bits, sets, lists, hashes. Support transactions. Great for real time stock prices, analytics, etc.
Kyoto TycoonKyoto Cabinet Open Source fast and lightweight network DBM over HTTP, can do more than 1 million insert/select per second. Multiple storage options (hash, tree, dir, etc.). Lua on server side. Can be used with C, Java, Python, Ruby, Perl, Lua, etc. Great for real-time data (cache).
Riak Open source distributed NoSQL key-value data store. Main benefit – high availability & fault tolerance. Simple to operate and to scale by adding more servers. Written in Erlang, C, and some Javascript. HTTP or custom binary interface. Map/reduce in JavaScript or Erlang.
Couchbase (Membase) Open Source distributed NoSQL document-oriented database to serve many concurrent users. Easy-to-scale key-value or document access with low latency and high sustained throughput. Designed to be clustered to very large scale deployments.
Neo4j Open source graph database for connected data (nodes and relationships). Transactional. Advanced path finding. Good for road maps, topologies, etc. (graphs). Written in Java.
Hypertable Open source database – basically a faster smaller HBase implemented in C++, uses ideas from Google’s BigTable. Runs on HDFS (or ClusterFS or KFS (Kosmos File System). Uses its own, “SQL-like” language, HQL.
ElasticSearch Open source distributed full-text search engine with HTTP interface and JSON documents (parent/children docs). Based on Java Lucene library. Great when you need advanced flexible fuzzy search.
Solr Open source popular blazing fast search platform. Based on Java Lucene library. Used by over pure-java Jetty web server/servlet container.
Accumulo Open source – similar to HBase, but provides cell-level security (access labels). Sorted, distributed key/value storage. Built on top of Hadoop, ZooKeeper, and Thrift. Written in Java and C++.
VoltDB The world’s fastest in-memory relational database. Fast ingestion and export, massive scalability, real-time analytics. SQL access from within pre-compiled Java stored procedures. Open Source and commercial versions.
Memcached Open source distributed memory caching system. Can be used to cache results of DB queries.
Scalaris Open source scalable, distributed, transactional key-value store. Written in Erlang, accessible from Python, Ruby, Java, etc.
BigTable Google proprietary data storage system. Compressed, high performance. Uses Google File System, Chubby Lock Service, SSTable (log-structured storage like LevelDB).
LevelDB open source on-disk key-value store (by Google).
GoogleFS Google File System – proprietary
QFS Quantcast File System (QFS) is an open-source distributed file system software package for large-scale MapReduce or other batch-processing workloads. It was designed as an alternative to Apache Hadoop’s HDFS, intended to deliver better performance and cost-efficiency for large-scale processing clusters.
Oracle noSQL database Oracle NoSQL Database (ONDB) provides network-accessible multi-terabyte distributed key/value pair storage with predictable latency.
Oracle BigData SQL Oracle Big Data SQL extends Oracle SQL to Hadoop and NoSQL
Oracle Data Integrator Oracle Data Intergator (ODI) is an ETL (Extract-Transform-Load) platform for high volume/high performance batch loads, event-driven, trickle-feed integration, SOA-enabled data service, etc. BigData support, parallelism. Monitoring.
Talend Open source ETL / integration tools, supports BigData. Look at Talend Jumpstart Sandbox.
Ab Initio Applications and custom soultions for very high-volume data processing and data integration. Record-breaking.
kiji Open Source framework for collecting, analyzing, and serving entity data in real time.
HPCC HPCC (High Performance Computing Cluster) – open source platform for massive parallel-processing to solve Big Data problems
Pivotal Big Data solutions, Hadoop distribution
HP Haven HP HAVEn Big Data platform = (Hadoop, Autonomy Corporation, Vertica, HP Enterprise Security Products).
Amazon Cloud Platform. Amazon Elastic Cloud Compute (Amazon EC2), Amazon Simple Storage Service (Amazon S3), Amazon Web Services (AWS), Amazon Redshift (Datawarehousing solutions), Amazon Elastic MapReduce.
Intel Partners with Claudera to provide server platforms for Big Data.
Signiant System to move large data sets into and out of the cloud (accross datacenters, parallel, encrypted). Signiant Media Shuttle, Signiant Media Exchange and Signiant Manager+Agents, Signiant SkyDrop.
BitYota Datawarehouse as a service
DataStax Delivers certified version of Apache Cassandra that is ready for heavy-duty production environments
Greenplum Bigdata analytics, in 2012 became a part of Pivotal.
Splunk Captures, indexes and correlates real-time data in a searchable repository from which it can generate graphs, reports, alerts, dashboards and visualizations
Sumo Logic Cloud-based log management and analytics service. Uses advanced machine learning algorithms to whittle down mountains of log file data into common groupings.
qubole Creators of Facebook’s Big Data infrastructure and Apache Hive have leveraged their experience to deliver Qubole Data Service (QDS) – a cloud Big Data service offering the same advanced capabilities used by Big Data savvy organizations.
altiscale Altiscale Data Cloud is the first cloud service purpose-built to run Hadoop. We offer an on-demand, elastic solution on a pay-as-you-go basis

2. DataWarehouse on Cloud

Amazon Redshift Amazon’s own database product in Amazon cloud
Snowflake Computing Elastic SQL database on top of amazon cloud, supports both relational tables and JSON (Snowflake Computing Elastic Data Warehouse
 Microsoft Azure  Elastic SQL Data Warehouse, on Microsoft Cloud
Teradata Expandable relational datawarehouse system (since 1979), “shared nothing” architecture.
IBM Big Data Platform , IBM dashDB IBM’s enterprise class big data and analytics platform – Watson Foundations, IBM dashDB
HPE Vertical in the cloud HP Vertica database on the cloud


3. Cloud Elasticity

Cloud is an architecture allowing for fast provisioning of servers on demand. You can elastically increase/decrease the number of servers you are using. You can use these servers for different tasks. For example, you can have a distributed file system (Hadoop, HDFS). Or you can install a database (RedShift, SnowFlake, etc.).

Top 10 Cloud service providers

1., San Francisco, CA
2. Amazon, Seattle, WA
3. Microsoft, Redmond, WA
4. Oracle, Redwood City, CA
5. Google, Mountain View, CA
6. SAP, Walldorf, Germany
7. SoftLayer (IBM), Dallas, TX
8. Terremark (a Verizon Company), Miami, FL
9. Rackspace, San Antonio, TX
10. NetSuite, San Mateo, CA
(Source: Top 100 Cloud service providers)


4. Some other technologies to consider: