How big tech giant like google,facebook handles BIG DATA?
We are living in the era where whenever we stuck in any problem then we use to google it and whenever we need to take a suggestions from the people across the world or share our thoughts we use social sites like Facebook , Instagram , Linkedin etc..
According to recent reports revealed by facebook that they daily dealt with
500+ terabytes of data each day. It’s pulling in 2.7 billion Like actions and 300 million photos per day, and it scans roughly 105 terabytes of data each half hour.
Google now processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide.
But have you ever thought that whatever you search in google and the billions of data that is uploaded by people across the globe to social sites where are these data stored ? How these tech giant store such a large volume of data? How the posts uploaded by you is retrieved within second whenever you want?
Well, these companies don’t store data in a single drive like we normally do in our mobile and PC. The exponential growth of data volumes, generally termed as big data, in all industries demands new storage technology.
Big data refers to very large datasets, which we cannot process it with a relational databases management tools or information systems. According to the National Institute of Standards and Technology (NIST), big data refers to the inability of traditional data architectures to handle efficiently the new datasets.
What is meant by Big data?
Big data refers to very large datasets, which we cannot process it with a relational
databases management tools or information systems. According to the National Institute of Standards and Technology (NIST), big data refers to the inability of traditional data architectures to handle efficiently the new datasets Big data refers to very large datasets, which we can not process it with a rational database management tools or information systems. According to the National Institute of Standards and Technology(NIST),big data refers to the inability of traditional data architectures to handle the new datasets.
Big data refers to very large datasets, which we cannot process it with a relational databases management tools or information systems. According to the National Institute of Standards and Technology (NIST), big data refers to the inability of traditional data architectures to handle efficiently the new datasets.
Big data refers to very large datasets, which we cannot process it with a relational databases management tools or information systems. According to the National Institute of Standards and Technology (NIST), big data refers to the inability of traditional data architectures to handle efficiently the new datasets.
To handle such a big volume of data they uses the concept of distributed storage .A distributed storage is a concept under which data is split and stored at multiple physical servers, and even across more than one data centers. It typically takes the form of a cluster of storage units, with a mechanism for data synchronization and coordination between cluster nodes. In normal word we can say that data is stored across the network. Since data is stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable filesystems. HDFS (Hadoop Distributed File System) is a unique design that provides storage for extremely large files with streaming data access pattern and it runs on commodity hardware.
Let’s elaborate the terms:
Extremely large files: Here we are talking about the data in range of petabytes(1000 TB).
Streaming Data Access Pattern: HDFS is designed on principle of write-once and read-many-times. Once data is written large portions of dataset can be processed any number times.
Commodity hardware: Hardware that is inexpensive and easily available in the market. This is one of feature which specially distinguishes HDFS from other file system.
In the HDFS cluster there are thousands of server connected which serve the purpose of Distributed Storage. In such an architecture all the server or storage unit connected via networks are called nodes. There is a concept of master node and slave node. The node which manages all the other nodes and assign the work to them is called Master Node and except this all other nodes are slave node.
There is also a concept of NameNode and DataNode:
· Namenodes:
Run on the master node.
Store metadata (data about data) like file path, the number of blocks, block Ids. etc.
Require high amount of RAM.
Store meta-data in RAM for fast retrieval i.e to reduce seek time. Though a persistent copy of it is kept on disk.
· DataNodes:
Run on slave nodes.
Require high memory as data is actually stored here.
How facebook is dealing with the BIG DATA?
There is a combined workforce of people and technology constantly working behind the successful implementation of this platform.
Lots of data is generated on Facebook
- — 200+ million active users
- — 30 million users update their statuses at least once each day
- — More than 900 million photos uploaded to the site each month
- — More than 10 million videos uploaded each month
- — More than 1 billion pieces of content (web links, news stories, blog posts, notes, photos, etc.) shared each week
Hadoop
“Facebook runs the world’s largest Hadoop cluster” says Jay Parikh, Vice President Infrastructure Engineering, Facebook.
Basically, Facebook runs the biggest Hadoop cluster that goes beyond 4,000 machines and storing more than hundreds of millions of gigabytes. This extensive cluster provides some key abilities to developers:
- The developers can freely write map-reduce programs in any language.
- SQL has been integrated to process extensive data sets, as most of the data in Hadoop’s file system are in table format. Hence, it becomes easily accessible to the developers with small subsets of SQL.
Hadoop provides a common infrastructure for Facebook with efficiency and reliability. Beginning with searching, log processing, recommendation system, and data warehousing, to video and image analysis, Hadoop is empowering this social networking platform in each and every way possible. Facebook developed its first user-facing application, Facebook Messenger, based on Hadoop database, i.e., Apache HBase, which has a layered architecture that supports plethora of messages in a single day.
Scuba
With a huge amount of unstructured data coming across each day, Facebook slowly realized that it needs a platform to speed up the entire analysis part. That’s when it developed Scuba, which could help the Hadoop developers dive into the massive data sets and carry on ad-hoc analyses in real-time.
Facebook was not initially prepared to run across multiple data centers and a single break-down could cause the entire platform to crash. Scuba, another Big data platform, allows the developers to store bulk in-memory data, which speeds up the informational analysis. It implements small software agents that collect the data from multiple data centers and compresses it into the log data format. Now this compressed log data gets compressed by Scuba into the memory systems which are instantly accessible.
According to Jay Parikh, “Scuba gives us this very dynamic view into how our infrastructure is doing — how our servers are doing, how our network is doing, how the different software systems are interacting.”
Cassandra
“The amount of data to be stored, the rate of growth of the data, and the requirement to serve it within strict SLAs made it very apparent that a new storage solution was absolutely essential.”
- Avinash Lakshman, Search Team, Facebook
The traditional data storage started lagging behind when Facebook’s search team discovered an Inbox Search problem. The developers were facing issues in storing the reverse indices of messages sent and received by the users. The challenge was to develop a new storage solution that could solve the Inbox Search Problem and similar problems in the future. That is when Prashant Malik and Avinash Lakshman started developing Cassandra.
The objective was to develop a distributed storage system dedicated to managing a large amount of structured data across multiple commodity servers without failing once.
Hive
After Yahoo implemented Hadoop for its search engine, Facebook thought about empowering the data scientists so that they could store a larger amount of data in the Oracle data warehouse. Hence, Hive came into existence. This tool improved the query capability of Hadoop by using a subset of SQL and soon gained popularity in the unstructured world. Today almost thousands of jobs are run using this system to process a range of applications quickly.
Prism
Hadoop wasn’t designed to run across multiple facilities. Typically, because it requires such heavy communication between servers, clusters are limited to a single data center.
Initially when Facebook implemented Hadoop, it was not designed to run across multiple data centers. And that’s when the requirement to develop Prism was felt by the team of Facebook. Prism is a platform which brings out many namespaces instead of the single one governed by the Hadoop. This in turn helps to develop many logical clusters.
This system is now expandable to as many servers as possible without worrying about increasing the number of data centers.
Corona
Developed by an ex-Yahoo man Avery Ching and his team, Corona allows multiple jobs to be processed at a time on a single Hadoop cluster without crashing the system. This concept of Corona sprouted in the minds of developers, when they started facing issues with Hadoop’s framework. It was getting tougher to manage the cluster resources and task trackers. MapReduce was designed on the basis of a pull-based scheduling model, which was causing a delay in processing the small jobs. Hadoop was limited by its slot-based resource management model, which was wasting the slots each time the cluster size could not fit the configuration.
Developing and implementing Corona helped in forming a new scheduling framework that could separate the cluster resource management from job coordination.
Peregrine
Another technological tool that is developed by Murthy was Peregrine, which is dedicated to addressing the issues of querying data as quickly as possible. Since Hadoop was developed as a batch system that used to take time in running different jobs, Peregrine brought the entire process close to real-time.
Apart from the above prime implementations, Facebook uses many other small and big sized pieces of technology to support its Big Data infrastructure, such as Memcached, Hiphop for PHP, Haystack, Bigpipe, Scribe, Thrift, Varnish, etc.
Today Facebook is one of the biggest corporations on earth thanks to its extensive data on over one and a half billion people on earth. This has given it enough clout to negotiate with over 3 million advertisers on its platform in order to clock staggering revenues that is north of 17 Billion US Dollars. But the privacy and security concerns still loom large regarding whether Facebook will utilize all that gargantuan volumes of data to server humanity’s greater good or just use it to make more money.
But one thing is for sure, it is Big Data indeed that has propelled Facebook, a small-time Harvard dorm startup into the constellation of some of the biggest corporations on earth of all times!