Site Reliability Engineering (SRE) is a set of principles explaining how Google’s been coordinating their software teams, who’ve spread apart and effectively communicate on developing and solving problems arises in their core platforms like Gmail, YouTube, Search, Chrome, Cloud, etc. The new book is 500-page long and is available for free to read. It’s one of the finest resources for every IT technician in tackling the problems, if not for all us to have an insight into how Google’s working internally.
The Opening Grand Story!
The book started with a story of how Google swept through a Wi-Fi password problem, occurred at their New York campus in 2012. It’s a day in September 2012 when the corporate transportation team informed every employee about the change of campus Wi-Fi password, which led thousands of employees to change their passwords immediately, that crashed the Google’s password manager! The flood of employee traffic was so high that, the password manager which was developed 5 years ago for a small set of system admins, has crashed. But that’s not all when the major manager received huge traffic, the load balancer shifted some of the traffic to the other two replicas of the password manager, which eventually failed too. And Google did clear the disturbances was the actual story. The firm needed a Hardware Security Module (HSM) smart cards to restart the service, which were stored in different Google offices across the globe! And when the engineer from the New York office called the Australian office to retrieve a card, it’s, unfortunately, was stored in a safe, and the password to it was forgotten by the engineer who stored! More interestingly, the password for that locked safe was stored in then crashed password manager! Fortunately, one engineer from California remembered the code and retrieved the card, and also the Australian team force drilled the safe to retrieve the card, but none of them worked in the first hour after inserting in the reader, cause they’re inserted upside down! There are other stories as such, defining how Google approached and solved their sudden mistakes by bringing the teams together. So have a look at the book if you’re free and locked down as us due to Coronavirus. Read it here: Building Secure and Reliable Systems Via: ZDNet