A Forest-Level View of the Big Data Ecosystem

Follow us on LinkedIn for our latest data and tips!

A Forest-Level View of the Big Data Ecosystem

A Forest-Level View of the Big Data Ecosystem

Rich Morrow teaches Big Data, DevOps and AWS courses for DevelopIntelligence. He is a 20+ year veteran of IT and has done everything from being a trench-level core developer to VP of engineering. He is currently a developer evangelist, instructor, and prolific writer on Cloud, Big Data, DevOps/Agile, Mobile, and IoT topics.

Rich works with many companies and developers each year and this gives him a forest-level view of how technology is evolving and the pros/cons of many different tools. For this interview, we focused on the Big Data tools and the Big Data technology landscape.

DevelopIntelligence: Tell us a little bit about yourself, your background in technology, and what you’re working on.

Rich: Sure, I’ve been a software developer, by trade, for about 20 years. I did fingers-on-the-keyboard for many years. I’ve worked for Fortune 500 companies, government, startups, and have founded my own consultancy. I’ve kind of seen the IT world from every angle…but my real passion is startups.

I started my own consultancy about seven or eight years ago. At first, I was basically doing fingers on the keyboard, software dev at the first two or three years of consulting and then got into training on public cloud technologies. Instructor-led live training is about 50-60% of what I do and then the rest is just a mishmash of things like doing training videos for O’reilly, training for DevelopIntelligence, webinars, blog posts, white papers, and speaking engagements. I did a couple of keynotes last year, and got to present at the AWS conference in November.

Get Big Data Training for Teams

DevelopIntelligence: Oh cool. What do you enjoy about training?

Rich: I enjoy getting to work with about 30 to 40 really cool companies for like a week each throughout the course of a year.

The thing that I love about the training is it really gives me a broad perspective, because when I am teaching, I am learning.

Students are telling me things, they are like hey “, we had a problem with this technology in our organization and here’s why, or hey, we found this other thing really useful.”

I get to pick up little tidbits, little nuggets of valuable stuff from a whole bunch of different folks. So when I go back to write thought leadership pieces and speak at events, I can come with a broader perspective.

DevelopIntelligence: Tell us a bit about the evolution of the big data ecosystem. Where could it be going and what are you bullish about?

Rich: Yes, totally great question. So, let’s go back about 50, 60 years maybe to really start talking about this. Big data analytics was confined to the big companies because in order to do big data analysis you had to plunk down a few million to buy (and then maintain) some Oracle or EMC hardware and software. Then you’d have to go back every few years, renew your license, and shovel another few million at Larry Ellison. This really limited who was able to do big data analysis.

The democratization of technology in the last 10-15 years is probably the most exciting thing to happen in IT in my lifetime. It’s reduced cost and complexity of even traditionally expensive systems like Data Warehouses to the point that everyone now has access to these tools. Open source technologies like Spark combined with low-cost, pay-as-you-go cloud platforms like AWS means you can rent cloud hardware and software to run these analyses for less than $100. And even if you’re a mom and pop shop and you just want to analyze your weblogs, you can do that in Amazon for like five, ten bucks a pop once you’ve figured out what you need to do.

This is really exciting because just like in any kind of democracy, you get a whole bunch of ideas flowing. You need a large community of people sharing code, sharing ideas, sharing architectural patterns coming up with new and novel ways to do things. And when this happens, everybody benefits.

I’m also really excited about the Internet of things (IoT) as it’s the marriage of our physical and digital worlds. This is going to be much bigger and impactful than anything that’s come before. To keep up on the developments, I attend as many IoT conferences as I can.

There’s all these cool stories coming out of the IOT world. One that sticks in my mind is the company doing “IoC” (Internet of Cattle), where they literally affix a sensor on cows and track their temperature, how many times they got sick, if they got a shot or immunization, all that detail. The goal was when you go into Whole Foods and buy some steak, it would literally list out all the important aspects of your meat. You could tell if the refrigerated shipping truck ever failed and took the meat over a certain temperature, you could see what it was fed, what antibiotics it was injected with. You could see all that data. That’s just one powerful example that you look at and say “Wow, this is going to change everything”. We’re starting to put IoT sensors on our parking structures so that when you drive in, it just automatically tells you which spaces are open on which floors. If it’s driverless car, it could even park itself there. With IoT, there’s all kinds of ways we can help optimize businesses, lives, all kinds of cool stuff.

DevelopIntelligence: Are big data and the Internet of Things evolving together?

Rich: Absolutely. It is really interesting because different technologies are now influencing each other from totally different disciplines. Like how genome sequencing work coming out of biology results in new mathematical algorithms that we can plug back into other unrelated systems like cascading network data failures.

Technology is a the common catalyst for all this. I’m constantly bringing up mobile, IoT, cloud, and Big Data in a single conversation because they are so interconnected. If you are trying to do a mobile app these days and you want it to scale, you are going to go to public cloud. When it scales, you’re going to kick out a bunch of data which you then want to extract value of and that’s going to lead you down the big data rabbit hole.

DevelopIntelligence: With specific technologies in those spaces that you’ve been working with over the last five to fifteen years… how have they been evolving and which ones are you bullish about?

Rich: The biggest thing that happened probably right around that timeline is about 15 years ago was when open source became valid in ways it wasn’t before. Steve Ballmer (from Microsoft) hated Linux because Linux started eating their lunch. The recent versions of the Windows include Linux. You know you’ve won when they mimic you.

Hadoop was the first open source project to really shake up the big data world. It kind of came out of nowhere, and provided a low cost storage and processing framework that could scale like mad. Because it makes no assumption about the underlying hardware, a lot of companies would just throw it on five or ten or twenty old machines laying around and boom! all of a sudden you’ve got something that folks could start using as an analytics engine. They could build a NoSQL with HBase, a Data Warehouse with Hive and so forth. A whole complete, rich ecosystem of projects just kind of grew up out of that. Years and years of growth, innovation, and adoption of the Hadoop platform resulted in a large user base that attracted many other projects.

Probably the hottest project in any tech space right now is Apache Spark — an in-memory analytics engine that came out of a group called the Berkeley Data Analytics Stack (or BDAS). BDAS are the same folks who developed Apache Mesos. Lots of people would lampoon me for this, but Spark is basically Hadoop on steroids — at least the processing half of Hadoop (Spark brings no native storage, but it can and frequently does use Hadoop’s storage engine called HDFS or the Hadoop Distributed File System). Spark lets you do all the same type of parallel data analytics you did with Hadoop MapReduce, but it does it much, much faster, lending itself to real-time analytics, interactive analytics, and SQL based queries. It ships with a module called Spark SQL that can essentially turn your Spark cluster into a high-speed data warehouse which you could plug your visualization interfaces (like Tableau, Looker, Microstrategy) right into and start doing point-and-click analytics. And the SQL actually runs as a first class citizen, so it doesn’t need to down compile like it does with Hive in Hadoop. It just runs crazy fast, 10 to 50 times faster depending on the query.

Spark gets even more atractive when you pair it up with public cloud. The high memory machines Spark requires are relatively expensive because memory is still one of the most expensive parts of a server. Rather than having to go out and buy these expensive clusters of several hundreds or thousands machines, you can just run them on Amazon on an hourly basis. And like a lot of organizations, you may do your analytics for less than an hour once a day, or a few hours a month or quarter.

If you were doing these analytics with on-premise systems, you’d waste a lot of cash on licenses, consulting fees, power, cooling, data center real estate, replacing failed hardware and other maintenance costs. All these hidden costs just go straight out the window when you look at something like Hadoop in AWS. There’s a lot of companies out there, big organizations, big enterprises that do like their weekly analytics for 20 to 50 bucks a month on Amazon.

DevelopIntelligence: Some people say there’s too much big data hype. Do you ever worry or think about the hype cycle?

Rich: Hype definitely exists, but maybe less so in this space. “Big Data” has already passed through Gartner’s trough of disillusionment because it’s been out there for so long. Many hot technologies get overhyped, but once they come through the “disillusionment” phase, you know the hype has leveled off.

In Big data, something like streaming or Hadoop or Spark will come out and be followed by a big six to eight month hype phase. Everybody goes, “This is greatest thing ever! The old stuff is no longer cool”. Then you go through this realization phase and you start realizing that some parts don’t quite work out the same way and it does have some limitations you didn’t know about. I would say it takes about a year for all that to settle down until you have got a realistic understanding of what exactly this technology is and where it shines and where it doesn’t.

I think all the technologies we’ve talked about have gone beyond the disillusionment phase. Like certain cloud-enabled data warehouses, NoSQL has way gone beyond that. IoT is just starting to get there and then machine learning will come next.

DevelopIntelligence: What would you recommend that someone young or new to the field focus on learning?

Rich: First thing I would say is understand distributed system design. When you look at all these systems that do big data analytics, whether it’s Hadoop or Spark or Redshift or what else is out there, even some of the NoSQL engines that we used big data analytics for, they all are distributed systems. So, understand how these systems work and what their limitations are — things like CAP Theorem, Horizontal Linear Scale, MapReduce, Eventual Consistency — those concerns underpin all of the modern distributed systems. Then find one you believe in and become an expert — I would say the big one right now is probably Spark. And a good language to learn is SparkR. It’s kind of the statistical language that came out of the data sciences specifically for analytics. Learning public clouds is probably important as well as streaming.

I think the recipe for success these days with any technologies is become jack of all trades and a master of some. So that you don’t go deer in the headlights when somebody on your team talks about any other technology and you are the go-to guy or gal when they have questions about those two or three technologies in your wheelhouse.

The challenge with both cloud and big data is if you are not paying attention to this space for like a year, it just completely leapfrogs over what was available last year. You don’t want to be a company making last year’s design when all your competitors have something faster, cheaper and more feature rich.

Get Big Data Training for Teams