{"id":12565,"date":"2015-06-02T19:45:23","date_gmt":"2015-06-02T19:45:23","guid":{"rendered":"http:\/\/www.smartdatacollective.com\/index.php\/post\/hadoop-can-t-do\/"},"modified":"2015-06-02T19:45:23","modified_gmt":"2015-06-02T19:45:23","slug":"hadoop-can-t-do","status":"publish","type":"post","link":"https:\/\/www.smartdatacollective.com\/hadoop-can-t-do\/","title":{"rendered":"Hadoop Can&#8217;t Do That"},"content":{"rendered":"<p>I just got back from a little executive summit conference in Dallas for Chief Data Officers. Frustratingly, I heard a lot of folks telling me what Hadoop CAN\u2019T do. Now, I know that Hadoop can\u2019t bring about world peace or get my husband to put the toilet seat down, but the things people keep saying it can\u2019t do are &nbsp;things that I\u2019ve personally DONE on Hadoop clusters, so I know they\u2019re doable.<\/p>\n<p>If you asked most people if water could cut through steel, they would probably tell you it can\u2019t. They would be wrong, too.<\/p>\n<p>I just got back from a little executive summit conference in Dallas for Chief Data Officers. Frustratingly, I heard a lot of folks telling me what Hadoop CAN\u2019T do. Now, I know that Hadoop can\u2019t bring about world peace or get my husband to put the toilet seat down, but the things people keep saying it can\u2019t do are &nbsp;things that I\u2019ve personally DONE on Hadoop clusters, so I know they\u2019re doable.<\/p>\n<p>If you asked most people if water could cut through steel, they would probably tell you it can\u2019t. They would be wrong, too.<\/p>\n<p>Going to the <a href=\"http:\/\/www.evanta.com\/cdo\/summits\/dallas\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Evanta CDO Summit<\/a> was a surprise. I first had to pinch hit for one of our sales execs who needed surgery, and then for my CMO when he missed his flight out from California. So, in the space of one day, the day before the conference, I went from \u201cnot going\u201d to \u201cgoing, but just talking to people\u201d to \u201cpresenting.\u201d Whee!<\/p>\n<p><a href=\"http:\/\/i2.wp.com\/bigdatapage.com\/wp-content\/uploads\/2015\/06\/Dallas_Flooding_Cropped.jpg\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\"><img decoding=\"async\" class=\"alignright size-medium wp-image-139\" src=\"http:\/\/i2.wp.com\/bigdatapage.com\/wp-content\/uploads\/2015\/06\/Dallas_Flooding_Cropped.jpg?resize=300%2C202\" alt=\"Brazos River Flooded in Dallas\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p>So, after doing tech support for one of our data scientists, Josh Poduska, for a <a href=\"http:\/\/www.meetup.com\/KNIME-Users-Group-Austin\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">KNIME sneak peek training at our local meetup group<\/a> until about 9 PM, I hopped in a car and drove to Dallas. Got there at 1:30 in the morning, just as the deluge started. I was in Houston last weekend to speak on panels at <a href=\"http:\/\/www.comicpalooza.com\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Comicpalooza<\/a> in my <a href=\"http:\/\/www.amazon.com\/Paige-E.-Ewing\/e\/B009KEBPV0\/ref=sr_ntt_srch_lnk_1?qid=1433175408&amp;sr=8-1\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">fiction writer<\/a> role, and rode out the storms there in a Denny\u2019s. Three years ago, Texas was so dry all the crops were burnt brown and the lakes were all but gone. This year, everything\u2019s overflowing.<\/p>\n<p>During the summit, I heard a talk by <a href=\"https:\/\/www.linkedin.com\/in\/robsaker\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Rob Saker, CDO of Crossmark<\/a>. He made a statement during his presentation that really stuck with me.<\/p>\n<blockquote>\n<p>\u201cData is like water. Not enough, and you die. Too much, and you\u2019re flooded. Properly focused, it can cut through steel.\u201d<\/p>\n<\/blockquote>\n<p>Surrounded by flooding in a state that was dying for lack of water a few years back, that struck me as an exceptionally apt metaphor. Before, businesses were not able to analyze even the small amount of data they could hold onto. They were dying for the lack of it. Now, they\u2019re flooded with data, but struggling to get their arms around it. Hadoop is the life saver, but people have pretty set notions about what Hadoop can\u2019t do.<\/p>\n<p>If you asked most people if water could cut through steel, they would probably tell you it can\u2019t. As my maker husband who loves computer controlled wet jets, lasers and router machines could tell you, water can and does cut through just about anything. Similarly, Hadoop, used properly, can accomplish just about any data analysis task.<\/p>\n<p>Here are some of the things people mentioned at the summit that Hadoop can\u2019t do:<\/p>\n<p><strong>Low Latency SQL access (with ACID compliance)<\/strong><\/p>\n<p>You can\u2019t do low latency, interactive SQL on Hadoop data, and there\u2019s certainly no way to get anything like transactional integrity.<\/p>\n<p>I\u2019m not going to beat this dead horse too much here. There are bunches of ways to access Hadoop data with SQL. That space is actually becoming a bit crowded. I already talked about the fact that \u201c<a href=\"http:\/\/bigdatapage.com\/not-all-hadoop-users-drop-acid\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Not All Hadoop Users Drop ACID<\/a>\u201d and pointed out that SQL access was one great way to \u201c<a href=\"http:\/\/bigdatapage.com\/bridge-big-data-analytics-skills-gap\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Bridge the Big Data Analytics Skills Gap<\/a>.\u201d Despite all the options and information out there about this, I just saw an article this morning pointing out that Hadoop had no SQL access. Sigh. (Not pointing to that one. It was not worth a link.)<\/p>\n<p>The other myth around this is that SQL access on Hadoop is crazy expensive because you need lots of specialized skills and time to make it work. There\u2019s a recent article by Tamara Dull on the Smart Data Collective called <a href=\"http:\/\/smartdatacollective.com\/tamaradull\/317716\/will-you-always-save-money-hadoop\" target=\"_blank\" rel=\"nofollow\" data-wpel-link=\"internal\">Will You Always Save Money with Hadoop?<\/a> comparing costs over time. It basically concludes that if you need sophisticated SQL access, it\u2019s more economical to just use an old school data warehouse, no matter how much data you have. This runs completely counter to the argument for why SQL access to Hadoop data is a good idea. It gives your business analysts, the folks who already work for you and already know the data, access to all the data using tools and a language that they\u2019re already fluent in.<\/p>\n<p>How is that crazy expensive? That\u2019s the most economical possible way to handle large data sets. I find it very hard to believe that buying a bigger Oracle or Netezza appliance is a better way to go financially.<\/p>\n<p><strong>CRUD operations<\/strong><\/p>\n<p>Hadoop is append only. You can\u2019t do inserts, updates, or deletes.<\/p>\n<p>Okay, well, don\u2019t tell that to Splice Machines or MarkLogic. For that matter, don\u2019t tell it to MapR, one of the big three Hadoop distributors, or the guys at Pivotal who make Hawq. Don\u2019t tell it to us at Actian, for sure, because we\u2019ll laugh at you. Or, maybe not. We\u2019re generally polite, so we\u2019ll wait until you\u2019re not in the room, and then laugh at you. Maybe we\u2019ll call the guys from Splice, MarkLogic, MapR and Pivotal over to share the joke.<\/p>\n<p>We do feel like we have an edge up on the other guys because <a href=\"http:\/\/www.actian.com\/products\/analytics-platform\/vortex-sql-hadoop-analytics\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Actian Vortex<\/a> uses a technique that not only does full insert, update and delete operations, but does them with high concurrency without slowing down query speed. Heck, Teradata doesn\u2019t even do that. Most analytics databases can\u2019t do that. And we do it. On Hadoop. All the time.<\/p>\n<p><a href=\"http:\/\/lerablog.org\/business\/industry\/a-comparison-of-laser-cutting-vs-water-jet-cutting\/\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\"><img decoding=\"async\" class=\"alignright wp-image-136 size-medium\" src=\"http:\/\/i2.wp.com\/bigdatapage.com\/wp-content\/uploads\/2015\/06\/waterjet-cutting-1024x768.jpg?resize=300%2C225\" alt=\"Water Jet Cutting Steel\" data-recalc-dims=\"1\" \/><\/a><\/p>\n<p><strong>Batch processing without writing code<\/strong><\/p>\n<p>If you want to use Hadoop, you have to hire an army of expensive MapReduce coders.<\/p>\n<p>Not so much, no. I can\u2019t code MapReduce to save my life, but I\u2019ve designed data preparation and machine learning workflows, tested them out, executed them on a Hadoop cluster, tweaked and monitored them, and put the answers I got from them to use. I\u2019ve been working on a team with two data scientists and an infrastructure specialist for the past couple of years. The infrastructure specialist stands up clusters, maintains and builds workflows on Hadoop on a daily basis, and never touches MapReduce. The data scientists do their jobs on Hadoop all the time, and neither one has ever coded a word of MapReduce. Our marketing analytics department does analysis on data on Hadoop regularly, and none of them speak MapReduce.<\/p>\n<p>Hadoop today is not the batch only, base level MapReduce&nbsp;+&nbsp;HDFS starting point that it was a decade ago. Yet, that\u2019s still what many people think of when they hear the word, Hadoop. Many people even think of Hadoop and MapReduce as synonymous. That just isn\u2019t the case.<\/p>\n<p>YARN has turned Hadoop into a cluster operating system that can support many types of execution engines. Spark, Actian DataFlow, and Tez are all examples of ways to process data on Hadoop clusters that don\u2019t use MapReduce.<\/p>\n<p>Even MapReduce jobs don\u2019t really require MapReduce coders. At this point, there are half a dozen different user interface applications that will let you design a MapReduce ETL process without writing a bit of&nbsp;code. Informatica and Pentaho, for example, will let you design your ETL workflows in the same interfaces you\u2019re accustomed to, then will turn those into MapReduce jobs for you and execute them on a nearby cluster.<\/p>\n<p>I\u2019m not saying that\u2019s the best way to go. MapReduce is slow, and MapReduce auto-generated by an interface is going to be even slower than usual. But if you\u2019ve got the time, it will do the job. Spark, Tez and DataFlow are all faster execution engines, and DataFlow&nbsp;workflows&nbsp;can be created in the KNIME user interface. So, there\u2019s those options as well.<\/p>\n<p>Whatever method you decide to use, you can do all of the batch data crunching that you expect from Hadoop, without writing a single word of MapReduce code, or hiring a single MapReduce coder.<\/p>\n<p><strong>Other stuff<\/strong><\/p>\n<p>There are a lot of other things that Hadoop CAN do right now that I keep hearing people saying it can\u2019t. It doesn\u2019t have any security, for instance. Have you heard of Kerberos, KNOX, Sentry, \u2026? It doesn\u2019t have role-based authentication, or encryption, or audit capability.&nbsp;Actian does a lot of work for the financial and healthcare industries, among others. If those limitations were real. If that data wasn\u2019t secure, those companies simply couldn\u2019t use that software. They would be legally obligated not to use the software.<\/p>\n<p>The data management industry had this crazy hype machine going for a while that said that Hadoop could do everything from cure cancer to get stubborn stains out of your socks. People, naturally, got disillusioned. Now, practical, sensible people like Chief Data Officers are on the other end of the pendulum swing. They now have ideas about what Hadoop can\u2019t do that were set when the software was in its infancy. Hadoop has grown up and, while it still can\u2019t wash your socks, it can process data efficiently, economically, and without requiring a massive investment in hardware or hard-to-find skills.<\/p>\n<p>There\u2019s more, too. I hear about data quality, life cycle management, data curation, \u2026 a lot of the stuff that <a href=\"http:\/\/www.ovum.com\/authors\/tony-baer\/\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">Tony Baer<\/a> and I both talked about on our <a href=\"http:\/\/bigdata.actian.com\/OvumHadoop\" target=\"_blank\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\">5 Tips for Getting Value Out of Hadoop<\/a> webinar last Thursday. That\u2019s all stuff that people keep telling me that Hadoop can\u2019t do.<\/p>\n<p>Right, and water can\u2019t cut through steel.<\/p>\n<p><a href=\"http:\/\/i1.wp.com\/bigdatapage.com\/wp-content\/uploads\/2015\/06\/Hadoop_Data_Water_Can_Cut_Steel.jpg\" rel=\"nofollow external noopener noreferrer ugc\" data-wpel-link=\"external\"><img decoding=\"async\" class=\"aligncenter wp-image-138 size-full\" src=\"http:\/\/i1.wp.com\/bigdatapage.com\/wp-content\/uploads\/2015\/06\/Hadoop_Data_Water_Can_Cut_Steel.jpg?resize=530%2C413\" alt=\"Hadoop Data Like Water Can Cut Steel\" data-recalc-dims=\"1\" \/><\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I just got back from a little executive summit conference in Dallas for Chief Data Officers. Frustratingly, I heard a lot of folks telling me what Hadoop CAN\u2019T do. Now, I know that Hadoop can\u2019t bring about world peace or get my husband to put the toilet seat down, but the things people keep saying [&hellip;]<\/p>\n","protected":false},"author":345,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"","_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","footnotes":""},"categories":[22],"tags":[],"class_list":{"0":"post-12565","1":"post","2":"type-post","3":"status-publish","4":"format-standard","6":"category-hadoop"},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts\/12565","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/users\/345"}],"replies":[{"embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/comments?post=12565"}],"version-history":[{"count":0,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts\/12565\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/media?parent=12565"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/categories?post=12565"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/tags?post=12565"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}