{"id":24944,"date":"2018-03-23T09:30:03","date_gmt":"2018-03-23T09:30:03","guid":{"rendered":"https:\/\/www.smartdatacollective.com\/?p=24944"},"modified":"2018-03-22T20:36:09","modified_gmt":"2018-03-22T20:36:09","slug":"turbo-charge-data-scientist-productivity-data-catalog","status":"publish","type":"post","link":"https:\/\/www.smartdatacollective.com\/turbo-charge-data-scientist-productivity-data-catalog\/","title":{"rendered":"Turbo-Charge Data Scientist Productivity with a Data Catalog"},"content":{"rendered":"<p>The average salary of a data scientist in the U.S. is\u00a0<a href=\"https:\/\/www.glassdoor.com\/Salaries\/data-scientist-salary-SRCH_KO0,14.htm\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">nearly $130,000<\/a>, a figure that\u2019s bound to climb as the\u00a0<a href=\"https:\/\/cmba.fiu.edu\/articles\/is-there-a-data-scientist-shortage.aspx\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">shortage<\/a> of people with the requisite skills persists. With that kind of investment at\u00a0stake, any company would want to get the maximum value out of its skills investment,\u00a0but by most accounts, data scientists typically spend 80% of their time on the routine and monotonous tasks of <a href=\"https:\/\/www.smartdatacollective.com\/moving-self-serve-analytics-you-need-data-catalog\/\" data-wpel-link=\"internal\">finding and organizing data<\/a>.<\/p>\n<p>They have no choice. Corporations have adopted data lakes enthusiastically, but without good governance and quality control procedures, those data lakes quickly become data swamps. Duplication, inconsistency, omissions, <a href=\"https:\/\/www.smartdatacollective.com\/data-quantity-or-data-quality\/\" data-wpel-link=\"internal\">data quality<\/a> issues, format incompatibilities, acceptable use policies, and permission problems are just some of the obstacles data scientists must navigate to whip information into shape so they can do the analyses and find the insights that matter to the business.<\/p>\n<p>And that\u2019s if they can find the data in the first place. In many organizations, silos have grown up over the years that make important data difficult or impossible to track down. Even if data scientists can locate the right information, they may wait weeks for the owners to make it available. Then begins the laborious task of correcting errors, harmonizing formats, filling in gaps, and resolving conflicts. It\u2019s not surprising that this grunt work can consume most of an expensive data scientist\u2019s time.<\/p>\n<h2><strong>Why Data Catalogs Are the Solution<\/strong><\/h2>\n<p>Organizations that are serious about data science need to be serious about <a href=\"https:\/\/www.smartdatacollective.com\/heal-heartbreak-data-sprawl-data-catalog\/\" data-wpel-link=\"internal\">data catalogs<\/a>. Today\u2019s technology enables machines to discover and classify data wherever it lives in the organization. And machine learning technology makes catalogs smarter as they work. With a little help from a human to resolve questions and inconsistencies, data catalogs can quickly <a href=\"https:\/\/www.waterlinedata.com\/blog\/data-fingerprinting-the-magic-is-finally-revealed\/\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">learn to make their own decisions<\/a> without human intervention.<\/p>\n<blockquote><p>A good rule of thumb is to assume that 80% of the effort is going to center around data-integration activities\u2026 A similar 80% of the effort within data integration is to identify and profile data sources.<\/p>\n<p>&#8212; Boris Evelson, Forrester Research, March 25, 2015<\/p>\n<p>Forrester Research: <a href=\"https:\/\/www.forrester.com\/report\/Boost+Your+Business+Insights+By+Converging+Big+Data+And+BI\/-\/E-RES115633\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">Boost Your Business Insights By Converging Big Data And BI<\/a><\/p><\/blockquote>\n<p>Data catalogs help data scientists in areas other than just information discovery. They\u2019re one of the best ways to identify duplicate or inconsistent information, cutting down on a laborious human task. Tags applied automatically or by humans through crowdsourcing can help data scientists decide if a given dataset is useful or extraneous without requiring them to dig into the data itself. The catalog can also indicate permissions and <a href=\"https:\/\/mapr.com\/resources\/waterline-data-data-governance-real-time-data-lake\/\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">data governance standards<\/a> that tell whether it\u2019s OK to use a given set of records.<\/p>\n<h2><strong>How Catalogs Ease the Burden on Data Scientists<\/strong><\/h2>\n<p>Data swamps present a formidable challenge to data scientists. Without a clear definition of data types, intended usage, and quality rating, scientists are left to make their best guess about what to use and what to disregard.<\/p>\n<p>Unfortunately, poor data quality is a rampant problem. Experian\u2019s <a href=\"https:\/\/www.edq.com\/globalassets\/white-papers\/2017-global-data-management-benchmark-report.pdf\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">2017 Global Data Management Benchmark Report<\/a> found that fewer than half of the organizations surveyed trust their data to make important business decisions. The most frequently cited cause of poor data quality is human error, such as sloppy data entry. Then there is poorly identified data. For example, a string of eight digits may be a partial phone number, a Social Security number, an account number, or a date. A smart data catalog can discover and tag the information that\u2019s most relevant to the task, eliminating guesswork and the risk of bad decisions.<\/p>\n<p>Copy sprawl is another challenge. In a perfect world, organizations would have only one \u201cgolden copy\u201d of their data, but the reality is that duplication is rampant in most organizations. Sales managers want customer data to populate their customer relationship management systems. Marketing wants it for a lead nurturing program. The support team wants it to build their service history database.<\/p>\n<p><a href=\"https:\/\/www.emc.com\/collateral\/analyst-reports\/idc-copy-management-infobrief-interactive-version.pdf\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">International Data Corp.<\/a> has estimated that up to 60% of storage capacity in a typical enterprise consists of these kinds of copies, but fewer than 20% of organizations have copy-management standards. Gartner analyst <a href=\"https:\/\/www.forbes.com\/sites\/petercohan\/2014\/06\/30\/actifio-is-eating-emcs-lunch-in-44-billion-market\/#58b7e7606303\" data-wpel-link=\"external\" rel=\"external noopener noreferrer ugc\">Dave Russell<\/a> estimates many companies keep between 30 and 40 copies of business data for purposes ranging from backups to regulatory compliance.<\/p>\n<p>As each group gets its own extract of production data, the costs and risks grow. Updates to one copy aren\u2019t reflected in the others, creating discontinuity. No one knows what the truth is, which makes analyzing data for critical business decisions a risky affair.<\/p>\n<p>An enterprise data catalog brings order out of this chaos by \u201cfingerprinting\u201d data and tagging backups and extracts so that there\u2019s never any confusion about which copies are valid. A catalog doesn\u2019t prevent copies from being made, but it can designate ownership, flag data that\u2019s been modified, and even specify rules about how those copies can be used.<\/p>\n<h2><strong>Stricter Privacy Rules Make Data Catalogs Even More Important<\/strong><\/h2>\n<p>The need for a data catalog will become even more pronounced as new <a href=\"https:\/\/www.smartdatacollective.com\/big-data-privacy-concerns-create-exodus-from-google\/\" data-wpel-link=\"internal\">privacy rules take effect<\/a> in Europe and elsewhere. These regulations place strict limits on how personal data may be used for purposes like profiling and segmentation. Information may need to be anonymized or deleted depending on the permissions that have been granted by the subject individual. This directly impacts the types of data science applications that can be used.<\/p>\n<p>For example, a marketing organization may want to target promotions at individual households. Residents who have given permission for such contact may receive customized offers, while those who haven\u2019t may receive only general promotions or may not be contacted at all. A data catalog can specify at a fine level of granularity what kinds of information may be used for targeting, thereby avoiding large fines for the company. The data scientist is protected when legitimate usage is defined by the data catalog.<\/p>\n<p>Data catalogs set the ground rules for how data is stored and labeled across an organization. This is particularly useful for companies that have grown rapidly through mergers and acquisitions, a phenomenon that tends to stoke the data silo problem. Introducing a catalog gives those companies a chance to get a clean start with a unified view that applies to all data.<\/p>\n<p>When you do the math, the benefits of a data catalog quickly exceed the costs. For example, if a catalog can save 30% of a data scientist\u2019s time that\u2019s currently wasted on searching and prepping, that\u2019s $40,000 per year. And that\u2019s not even taking into account the business benefits of having that <a href=\"https:\/\/www.smartdatacollective.com\/good-management-essential-age-of-big-data-ml\/\" data-wpel-link=\"internal\">person working in a satisfying, challenging job<\/a> doing what you hired him or her for.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The average salary of a data scientist in the U.S. is\u00a0nearly $130,000, a figure that\u2019s bound to climb as the\u00a0shortage of people with the requisite skills persists. With that kind of investment at\u00a0stake, any company would want to get the maximum value out of its skills investment,\u00a0but by most accounts, data scientists typically spend 80% [&hellip;]<\/p>\n","protected":false},"author":9476,"featured_media":27478,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_seopress_robots_primary_cat":"30","_seopress_titles_title":"Turbo-Charge Data Scientist Productivity with a Data Catalog","_seopress_titles_desc":"Data scientists at any company are highly skilled and cost a lot -- so you don't want them slogging through swamps of information that machine learning can take care of with data catalogues.","_seopress_robots_index":"","footnotes":""},"categories":[48,5,30],"tags":[252,1082,87,2691,222,937,954,2692,356],"class_list":{"0":"post-24944","1":"post","2":"type-post","3":"status-publish","4":"format-standard","5":"has-post-thumbnail","7":"category-big-data","8":"category-data-quality","9":"category-policy-and-governance","10":"tag-big-data","11":"tag-big-data-scientists","12":"tag-crowdsourcing","13":"tag-data-catalogs","14":"tag-data-quality","15":"tag-data-science","16":"tag-data-scientists","17":"tag-data-set","18":"tag-machine-learning"},"amp_enabled":true,"_links":{"self":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts\/24944","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/users\/9476"}],"replies":[{"embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/comments?post=24944"}],"version-history":[{"count":12,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts\/24944\/revisions"}],"predecessor-version":[{"id":27485,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/posts\/24944\/revisions\/27485"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/media\/27478"}],"wp:attachment":[{"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/media?parent=24944"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/categories?post=24944"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.smartdatacollective.com\/wp-json\/wp\/v2\/tags?post=24944"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}