Funny Anchorman jokes aside, the reason for the picture here is because so many people talk about Big Data to the point that it’s a little over-used. So why am I talking about it? Well, because technology corporations still need engineers who are constantly solving data aggregation issues that are ever increasing. And because of this, sourcers and recruiters need to understand what to look for in the engineer’s experience besides the words “Big Data” (which really makes me cringe when non-technical people throw out that word to me and assume they have instant street cred….or tech cred, as the case may be).


So let’s talk about what data services or data aggregation is and simplify it. Many companies like Google, Amazon, Facebook, Yahoo, Microsoft, etc are dealing with an exponential overflow of data that they need to make sense of, analyze, report on, and in some cases pipeline back into the system. They do this to make sense of customer statistics, user data, search data, advertising data, log data, etc. The trends for most companies follow a certain technology stack/recipe: Hadoop/Unstructured Databases, Java Programming, Customized Software Tools, On-the-fly data replication, and streamlined ETL processes. So let’s talk about the technology:

Hadoop / Unstructured Databases – An open source framework used to create distributed data applications. Typically used in high availability, large scale applications like search engines, highly visible ecommerce applications, mission critical distributed apps. Hadoop can handle large data sets that reach into the petabytes range, which traditional enterprise databases cannot handle. Also, Hadoop can work with unstructured data.

Java Programming – Hadoop is written in Java and is very much a part of the open source SW community – because of that, many versions have been created. For corporations that are deciding between an enterprise database solution that costs them $500,000.00, or using open-source Hadoop & hiring a good Java Programmer for complete customization, you can guess what path they are taking.

Customized ETL tools – There are very expensive and industry-standard solutions like Informatica, Oracle Warehouse Builder, and SSIS (SQL Server Integration Services), but you can also write your own ETL tools with a good programmer in almost any language. Most likely either Perl, PHP, or Python.

Database replication – HDFS (the Hadoop file system) is used to replicate the data in triplicate, which protects the data against hardware failure (not against erasure, so you still have to have data security standards in place) in case one hard drive fails with a critical database residing on it.

So to source candidates for this type of technology, you could search the words “Big Data”, but you could also be missing out on a ton of qualified candidates. So instead, search for what we’re really talking about and use as many SIMILAR variations as possible:

Hadoop / Unstructured Databases – (hadoop OR vertica OR scala OR mapreduce OR hbase OR hive)

Big Data – (“big data” OR “data pipeline” OR petabytes OR pbs OR “data aggregation” OR “data services” OR “data integration”) etl

There are a zillion ways to search for people like this, but this time let’s focus on candidates that are writing about studies or use cases of how they aggregated data in a Hadoop data system. Let’s try the blog angle in Google with this string:

( OR inurl:blog OR intitle:blog OR OR (“about me” OR “about the author” OR “view my complete”) java (“big data” OR “data pipeline” OR petabytes OR pbs OR “data aggregation” OR “data services” OR “data integration”) etl (hadoop OR vertica OR mapreduce OR hbase OR hive)

The results are people blogging about the technology:


At this point, you click through and read the blogs, get an idea of the projects that are discussed, and then contact the individuals. There is usually a way to contact people on their blog sites personally. Or double back and check the person’s name in your Linkedin account and ATS.

– Mark Tortorici


