hive

Beta connectors for both Hadoop Hive and PureData for analytics (Netezza) are now available on Conductor.

If you’re interested in moving data in or out of Hive or Netezza, this overview of their setup, the benefits of both, and how their Conductor connectors work should help you make a decision.

Hadoop Hive

If you are familiar with SQL then Hive would be just another playground for you, in fact, it’s quite a big one. Hive was developed to harness the power of the Hadoop platform, and MapReduce operations. It’s able to process large amounts of data by being architected to use clusters of computers built from commodity hardware. These computers are cheap and not built for speed, but get the job done comparatively quickly by being run in masses, computing in parallel. As a result, the cost-performance ratio scales linearly, as opposed to traditional databases which, in most cases, scale exponentially.

Queries run on Hive will have quite a high latency. This means that Hive is better suited for applications that don’t depend on short response times, such as data warehousing applications where data could be mined for insights and reports etc. If you need low-latency SQL queries that use Hadoop then Impala, Stinger, Drill or Spark are all possible solutions.

We set up Hadoop and Hive on our development environments using Hortonworks’ (a company that ships out Hadoop in a nice beginner-Hadoop-developer-friendly package) solution of Hadoop and developed our Hive connector against it.

With Conductor, getting data into Hive is done in three steps:

  • Firstly, we upload your transformed data to your Agent.
  • Once the Agent has your data, the Agent will then get it into the Hadoop FileSystem (HDFS)
    saved in the /tmp/ folder via the HDFS REST API (or WebHDFS).
  • When your data is in the HDFS, we tell your Agent to perform a command to load the data from HDFS into Hive via ODBC (which is a connection method that allows communication with the Hive database). This is just a single Hive SQL command. Doing it this way results in a much faster data migration. Since queries run on Hive have high latencies, running multiple bulk insert queries did not result in satisfying speeds but running just one big query did.

This three step process may seem a little convoluted, but it is completely transparent to anyone using Conductor. We do all the dirty work behind the scenes so you just have to choose the source and the destination, then sit back and watch it all happen.

Netezza

Netezza is another relational database, but in contrast to Hadoop, it’s a fridge-sized appliance using high end hardware specially tuned for database processing in almost every way for performance.

During development, we set up a Netezza emulator on a virtual machine and developed our connector against it. Getting data into Netezza is done in two main steps, very similar to Hadoop Hive:

  • Firstly, we upload your transformed data to your Agent.
  • Once your Agent has your data, the Agent will initiate a connection with the Netezza appliance via ODBC or OleDB to run a load command to load the entire contents of the data into Netezza in one go.

We’ve chosen to do it this way because running multiple bulk inserts is very slow compared to the load method. If we only take into account the actual loading time, loading in several hundred thousand rows took only a matter of seconds, and this wasn’t even running on a real instance of Netezza.

For more on databases, data migration and new Conductor features, read more.

2017-05-19T09:25:48+00:00

About the Author: