Apache Spark component

 Apache Spark component is available starting from Camel 2.17.

 

This documentation page covers the Apache Spark component for the Apache Camel. The main purpose of the Spark integration with Camel is to provide a bridge between Camel connectors and Spark tasks. In particular Camel connector provides a way to route message from various transports, dynamically choose a task to execute, use incoming message as input data for that task and finally deliver the results of the execution back to the Camel pipeline.

Supported architectural styles

Spark component can be used as a driver application deployed into an application server (or executed as a fat jar).


Spark component can also be submitted as a job directly into the Spark cluster.


While Spark component is primary designed to work as a long running job serving as an bridge between Spark cluster and the other endpoints, you can also use it as a fire-once short job.   

Running Spark in OSGi servers

Currently the Spark component doesn't support execution in the OSGi container. Spark has been designed to be executed as a fat jar, usually submitted as a job to a cluster. For those reasons running Spark in an OSGi server is at least challenging and is not support by Camel as well.

URI format

Currently the Spark component supports only producers - it it intended to invoke a Spark job and return results. You can call RDD, data frame or Hive SQL job.

 

Spark URI format

 

RDD jobs 

 

To invoke an RDD job, use the following URI:
Spark RDD producer

 Where rdd option refers to the name of an RDD instance (subclass of org.apache.spark.api.java.JavaRDDLike) from a Camel registry, while rddCallback refers to the implementation of org.apache.camel.component.spark.RddCallback interface (also from a registry). RDD callback provides a single method used to apply incoming messages against the given RDD. Results of callback computations are saved as a body to an exchange.

Spark RDD callback

The following snippet demonstrates how to send message as an input to the job and return results:

Calling spark job

The RDD callback for the snippet above registered as Spring bean could look as follows:

Spark RDD callback

The RDD definition in Spring could looks as follows:

Spark RDD definition

 

RDD jobs options

OptionDescriptionDefault value
rddRDD instance (subclass of org.apache.spark.api.java.JavaRDDLike).null
rddCallbackInstance of org.apache.camel.component.spark.RddCallback interface.null

Void RDD callbacks

If your RDD callback doesn't return any value back to a Camel pipeline, you can either return null value or use VoidRddCallback base class:

Spark RDD definition

Converting RDD callbacks

If you know what type of the input data will be sent to the RDD callback, you can use ConvertingRddCallback and let Camel to automatically convert incoming messages before inserting those into the callback:

Spark RDD definition

Annotated RDD callbacks

Probably the easiest way to work with the RDD callbacks is to provide class with method marked with @RddCallback annotation:

Annotated RDD callback definition

If you will pass CamelContext to the annotated RDD callback factory method, the created callback will be able to convert incoming payloads to match the parameters of the annotated method:

Body conversions for annotated RDD callbacks

 

DataFrame jobs

 

Instead of working with RDDs Spark component can work with DataFrames as well. 

To invoke an DataFrame job, use the following URI:
Spark RDD producer

 Where dataFrame option refers to the name of an DataFrame instance (instance of of org.apache.spark.sql.DataFrame) from a Camel registry, while dataFrameCallback refers to the implementation of org.apache.camel.component.spark.DataFrameCallback interface (also from a registry). DataFrame callback provides a single method used to apply incoming messages against the given DataFrame. Results of callback computations are saved as a body to an exchange.

Spark RDD callback

The following snippet demonstrates how to send message as an input to a job and return results:

Calling spark job

The DataFrame callback for the snippet above registered as Spring bean could look as follows:

Spark RDD callback

The DataFrame definition in Spring could looks as follows:

Spark RDD definition

 

DataFrame jobs options

OptionDescriptionDefault value
dataFrameDataFrame instance (subclass of org.apache.spark.sql.DataFrame).null
dataFrameCallbackInstance of org.apache.camel.component.spark.DataFrameCallback interface.null

 

Hive jobs

 Instead of working with RDDs or DataFrame Spark component can also receive Hive SQL queries as payloads. To send Hive query to Spark component, use the following URI:

Spark RDD producer

The following snippet demonstrates how to send message as an input to a job and return results:

Calling spark job

The table we want to execute query against should be registered in a HiveContext before we query it. For example in Spring such registration could look as follows:

Spark RDD definition

 

Hive jobs options

OptionDescriptionDefault value
collectIndicates if results should be collected (as a list of org.apache.spark.sql.Row instances) or if count() should be called against those.true

 

See Also

© 2004-2015 The Apache Software Foundation.
Apache Camel, Camel, Apache, the Apache feather logo, and the Apache Camel project logo are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners.
Graphic Design By Hiram