Apache Hive & Apache Spark SQL

Hive is a data warehouse software built on top of Hadoop that facilitates the management of large datasets, there are some important features to know:

  • It has tools to access data via SQL commands
  • It allows access to files stored in Apache HDFS or HBase
  • The query execution could be integrated with Apache Spark or Apache Tez
  • Hive’s SQL could be extended via User Defined Functions (UDFs), User Defined Aggregates (UDAFs) and User Defined Table functions (UDTFs)
  • Hive is designed for data warehousing tasks and not for online transaction processing workloads
  • Hive is designed to scale horizontally using the Hadoop cluster, then it could maximize scalability

Now that we have an overview / understanding about Hive, and if we need to execute queries for Big Data, an excellent integration we can do is with Spark SQL, since it supports HiveQL syntax, Hive UDFs and Hive SerDes allowing us to access existing Hive warehouses and gain all the amazing features of the Apache Spark like cost-based optimizer, code generation for queries, scalability and multi hour queries using the Spark engine (for fault tolerance).

Spark SQL can use existing Hive metastores, Serialization / Deserialization (SerDes), and UDFs.
import java.io.File

import org.apache.spark.sql.{Row, SaveMode, SparkSession}

case class Record(key: Int, value: String)

val warehouseLocation = new File("spark-warehouse-dir").getAbsolutePath

val spark = SparkSession
  .builder()
  .appName("Spark Hive Test")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()

import spark.implicits._
import spark.sql

sql("SELECT * FROM hive_table").show()

I didn’t test an integration with Apache Tez yet, but it seems so powerful processing Directed Acyclic Graphs (DAGs) and with similar performance as Apache Spark. An interesting comparison about it could be found in https://www.integrate.io/blog/apache-spark-vs-tez-comparison/

References

https://cwiki.apache.org/confluence/display/Hive/Home

https://spark.apache.org/sql/

https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

What is Spring Aspect Oriented Programming?

I have to admit that AOP and all the concepts behind it are not easy to catch at the beginning. After reading some definitions about Spring-aop and looking examples, I found a good and simple definition that I’d like to share:

“Spring AOP (Aspect-oriented programming) framework is used to modularize cross-cutting concerns in aspects. Put it simple, it’s just an interceptor to intercept some processes, for example, when a method is executed, Spring AOP can hijack the executing method, and add extra functionality before or after the method execution” https://mkyong.com/spring/spring-aop-examples-advice/.

Examples that are mentioned in many sites are related to logging and verification and in most cases are temporal or have a cross-cutting concern, it means a same code that needs to be used in many places (temporal or not) could be hijacked without changing the original code.

How can I use it?

Well to start simple, we can create a custom annotation and then use AOP to understand how it works.

Dependencies

In this example, I’m using Spring Boot and the AOP starter that pulls the libraries required to implement with Aspects.

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>2.3.12.RELEASE</version>
</parent>

<dependencies>
        <dependency>
            <groupId>org.aspectj</groupId>
            <artifactId>aspectjrt</artifactId>
        </dependency>
        <dependency>
            <groupId>org.aspectj</groupId>
            <artifactId>aspectjweaver</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-aop</artifactId>
        </dependency>
</dependencies>

Custom Annotation

I’m creating an annotation where I define that it only can be used by methods, and it will be available to the JVM at runtime.

@Target(ElementType.METHOD)
@Retention(RetentionPolicy.RUNTIME)
public @interface LogElapsedTime {

}

Example Service Component

Now, let’s create a component with a fake slow method and then include our annotation, until now we don’t have any Aspect configured yet.

@Component
public class ExampleService {

    @LogElapsedTime
    public void simulateSlowMethod() throws InterruptedException {
        Thread.sleep((int)(Math.random() * 1000));
    }
}

Example Aspect Component

Let’s create another component and in this case annotate also with @Aspect. In this component we need to create the pointcut and advice. I defined the advice to be @Around, it means I’m including extra code before and after the method execution. The advice argument is the pointcut and it says to apply when the @LogElapsedTime annotation is used.

The goal is to inject extra code to log how much time some methods take to finish. As you can see it’s very flexible and to include any method in this flow just needs an annotation.

@Aspect
@Component
@Slf4j
public class ExampleAspect {

    @Around("@annotation(com.sample.aop.LogElapsedTime)")
    public Object logElapsedTime(ProceedingJoinPoint joinPoint) throws Throwable {
        long startTime = System.currentTimeMillis();
        Object proceed = joinPoint.proceed();
        log.info(String.format("%s took %d ms", joinPoint.getSignature(), System.currentTimeMillis() - startTime));
        return proceed;
    }

}

Testing

The happy part, to test it and see how it works!

@RunWith(SpringJUnit4ClassRunner.class)
@SpringBootTest
public class LogElapsedTimeTest {

    @Autowired
    private ExampleService service;

    @Test
    public void validateLogElapsedTimeAnnotation() throws InterruptedException {
        service.simulateSlowMethod();
    }

}

The complete code is available in my Github repository.

I hope it helps to understand better.

References:

[1] https://docs.spring.io/spring-framework/docs/2.5.x/reference/aop.html

[2] https://mkyong.com/spring/spring-aop-examples-advice/

[3] https://www.baeldung.com/spring-aop-annotation

Spring Bean Scopes

With a simple Spring annotation you get the flexibility to choose the scope of a bean definition and objects created for a particular bean definition.

The scope of a bean defines the lifecycle and visibility of a bean inside a given context. In case of Spring Framework we have the following types:

  • Singleton
  • Prototype
  • Request
  • Session
  • Application
  • Websocket

Singleton

It’s the default scope, the container will create a single instance of a bean and every request for this bean will return the same object which is cached. This scope reaches only the ApplicationContext.

@Bean
@Scope(value = ConfigurableBeanFactory.SCOPE_SINGLETON)

Prototype

In this case the container will create a new object each time a request is made to this bean. It’s mostly used when we need to maintain state.

@Bean
@Scope(value = ConfigurableBeanFactory.SCOPE_PROTOTYPE)

Request

In this case a new bean is created for each HTTP request that is made. It’s mostly used in authentication and authorization context.

@Bean
@Scope(value = WebApplicationContext.SCOPE_REQUEST, proxyMode = ScopedProxyMode.TARGET_CLASS)

Session

In this case only one bean will be created for each HTTP session

@Bean
@Scope(value = WebApplicationContext.SCOPE_SESSION, proxyMode = ScopedProxyMode.TARGET_CLASS)

Application

In this case a bean will be created for all the lifecycle of the ServletContext.

@Bean
@Scope(
  value = WebApplicationContext.SCOPE_APPLICATION, proxyMode = ScopedProxyMode.TARGET_CLASS)

Websocket

In this case the bean will be alive while the websocket session remains.

@Bean
@Scope(scopeName = "websocket", proxyMode = ScopedProxyMode.TARGET_CLASS)

References:

https://www.baeldung.com/spring-bean-scopes

https://docs.spring.io/spring-framework/docs/3.0.0.M3/reference/html/ch04s04.html