[Tutorial]How to use spark-sql-perf

The situation is that: there’s one of our work need to be benchmarked. It’s still the SparkSQL Server that I designed and developed but new sharing technique was implemented and integrated into it. So, we decided to use the spark-sql-perf of Databricks to benchmark our work, and the benchmarked queries we used is TPCDS.

There’s one thing that could take you time to use this repository. Since it’s still on developing so the README.md is also under developement (at they said) and it’s not up-to-date. Another thing that could annoy you is that they don’t keep the older version of this repo, so there’s  only one branch which is master. The master is always updated to catch up the growing speed of Apache Spark, so it could be a problem for someone doesn’t use the latest version of Spark (which I think most of the users do).

In this post, I would like to write a tutorial in step-by-step on how we can use the spark-sql-perf. The version of Spark running on our cluster is 1.6.1, so, if someone uses different (and older) version of Spark, the only thing you can do is trying to find the spark-sql-perf for that version and digging into its source codes to use it.

So here are the steps:

  1. Get the spark-sql-perf for Apache Spark 1.6.1. You can get it here: link. Then, compile it to get the the jar file.
  2. Get the TPCDS data generator. You can get it here: link.
  3. Generate the data from TPCDS.
    • First of all, you need to build the dsdgen inside the tools folder of TCPDS package. On Linux, remember to install the essential-builds or you can get the errors when compiling.
    • Generate the data using dsdgen. You can find the usage by typing dsdgen –help. For example, I would like to generate 10GB of TPCDS data, with the suffix is csv, and store them at tmp folder. So, the command is:
      dsdgen -scale 10 -suffix .csv -dir /tmp

      After that, you can push the data to the storage of the cluster.

    • The spark-sql-perf can also help you to generate the data from TPCDS, the data will be stored directly to the cluster’s storage, but the file format would be json or parquet. Below is the code snippet to generate the data using this repo:
      val tables = new Tables(sqlContext, args{0}, args{1}.trim.toInt)
      tables.genData(args{2}, "parquet", true, false, false, false, false) 
    • Let me explain a bit about the codes
      • args{0}: path to dsdgen. This file must be put in the same path on all slave machines.
      • args{1}: file size (in GB).
      • args{2}: output path of generated data.
      • others paremeters of genData function can be found in the source code.

    How to run it? It’s exactly the same way when you submit a Spark application using spark-submit command, the difference is you need to put option –jars and specify spark-sql-perf jar file location to let Spark know you use it. 

    You can also run the benchmark inside this application, but at each time you run, it would reproduce the data, so I prefer to create the data once and use it many times later on.

  4. Run the benchmark
    • After creating the data, thing is more simple: load the data and run the benchmark.
    • Since the data generator (of TPCDS) creates 24 files, which will be used as 24 tables. So, just load them and put them into 24 tables, the name of the tables are also the names of the files. The code below demonstrates what I’m saying here, args{0} is the input path to those files.
      val tableNames = Array("call_center", "catalog_page",      "catalog_returns", "catalog_sales",      "customer", "customer_address",      "customer_demographics", "date_dim",      "household_demographics", "income_band",      "inventory", "item", "promotion",      "reason", "ship_mode",      "store", "store_returns",      "store_sales", "time_dim",      "warehouse", "web_page",      "web_returns", "web_sales", "web_site")
      for(i <- 0 to tableNames.length - 1) {
      sqlContext.read.parquet(args{0} + "/" + tableNames{i}).registerTempTable(tableNames{i})
    • After that, choosing the query suite to benchmark, there are couple of suites that you can find in the source code of spark-sql-perf, the example below is for TPCDS 1.4:
      val tpcds = new TPCDS (sqlContext = sqlContext, resultsLocation = “output-of-result”)
      val experiment = tpcds.runExperiment(tpcds.tpcds1_4Queries, iterations = 1)
    • So, the result will be stored at the resultsLocation you specify, it would be in json format, after completing the benchmark, you can get it out from the cluster’s storage and view it.
    • You need to have the “waitForFinish” function or it won’t produce any result since it  would go to the end of the application and exit. Just put a number large enough, when all the queries are executed and finished, the application also terminates.

That’s all for the tutorial. There could be another ways to use this repository, so that would be great if you can leave a comment about how you use it. I hope it would be useful for someone who wants to use the spark-sql-perf and also hope that databricks will provide better usage for this repository.

2 thoughts on “[Tutorial]How to use spark-sql-perf

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s