Bulk Loading data into JanusGraph — Part 1

Bulk Load into JanusGraph using Apache Spark

Nitin Poddar
4 min readJul 9, 2020

Last month, I was working on a POC project to check the feasibility of switching our Identity Resolution solution from Apache Hive based relational-style solution to an open source JanusGraph DB solution.

Since JanusGraph is an open-source platform with limited documentation available, I hit so many roadblocks and ended up spending weeks to get the ideal setup working. Hence, I thought of writing this post so that it can help someone.

JanusGraph, GraphDB, Bulk Load

Pre-requisites:

I am going with the assumption that you have some familiarity with Graph databases and have tried your hands on JanusGraph setup. If you are absolutely new to JanusGraph, consider reading Getting started document here.

This is going to be a two part post. In this post we will setup our JanusGraph server and create schema. In the next post, we will look at some of configuration options we choose and Apache spark code to perform the bulk load.

Environment

Our environment is AWS bound and we used relatively small infrastructure setup for the POC project

Starting JanusGraph Server

Once you are ready with your EC2 instances, the next thing is to configure and start the JanusGraph server. This is the most important step and has to be followed carefully. The sequence of operation is very important here. We will be configuring JanusGraph server to use ConfiguredGraphFactory.

Follow below steps to configure the JanusGraph server with Configured Graph Factory and create a schema with composite index. This process can be confusing and sequence of operation matters. So follow along.

  1. In order to use ConfiguredGraphFactory, modify the conf/janusgraph-cql-configurationgraph.properties file as below
gremlin.graph=org.janusgraph.core.ConfiguredGraphFactory
storage.backend=cql
graph.graphname=ConfigurationManagementGraph
storage.hostname=<comma separated hostname>
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

2. Modify the conf/gremlin-server/gremlin-server-configuration.yaml file and edit following lines (changes highlighted in bold).

graphs: {
ConfigurationManagementGraph: conf/janusgraph-cql-configurationgraph.properties
}
scriptEngines: {
gremlin-groovy: {
plugins: {
....
.... org.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: []}}}}

3. Now, start the JanusGraph server using the below command. Make sure to check the logs and confirm there are no errors.

$ bin/gremlin-server.sh conf/gremlin-server/gremlin-server-configuration.yaml

Gremlin server default heap size is very modest. It is recommended to change the JVM_OPTIONS by modifying the bin/gremlin.sh file with desired heap size using -Xmx and -Xms

4. Once the server is up and running, it is now time to login to the gremlin-console to configure a graph.

$ bin/gremlin.sh

\,,,/
(o o)
-----oOOo-(3)-oOOo-----
gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182
gremlin> :remote console
==>All scripts will now be sent to Gremlin Server on local mode

After the console is connected to the server, all commands be executed against the JanusGraph server. Run the following commands to create a ConfiguredGraphFactory with some custom properties.

map = new HashMap();
map.put("graph.graphname","my-demo-keyspace")
map.put("ids.block-size",15000000)
map.put("query.batch",true)
map.put("storage.backend","cql")
map.put("storage.cql.read-consistency-level","LOCAL_ONE")
map.put("storage.cql.replication-factor",3)
map.put("storage.cql.write-consistency-level","LOCAL_ONE")
map.put("storage.hostname","xx.xx.xx.xx")
map.put("cluster.max-partitions",6)
map.put("storage.buffer-size", 2048)
map.put("ids.authority.wait-time", 1000)
ConfiguredGraphFactory.createConfiguration(new MapConfiguration(map))

This creates a graph database with configured graph factory.

5. Next, we will create the schema and index. Run the following command in the same gremlin console. Though JanusGraph allows you to insert data without defining any schema, for bulk load it is recommended to create schema prior to loading any data

mgmt = g1.openManagement()  #opens the management graph
#created vertex label for cookies, mobile ids and emails
cookie = mgmt.makeVertexLabel("COOKIE").make()
mobile = mgmt.makeVertexLabel("MOBILE").make()
email = mgmt.makeVertexLabel("EMAIL").make()
rel = mgmt.makeEdgeLabel("REL").make()
#creates an edge label for relationship
idvalue = mgmt.makePropertyKey("idValue").dataType(String.class).make()
#create a vertex property id value, id type, timestamp
idtype = mgmt.makePropertyKey("idType").dataType(String.class).make()
timestamp = mgmt.makePropertyKey("timestamp").dataType(Long.class).make()
edgeValue = mgmt.makePropertyKey("edgeValue").dataType(String.class).make()
#creates a edge property edge value#Add the relevant properties to vertex and edge
mgmt.addProperties(rel, edgeValue, timestamp)
mgmt.addProperties(cookie, idvalue, idtype, timestamp)
mgmt.addProperties(mobile, idvalue, idtype, timestamp)
mgmt.addProperties(email, idvalue, idtype, timestamp)
#build indexes for vertex and edges
mgmt.buildIndex('byCookiedValueComposite', Vertex.class).addKey(idvalue).indexOnly(cookie).buildCompositeIndex()
mgmt.buildIndex('byMobileidValueComposite', Vertex.class).addKey(idvalue).indexOnly(mobile).buildCompositeIndex()
mgmt.buildIndex('byEmailidValueComposite', Vertex.class).addKey(idvalue).indexOnly(email).buildCompositeIndex()
mgmt.buildIndex('byRelEdgeValueComposite', Edge.class).addKey(edgeValue).indexOnly(rel).buildCompositeIndex()
#prints the schema you just created
mgmt.printSchema()
#close resources
mgmt.commit()
mgmt.close()
g.close()

Composite index we created does not use Elasticsearch for indexing and manages the indexes inside storage backend only

6. After the schema and index is designed, it is now time to restart the JanusGraph server by modifying the scripts/empty-sample.groovy and conf/gremlin-server/gremlin-server-configuration.yaml files.

//scripts/empty-sample.groovydef globals = [:]
ig = ConfiguredGraphFactory.open("my-demo-keyspace")
globals << [ig : ig.traversal()]
#Modify gremlin-server-configuration.yaml to add the groovy scriptorg.apache.tinkerpop.gremlin.jsr223.ScriptFileGremlinPlugin: {files: [scripts/empty-sample.groovy]}}}}

7. Stop and restart the service using command from step 3 and verify the log file again. You should see a message saying that server started on port 8182.

INFO  org.apache.tinkerpop.gremlin.server.GremlinServer  - Channel started at port 8182.

8. Voila! Your JanusGraph Server is now up and running.

In this post we saw how to start JanusGraph server with ConfiguredGraphFactory. We also created Graph schema and created composite index to speed up query processing.

In the next post we will see how to perform bulk load with Apache Spark.

Happy Learning

--

--