Message Queues – Computing Codes

How to scale a web application like WikiPedia?

Wikipedia Performance numbers:

30,000 http request/sec
3 Gbit/s data transfer
3 data center
350 server machines 1x P4 to 2x Xeon QuadCore, 0.5 – 16 GB of memory

Runs on LAMP Architecture

OutOfMemory Exception

out of memory error

java.lang.OutOfMemoryError: Java heap space

The quick solution is to add these flags to JVM command line when Java runtime is started:

-Xms1024m -Xmx1024m

java.lang.OutOfMemoryError: PermGen space

The solution is to add these flags to JVM command line when Java runtime is started:

-XX:+CMSClassUnloadingEnabled-XX:+CMSPermGenSweepingEnabled

PHP GEARMAN : Best Way to handle the CPU intensive BackGround jobs

Best Way to handle the CPU intensive jobs:

E.g. youtube user uploading the video. We get the content/data from the user, process and send a response to the client.

Basic Architecture:

Have a message Queue. Client thread will add the data to the message queue and poll the queue for the job status. All the client thread will add the data to the same Queue. At the end of the queue we can have the worker to send the notification to the ui or mail to the client.

PHP syntax : ignore_user_abort(true) => Script should be aborted when the response is send to the client or not.

PHP GEARMAN

Generic web application framework for farming out the work into multiple machine/processes.

written in C
Support multi-threaded
Persistent queues.
No single point of failure.

Client : Create a job and send it to the Job server

Worker : Register with the job server and get a job and process it.

Job Server : Co-ordinate the job from the client to worker and handle restarts.

GEARMAN application architecture:

Application Client =====>| GearMan Client API ===> GearMan job server ===>GearMan Worker API | ===> Application Worker

Sample Client :

$client = new GearmanClient();

$client->addServer();

print $client->do(“reverse”,’Hello world”);=> run the job in foreground (synchronous)

print $client->doBAckGround(“reverse”,’Hello world”);=> run the job in background (asynchronous)

Sample Worker :

$worker = new GearManWorker();

$worker->addServer(“172.16.33.12”,6322);

$worker->addFunction(“reverse”,”my_reverse_function”);

while($worker->work());

function my_reverse_function($job){

return strrev($job->workload());

}

Running :

gearmand -d

Shell$ php worker.php &

17510

Shell $ php client.php

!dlrow olleh

Gearman support distributed processing with synchronous and asynchronous queues.

Gearman : job vs tasks

job is task. But task is not job.e.g checking job status is a task but not job

client submit a tasks

worker process a job

Concurrent task API:

Queue the jobs

Callback function on the specific events

No promise in the order in which job is processed.

Thread Model:

Specify the thread count in the -t parameter while starting

By default it is single threaded.

libevent is an asynchronous event notification software library

Currently there are three type of thread in gearman

1. Listening and management threads: Listen for the incoming connections and assign it to the IO threads and manage the server coming up.

2. The I/O thread is responsible for doing the read and write system calls on the sockets and initial packet parsing. Once the packet has been parsed it it put into an asynchronous queue for the processing thread

3. The processing thread should have no system calls within it (except for the occasional brk() for more memory), and manages the various lists and hash tables used for tracking unique keys, job handles, functions, and job queues.

Queues: Inside the Gearman job server, all job queues are stored in memory-> when the server is restarted all the jobs will be gone

Support queues only for the background jobs

uses libdrizzle

The persistent queue is only enabled for background jobs because foreground jobs have an attached client. If a job server goes away, the client can detect this and restart the foreground job somewhere else (or report an error back to the original caller). Background jobs on the other hand have no attached client and are simply expected to be run when submitted.

PHP: Things that needs to be taken care in cluster mode

1. Source code should be same in all the environments.

2. If each node in the cluster needs to be pointed to different resources( like db or filer) don’t make changes in the server specific code. Instead use the same code in all the server and the environment variables.

3. DB clustering most easiest way :

Master-Slave system (one INSERT – all Select). All insert is maintenance nightmare( request routing and keeping all the databases in sync)

4. NoSQL is another solution too…

5. Cluster Deployment options: This work must be automated. PHING, rsync, pull/push actions in a distributed revision control (e.g. mercurial) or maybe a simple bash script can be enough.

6. If your application is protected with any authentication mechanism you must ensure all nodes will be able to authenticate against the same database. A good pattern is to work with an external authentication mechanism such as OAuth in a separate server. But if it’s not possible for you must consider where is located user/password database.

JBoss: Clustering a Web Application

Let us say we have 2 machines in the same network. Node A and Node B

nodeA 192.168.0.1

nodeB 192.168.0.2

Now for every installation of JBoss go under the folder deploy\jboss-web.deployer (jBoss 4.X) or under the folder deploy\jbossweb.sar (jBoss 5.X and later) and pick up the file server.xml line.

Configure on the 192.168.0.1 machine

and on the machine 192.168.0.2

Simple Helloworld java servlet that throws the session information.

public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {

response.setContentType(“text/html”);

String dummy = (String)request.getSession().getAttribute(“dummy”);

PrintWriter out = response.getWriter();

System.out.print(“Hello from Node “+ InetAddress.getLocalHost().toString());

if (dummy == null) {

out.println(“Dummy value not found. Inserting it….”);

request.getSession().setAttribute(“dummy”, “dummyOk”);

} else {

out.println(“Dummy is ” +dummy);

}

out.flush();

out.close();

}

Web.xml

<?xml version=”1.0″ encoding=”UTF-8″?>

<web-app version=”2.4″

xmlns=”http://java.sun.com/xml/ns/j2ee”

xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance”

xsi:schemaLocation=”http://java.sun.com/xml/ns/j2ee

http://java.sun.com/xml/ns/j2ee/web-app_2_4.xsd “>

<distributable/>

</web-app>

he easiest way to deploy an application into the cluster is to use the farming service. That is to hot-deploy the application archive file (e.g., the EAR, WAR or SAR file) in the all/farm/ directory of any of the cluster member and the application is automatically duplicated across all nodes in the same cluster.

Using Apache Web Server will be used for the Load Balancer infront of the JBOSS App servers.

The first thing we need to have is redirecting calls to the mod_JK load balancer: this is done in file mod_jk.properties:

# Where to find workers.properties

JkWorkersFile conf/workers.properties

# Where to put jk logs

JkLogFile logs/mod_jk.log

JkMount /HelloWorld/* loadbalancer

Here HelloWorld is the Web Context of our sample application. Now let’s move to workers.properties file.

worker.list=loadbalancer,status

worker.nodeA.port=8009

worker.nodeA.host=192.168.0.1

worker.nodeA.type=ajp13

worker.nodeA.lbfactor=1

worker.nodeB.port=8009

worker.nodeB.host=192.168.0.2

worker.nodeB.type=ajp13

worker.nodeB.lbfactor=1

worker.loadbalancer.type=lb

worker.loadbalancer.balance_workers=nodeA,nodeB

# Status worker for managing load balancer

worker.status.type=status

As you can see we have the two nodes (nodeA and nodeB) with relative port and IP address.

Now launch the two instances of JBoss using the command run -c all. The first instance started will display this information when the second instance kicks in:

In order to invoke our Servlet simply point to our Apache Web Server domain address

http://apache.web.server/HelloWorld/helloWorld

You’ll see that the balancer load balances calls and, once it has created the session, it will replicate the session to the secondary server. You can do experiments to test your cluster, switching off one instance of the cluster and verifying that the session stays alive on the other server.

Simply verify that the dummy values stays in the Session when you switch off the current instance of JBoss.

Dummy is dummyOk

MOD_JK APACHE MODULE:

mod_jk is an Apache module used to connect the Tomcat servlet container with web servers such as Apache, iPlanet, Sun ONE (formerly Netscape) and even IIS using the Apache JServ Protocol.

In a nutshell, a web server is waiting for client HTTP requests. When these requests arrive the server does whatever is needed to serve the requests by providing the necessary content.

Adding a servlet container may somewhat change this behavior. Now the web server needs also to perform the following:

Load the servlet container adapter library and initialize it (prior to serving requests).

When a request arrives, it needs to check and see if a certain request belongs to a servlet, if so it needs to let the adapter take the request and handle it.

The adapter on the other hand needs to know what requests it is going to serve, usually based on some pattern in the request URL, and to where to direct these requests.

Things are even more complex when the user wants to set a configuration that uses virtual hosts, or when they want multiple developers to work on the same web server but on different servlet container JVMs.

GC Policy of IBM JRE

optthruput: Optimizes for throughput. This is the default policy.

optavgpause: Optimizes for the average GC pause.

gencon: Uses a generational concurrent style of collection.

subpool: Speeds up object allocation on systems with very large numbers of processors . Garbage collection can enhance application performance by improving object locality and the speed at which new objects can be allocated. Garbage collectors that rearrange objects in the heap by compaction or by performing a copying collection can move objects that access one another close to each other in the heap, and this can have a dramatic effect on the rate of data processing.

HEAP Sizing:Xms : Initial Heap SizingXmx : Maximum HEAP Sizing Dont have the initial and max heap sizing with the same value. It will run will not run the GC until the max heap size in reached( in this case). You can use -verbose:gc when running your application with no load, and again under stress, to help you set the initial and maximum heap sizes.

The -verbose:gc output is fully described in Garbage Collector diagnostics. Switch on -verbose:gc and run up the application with no load. Check the heap size at this stage. This provides a rough guide to the start size of the heap (-Xms option) that is needed. If this value is much larger than the defaults (see Default settings for the JVM), think about reducing this value a little to get efficient and rapid compaction up to this value, as described in Initial and maximum heap sizes.

By running an application under stress, you can determine a maximum heap size. Use this to set your max heap (-Xmx) value.