vivekjagan

Apache Spark Hello world program

Installing Spark Master:

Package used : spark-1.2.0-bin-hadoop2.4

Extract the above zip file and copy to the following location

Linux: path : /usr/soft/spark/

Make sure to set appropriate permission to folder for execution. This can be done using the chmod command in linux.

Set the hostname of the master using the following command

export SPARK_MASTER_IP=”192.16.35.10″

Start the Master server using the following command

Starting the Master :
cd sbin
./start-master.sh

You can see the logs in the following location

Now you can view the Master server details in the following url
Master web ui:
http://192.16.35.10:8080/
If you are not able to access this url, you can get the exact url from the logs in the following location.
log location:
/usr/soft/spark/logs
spark-root-org.apache.spark.deploy.master.Master-1-localhost.out file

log statement
14/12/31 13:20:28 INFO MasterWebUI: Started MasterWebUI at http://192.16.35.10:8080

Installing Spark Worker:

Package used : spark-1.2.0-bin-hadoop2.4

Extract the above zip file and copy to the following location of the worker machine

Linux: path : /usr/soft/spark/

Make sure to set appropriate permission to folder for execution. This can be done using the chmod command in linux.

Start the Worker server using the following command

./bin/spark-class org.apache.spark.deploy.worker.Worker spark://192.16.35.10:7077

Now go to the Master WebUI, you can view the new worker added.

MySQL Cluster Partitioning

Cluster Partitioning:

1. Role-Based Cluster Partitioning

2. Data based Partitioning

Role-Based Partitioning:

Usually, all the DB operations are IO intensive rather than the CPU intensive. But certain operations like Reports/search are CPU intensive for these type of CPU intensive operation we can use a different set of the slaves that are connected to the same master.

Data based Partitioning:

Data is partitioned based on the incoming data. For example, Username starting from a-m will be stored in one table and m-z will be stored in another table.

But this will need all the select statements to be optimized for handling this data partitioning.

Write operation for Performance:

Only true option for reducing the load on write operation is to move from the single Master to multiple Master with their own set of the slaves.

MYSQL NDB:

Network DB storage engine will provide an interface to backend NDB cluster with this Mysql will have a built-in clustering/automatic failover functionalities.

High availability:

Dual Master replication

Shared storage with standby

MySQL Replication Design

In Mysql, we will be having the Master and Slave db’s.

Master db will have a binary log that records all the queries that are ran to update the data. All the slaves will be reading the binary log and update the slave tables.

Master slave architecture might not solve all the issues. For instance real time system wants all the query updated in master to be updated in slave asap and all slaves will have the same issue as the master.

For DB servers, IO operations are more costlier than the CPU intensive operations.

Replication Architectures:

1. Master – Slave

2. Slave with Master

3. Dual Master

4. Replication ring(multiple Masters)

5. Pyramid

Software based load balancing :

If you are not interested in the hardware based load balancing, we can use the software based options like Cluster-JDBC, SQL-RELAYS

HTTP load balancing is different from the SQL Load balancing. Load balancing in SQL might effected by the functionalities from connection pooling/data partitioning/reverse lookups

Data partitioning:

With the data partitioning, appln will determine which query will be excuted by which db cluster. Each cluster will be eficient with its own data caches

Connection Pooling Might not scale:

Breaks the basic assumption of the load balancing: Connections should be created and terminated as soon as possible so that load can be distributed.

Load Balancer Basic Functionalities :

1. Maintain a list of the active servers.

2. Route the request to different servers.

MySQL default MAX_Connection is 100.

Load Balancing Algorithms:

1. Random

2. Round Robin

3. Least Connection

4. Fastest Response

5. Hashed

6. Weighted

Making a poor choice of load balancing algorithm will have huge effect on the systems behavior.

For example if we choose “LEAST CONNECTION” on the high load scenario, if we add a new sql server, all the request by default will be routed to the new server. In the new server all the db level/OS level caches will not be loaded for initial few request and this will create more performance impact.

MySQL Internal Architecture

MySQL Architecture:

MySQL supports three level of configurations.

MySQL default parameter can be easily changed to run on variety of hardware
In every table, we can define the data type for the columns
At the table level we can define the table type(engine)e.g. innodb or myisam

Main criteria for choosing different engines:

1. Locking and concurrency

2. Transaction

Locking and Transaction:

READ Locks : Shared Locks
WRITE Locks : Exclusive Locks

Locking Granularity:

Table Level Locking
Page Level Locking : As the records are inserted in the last row, last page will be always locked. Efficient with smaller page size
Row Level Locking

Row Locking is implemented using the Multi Version Concurrency Control.

Isolation Levels:

Read Uncommit : Reads uncommitted transaction data

Read Commit : reads only committed trasnsaction data

Repeatable Read: while reading the record, locks that record so that it will not be updated

Serializable: Order the transactions. has performance issues but data will not be corrupted.

Innodb:

All Isolation levels are supported
Very good in transaction processing
Supports READ/WRITE Locks
Not so easy to move the data across the different architecture
Low performance for READ-ONLY transactions
Does not support full Text searches
Good with referential integrity checks
Support Row Level Locking

MyISAM:

Good performance for read-only tables.
Support full text searches
Supports only Table level locks.
Easy to move across the different systems

PHP Scaling : Load Balancers

Load Balancers

All the incoming and Outgoing requests needs to be going through the Load balancers. This is required to fairly distribute the incoming requests across the different application servers.

There are two important type of the load balancers.

Hardware Load Balancers
Software Load balancers

Issues with Hardware Load Balancers:

1. COST

2. It will be always acting as a black box, not knowing what exactly goes on inside.

It is always recommend to have multiple Load Balancers, so that it will not be a single point of failure. Although you need only 2 load balancers, add one more extra server.

HAPRoxy:

It is written in C, event based and require only low profile hardware. It requires only low CPU/Memory Footprints. This can handle 20-30,000 connections per seconds.

It is written only to be Reverse Proxy.

Layer 4 vs Layer 7( TCP vs HTTP)

HAPRoxy supports both the TCP and HTTP Load balancing. If used with TCP, it will be forwarding only the TCP packets. This will take less CPU/memory. But it will not parse the http headers before forwarding. Nginx server will be using only the HTTP layer.

SSL Termination:

If we are using HTTPS, We need to decide on where the SSL termination happens. If the SSL termination happens in the application server, HTTP headers cannot be parsed in Load balancer. If the SSL termiantion happens in the Load balancer it will increase the CPU utilization in Load balancer. It is advisable to have this in the application servers.

Better Health checks and distribution algorithm:

Ngnix server will be always use the timeout option for health checks. If the appln server times out, it will be removed from the list of application servers. Further Ngnix supports only the Round robin method for request distribution.

HAPROxy Distribution Algorithms:

1. Round Robin

2. Least Connections

3. Source : Sticky sessions

4. URI: Hashing based on client ip address.

Choosing Hardware for Load Balancers:

HAProxy always uses the single process mode. With this powerful single CPU machines are more preferable than the multi-core machines.

Automatic Failover with keepalived:

keepalived is a daemon used for monitoring the other servers and if it is not responding, it will steal the ip address.

Fine Tuning Linux for handling huge connections:

1. net.core.somaxconn:

This will define the max queue size of the kernel for accepting new connections. By default it will be 128, if we need more connections to be handled this needs to be increased.

2. net.ipv4.ip_local_port_range:

Defines the range of usable ports on your system. The number of the ports opened will increase the number of the connections.

3. net.ipv4.tcp_tw_reuse

In TCP protocal if the connection ends, it will still hold the connection will the TIME_WAIT reported is completed. Only after that connection will be closed. On a busy server, this can cause the issues with running out of ports/sockets. This will tell the server to reuse the TCP sockets when it is safe to do so.

4. Ulimit -n 99999

This will set the max limit for the number of the open files in the os.

Debugging Load Balancer for Slowness or unresponsiveness:

Saturating the network:

Network card configuration needs to be verified. If the network card is configured for 150 mbps, it cannot pass 200mbps. We need to check both the private and public network.

Running Out of Memory:

Check the memory taken by the HAProxy using the ‘free’ command.

Lot of connections in TIME_WAIT:

Having lot of connections in the TIME_WAIT will lead to the slowness of the load balancers.

PHP scaling: DNS Server Configuration

DNS is very important in scaling php. When a DNS failure happens it throws generic exception / page timeout, with this client don’t have a way to contact you.

IT IS NOT REQUIRED TO USE THE DOMAIN REGISTRAR’S OR HOST PROVIDERS, DNS CONFIGURATION.

It is not recommended to use your own dns server, as it involves geographically distributed servers and AnyCast network. DDOS attacks are very common in the dns servers.

Criteria for choosing dns provider:

1. Geographically distributed servers.
2. AnyCast Network
3. Large Ip space to handle network failures.

ANYCAST Network:

Anycast is a network technique that allows multiple routers to advertise same ip, and it will route the USER request to nearest possible routers.

Using DNS for Load Distribution
Load Balancer will be a single point of failure for the application. With DNS we can listen for the specific port and if it goes down, it will change the port.

Different techniques for DNS configuration:
1. Use First IP in the list
2. Using Round Robin method. This will be an issue when one ip address went down. For each n number of servers call goes this ip, which is down.
3. DNS server does the health check and if it goes down, this ip will be removed from the list.
4. Users with ip cache will face the issue still, to avoid this we need to steal the ip from neighbors using the keepalivedin config.

DNS Resolution: Internal Domain Resolution

Every time you post a tweet in twitter, Twitters web application will call the twitter API from remote machine. To accomplish this, either the internal name server or Public dns server needs to be called for converting the domain to ip address

To avoid this calling of external server to do the domain resolution we can use the LINUX DNS CACHE to do this. Variety of servers used to do this is NSCD, DNSMASQ, ORBIND.

This can be implemented as early as possible, only extra step is this will become one more server to monitor.

PHP Session Handling in Clustered environment

Sessions are important part of web application as we need to store the user information/data between the two requests. Default application server config is that, it will store the server information on the server with files and in set client browser with encryption.

Problem with Storing session in File System:

1. Large IO operation time

2. We need to use sticky session in the server cluster and this is create problems in scalability.

Other option is storing session in relation database. Session is a small piece of temp information that keeps on changing an relational db might not be the correction option as we need lot of IO costly operations to retrieve/set information.

PHP has an option to set the session handler plugin in the PHP.ini file

Setting session handlers in PHP.ini

Memcache:

ini_set(‘session.save_handler’,’memcached’);

ini_set(‘session.save_path’,’192.16.32.112:11211,192.16.32.113:11211′);

Fastest way to scale is setup the memcache server and use that for session handling.

Issues with Memcache:

In-memory data store that will not persist any of the information in file. If the system crashes all the information will be gone. But memcache will be extremely fast.

Redis:

Redis will store the information in data server and persist in the file system.

ini_set(‘session.save_handler’,’redis’);

ini_set(‘session.save_path’,’tcp://192.16.32.112:11211?timeout=0.5,tcp://192.16.32.113:11211?timeout=0.5′);

Another way of implementing session is through the cookie. session information is stored in the browser cookie.

Issues:

1. Cookie information is sent back and forth for every HTTP call. For every server call this information is sent and retrieved. This will involve a huge bandwidth consumption if cookie size is big.

2. Some of the browsers has max size limit set on the cookie.

3. Cookie information is visible at client side using the tools like firebug and across the wire if you are not using https.

4. If you use Hash Based Message Authentication Code to sign the cookie data, it will not be changed.

Cookie session handling is not built into the PHP but using the SessionHandlerInterface, we can do this.

How to scale a web application like WikiPedia?

Wikipedia Performance numbers:

30,000 http request/sec
3 Gbit/s data transfer
3 data center
350 server machines 1x P4 to 2x Xeon QuadCore, 0.5 – 16 GB of memory

Runs on LAMP Architecture

Object Pooling Pattern : Used in Connection Pooling , Thread pooling implementations.

Avoid expensive acquisition and release of resources by recycling objects that are no longer in use. Can be considered a generalization of connection pool and thread pool patterns.

Handling of empty pools

Object pools employ one of three strategies to handle a request when there are no spare objects in the pool.

Fail to provide an object (and return an error to the client).
Allocate a new object, thus increasing the size of the pool. Pools that do this usually allow you to set the high water mark (the maximum number of objects ever used).
In a multi-threaded environment, a pool may block the client until another thread returns an object to the pool.

public abstract class ObjectPool<T> {

private long expirationTime;

private Hashtable<T, Long> locked, unlocked;

public ObjectPool() {

expirationTime = 30000; // 30 seconds

locked = new Hashtable<T, Long>();

unlocked = new Hashtable<T, Long>();

}

protected abstract T create();

public abstract boolean validate(T o);

public abstract void expire(T o);

public synchronized T checkOut() {

long now = System.currentTimeMillis();

T t;

if (unlocked.size() > 0) {

Enumeration<T> e = unlocked.keys();

while (e.hasMoreElements()) {

t = e.nextElement();

if ((now – unlocked.get(t)) > expirationTime) {

// object has expired

unlocked.remove(t);

expire(t);

t = null;

} else {

if (validate(t)) {

unlocked.remove(t);

locked.put(t, now);

return (t);

} else {

// object failed validation

unlocked.remove(t);

expire(t);

t = null;

}

// no objects available, create a new one

t = create();

locked.put(t, now);

return (t);

}

public synchronized void checkIn(T t) {

locked.remove(t);

unlocked.put(t, System.currentTimeMillis());

}

//The three remaining methods are abstract

//and therefore must be implemented by the subclass

public class JDBCConnectionPool extends ObjectPool<Connection> {

private String dsn, usr, pwd;

public JDBCConnectionPool(String driver, String dsn, String usr, String pwd) {

super();

try {

Class.forName(driver).newInstance();

} catch (Exception e) {

e.printStackTrace();

}

this.dsn = dsn;

this.usr = usr;

this.pwd = pwd;

}

@Override

protected Connection create() {

try {

return (DriverManager.getConnection(dsn, usr, pwd));

} catch (SQLException e) {

e.printStackTrace();

return (null);

}

@Override

public void expire(Connection o) {

try {

((Connection) o).close();

} catch (SQLException e) {

e.printStackTrace();

}

@Override

public boolean validate(Connection o) {

try {

return (!((Connection) o).isClosed());

} catch (SQLException e) {

e.printStackTrace();

return (false);

}

JDBCConnectionPool will allow the application to borrow and return database connections:

public class Main {

public static void main(String args[]) {

// Do something…

…

// Create the ConnectionPool:

JDBCConnectionPool pool = new JDBCConnectionPool(

“org.hsqldb.jdbcDriver”, “jdbc:hsqldb://localhost/mydb”,

“sa”, “secret”);

// Get a connection:

Connection con = pool.checkOut();

// Use the connection

…

// Return the connection:

pool.checkIn(con);

}

Lazy Initialization and Eager Initialization of Singleton pattern

Lazy initialization

This method uses double-checked locking, which should not be used prior to J2SE 5.0, as it is vulnerable to subtle bugs. The problem is that an out-of-order write may allow the instance reference to be returned before the Singleton constructor is executed.[8]

public class Singleton {

private static volatile Singleton instance = null;

private Singleton() { }

public static Singleton getInstance() {

if (instance == null) {

synchronized (Singleton.class){

if (instance == null) {

instance = new Singleton();

}

return instance;

}

Eager initialization

If the program will always need an instance, or if the cost of creating the instance is not too large in terms of time/resources, the programmer can switch to eager initialization, which always creates an instance:

public class Singleton {

private static final Singleton instance = new Singleton();

private Singleton() {}

public static Singleton getInstance() {

return instance;

}

This method has a number of advantages:

The instance is not constructed until the class is used.
There is no need to synchronize the getInstance() method, meaning all threads will see the same instance and no (expensive) locking is required.
The final keyword means that the instance cannot be redefined, ensuring that one (and only one) instance ever exists.

This method also has some drawbacks:

If the program uses the class, but perhaps not the singleton instance itself, then you may want to switch to lazy initialization.