Tony Nguyen Anh Tuan

Greenhorn
+ Follow
since Mar 30, 2014
Cows and Likes
Cows
Total received
0
In last 30 days
0
Total given
0
Likes
Total received
1
Received in last 30 days
0
Total given
0
Given in last 30 days
0
Forums and Threads
Scavenger Hunt
expand Ranch Hand Scavenger Hunt
expand Greenhorn Scavenger Hunt

Recent posts by Tony Nguyen Anh Tuan

AngularJs as the alternative choice for building web interface

Recently, we were invited to conduct a joint talk with a UX designer, Andrew from Viki on Improving communications between Design and Development for Singasug. The main focus of the talk was to introduce AngularJS and elaborate on how it help to improve collaboration between developers and designer from both sides of view. To illustrate the topic, we work together on revamping the Pet Clinic Spring sample application.

Over one month of working remotely, mostly on spare time, we have managed to refresh the Pet Clinic application with a new interface, which built on AngularJS and Bootstrap. Based on what we have been through, we delivered the talk to share what have been done with Spring community.

In this article, we want to re-share the story with a different focus, that is using AngularJs as the alternative choice for building web application

Project kick starting

The project was initiated last year by Michael Isvy and Sergiu Bodiu, the two organizers of the Singapore Spring User Group. Michael asked our help to deliver a talk about AngularJS for the web night. He even went further by introducing us an UX designer named Andrew and nicely asked if we can collaborate on revamping the Pet Clinic application. Feeling interested on this idea, we agreed to work on the project and delivered the talk together.

We set the initial scope of the project to be one month and aimed to show as much functionality as possible rather than fully complete the application. Also, because of Andrew involvement, we take this opportunity to revamp the information architect of the website and give it with new layout.

Project Delivery

The biggest problem we have to solve is geographic location. All of us working on spare time and can afford limited face to face communication. Due to personal schedule, it is difficult to setup a common working place or time. For discussing about the project and planning, we only scheduled a weekly meeting during weekend.

To replace direct communication, we contact each other through Whatapp. We hosted the project on Github and turned on the notification feature so that each member is informed after any check-in. Even though there was a bit of concern at the beginning but each member play their part and the project progressed well.

To prepare in advance, we converted SpringMVC controllers with Jsp view to the Restful controllers. After that, Andrew showed us the wireframe flow that laid the common understanding of how the application should behave. Based on that, we built the respective controllers and integrated with html templates Andrew provided.

After half of the project time, Andrew picked up some understanding about Angular directive and be able to work directly on the project source code. That helped to boost our development speed.

When the project has been through half of the timeline, Michael came back to us and asked if it is possible to make the website functional without the Restful API. He think that would help designers to be able to develop html template without deploying it into an application server. We bought into this idea and provided this feature for the application. By turning on a flag, the AngularJS application will no longer make any http call to server. Instead, the mock http service will returned the pre-defined static Json files.

What we have achieved

After one month, we have a new website with beautiful theme that look like a commercial website rather than sample application. The new Pet Clinic is now a single page application that accessing Restful API.

We did figure out an effective way of working remotely even when we did not know each other before the project started.

We also found many benefits of building web interface with AngularJs. That is what we want to share with you today.

Understanding AngularJS

If you are new to AngularJS, think of it as a new MVC framework for front-end application. MVC is a well known pattern for server side application, but it is applicable for front-end application as well. With AngularJs, developer build a model based on the data received from server. After that, Angular will bind the model data to the html controls, displaying output on screen.

The good thing here is AngularJs provide the two-way binding instead of one-way binding. In this case, if user update the value in the html control, the model is automatically updated without developer effort.

This is the illustration of the boring one-way binding versus the two-way binding





Angular also provides scope for each model, so that the 2-way binding is only active within the boundary of a controller. To instruct the binding, AngularJs uses directives, which embedded directly to html controls. The Angularised html will look similar to this:



For the above binding to work, you need to prepare the data for the model "owners" in your controller


If you feel curious how the code of the controller look so short, it is because we implement an Owner service with JPA liked methods



Look at the above template, if we take out all Angular directives, which start with data-ng or ng, we should have back the original html template. The only exception here is the directive data-ng-repeat, which function like a for loop that help to shorten the html code.

After Angularised the template file and create controller, the last thing one need to do is to declare them in the global app.js file



So far, that is the effort to Angularised one owners page template. The above example is mostly about showing data, but if we replace the tag <p/> with <input/> then users should be able to edit and view the owner at the same time.

Pros and cons of AngularJS

We have done evaluation on AngularJS before adopting it and we have evaluated it again after using. To recap our experience, we will share some benefits and issues of AngularJS.

Facilitate adoption of Restful API

Obviously, one need to introduce Restful API to work with AngularJS. Given the changes happened in this industry over last decade, Restful API is slowly but surely will be the standard practice for future applications.

In the past, a typical Spring framework developer should know JDBC, JPA, Spring, Jsp in order to develop web applications. But they are no longer enough. At first, there are some big players like Twitter, Facebook or Google that introducing the APIs for third-party web applications to integrate with their services. Later, there is a booming trend of mobile applications that no longer render UI based on html.

Because of that, there is an increasing demand for building application that serving data instead of serving html. For example, any start-up that want to play safe with will start with building Restful API before building a front-end application. That can save a lot of effort if there is a need to build for another client rather than browser in the future.

There is another contributing factor from development point of view. Splitting back-end and front-end applications make parallel development easier. It is not necessary to wait until the back-end services completed before building web interface. It is also beneficial in term of utilizing resources. In the past, we need developers that know Java and Javascript to develop java web application. Similarly, we need developers that know .NET and Javascript to develop .NET application. However, the web applications nowadays are using a lot more Javascript than in the past. That make finding developers that good at both languages harder to find. With Restful API, it is possible to recruit front-end developers that have strong understanding of Javascript and Css to build web interface while back-end developers can focus on security, scalability and performance issues.

Because we adopted Restful API from development benefit, it is important for us to showcase the ability of running Angular application without back-end service. Because AngularJS injects the modules by declaration, we have a single point of integration to configure the http service for the whole application. This look very favorable, especially for a Spring developer, because we have better control over the technology usage.

Faster development

Jsp by nature is only for viewing. Changing of data happened by other methods like form submitting or Ajax request. That why most of us think the MVC pattern only include one-way binding for displaying view. That not necessary to be the case if you remember desktop applications. For front end, MVC is not that popular. Some developers still like to manually populate values for html controls after each Ajax request. We also do but only sometimes. Some other times, we prefer automation over control. That why we love AngularJS.

As any other MVC framework, AngularJS provide a model behind the scene for each html control. But it is a Wow factor to have it reversed way, when the html control update the model. Think about it practically, sometimes our html controls are inter-related. For example, updating the value of country from the first drop down suppose to display corresponding states or provinces in the next drop down. AngularJS allows us to do this without coding.

Directive is cleaner comparing to Jsp or jQuery

I think the biggest benefit that AngularJS offer is custom directive. We adopted AngularJS because of two-way binding but custom directive is what make us committed.

Directive captured our interest at first sight. It is a clean solution for the long term problems that we face, the continuity problem and tracking problem.

Let start with a short example of a for loop in jsp file



The code between opening and closing of for loop helps to render one element. Whenever we write this kind of code, we break the original html template into smaller chunks. Therefore, it is understandable that designer don't like to maintain jsp file. They are meant to be looked at by developers not designers.

To avoid inserting opening and closing of for loop at different places, we can use jsp tag. Then, we have another issue with the html code go missing in the jsp file. There is no clean and easy way for us to represent this jsp contents to designers. If there is a minor change to be made to the web interface, it normally comes as an UI patch request from designers.

Thing did not get better when developers moved to jQuery based UI. We always have concern regarding DOM manipulation and event handlers registration. In a typical jQuery web application, viewing source give us very limited information on what the user would see. Instead of being the source of truth, HTML become the material for Javascript developers to perform magic. That even shun the designers away more than Jsp.

Even for developers, thing does not go well if you are not the author of the code. It may take guess work to find out the code that being executed when user clicks a button, unless you know the convention. The Css selector is too flexible. It allows one developer to define behavior in a non-deterministic way and let the other developer go around searching for it.

Directive helps us get rid of both problems.

The directive data-ng-repeat="owner in owners" is inserted directly to the html tag. That make it is easier to keep track when the loop actually end both for us and the designers. The html inside the div remained similar to the original template, which make minor modification is possible. For tracking purpose, seeing ng-click directive on the html tags tell us immediately which method to look for. The ng-controller directive or the routing configuration gave us clue which controller that the method should be defined in. That make reaching source much easier.

Finally, the AngularJS team allow us to define our own directive. This is amazing because it allows us to create new directive to solve un-foreseen circumstance in the future and tap into existing directives library created by community.

Let get ourselves familiarized with custom directive by this example. In our Pet Clinic app, clicking some links on the banner will scroll the page down to respective part of the website.

To achieve this behavior in a jQuery app, the fastest solution is to make it as a link and put



For an Angular app, it is not necessarily to be a link and the code look slightly different



If he decided to make a directive for this function, the code will look like



For us, the last one is the best. The directive is a familiar concept for any designer, as they should have used class, id, name or onclick. Having a new well meaning directive will not be that hard to pick up even for a designer. For us, it is fun and cool. We are no longer depend on World Wide Web consortium to release the feature that we want. If we need somethings, we can hunt for the directive, we also can modify it or create it our selves. Slowly, we can have a community-backed directives library that forever improving.

Automation versus Control

After AngularJS presentation, I think not every folk will be ready to jump the boat. The AngularJS deliver so much magic with a big trade off that you need to relinquish the control of your application over to AngularJS.

Instead of having work done yourself, you need to provide information for AngularJS to do the work. Two-way binding, injection of services all happen behind the scene. When we need to modify the way it work, it is not so convenient to do so.

For us, relinquishing control to build an configurable application is tolerable. However, not able to alter the behavior when we need is much less tolerable. This is where Spring framework shines. It does not simply give you the bean, it allow you to do whatever way you want with it like executing bean life-cycle functions, injecting bean factory or defining scope.

Contrast to this, I find it hard to go out of Angular way when I want to. For example, I would like Angular to help us populate content of a textarea and later running CkEditor Javascript to convert this textarea to a html editor. To get this done, the application need to load CkEditor Javascript only after the binding is completed. It can be done by altering AngularJS behavior or converting the textarea to a directive.

But I am not satisfied enough with both, the former solution does not look clean and the latter solution seem tedious. It look cooler if we have life-cycle support to inject additional behavior like we have with Spring framework

FAQ about AngularJS

Up to this point, I hope you already have some ideas and can make a choice for yourself whether AngularJS is the right choice. To contribute to your decision making, I will share our answers of the questions we received in the presentation.

Is it possible to protect confidential data when you build web interface with AngularJS?

For me, this is not AngularJS issue, this is how you design your Restful API. If there is something you do not want users to see, don't ever show it on your Restful API. For a Spring developer, this means avoid using entity as @ResponseBody if you are not ready to fully expose it. At least, put in some annotation like @JsonIgnore to avoid the confidential fields.

How you pass object around different pages?

There are many ways to do so. Each page have its own controller but the whole application share a single rootscope that you can place the object there. However, we should not use rootscope unless there is no other way. Our prefer solution is using url.

For example, our routing configuration specify:




When user route to the owner detail page, the controller load the owner again from the id in url. It may look redundant but it allow the browser bookmarking.

Is this really possible for designers to commit to the project?

Yes, that what we have tried and it worked. One day before the presentation, our designer, Andrew redesign the landing page without our involvement. He kept the all the directives we put in and there is no impact to functionality after the design changed.

Should I adopt AngularJs 1 or wait until AngularJs 2 available?

I think you should go ahead with AngularJS 1. For us, AngularJs 2 is so different that it can be treated as another technology stack and support for AngularJS will continue for at least a year after AngularJS 2 is released (slated for 2016) We feel that due to community support, AngularJs 2 will continue to thrive like Play framework 1 after the release of Play framework 2.


Conclusion

So, we have done our part to introduce a new way of building web interface for Spring framework user. We hope you are convinced that Restful API is the way to go and AngularJs is worth trying. If you have any other idea, please help to feedback.

If you are keen to see the project or original talk, here is the reference:

https://github.com/singularity-sg/spring-petclinic

http://petclinic.hopto.org/petclinic/slides/#/

http://petclinic.hopto.org/petclinic/#/

Enjoy!
Hi Chris,

Our problem is slightly different as we need a special routing mechanism to send similar job to the same node. However, I think that customization should not be too difficult to do for Akka. I am checking how to do it.

Thanks for your feedback.
Distributed Crawling

Around 3 months ago, I have posted one article explaining our approach and consideration to build Cloud Application. From this article, I will gradually share our practical design to solve this challenge.

As mentioned before, our final goal is to build a Saas big data analysis application, which will deployed in AWS servers. In order to fulfill this goal, we need to build distributed crawling, indexing and distributed training systems.

The focus of this article is how to build the distributed crawling system. The fancy name for this system will be Black Widow.

Requirements

As usual, let start with the business requirement for the system. Our goal is to build a scalable crawling system that can be deployed on the cloud. The system should be able to function in an unreliable, high-latency network and can recover automatically from a partial hardware or network failure.

For the first release, the system can crawl from 3 kind of sources, Datasift, Twitter API and Rss feeds. The data crawled back are called Comment. The Rss crawlers suppose to read public sources like website or blog. It is free of charge. DataSift and Twitter both provide proprietary APIs to access their streaming service. Datasift charges its users by comment count and the complexity of CSLD (Curated Stream Definition Language, their own query language). Twitter, in the other hand, offers free Twitter Sampler streaming.

In order to do cost control, we need to implement mechanism to limit the amount of comments crawled from commercial source like Datasift. As Datasift provided Twitter comment, it is possible to have single comment coming from different sources. At the moment, we did not try to eliminate and accept it as data duplication. However, this problem can be eliminated manually by user configuration (avoid choosing both Twitter and Datasift Twitter together).

For future extension, the system should be able to link up related comments to from a conversation.

Food for Thought

Centralized Architecture

Our first thought when getting requirement is to build the crawling on the nodes, which we called Spawn and let the hub, which we called Black Widow to manage the collaboration of effort among nodes. This idea was quickly accepted by team members as it allows the system to scale well with the hub doing limited work.

As any other centralized system, Black Widow suffers from single point of failure problem. To help easing this problem, we allow the node to function independently for a short period after losing connection to Black Widow. This will give the support team a breathing room to bring up backup server.

Another bottle neck in the system is data storage. For the volume of data being crawled (easily reach few thousands records per seconds), NoSQL is clearly the choice for storing the crawled comments. We have experiences working with Lucene and MongoDB. However, after research and some minor experiments, we choose Cassandra as the NoSQL database.

With that few thoughts, we visualize the distributed crawling system to be build following this prototype:



In the diagram above, Black Widow, or the hub is the only server that has access to the SQL database system. This is where we store the configuration for crawling. Therefore, all the Spawns, or crawling nodes are fully stateless. It simply wakes up, registers itself to Black Widow and does the assigned jobs. After getting the comments, the Spawn stores it to Cassandra cluster and also push it to some queues for further processing.

Brainstorming of possible issues

To explain the design to non-technical people, we like to relate the business requirement to a similar problem in real life so that it can be easier to understand. The similar problem we choose would be collaborating of efforts among volunteers.

Imagine if we need to do a lot of preparation work for the upcoming Olympic and decide to recruit volunteers all around the world to help. We do not know volunteers but the volunteers know our email, so they can contact us to register. Only then, we know their emails and may send tasks to them through email. We would not want to send one task to two volunteers or left some tasks unattended. We want to distribute the tasks evenly so that no volunteers are suffering too much.

Due to cost issue, we would not contact them through mobile phone. However, because email is less reliable, when sending out tasks to volunteers, we would request a confirmation. The task is consider assigned only when the volunteer replied with confirmation.

With above example, the volunteers represent Spawn nodes while email communication represent unreliable and high latency network. Here are some problems that we need to solve:

1/ Node failure

For this problems, the best way is to check regularly. If a volunteer stop responding to the regular progress check email, the task should be re-assign to someone else.

2/ Optimization of tasks assigning

Some tasks are related. Therefore assigning related tasks to the same person can help to reduce total effort. This happen with our crawling as well because some crawling configurations have similar search terms, grouping them together to share the streaming channel will help to reduce final bill.

Another concern is the fairness or ability to distribute the amount of works evenly among volunteers. The simplest strategy we can think of is Round Robin but with a minor tweak by remembering earlier assignments. Therefore, if a task is pretty similar to the tasks we assigned before, the task can be skipped from Round Robin selection and directly assign to the same volunteer.

3/ The hub is not working

If due to some reasons, our email server is down and we cannot contact volunteer any more, it is better to let the volunteers stop working on the assigning tasks. The main concern here is over-running of cost or wasted efforts. However, stopping working immediately is too hasty as temporary infrastructure issue may cause the communication problem.

Hence, we need to find a reasonable amount of time for the node to continue functioning after being detached from the hub.

4/ Cost control

Due to business requirement, there are two kinds of cost control that we need to implement. First is the total of comments being crawled per crawler and second is the total of comments crawled by all crawlers belong to the same user.

This is where we have a debate about the best approach to implement cost control. It is very straight forward to implement the limit for each crawler. We can simply pass this limit to the Spawn node and it will automatically stop the crawler when the limit is reached.

However, for the limit per user, it is not so straight forward and we have two possible approaches. For the simpler choice, we can send all the crawlers of one user to the same node. Then, similar to the earlier problem, the Spawn node knows the amount of comments collected and stops all crawlers when limit reached. This approach is simple but it limits the ability to distribute jobs evenly among nodes. The alternative approach is to let all the nodes retrieve and update a global counter. This approach creates huge network traffic internally and add considerable delay to comment processing time.

At this point, we temporarily choose the global counter approach. This can be considered again if the performance become a huge concern.

5/ Deploy on the cloud

As any other Cloud application, we can not put too much trust in the network or infrastructure. Here is how we make our application conform to the check-list mentioned in last article:
Stateless: Our spawn node is stateless but the hub is not. Therefore, in our design, the nodes do actual work and the hub only collaborates efforts.
Idempotence: We implement hashCode and equal methods for every crawler configuration. We store the crawler configurations in the Map or Set. Therefore, the crawler configuration can be sent multiple times without any other side effect. Moreover, our node selection approach ensure that the job will be sent to the same node.
Data Access Object: We apply the JsonIgnore filter on every model objects to make sure no confidential data flying around in the network.
Play Safe: We implement health-check API for each node and the hub itself. The first level of support will get notified immediately when anything wrong happened.

6/ Recovery

We try our best to make the system heal itself from partial failure. There are some type of failure that we can recover from:
Hub failure: Node register itself to the hub when it start up. From then, it is the one way communication when only the hub send jobs to node and also poll for status update. The node is consider detached if it failed to get any contact from Hub for a pre-defined period. If a node is detached, it will clear all the job configurations and start registering itself to the hub again. If the incident is caused by hub failure, a new hub will fetch crawling configurations from database and start distributing jobs again. All the existing jobs on Spawn nodes will be cleared when the Spawn node go to detached mode.
Node failure: When hub fail to poll a node, it will do a hard reset by removing all working jobs and re-distribute from beginning again to the working nodes. This re-distribution process help to ensure optimized distribution.
Job failure: There are two kind of failures happened when the hub do sending and polling jobs. If a job is failed in the polling process but the Spawn node is still working well, Black Widow can re-assign the job to the same node again. The same thing can be done if the job sending failed.

Implementation

Data Source and Subscriber

In the initial thought, each crawler can open it own channel to retrieve data but this does not make sense any more when inspecting further. For Rss, we can scan all URLs once and find out the keywords that may belong to multiple crawlers. For Twitter, it supports up to 200 search terms for one single query. Therefore, it is possible for us to open single channel that serve multiple crawlers. For Datasift, it is quite rare, but due to human mistake or luck, it is possible to have crawlers with identical search terms.

This situation lead us to split out crawler to two entities: subscriber and data source. Subscriber is in charge of consuming the comments while data source is in charge of crawling the comments. With this design, if there are two crawlers with similar keywords, a single data source will be created to serve two subscribers, each processing the comments their own ways.

Data source will be created when and only when no similar data source exist. It starts working when having the first subscriber subscribe to it and retire when the last subscriber unsubscribe from it. With the help of Black Widow to send similar subscribers to the same node, we can minimize the amount of data sources created and indirectly, minimize the crawling cost.

Data Structure

The biggest concern of data structure is Thread Safe issue. In the Spawn node, we must store all running subscribers and data sources in memory. There are a few scenarios that we need to modify or access these data:

  • When a subscriber hit the limit, it automatically unsubscribe from data source, which may lead to deactivation of data source.
  • When Black Widow send a new subscriber to Spawn nodes.
  • When Black Widow send a request to unsubscribe an existing subscriber.
  • Health check API expose all running subscribers and data sources.
  • Black Widow regularly polls the status of each assigned subscriber.
  • The Spawn node regularly checks and disables orphan subscribers (subscriber which is not polled by Black Widow).


  • Another concern of data structure is idempotence of operations. Any of operation above can be missing or being duplicated. To handle this problem, here is our approach
  • Implement hashCode and equals method for every subscriber and data source.
  • We choose the Set or Map to store collection of subscribers and data sources. For records with identical hash code, Map will replace the record when there is new insertion but Set will skip the new record. Therefore, if we use Set, we need to ensure new records can replace old record.
  • We use synchronized in data access code.
  • If Spawn node receive a new subscriber that similar to existing subscriber, it will compare and prefer to update existing subscriber instead of replacing. This avoid the process of unsubscribing and subscribing identical subscribers, which may interrupt data source streaming.



  • Routing

    As mentioned before, we need to find a routing mechanism that serve two purposes:

  • Distribute the jobs evenly among Spawn nodes.
  • Route similar jobs to the same nodes.


  • We solved this problem by generating an unique representation of each query named uuid. After that, we can use a simple modular function to find out the note to route:




    With this implementation, subscribers with similar uuid will always be sent to the same node and each node has equals chance of being selected to serve a subscriber.

    This whole practice can be screwed up when there is change to the collection of active Spawn nodes. Therefore, Black Widow must clear up all running jobs and reassign from beginning whenever there is a node change. However, node change should be quite rare in production environment.

    Handshake

    Below is the sequence diagram of Black Widow and Node collaboration



    Black Widow does not know Spawn node. It wait for the Spawn node to register itself to the Black Widow. From there, Black Widow has the responsibility to poll the node to maintain connectivity. If Black Widow fail to poll a node, it will remove the node from the its container. The orphan node will eventually go to detached mode because it is not being polled any more. In this mode, Spawn node will clear existing jobs and try to register itself again.

    The next diagram is the subscriber life-cycle


    http://1.bp.blogspot.com/-CwnJ32eUP_8/U_7DHCZsiPI/AAAAAAAAB8A/FMe_-yltWy4/s1600/job_sequence_diagram.png

    Similar to above, Black Widow has the responsibility of polling the subscribers it send to Spawn node. If a subscriber is not being polled by Black Widow anymore, Spawn node will treat the subscriber as orphan and remove it. This practice help to eliminate the threat of Spawn node running obsoleted subscriber.

    On Black Widow, when a subscriber polling fails, it will try to get a new node to assign the job. If the Spawn node of the subscriber still available, it is likely that the same job will go to the same node again due to our routing mechanism we used.

    Monitoring

    In a happy scenario, all the subscribers are running, Black Widow is polling and nothing else happen. However, this is not likely to happen in real life. There will be changes in Black Widow and Spawn nodes from time to time, triggered by various events.

    For Black Widow, there will be changes under following circumstances:

  • Subscriber hit limit
  • Found new subscriber
  • Existing subscriber disabled by user
  • Polling of subscriber fails
  • Polling of Spawn node fails


  • To handle changes, Black Widow monitoring tool offers two services: hard reload and soft reload. Hard Reload happen on node change while Soft Reload happen on subscriber change. Hard Reload process takes back all running jobs, redistribute from beginning over available nodes. Soft Reload process removes obsoleted jobs, assigns new jobs and re-assigns failed jobs.



    Compare to Black Widow, the monitoring of Spawn node is simpler. The two main concerns are maintaining connectivity to Black Widow and removing orphan subscribers.



    Deployment Strategy

    The deployment strategy is straight forward. We need to bring up Black Widow and at least one Spawn node. The Spawn node should know the URL of Black Widow. From then, the Health Check API will give use the amount of subscribers per node. We can integrate Health Check with AWS API to automatically bring up new Spawn node if existing nodes are overloaded. The Spawn node image will need to have Spawn application running as service. Similarly, when the nodes are not utilized, we can bring down redundant Spawn nodes.

    Black Widow need special treatment due to its importance. If Black Widow fails, we can restart the application. This will cause all existing jobs on Spawn nodes to become orphan and all the Spawn nodes go to detached mode. Slowly, all the nodes will clean up itself and try to register again. Under default configuration, the whole restarting process will happen within 15 minutes.

    Threats and possible improvement

    When choosing centralized architecture, we know that Black Widow is the biggest risk to the system. While Spawn node failure only causes a minor interruption in the affected subscribers, Black Widow failure finally lead to Spawn nodes restart, which will take much longer time to recover.

    Moreover, even the system can recover from partial, there still be interruption of service in recovery process. Therefore, if the polling requests failed too often due to unstable infrastructure, the operation will be greatly hampered.

    Scalability is another concern for centralized architecture. We have not had a concrete amount of maximum Spawn nodes that the Black Widow can handle. Theoretically, this should be very high because Black Widow only do minor processing, most of its effort are on sending out HTTP requests. It is possible that network is the main limit factor for this architecture. Because of this, we let the Black Widow polling the nodes rather than the nodes polling Black Widow (other people do this, like Hadoop). With this approach, Black Widow may work at its own pace, not under pressure of Spawn nodes.

    One of the first question we got is whether it is a Map Reduce problem and the answer is No. Each subscriber in our Distributed Crawling System processes its own comments and does not reporting result back to Black Widow. That why we do not use any Map Reduce product like Hadoop. Our monitor is business logic aware rather than purely infrastructure monitoring, that why we choose to build ourselves over using monitoring tools like Zoo Keeper or AKKA.

    For future improvement, it is better to walk away from Centralized Architecture by having multiple hubs collaborating with each other. This should not be too difficult provided that the only time Black Widow accessing database is loading subscriber. Therefore, we can slice the data and let each Black Widow load a portion of it.

    Another point that make me feel pretty unsatisfied is the checking of global counter for user limit. As the check happened on every comment crawled, this greatly increases internal network traffic and limit the scalability of system. The better strategy should be divide of quota based on processing speed. Black Widow can regulate and redistribute quota for each subscriber (on different nodes).
    How to increase productivity

    Unlock productivity is one of the bigger concerns for any person taking management role. However, people rarely agree on the best approaches to improve performance. Over the years, I have observed different managers using the opposite practices to churn out best performance of the team they are managing. Unfortunately, some works and other don't. To be more accurate, what does not increase performance, actually reduce performance.

    In this article, I would like to review what I have seen and learnt over the years and share personal view on the best approaches to unlock productivity.



    What factors define teams performance?

    Let start with analysing what compose a team. Obviously, a team is composed from team members, each has own expertise, strength and weakness. However, the total productivity of the team is not necessarily the total sum of individual productivity. Other factors like team work, process and environment also have major impact to total performance, which can be both positive or negative.

    To sum up, the 3 major factors discussed in this article will be technical skills, working process and culture.

    Technical Skills

    In a factory, we can count the total productivity as sum of individual productivity of each worker, but this simplicity does not apply to IT field. The differences lie in natural of work. Programming until today is still an innovative work, which cannot be automated. In IT industry, nothing is more valuable than innovation and vision. That explains why Japan may be well known for producing high quality car but US is much more famous for producing well known IT company.

    Contradict to factory environment, in a software team, developers does not necessarily do or good at the same things. Even if they have graduated from the same school, taking the same job, personal preference and the self studying quickly make developer's skills different again. For the sake of increasing total productivity, this may be a good thing. There is no use for all of member to be competent on the same kind of tasks. As it is too difficult to good at everything, life will be much easier if members of the team can compensate for each other weakness.

    This is not easy to improve on technical skills of the team as it take many years for a developer to build up his/her skill set. The fastest way to pump up the team skill sets is to recruit new talent that offer what the team is lack of. That why the popular practice in the industry is to let the team recruit new member themselves. Because of this, the team, which is slowly built over the years normally normally offers a more balance skills set.

    While recruitment is a quick and short term solution, the long term solution is to keep the team up to date with latest trends of technology. In this field, if you do not go forward, you go backward. There is no skill set that can be useful forever. One of my colleague even emphasize that upgrading developers's skills is beneficial to the company in the long run. Even if we do not count inflation, it is quite common that the company will offer pay rise after each annual review to retain staffs. If the staff do not acquire new skills, effectively, the company is paying higher price every year for a depreciating asset. It may be a good suggestion for the company to use monetary prize like KPI to motivate self-study and upgrading.



    There are a lot of training courses in the industry but it is not necessarily the best method for upgrading skills. Personally, I feel most of the coursework offer more branding value than real life usage. If a developer is keen to learn, there should be quite sufficient knowledge on internet to pick up anything. Therefore, unless for commercial API or product, spending money on monetary prize should be more worthy than on training course.

    Another well-known challenge for self-studying is the human natural laziness. There is nothing surprise about it. However, the best way to fight laziness is to find fun in learning new things. This only can be achieved if developers take programming as his hobby more than professional. Even not, it is quite reasonable that one should re-invest effort on his bread and butter tool. One of my friend even argue that if singer/musician take own responsibility in training, programmer should do the same.

    Sometimes, we may feel lost due to the huge amount of technologies exposed to us every year. I myself feel that too. My approach for self studying is adding a delay in absorbing concepts and ideas. I try to understand but do not invest too much until the new concepts and ideas are reasonable accepted by the market.

    Working Process


    Working process can contribute greatly to team performance, positively or negatively. Great developer write great code but he will not be able to do so if wastes to much effort on something not essential. Obviously, when the process is wrong, developers may feel uncomfortable about their daily life. Unhappy developer may not perform his best.

    There is no clear guideline to judge if the working process is well defined but people in the environment will feel it right a way if something is wrong. However, it is not as easy to get it right as people who have the right to make decision not necessarily the guys who suffer from bad process. We need an environment with effective feedback channels to improve on working process.

    The common pitfall for working process is the lack of result oriented nature. The process is less effective if it is too reporting oriented, attitude oriented or based on some unreal assumptions. To define the process, it may be good if the executive can decide whether he want to build an innovative company or operation oriented company. The samples for former kind is Google, Facebook, Twitter while the latter may be GM, Ford, Toyota. It is not that operation-oriented company cannot innovate but the process was not built with the first priority for innovation. Therefore, the metric for measuring performance may be slightly different, which causes different results in long term. Not all companies in IT fields are innovative company. One counter example is the outsourcing companies or software house in Asia. To encourage innovation, the working process need to focus on people, minimize hassle, maximize collaboration and sharing.

    Through my years in the industry with Water Fall, not so Agile and Agile companies, I feel that Agile work quite well for IT fields. It was built based on the right assumptions that software development is innovation work and less predictable compare to other kinds of engineering.

    Company Culture

    When Steve Job passed away in 2011, I bought his authorized biography by Walter Isaacson. The book clearly explains why Sony failed to keep its competitive edge because of inner competition amongst its departments. Microsoft suffer similar problem due to the controversy stack ranking system that enforce inner competition. I think that IT fields is getting more complicated and we need more collaboration than in the past to implement new ideas.

    It is tough to maintain collaboration when your company grow to become an multi-culture MNC. However, it still can be done if management got the right mindset and continuously communicate their visions to the team. As above, the management need to be clear if they want to build an innovative company as it requires a distinct culture, which is more open, and highly motivated.

    In silicon valley, office life end up quite late as most of developers are geeks and they love nothing more than coding. However, it is not necessary a good practice as all of us have a family to take care of. It is up to individual to define his/her own work life balance but the requirement is employee fully charged and feel exited whenever he come to office. He must feel that his work is appreciated and he has the support when he need it.

    Conclusions

    To makes it short, here are the kind of things that management can apply to increase productivity of the team:

  • Let the team involve in the recruitment. Recruit the person who takes programming as hobby.
  • Monetary prize or other kind of encouragements for self-study, self-upgrading.
  • Save money for company sponsored course unless for commercial products.
  • Make sure that the working process result oriented.
  • Apply Agile practices
  • Encourage collaboration, eliminate inner competition.
  • Encourage sharing
  • Encourage feedback.
  • Maintain employee work-life balance and motivation.
  • Make sure employee can find support when he need it.
  • (http://sgdev-blog.blogspot.sg/2014/02/thread-safe.html)

    Thread-Safe is definitely one of the thing I want to write about. It simply something too important to ignore in a developer day of life. Not that it is a constant concern, it is also one of the most common source of error that we need to deal with.

    WHAT IS THREAD SAFE
    Come back to the earlier day, C developers rarely need to worry about thread. C does not support multithreading and the text book never mentioned about it as well. Things was changed when Java come to life. The language natively support multi-threading. The same portion of code in Java can be executed concurrently by more than 1 threads. Unfortunately, these threads can simultaneously read and write to the object state and interfere with each other. By definition, a piece of code is thread-safe if it functions correctly during simultaneous execution by multiple threads.

    To illustrate the thread safe issue, we can take a look at this example.



    Please do not laugh at the stupid implementation, this sample was created just to illustrate how multi-threading can spoil the functionality of your class. Assume that we have 2 threads that make use of the same inverter:


    If you are extremely unlucky, thread 1 can execute up to assign origin to the field value but has not returned and thread 2 come to execute the same command. In this case, inverter.invert(10) will return -20 instead of -10.

    This issue is not rare. Actually, you will encounter it very often as lots of class in Java are not thread safe (for example SimpleDateFormatter, StringBuillder, ...)

    HOW TO PREVENT THREAD SAFE ISSUE
    Thread safe issue happen easily but also can be prevented easily. If we look deep into how things happen, thread safe issue can only occur when there are two conditions:

    1/ Multiple thread access the same variable.
    2/ The code require multiple atomic steps to complete. The code only function properly if there is no change to the variable at the middle of execution.

    Hence, to prevent thread safe issue, we should make sure this two conditions cannot happens together. However, thread safe prevention come at a price of performance reduction. That why, not all the class was created thread safe at the beginning. It is developer responsibility to prevent thread safe issue from happening.

    A. No instance variable
    Yep, we never need to worry about thread safe if there is nothing to share among threads. In Java, there is stack memory, heap memory and permgen memory. PermGen should not be our concern here because it is used to store class definition rather than variable. Generally, all Java objects and its instance variables are stored in heap space. However, return value, reference variable, local variables inside method are stored in stack memory. Stack is dedicated memory for each thread and therefore protected from first condition of thread safe.

    Let say we fix the above class this way:



    This class is equally stupid to earlier class but it is thread safe. The local variable "value" and return value of this method is stored in current thread stack memory. It will not accessible to any other thread. This method is highly recommended if you can achieve it. Sometimes, you do not have this luxury as instance variables are necessary for business logic.

    On the side note, it also worth highlighting in JVM do create a return variable for each non-void method. You may never know about its existence until you encounter a finally block that can overwrite your return value.

    B. No sharing of object

    Assume that you keep the same Inverter class as original but you never share the inverter object, thread safe issue cannot happen as each thread access their own Inverter object.



    This method is pretty simple but its create burden for garbage collection as a lot of temporary object need to be created. Moreover, some constructors take long time to execute.

    C. Synchronize method and code
    This solution aim to prevent the second condition of thread safe issue. It simply issue a lock of execution to method or block of code. Only one thread can execute the code at one time.



    This method simply disable the multi-threading support of Java and generally reduce performance. However, if the portion of code that need to be synchronize is short enough and not too many threads are being run, you can use this method.
    There is one more variation of this method where we create threadsafe wrapper for the object.




    Use this solution when you do not have access to the original method or when you want to provide both threadsafe and non threadsafe version to user. Java collection framework use this approach.

    D. Object Pool
    Object Pool is the combination of both solution B and C. You use Object pool when you want to avoid the pain of keep creating new object:



    With this implementation, you still cannot avoid using synchronize method but you limit it to short and simple method. You also cannot avoid creating some Inverter but you can reuse many objects and minimize creating too many object. So, this method is recommended when the object creation take a lot of resource or the code that vulnerable to thread safe issue is long.
    CSS Score

    (Original Article at http://sgdev-blog.blogspot.sg/2014/01/css-score.html)

    We all know that when many conflict css properties can be applied for one web element, it is by specification that the more specific properties will be applied. However, specific is an abstract word. Hence, it is better that we know about css score, or how browser choose which properties to override.

    Browser categorize css to 4 categories with the specificity from high to low as:

    1/ Style Attribute: <li style="color:white"/>

    2/ ID: #some_id{ color: red;}

    3/ Class, pseudo class, attribute: .some_class {color:green;}

    4/ Elements: li {color:black;}

    From W3C recommendation, the result of this calculation takes the form of four comma-separated values, a,b,c,d,1 where the values in column “a” are the most important and those in column “d” are least important. A selector’s specificity is calculated as follows:

    To calculate a, count 1 if the declaration is from a style attribute rather than a rule with a selector (an inline style), 0 otherwise.
    To calculate b, count the number of ID attributes in the selector.
    To calculate c, count the number of other attributes and pseudo-classes in the selector.
    To calculate d, count the number of element names and pseudo-elements in the selector.
    Here is one example using this rule:

    body#home div#warning p.message --> 0, 2, 1, 3

    Please notice the comma ',' in the css score, it is there to remind us that the score b, c, d can be equal or bigger than 10. Still, the rule to compare is left to right.
    Maximum concurrent connection to the same domain for browsers

    (The original article: http://sgdev-blog.blogspot.sg/2014/01/maximum-concurrent-connection-to-same.html )

    Do you surprise when I told you that there is a limit on how many parallel connections that a browser can make to the same domain?

    The limit

    Don't be too surprise if you never heard about it as I have seen many web developers missed this crucial point. If you want to have quick figure, this table is from the book PROFESSIONAL Website Performance: OPTIMIZING THE FRONT END AND THE BACK END by Peter Smith



    Why browsers have this limit?

    You may ask if this limit can have such a great impact to performance, then why don't browser give us a higher limit so that user can enjoy better browsing experience. The browser choose not to do so so that the server will not be overloaded by small amount of browsers. In the past, the common limit is only 2 connections. This is sufficient in the beginning day of web pages as most of the contents are delivered in a single page load. However, it soon become the bottleneck when css, javascript getting popular. Because of this, you can notice the trend to increase this limit for modern browsers. Some browsers even allow you to modify this value (Opera) but it is better not to set it too high unless you want to load test the server. Otherwise the server may classify your IP as DDOS attacker.

    The impact of this limit

    How this limit will affect your web page? The answer is a lot. Unless you let user load a static page without any images, css, javascript at all, other while, all these resources need to queue and compete for the connections available to be downloaded. If you take into account that some of the resources depend on other resource to be loaded first, then it is easy to realize that this limit can greatly affect page load time.

    How to handle this limit?

    This limit will not cause slowness in your website if you manage your resource well and not hitting the limit. When your page is first loaded, there is a first request which contain html content. When the browser process html content, it spawn more requests to load resource like css, images, js. It also execute javascript and send Ajax requests to server as you instruct it to do.

    Fortunately, static resources can be cached and only be downloaded the first time. If it cause slowness, it happen only on first page load and is still tolerable. It is not rare that user will see a page frame loaded first and some pictures slowly appear later later. If you feel that your resources is too fragmented and consume too many requests, there are some tools available that compress and let browser load all resources in single request (UglifyJS, Rhino, YUI Compressor, ...)

    Lack of control on Ajax requests cause a more severe problem. I would like to share some sample of poor design that cause slowness on page loading.

    1. Loading page content with many Ajax requests

    This approach is quite popular because it let user feel the progress of page loading and can enjoy some important parts of contents while waiting for the rest of contents to be loaded. There is nothing wrong with this but thing is getting worse when you need more requests to load content that the browser can supply you with. Let say if you create 12 Ajax requests but your browser limit is 6, in best case scenario, you still need to load resources in two batches. It is still not too bad if these 12 requests are not nesting or consecutive executed. Then browser can make use of all available connections to serve the pending requests. Worse situation happen when one request is initiated in another request callback (nested Ajax requests). If this happen, your webpage is slowed down by your design rather than by browser limit.

    Few years ago, I took over one project, which is haunted with performance issue. There are many factors that causing the slowness but one concern is too many Ajax requests. I opened browser in debug mode and found more than 6 requests being sent to servers to load different parts of page. Moreover, it is getting worse as the project is delivered by teams from different continents, different time zone. Features are developed in parallel and the pair working on a feature conveniently add server endpoint and Ajax request to let work done. Worrying that the situation is going out of control, we decided to shift the direction of development. The original design is like this:




    For most of Ajax requests, the response return JSON model of data. Then, the Knock-out framework will do the binding of html controls with models. We do not face the nested requests issue here but the loading time cannot be faster because of browser limit and many http threads is consumed to serve a single page load. One more problem is the lack of caching. The page contents are pretty static with minimal customization on some parts of webpages.

    After consideration, we decided to do a reset to the number of requests by generating page contents in one page. However, if you do not do it properly, it may become like this:




    This is even worse than original design. It is more or less equal to having the limit of 1 connection to server and all the requests are handled one by one.

    The proper way to achieve similar performance use Aysnc Programming




    Each promise can be executed in a separate thread (not http thread) and the response is returned when all the promises are completed. We also apply caching to all of the services to ensure the service to return quickly. With the new design, the page response is faster and server capacity is improved as well.

    2. Fail to manage the request queue

    When you make a Ajax request in javascript and browser do not have any available connection to serve your request, it is going to request queue. Disaster happen when you fail to manage the request queue. This happen when developer build the rich client application. Well, rich client application function like an application more than a web page. Clicking on button does not load new web address, instead the page content is uploaded with result of Ajax request. The common mistake is to let new requests to be created when you have not managed to clean up the existing requests in queue.

    I have worked on a web application that make more than 10 Ajax requests when user change value of a first level combo box. Imagine what happen if user change the value of the combo box 10 times consecutively without any break in between? There will be 100 Ajax requests go to request queue and your page seem hanging for a few minutes. This is an intermittent issue because it only happen if user manage to create Ajax requests faster than the browser can handle.

    The solution is simple, you have two options here. For the first option, forget about rich client application, using javascript to refresh the page with the value of combo box appear on the hash of web address. Browser will clear up the queue for a page refresh. For the second option, block user from making change to combo box if the queue is not cleared.

    3. Nesting of Ajax requests

    I have never seen a business requirement for nesting Ajax request. Most of the time I saw nesting request, it was design mistake. For example, if you are a lazy developer and you need to load country flags for every country in the world, sorting by continent. Disaster happen when you decide to write code this way:

    Load the continent list
    For each continent, loading country
    Assume the world have 5 continents, then you spawn 1 + 5 = 6 requests. This is not necessary as you can return a map of list in single return. Making requests is expensive, making nesting request is very expensive, using Facade pattern to get what you want in a single call is the way to go.
    Class Loader, Class and Object

    (This article is quoted from my blog at http://sgdev-blog.blogspot.sg/2014/01/class-loader-class-and-object.html)

    Recently, there are two occasions that make me feel that it is worth talking about the Class Loading mechanism in Java. On the first occasion, my friend asked me how to create new object of the nested class. Simply put, it is like this:





    It is very simple to know why if you understand the class loader mechanism in Java well enough. As it is perfectly explained in http://zeroturnaround.com/rebellabs/reloading-objects-classes-classloaders/

    When you create inner class, you need to have an outer class object to bind your self to before manage to create any object.

    The hierarchy of innerObject should be:

    Class Loader --> OuterClass --> outerObject --> InnerClass --> innerObject.

    In constrast, the hierarchy of nestedObject should be:

    Class Loader --> OuterClass --> InnerClass --> innerObject

    That explain the different among 3 constructors in the above example:



    If you read up to this point, you will wonder when to use inner class, when to use static inner class. For me, static inner class is naturally an independent class that being hide inside another class declaration. It still fully function as a normal class but invisible from IDE autocomplete. It gives you the flexibility of reusing the same name for inner class. Some sample of using it is the mapper class to be used in JDBC if you find the mapper useful to be reused.

    Most of the time, you will just need to use static inner class and it is recommended to do so as JVM will only keep track of one class declaration for your inner class. If you are still unclear about this benefit, the class itself is also one object in JVM. That why java allow you to access class object using reflection. The only case that you use non-static inner class is when you need access to outer class non-static variable.

    It is also worth taking note that the class object is not unique in JVM. To be more accurate, it is unique per class loader. However, depend on how you create your application, you may have more than one class loader in your JVM. For example, if you run the above code using java OuterClass from terminal, you will only have 1 class loader in your JVM. If you put the class above inside one Java web application then your JVM will have more than one class loader. It is because most of the Java web container use dedicated class loader to load web application for security reason (you do not want the class from other web application to interfere with your application right?).

    In this case, if you deploy 2 web application to the same container that both contain log4j, you actually have 2 log4j class in your JVM. However, only one log4j class is visible to each webapp. Still, it may cause some issues if you have 2 appenders that attempt to write to the same log file. If you choose the other strategy by putting log4j class to web container library then you will have unique appender per web container and you force all the webapps to share log4j configuration. It is highly recommended that you do not put any class that you need access to its static variable to web container library.

    Up to this point, I have not mentioned the second occasion yet. Yep, it happened when we deployed two applications with this log4j configuration to Tomcat:




    Kindly notice the highlighted param above. This is quite tricky and I see many developers suffered from this problem. webAppRoot is a bonus feature that allow you to use webapp root inside log4j config file. However, if you put log4j config file to a share folder in Tomcat, you manage to create conflict here as log4j fail to identify the unique webapp root. Depend of your deployment order, it is likely that the first webapp deployment will pass and the second one will fail. In this case, you got no choice but turn off this feature.