Jul 01 2015

Is X a good technical question for an interview?

Category: UncategorizedFractalizeR @ 7:13 pm

I cannot answer this question better, than Eric Lippert. Below is a quotation from one of his blog posts.

A good technical question has characteristics such as:

  • can be stated and solved in a short amount of time with a short amount of code
  • does not require an “aha!” insight that is difficult to come up with under stress
  • is not a “stock” problem or “brain teaser”
  • can be made arbitrarily harder or easier, to better gauge what level the candidate is really at
  • is similar to the sorts of problems the candidate will have to solve on the job
  • tests skills other than straight up coding, like ability to read existing code, or deal with an ambiguous specification


Jun 12 2015

Encapsulating complex query conditions using QueryCriteria in Laravel

Category: Laravel,PHPFractalizeR @ 9:56 pm

Very often my query conditions in repositories become complicated. To be able to reuse them across repositories I use the following simple pattern:

interface CriteriaInterface
{
    /**
     * @param EloquentBuilder|QueryBuilder $query
     *
     * @return void
     */
    public function attach($query);
}

All my query conditions implement this interface and accept condition parameters in the constructor. For example, this is how CategoryCriteria looks like:

/**
 * Class CategoryCriteria
 *
 */
class CategoryCriteria implements CriteriaInterface
{
    /**
     * @var null|int|array
     */
    private $categoryId;
 
    /**
     * CategoryCriteria constructor.
     *
     * @param null|int|array $categoryId
     */
    public function __construct($categoryId)
    {
        $this->categoryId = $categoryId;
    }
 
    /**
     * @param null|int|array $categoryId
     *
     * @return LocationCriteria
     */
    public static function create($categoryId)
    {
        return new static($categoryId);
    }
 
    /**
     * @param EloquentBuilder|QueryBuilder $query
     */
    public function attach($query)
    {
        if (empty($this->categoryId)) {
            return;
        }
 
        if (is_array($this->categoryId)) {
            $query->whereIn('category_id', $this->categoryId);
        } else {
            $query->where('category_id', $this->categoryId);
        }
    }
}


Jun 05 2015

Laravel’s Eloquent::has() and Eloquent::whereHas() performance problems

Category: Laravel,PHPFractalizeR @ 12:52 am

I had a task in my project where I needed to fetch all orders from the database, for which some user has posted an offer. The first intention was to use nice and shiny whereHas method on Eloquent builder like this:

Order::whereHas('offers', function($q) {
    $q->where('user_id', 1112485)
 });

 

But something stopped me and I went to investigate. It turned out, that this instruction generates the following SQL statement:

SELECT *
FROM orders
WHERE (SELECT count(*)
       FROM offers
       WHERE offers.order_id = orders.id AND "user_id" = 1112485) >= 1;

 

The plan for this statement is:

Seq Scan on orders (cost=0.00..249278.84 rows=9525 width=912) (actual time=271.528..465.900 rows=2 loops=1)
 Filter: ((SubPlan 1) >= 1)
 Rows Removed by Filter: 28574
 SubPlan 1
 -> Aggregate (cost=8.44..8.45 rows=1 width=0) (actual time=0.009..0.011 rows=1 loops=28576)
 -> Index Only Scan using offers_order_id_user_id_idx on offers (cost=0.42..8.44 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=28576)
 Index Cond: ((order_id = orders.id) AND (user_id = 1112485))
 Heap Fetches: 2
Planning time: 0.133 ms
Execution time: 465.953 ms

Pay attention to query cost and execution time. The same task may be achieved by  the following query:

SELECT *
FROM orders
WHERE id IN (SELECT DISTINCT order_id
             FROM offers
             WHERE user_id = 1112485)

 

And this one has much better plan:

Nested Loop (cost=0.71..6280.39 rows=5 width=912) (actual time=0.043..5.260 rows=2 loops=1)
 -> Unique (cost=0.42..6238.76 rows=5 width=4) (actual time=0.024..5.217 rows=2 loops=1)
 -> Index Only Scan using offers_order_id_user_id_idx on offers (cost=0.42..6238.62 rows=53 width=4) (actual time=0.019..5.201 rows=2 loops=1)
 Index Cond: (user_id = 1112485)
 Heap Fetches: 2
 -> Index Scan using orders_pkey on orders (cost=0.29..8.31 rows=1 width=912) (actual time=0.007..0.009 rows=1 loops=2)
 Index Cond: (id = offers.order_id)
Planning time: 0.191 ms
Execution time: 5.312 ms

Well, about 100x difference, right?

The same goes for simple

Order::has('offers')

 

Generated SQL for it is

SELECT *
FROM "orders"
WHERE (SELECT count(*)
       FROM "offers"
       WHERE "offers"."order_id" = "orders"."id") >= 1

 

With the query plan

Seq Scan on orders (cost=0.00..856965.30 rows=9525 width=912) (actual time=11.044..2987.939 rows=25603 loops=1)
 Filter: ((SubPlan 1) >= 1)
 Rows Removed by Filter: 2973
 SubPlan 1
 -> Aggregate (cost=29.71..29.72 rows=1 width=0) (actual time=0.095..0.097 rows=1 loops=28576)
 -> Index Only Scan using offers_order_id_user_id_idx on offers (cost=0.42..29.68 rows=13 width=0) (actual time=0.019..0.072 rows=9 loops=28576)
 Index Cond: (order_id = orders.id)
 Heap Fetches: 247734
Planning time: 0.260 ms
Execution time: 3039.399 ms

If you rewrite query like this

EXPLAIN ANALYZE
SELECT *
FROM "orders" O
WHERE id IN (SELECT DISTINCT order_id
             FROM offers);

 

You get the following plan:

Hash Join (cost=14061.99..21990.07 rows=20785 width=849) (actual time=940.164..1093.802 rows=28694 loops=1)
 Hash Cond: (o.id = offers.order_id)
 -> Seq Scan on orders o (cost=0.00..7439.19 rows=32119 width=849) (actual time=0.006..51.002 rows=32119 loops=1)
 -> Hash (cost=13802.18..13802.18 rows=20785 width=4) (actual time=939.942..939.942 rows=28694 loops=1)
 Buckets: 4096 Batches: 1 Memory Usage: 1009kB
 -> HashAggregate (cost=13594.33..13802.18 rows=20785 width=4) (actual time=852.582..898.002 rows=28694 loops=1)
 Group Key: offers.order_id
 -> Seq Scan on offers (cost=0.00..12887.66 rows=282666 width=4) (actual time=0.003..422.209 rows=282666 loops=1)
Planning time: 0.364 ms
Execution time: 1133.665 ms

3x improvement. Which is slightly better :)

So, the conclusion is always to check what is generated for you by ORM.


May 27 2015

Disabling Eager Loading Eloquent Model Relations in Laravel 4 and 5

Category: Laravel,Laravel 5FractalizeR @ 1:12 pm

Since accidental lazy-loading almost always lead to performance problems, I was seeking a way to disable it. The solution is easy. Just define somewhere the following trait and use it in your model when you need to disable lazy loading:

* Class OnlyEagerLoadedModelTrait
 *
 * @package FH\Stock\Model
 */
trait EagerLoadOnlyTrait
{
    /**
     * @param string $method
     */
    protected function getRelationshipFromMethod($method)
    {
        $modelName = static::class;
        throw new LogicException(
            "EagerLoadedOnlyTrait: Attempting to lazy-load relation '$method' on model '$modelName'"
        );
    }
}

 

class CategoryUser extends Eloquent
{
    use EagerLoadOnlyTrait;
}

Tags: , , ,


May 02 2015

Enabling Ansible Jinja2 template syntax highlighting in PHPStorm (and other editors)

Category: Linux administrationFractalizeR @ 1:14 am

Just do this:

    <key>fileTypes</key>
    <array>
        <string>j2</string>
    </array>
  • Add cloned repository folder to PHPStorm -> Settings -> Editor -> TextMate Bundles
  • It should work now, but may be you will have to add “*.j2” to PHPStorm -> Settings -> Editor -> Filetypes -> Files supported via TextMate bundles


May 01 2015

What is a coder’s worst nightmare?

Category: UncategorizedFractalizeR @ 6:37 pm

What is a coder's worst nightmare?


May 01 2015

Which is better and why: Django’s ORM or SQLAlchemy?

Category: UncategorizedFractalizeR @ 6:34 pm

This is a follow-up question to What's the best MySQL library for Python?. Django's interface seems easier and more intuitive than SQLAlchemy. Are there any crucial features missing in Django? Are they comparable in speed?

Which is better and why: Django's ORM or SQLAlchemy?


May 01 2015

What are the pros and cons of using schemas to support multi-tenancy in PostgreSQL? How scalable is it?

Category: UncategorizedFractalizeR @ 6:31 pm

For a large SAAS application (backed by PostgreSql 9.4), with over 300,000 accounts (and growing), what are the pros and cons of using a schema per account to partition the data Vs putting all the data in one schema and using foreign keys to partition it in the queries?

I know that in the past pg_dump was painfully slow when working with many schemas but I'm not sure if that is the case today. I'm also aware any change in the database structure will have to be done on all the schemas. On the plus side, moving a schema from one physical server to another becomes easy, as well as restoring a schema from backup. It also makes sense to partition data that way.

However, at the same time the number of schemas required here is very large and would be so from day 1 (its a rework of an existing system). The shared approach works well, but I'd like to explore other avenues.

What are the pros and cons of using schemas to support multi-tenancy in PostgreSQL? How scalable is it?


May 01 2015

How do companies like Yelp and Foursquare implement their geospatial backends?

Category: UncategorizedFractalizeR @ 6:30 pm

Do they use PostgreSQL or MySQL spatial extensions? Or some other backend that allows them to query for nearby places.

How do companies like Yelp and Foursquare implement their geospatial backends?


May 01 2015

What is the largest production deployment of PostgreSQL for online use?

Category: UncategorizedFractalizeR @ 6:30 pm

Answer by Thejo Kote:

May not be comparable to Facebook, but Skype and hi5.com use Postgres.

Stats from Skype deployment (http://www.slideshare.net/adorepump/small-overview-of-skype-database-tools):

  • Over 100 database servers and 200 databases
  • Largest table has several billions of records.
  • Over 10000 transactions per second.
  • Databases are used mainly for web store, billing and other centrally provided services.
  • All accessed through stored procedures.

What is the largest production deployment of PostgreSQL for online use?


May 01 2015

What is the best resource to learn ‘joins’ in MySQL?

Category: UncategorizedFractalizeR @ 6:27 pm

Answer by Siva Kumar:

What is the best resource to learn 'joins' in MySQL?


May 01 2015

With which technical course do I have a better future: Java or PHP?

Category: UncategorizedFractalizeR @ 6:20 pm

Developer supply market stats – who is in demand and how they are paid

Answer by Chung I Yen:

There are some interesting analysis in this article, including the salary range and demand job trend. Check the brief or go to the original article:
What Programming Language Should a Beginner Learn in 2015?
The Developer Supply Market
Let’s take a look at what mentors on Codementor are skilled at:

Of course, this is by no means an accurate depiction of the actual market, but it should give you a good idea of the situation. As you can see, Objective-C experts are mighty scarce, yet the mobile app business has been booming for some time now. With scarce supply of good Objective-C developers, their compensation would naturally be higher.
PHP, on the other hand, seems to have a healthy amount of supply in the workforce, which means employers would have more choice and thus more bargaining power in terms of salary, but we’ve seen it isn’t as demanded as Java, SQL, or JavaScript. Therefore, PHP developers earn less.
JavaScript is more of a special case. Despite having the most supply, it is not at the bottom of the salary tier, and we’ll take a look at JavaScript later in this article.
Salary Range
Now that you have an idea of what the demand and supply for each language is, let’s take a look at the potential salary you could earn based on salary information from job ads. The data is based on data from gooroo.io, in which salary is usually derived as the average value of the salary range offered by individual job ads.

Here you can see that developers who know Python, Ruby, C++, or Objective C appear to have higher starting salaries, but also more limited salary potential. Python and C++ tend to pay at the higher end of the potential salary range.
Java and JavaScript both have huge differences between the lowest and highest offered salaries. JavaScript, in particular, offers the most potential salary to skilled developers at the moment at about 188,168 USD as of April 2015.

With which technical course do I have a better future: Java or PHP?


May 01 2015

Who is Yehuda Katz?

Category: UncategorizedFractalizeR @ 6:17 pm

Answer by Yad Faeq:

Man man man! This question :)!
Quora Profiles (this should be a new Topic).

Well let me introduce you to Yehuda Katz, the man behind / involved in /with the following awesome stuff:

and perhaps lesser known about:

The startup he founded: Tilde Inc.

We're a small team of developers who are passionate about crafting     great software. We're some of the faces behind Ember.js, Ruby on Rails, jQuery and more. We're here to help you build the products and tools of the future. 

Website /Blog : Katz Got Your Tongue?
Github: wycats (Yehuda Katz)

According to his Linkedin:

Google > Procore > Google > Engine Yard > jQuery > Tilde Inc

These are some of his talks:
Yehuda Katz's Presentations on Confreaks

Who is Yehuda Katz?


May 01 2015

What are some criticisms of Ember.js?

Category: UncategorizedFractalizeR @ 6:02 pm

Answer by Samer Buna:

  1. Big learning curve, it's definitely not for everybody.
  2. Opinionated framework, for some people, this is a dark hole, even when the opinions are good.
  3. EmberData is not ready for production yet, it might be that this library is just few years a head of its time.
  4. Too much generated code, if you don't understand why it's being generated, you certainly see that as a problem.
  5. 2-way data binding in general has lots of criticism on its own (this is shared with other frameworks of course).
  6. Routes are  the front-line, some people don't see the value in that.

What are some criticisms of Ember.js?


May 01 2015

Why should I NOT use ElasticSearch as my primary datastore?

Category: UncategorizedFractalizeR @ 5:50 pm

Answer by Mani Gandham:

As is the case with all database deployments, it really depends on your specific application.

ElasticSearch is a great open source search engine built on top of Apache Lucene. Its features and upgrades allow it to basically function just like a schema-less JSON datastore that can be accessed using both search-specific methods and regular database CRUD-like commands.

Here are the main "disadvantages" I see:

  • Security – ElasticSearch does not provide any authentication or access control functionality.
  • Transactions – There is no support for transactions or processing on data manipulation.
  • Durability – ES is distributed and fairly stable but backups and durability are not as high priority as in other data stores.
  • Maturity of tools – ES is still relatively new and has not had time to develop mature client libraries and 3rd party tools which can make development much harder.
  • Large Computations – Commands for searching data are not suited to "large" scans of data and advanced computation on the db side.
  • Data Availability – ES makes data available in "near real-time" which may require additional considerations in your application (ie: comments page where a user adds new comment, refreshing the page might not actually show the new post because the index is still updating).

If you can deal with these issues then there's certainly no reason why you can't use ElasticSearch as your primary data store. It can actually lower complexity and improve performance by not having to duplicate your data but again this depends on your specific use case.

As always, weigh the benefits, do some experimentation and see what works best for you.

UPDATE FOR 2015: ElasticSearch has come a long way in the past few years since this original answer and now has much more capable durability options, backup methods and even realtime indexing. Please review the official site for more information.

Why should I NOT use ElasticSearch as my primary datastore?


May 01 2015

Why does Wikipedia use MySQL as data store rather than a NoSQL database?

Category: UncategorizedFractalizeR @ 5:48 pm

Answer by Domas Mituzas:

Billions of views hit the cache layer, not the application – application generates only about 50000 queries a second.

Now, there are two kinds of databases serving the wikis – 'core' – which has all the metadata, and text storage.

There're six shards for core database, split by language (so requests don't really have to hit multiple shards) – and they're all under 500G, and active dataset for each shard fits into 64GB (on some into 32GB) systems and doesn't thrash disks much.

Text storage is a bit different beast – essentially it is append-only chunks of data that later go through differential-concatenated-compression batch process and end up being part of larger append-only chunks. :) Technically this is the only place where we do key:value data storage and that could benefit from NoSQL solutions, but we don't touch it much and it doesn't take too much attention (well, neither do other databases, TBH).

Now, the 'economic' part is different – MediaWiki relies a lot on transactions and consistent read snapshots and what not (and being able to recover after crash too :) – and also it has few heavyweight queries.

MySQL is remarkably fast at these kinds of operations – and if you look at per-node performance, there're no NoSQL solutions that provide it at the same time with e.g. transactions and underlying storage that isn't naive.

You can see query distribution at:
http://noc.wikimedia.org/cgi-bin/ng/report.py?db=enwiki&prefix=query
(don't judge too much, hasn't been reviewed for a while :)

Technically, serving the wiki database main cost is not only in serving a single page object (where data locality doesn't matter much), but also having lots of crossindexing (page links, image links, template links, metadata, etc), as well as providing reporting – different views for recent changes, watchlists, per-user contributions, per-page contributions etc.

Moving such features to a NoSQL solution would lose multiple efficiency factors – starting with data locality both at page level and at server level, as well as would move certain problems from data layer to application layer (though I can't complain, that would probably mean a more efficient implementation of watchlists, for example)

Still, data would have to be cached and served from disks too to have a decent economic solution – and I really doubt those ratios could improve a lot by having different storage solution.

Anyway, I'm not telling that NoSQL wouldn't work – database isn't a problem at all, and there's a big chance that replacing database with different technology wouldn't introduce a problem. There's simply no need to chase the fashion.

OTOH, Wikipedia already uses Lucene and memcached for eons, don't these count as NoSQL? :)

P.S. I do this at few days a year on volunteer basis, so, the system really really doesn't have too many problems.

Why does Wikipedia use MySQL as data store rather than a NoSQL database?


May 01 2015

What do big websites like Facebook, Google, Twitter, and LinkedIn use for their database?

Category: UncategorizedFractalizeR @ 5:02 pm

Answer by Ben Baert:

Big websites usually use a variety of databases, because different databases serve different purposes.

You seem to misunderstand what NoSQL means. NoSQL stands for not only SQL and is the 'opposit'e of SQL. Within this categorisation, all databases fall in either SQL or NoSQL. That is, MySQL, MariaDB (open-source fork of MySQL) and PostgreSQL are SQL databases, MongoDB is an example of a NoSQL database.

As for the companies you've mentioned:

  • Facebook uses mainly MySQL. In fact, someone seems to have made an attempt to revers engineer Facebook's database design:

    source: Facebook Object-Oriented Diagram

  • Twitter uses (their own version of) MySQL for tweets and users, and their own special kind of graph database, FlockDB, built on top of MySQL, for relations (followers, following,…), and their own version of Memcached (with speculation about whether Redis is used too or not).
  • LinkedIn used/uses Oracle Database and Voldemort.

One last thing though. Simply because these big companies use these technologies, doesn't mean that you should. They have tons of engineers and resources, and very high demands, so they can both afford to use multiple databases (which requires specialised engineers) and need to use it (because of the extreme demand). As a small-scale developer, you probably don't need to worry about the extreme demands, nor do you have the means to use multiple databases.

What do big websites like Facebook, Google, Twitter, and LinkedIn use for their database?


May 01 2015

How is lazy evaluation implemented in functional programming languages?

Category: UncategorizedFractalizeR @ 4:28 pm

Answer by Costya Perepelitsa:

With thunks. It's pretty straight-forward, really:

The definition of the lazy value is wrapped in a tiny wrapper function which, when called, produces the desired value. This function is not called immediately, but instead stored, to be called only when/if needed. This stored function is the "thunk".

When the value is needed, a simple check is performed:
If the thunk has never been evaluated, the thunk is forced to run, the result is stored for future reference, and then the result is returned.
But if the thunk has already been evaluated, the result has already been stored, and so it's just returned directly without evaluating the thunk again.

I'll give an example implementation in C first, so you can see how lazy values can be implemented in a language that does not support first-class functions. If you're not following the C, there's a Python implementation at the end of this answer that you'll probably find much more readable, but it's less explicit about the creation of the thunk.
(I use a somewhat uncommon feature of C here called a function pointer. If you're unfamiliar with this concept, just know that

thunk

must be a function that takes a single argument of type

void
*

and returns

void
*

. This way, an arbitrary number of arguments of arbitrary types can be passed to the thunk, and the return can be anything the client wants.)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <stdlib.h>

typedef struct {
    void *val;                  // result, when evaluated
    int evaluated;              // true if thunk has ever been evaluated
    void *(*thunk)(void*);      // deferred computation
    void *args;                 // args to pass to deferred computation
} lazy;

// given a thunk and args to pass to it, creates the lazy type above
lazy *make_lazy(void* (*thunk)(void*), void *args) {
    lazy *l = malloc(sizeof(lazy));
    l->val = NULL;
    l->evaluated = 0;
    l->thunk = thunk;
    l->args = args;
    return l;
}

// force evaluation of the thunk
void *force(lazy *l) {
    if (!l->evaluated) {
        // the thunk will only ever be evaluated once,
        // and only when the evaluation is forced via this function
        l->val = l->thunk(l->args);
        l->evaluated = 1;
    }
    return l->val;
}

And here's an example of its use:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <stdlib.h>
#include <stdio.h>
#include "lazy.h"   // or whatever you named it

// the function to make lazy
int id(int x) {
    // the following puts() gives an obvious indication
    // that this function is being evaluated
    puts("Evaluating id!");
    return x;
}

// create the thunk; when forced, it evaluates id()
void* id_thunk(void *x_vp) {
    int *x = malloc(sizeof(int));
    *x = id(*(int*)x_vp);
    return (void*)x;
}

int main(void) {
    // create a lazily-evaluated call to "id(10)"
    int *ten_arg = malloc(sizeof(int));
    *ten_arg = 10;
    lazy *ten = make_lazy(id_thunk, (void*)ten_arg);

    puts("Evaluating lazy ten:");
    // the first time the lazy value is evaluated,
    // the thunk will be evaluated, which should
    // produce the side-effect of outputting
    // "Evaluating id!" to the console
    int *ten_evaluated = (int*)force(ten);
    printf("Result: %d\n", *ten_evaluated);

    puts("\nEvaluating lazy ten again:");
    // the second evaluation of the lazy value
    // should not cause id() to be run again,
    // so we're expecting to NOT see the
    // "Evaluating id!" message
    ten_evaluated = (int*)force(ten);
    printf("Result: %d\n", *ten_evaluated);

    return 0;
}

And when you run this program, you get the following output:

1
2
3
4
5
6
Evaluating lazy ten:
Evaluating id!
Result: 10

Evaluating lazy ten again:
Result: 10

You can see from this output that the

id()

function is not evaluated before the call to

force

; otherwise, the message "Evaluating id!" would have been above the first line that said "Evaluating lazy ten:".
You can further see that the

id()

function is only evaluated once, because the "Evaluating id!" message does not appear again when the lazy value is evaluated a second time.

If C makes you queasy, here's a re-implementation in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
class Lazy:
    def __init__(self, f, args):
        self.evaluated = False
        self.value = None
        self.f = f
        self.args = args

    def force(self):
        if not self.evaluated:
            self.value = self.f(self.args)
            self.evaluated = True
        return self.value

And example use in Python:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def identity(x):
    print "Evaluating identity()!"
    return x

if __name__ == '__main__':
    lazy_ten = Lazy(identity, 10)

    print "Evaluating lazy ten:"
    result = lazy_ten.force()
    print "Result: " + str(result)
    print
    print "Evaluating lazy ten again:"
    result = lazy_ten.force()
    print "Result: " + str(result)

The output is the same as the C example:

1
2
3
4
5
6
Evaluating lazy ten:
Evaluating identity()!
Result: 10

Evaluating lazy ten again:
Result: 10

How is lazy evaluation implemented in functional programming languages?


May 01 2015

What’s the attraction of the JVM platform?

Category: UncategorizedFractalizeR @ 4:28 pm

Answer by Mike Anderson:

I'd regard being on the JVM as a massive advantage.

If you came away with a bad impression, I suspect that you probably didn't have a well configured environment. I haven't edited a classpath or worried about the location of a .jar file for several years – the modern IDEs like Eclipse and Netbeans do all this for you……

Here's what I consider are the main advantages to being on the JVM:

  • Cross platform – code written to target the JVM will run on any platform that the JVM is available, which is pretty much everywhere nowadays. It's the *only* platform that offers comparable write-once-run-anywhere capabilities for compiled binary code.
  • Largest library ecosystem – the open source library ecosystem around the JVM is unquestionably the most broad, comprehensive and mature. You get everything from enterprise-class application servers like JBoss to full 3D game engines like JMonkeyEngine to IDE suites like Eclipse to data mining platforms such as Weka.
  • Excellent engineering – alongside Microsoft's CLR, its the best engineered virtual machine platform available. Particular strengths are around extremely efficient garbage collection, and JIT compilation.
  • Mobile – Google is a big proponent of Java, so it's not particularly surprising that they chose Java as the core language for Android. But apart form that, the JVM is very prominent on many other mobile / embedded environments.
  • Tooling – the JVM ecosystem has built up an impressive array of tooling for the entire development cycle over many years. Everything from IDEs (Eclipse, Netbeans, IntelliJ), build tools (Maven, Ant), automated deployment and testing tools etc. Yes, you need to do a bit of work to set these tools up in the first place. But you get a lot of return on that investment over the long run.
  • Language innovation – although you can argue that Java is now beginning to show it's age, it turns out that the JVM platform is a good target for innovative and experimental languages. As a result, there are some notable new languages on the JVM that are very promising and are starting to gain considerable momentum (Scala, Clojure, Groovy to name but a few….)
  • Skills availability – Java is probably the most widely used application development language (TIOBE index for example typically shows in as No.1), so it follows that it has the broadest base of skilled developers in a wide variety of domains. If you're making a stregic choice of platform as a business, this is a pretty important consideration!
  • Relative vendor independence – the JVM counts among it's major corporate backers Oracle, IBM, Google and a host of other companies. Although Oracle has a slightly privileged position since it acquired Sun, it can't stop you from choosing other alternatives (since the main JVM and JDK implementations are open source!). Contrast with .NET, where you are inextricably tied into Microsoft platforms (Windows, XBox etc.).
  • Career opportunities – a lot of big companies are heavily invested in the JVM, and are willing to pay top money for JVM skills. In particular, my experience is that the JVM is the preferred langauge for server side development in the majority of Banks and major internet companies.

What's the attraction of the JVM platform?


Mar 14 2015

Пересчет сообщений пользователей на форуме XenForo: удачные и неудачные варианты

Category: XenForoFractalizeR @ 9:45 pm

Столкнулся с задачей пересчета сообщений у пользователей на форуме и решил поделиться с сообществом исследованиями на тему ее наиболее эффективного решения.

Вариант №1 (аддон sonnbRebuildUserPostCount 1.0.0 20120725).

Этот аддон пересчитывает сообщения пользователей в цикле, для каждого пользователя выполняя запрос

SELECT COUNT(*)
  FROM `xf_post` AS post
  INNER JOIN `xf_thread` AS thread
  ON (post.thread_id = thread.thread_id)
  WHERE post.user_id = $userId AND
  thread.node_id IN (" . XenForo_Application::getDb()->quote(array_keys($nodes)) . ")

Неудачные моменты:

  • Долго работает на больших форумах (на нашем сейчас 700.000+ пользователей). Выполняет количество запросов превышающее количество пользователей на форуме, что достаточно сильно нагружает сервер (нужно ведь дождаться, пока выполнятся эти 700 тысяч запросов)

Вариант №2 от patrikq

Выполнить запрос

UPDATE xf_user AS user
SET message_count = (
  SELECT COUNT(*)
  FROM xf_post AS post
  LEFT JOIN xf_thread AS thread ON (thread.thread_id = post.thread_id)
  WHERE post.user_id = user.user_id
  AND post.message_state = 'visible' AND thread.discussion_state = 'visible'
  GROUP BY post.user_id
);

Неудачные моменты:

  • Неэффективен, поскольку EXPLAIN показывает, что для выполнения такого запроса будут выполняться два подзапроса. Сильно нагружает базу.
  • Не учитывает, что некоторые форумы могут быть настроены так, чтобы не учитывать сообщения пользователей в этих разделах (встроенная возможность движка). Нужно добавить еще JOIN с таблицей xf_forum и условие на count_messages в WHERE.

Поиск альтернативы.

В XenForo существует таблица xf_thread_user_post, в которой закешировано количество сообщений пользователя в каждом треде форума. Запросы по этой таблице будут выполняться быстрее, чем запросы к xf_post поскольку она гораздо меньше по объему (если таблица в память не помещается, InnoDB читает информацию с диска страницами по 16KB даже в том случае, если из всей страницы нужен только 1 байт).

Вариант №3 (ForumHouse):

Выполнить запрос:

Код:
UPDATE xf_user U SET message_count = (
  SELECT
  SUM(post_count) AS user_post_count
  FROM xf_thread_user_post TUP
  INNER JOIN xf_thread T USING (thread_id)
  INNER JOIN xf_forum F USING (node_id)
  WHERE count_messages AND TUP.user_id = U.user_id
  )

Неудачные моменты:

  • EXPLAIN показывает, что будут выполняться три SUBQUERIES, что недостаточно эффективно.

Вариант №4 (ForumHouse):

Выполнить запрос

UPDATE xf_user U
  INNER JOIN (
  SELECT
  SUM(post_count) AS user_post_count,
  TUP.user_id
  FROM xf_thread_user_post TUP
  INNER JOIN xf_thread T USING (thread_id)
  INNER JOIN xf_forum F USING (node_id)
  WHERE count_messages AND T.discussion_state = 'visible'
GROUP BY TUP.user_id) AS PC ON U.user_id = PC.user_id SET message_count = user_post_count

На нащих объемах (700.000+ пользователей, 200.000+ тем и 8.700.000+ сообщений) отрабатывает за 3 секунды. Если посмотреть его EXPLAIN, будет видно, что SUBQUERIES нет. Есть только несколько DERIVED таблиц.

Разница по времени выполнения между вариантом №3 и 4 примерно 10 раз. Так что остановились пока на нем ;)


Next Page »