Tom Cannaerts

Efficient data processing in PHP

Last weekend, I was at PHP Benelux and attended a talk of Juozas Kaziukėnas titled “Process any amounts of data. Any time”. What it comes down to, is that by using the right techniques, you could process virtually any amount of data without killing your webserver. The thing is, the talk did not really teach me that much I didn’t already know, but rather made me realize that developers (me including) often tend to be lazy.

Why should I write my own code to open a file and read it and process is line by line, when I can simply use file_get_contents and send the entire output to the explode function or SimpleXML element, to start looping over every element? Well, have a look at these two samples and read on.

Sample 1:

$lines = explode(“\r\n”, file_get_contents(“1MB.txt”));
foreach ($lines as $line){

Sample 2:

$fp = fopen(“1MB.txt”, “r”);
while (!feof($fp)){
$line = fgets($fp);

Yes, the first code is the full 2 lines shorter, and yes, reading the file all at once probably will execute slightly faster than having to do several read operations. So suppose you are using this code to import a product catalog in a webshop. Assuming the catalog has 1.000 products, and each line requires 1kb of memory to process, you would essentially need 1MB of memory. Actually, since you are doing various function calls, the peak memory usage will be at least twice of that, as PHP will have to copy the data around from function to function.  Obvioulsy, these 2MB of memory will not kill your server. But suppose business is doing well and our 1.000 products turn into 100.000 products. All of the sudden, our import script needs 200MB to finish. Congragulation, your attempt to save 2 lines has just broke the default 128MB memory limit of PHP. So the right thing to do in this situation is to use the ini_set command and set the memory_limit to a value that will not crash your server, or even set it to -1 (no memory limit), right? Let’s start by looking at what the PHP documentation has to say about it:

“This sets the maximum amount of memory in bytes that a script is allowed to allocate. This helps prevent poorly written scripts for eating up all available memory on a server.”

So the fact that we are running against the default memory limit (which should be a sensible value for most situations), might just mean that we have in fact produced a poorly written script. Although servers these days might not have a problem at all with a little bit of extra memory consumption, it does not excuse us from having to write decent code. So let’s try the same by reading the file line by line. The theory is that we should only need the amount of memory to hold a single line. To see if this is only a theory, I created a simple test script that reads in a 1MB text file using both methods described above:

Method 1 (file_get_contents / explode / foreach)

Memory usage start: 325144
Memory usage after processing: 5348440
Memory usage peak: 6398344 (6.1MB)

Method 2 (fopen/fgets)

Memory usage start: 325928
Memory usage after processing: 326368
Memory usage peak: 335728 (327kb)

As you can see, to process a 1MB file, the first method has a peak memory usage of over 6MB, which is 6 times the amount of data we are processing. So appearantly, the factor 2 I mentioned earlies was quite optimistic. Our 100.000 products example would in this case need a whooping 600MB. The peak memory of the second example on the other hand, was only 10KB more than what we started with. When feeding the scripts a 10MB file, the results were pretty much what I expected them to be: 58MB peak usage vs 327KB peak usage. The memory usage of the file_get_contents method almost grew linear with the size of the input file, whereas the memory usage of the fgets method remained the same.

Ok, so not everyone has to import large textfiles, or if they do they might just do this in the middle of the night where it’s not really a problem to snoop away a chunk of the available memory. Something we probably all do need, are databases. If done “right”, you will probably be using some sort of MVC framework. Due to the nature of MVC, the rendering of the data is done in a different place than the actual gathering of it. Unfortunately, a lot of people and framework will bridge this “distance” by retrieving the data from the database and storing it in a variable of object, which will then be consumed by the template for output. Typically, the database abstraction layer will return a array of rows or objects that each represent a single row in a particular database table. Also, when dealing with objects, chances are real that it will have fetched each and every field in the database table as well, regardless of the fact that you need it or not.

Take your average webshop. When browsing a category, you will be shown a list of products. This list typically contains the name, price, short description and image of the product, and will link to a detailed page using the internal id of the product. The database table of the products will likely contain a lot more columns (like SKU, stock, date added, vendor links, …) that you do not need to process the request to render the list. So why should your MVC model or database abstraction layer have to return each and every field? Sure enough, you will probably be dealing with less data at a time compared to running an import of your entire catalog, so the total memory consumption of a request will not likely be that large. On the other hand, there might as well be 100 concurrent users visiting your webshop. Other than the excess memory usage, PHP will also have to move this extra amount of data back and forth, consuming time and CPU cycles. Every cycle it uses, is a cycle it cannot use for something else. For that reason, it is in your own interest that you limit the amount of time PHP wastes, allowing it so finish sooner when servicing your clients.

Obviously, this is not as simple as changing 2 lines to overcome the problem. Due to the nature of MVC, usually coupled with OO-programming, fetching and processing of database records one by one and not fetching the complete records is often easier said than done, definitely when using pre-made frameworks that have implemented this otherwise. The availability of bigger and better computer resources have also lead to a form of lazyness of developers, as they often do not see the need to walk the extra mile. Nontheless, sometimes even a few small and seamingly insignificant changes can mean a lot when dealing with resources. The amount of extra time spent doing it right the first time will probably pay off itself the moment you run into the limitations of your environment. Sure enough, you could probably scale your server to fix the symptoms, but that doesn’t really solve the problem.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.