Taka’s blog

A software engineer's blog who works at a start-up in London

How to read a text file in batches with Ruby

When I was reviewing my colleague's source, I found awful code in his program.

Like this:

File.open("./production.log") do |file|
  results = {}
  file.readline # skip header
  file.each_line do |line|
    key, value = process(line)
    results[key] = value
    next if results.size < 2000
    output(results)
    results = {}
  end
  output(results)
end

Oops! How did this happen!?

He said "We need purge results of memory every 2,000 lines because that causes out of memory. And there is no method to get 'each batch of lines' from text files in Ruby."

I got it, we need some methods to read files in batches, like a "find_in_batches" of ActiveRecord. Then I searched such methods, and I found a same question as ours in Stack Overflow.

stackoverflow.com

Those answers will go well, but those are not cool! Finally, I couldn't find out great ways.

Ideal ways

I'd like to use like a "find_in_batches" method, like this:

File.open("./production.log", skip_header: true) do |file|
  file.batch_line(batch_size: 2000) do |lines|
    results = process(lines)
    output(results)
  end
end

Or, it seems nice to wrap text file by an object which has the "batch_line*1" method:

file = TextFilePager.new("./production.log", skip_header: true)
file.batch_line(batch_size: 2000) do |lines|
  results = process(lines)
  output(results) 
end

Both are better than previous code. We can easily understand what this code will do.

First Proposal Way

Using "Enumerator::Lazy" and "each_slice" seems one of the great ways:

File.open("./production.log") do |file|
  file.lazy.drop(1).each_slice(2000) do |lines|
    results = process(lines)
    output(results)
  end
end

It's really simple way, because we need no additional class and method.

All we need is understand "Enumerator::Lazy*2" and "each_slice", that's it.

Second Proposal Way

Making a new class, like "TextFilePager", seems a nice idea:

class TextFilePager
  DEFAULT_BATCH_SIZE = 1000

  def initialize(file_path, skip_header: false, delete_line_break: false)
    @file_path = file_path
    @skip_header = skip_header
    @delete_line_break = delete_line_break
  end

  def batch_line(batch_size: DEFAULT_BATCH_SIZE)
    File.open(@file_path) do |file|
      file.gets if skip_header?
      loop do
        line, lines = "", []
        batch_size.times do
          break if (line = file.gets).nil?
          lines << (delete_line_break? ? line.chomp : line)
        end
        yield lines
        break if line.nil?
      end
    end
  end

  def skip_header?
    @skip_header
  end

  def delete_line_break?
    @delete_line_break
  end
end

You can use this class like this:

file = TextFilePager.new("./production.log", skip_header: true)
file.batch_line(batch_size: 2000) do |lines|
  results = process(lines)
  output(results) 
end

This code is absolutely same what I wrote above section. The reason is clear, I've just created this class along with my ideal's interface, like TDD.

Conclusion

I prefer first proposal because it only requires knowledge of Ruby and we don't need to make a new class. Too many classes bother us, so we shouldn't add a class unless we really need that.

But second way gives us good interface, it seems intuitive. Plus, we can easily pass options to that object. If you need such text file processing time and time again, adding a new class may be a good way.

Perhaps, it's nice to add the "batch_line" method to "IO" or "File" object. But I don't like to modify core classes so much, because that affects our code widely.

Hence, I'll choose to use "Enumerator::Lazy" and "each_slice" at first when it comes to that matter.

Thanks for reading.

An appendix

Here is my original post about that matter, written in Japanese.

*1:Of course, this method doesn't exist yet.

*2:If you don't need to use `drop`, `lazy` is unnecessary.