HowTo: Working efficiently with large files in Ruby
How can we read files in Ruby?
* Testing file generated by running a following command:
❯ openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -keyout key.pem -out cert.pem
It has a clearly defined the beginning and the end of the file which fill be useful while reading the files.
1. File.read()
which is actually IO.read()
:
> file = File.read("cert.pem")
=> "-----BEGIN CERTIFICATE-----\nMIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\nTDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\nMIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\nzqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\nTgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\naigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\nraNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\nGNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\naeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\nlybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n-----END CERTIFICATE-----\n"
> file.bytesize
=> 956
> file.class
=> String
read
method reads the entire file's content and assigns it to the variable as single String.
2. File.new()
and its synonym File.open()
:
> file = File.new("cert.pem")
=> #<File:cert.pem>
> lines = file.readlines
=> ["-----BEGIN CERTIFICATE-----\n",
"MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
"TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
"MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
"zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
"1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
"TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
"7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
"aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
"4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
"raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
"9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
"GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
"aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
"lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
"-----END CERTIFICATE-----\n"]
> lines.class
=> Array
new
or open
methods returns an instance of the File
class on which we can call readlines
method which reads the entire file's content, splits it line by line and returns an Array of Strings where one element is one line from the file.
3. File.readlines()
which is actually IO.readlines()
:
> lines = File.readlines("cert.pem")
=> ["-----BEGIN CERTIFICATE-----\n",
"MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
"TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
"MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
"zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
"1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
"TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
"7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
"aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
"4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
"raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
"9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
"GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
"aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
"lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
"-----END CERTIFICATE-----\n"]
> lines.class
=> Array
Here, we have the same output as in the previous example by calling just class method readlines
on File
class.
4. File.foreach()
which is actually IO.foreach()
:
> file = File.foreach("./cert.pem")
=> #<Enumerator: ...>
> file.entries
=> ["-----BEGIN CERTIFICATE-----\n",
"MIICljCCAX4CCQD5x/0DnI1UazANBgkqhkiG9w0BAQsFADANMQswCQYDVQQGEwJQ\n",
"TDAeFw0xOTA4MzExOTQ0NDdaFw0yOTA4MjgxOTQ0NDdaMA0xCzAJBgNVBAYTAlBM\n",
"MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEA2qJrZayMFRE7zIeUL8CZ\n",
"zqsOcwEv6flF41EjIvVf6h164i+NGkRu9E0wo1LHYsoF5tutYKKpRLJoY9xGq+Jr\n",
"1SPOJYGBaFqKyQye+lnSzJdpnCAklXObfJpGtBmKCm4OTcb8eC4nm2q4x3mNkP5Z\n",
"TgzdfIhALCwtD6wsHcyy5qmqGfPWAaGUDHqAQRu7QV/vu5VzJXgN0c6Zj+bOWw4H\n",
"7Zu+FxtpUACQk4lnqt9CUzp6GX3dIETTfA3cpTFvoxwqBZGnrjsgZA5HzbyKRUYi\n",
"aigbkyzc701sJaS8gcjIKDy2s8L8MfqaJkMu+N52e5tXoj4oQT9wPzxOou+GpYM/\n",
"4QIDAQABMA0GCSqGSIb3DQEBCwUAA4IBAQDDrOrN+asQjkOwjPcNLkycy4TJ/6QE\n",
"raNDVZ1N5h+70vIQwmmCS+hBN7SSM0f0OxgEggvK0etNQb6LXWXAIa7pMuzhqmHR\n",
"9Q/NBizj+GOIvH7EoCTVKYUkRLxEq5i63cm0ZvFu9qwr8v7IGM4HkLo3A0F6+Vcp\n",
"GNuOBNcGqAtCXNhgcpzu/6zWT2kAj1M82IC4aCIiTGovDidnp2ZO4bV5PTCy7ecd\n",
"aeJxt9LIlt/FVk29sjdtutPMZgtQwKKp2gWyY9D7/x8Dxpf2DCkjAtqEdN3/GER6\n",
"lybIrvAtYW7MNmu9MLkxionOak9CoZGsVg0kiXliHrhfxrDc8qLe8rqV\n",
"-----END CERTIFICATE-----\n"]
> lines.class
=> Array
foreach
method returns an Enumerator instance on which we call entries
which returns an Array of String, again each element is a line from the file.
As we can see above there are many methods that allow us to read the file. However which one should we use and why? Let's create a large file and check those methods again!
Which methods should we use to read large files?
Generating our test file
At first, let's generate a large file with randomized data inside:
require 'securerandom'
one_megabyte = 1024 * 1024
name = "large_1G"
size = 1000
File.open("./#{name}.txt", 'wb') do |file|
size.times do
file.write(SecureRandom.random_bytes(one_megabyte))
end
end
w
- Write-only, truncates existing file to zero length or creates a new file for writing.b
- Binary file mode. Suppresses EOL <-> CRLF conversion on Windows. And sets external encoding to ASCII-BIT unless explicitly specified.
As the result we generated a 1GB file:
❯ ls -lah
...
-rw-r--r-- 1 user user 1.0G Aug 31 22:10 large_1G.txt
Defining our metrics and profilers
There are probably 2 the most important metrics that we would like to track in our experiment:
- Time - How long does it take to open and read the file?
- Memory - How much memory does it take to open and read the file?
Also there will be one additional metric describing how many objects were freed by Garbage Collector.
We can prepare simple profiling methods:
# ./helpers.rb
require 'benchmark'
def profile_memory
memory_usage_before = `ps -o rss= -p #{Process.pid}`.to_i
yield
memory_usage_after = `ps -o rss= -p #{Process.pid}`.to_i
used_memory = ((memory_usage_after - memory_usage_before) / 1024.0).round(2)
puts "Memory usage: #{used_memory} MB"
end
def profile_time
time_elapsed = Benchmark.realtime do
yield
end
puts "Time: #{time_elapsed.round(2)} seconds"
end
def profile_gc
GC.start
before = GC.stat(:total_freed_objects)
yield
GC.start
after = GC.stat(:total_freed_objects)
puts "Objects Freed: #{after - before}"
end
def profile
profile_memory do
profile_time do
profile_gc do
yield
end
end
end
end
Testing our methods for reading files:
.read
file = nil
profile do
file = File.read("large_1G.txt")
end
Objects Freed: 39
Time: 0.52 seconds
Memory usage: 1000.05 MB
.new
+#readlines
file = nil
profile do
file = File.new("large_1G.txt").readlines
end
Objects Freed: 39
Time: 4.19 seconds
Memory usage: 1298.4 MB
.readlines
file = nil
profile do
file = File.readlines("large_1G.txt")
end
Objects Freed: 39
Time: 4.24 seconds
Memory usage: 1284.61 MB
.foreach
file = nil
profile do
file = File.foreach("large_1G.txt").to_a
end
Objects Freed: 40
Time: 4.42 seconds
Memory usage: 1284.31 MB
The examples we can see above allowed us to read the whole file and store it in local memory as one String or as an Array of Strings (each line from the file as one element in the Array).
As we can see, it requires at least as much memory as the size of the file:
- one String - 1GB file requires 1GB of memory.
- an Array of Strings - 1GB memory for file's content + additional memory for an Array (+- 300MB here). This approach has one advantage, we can access whichever line of the file we want as long as we know which line is it.
At this point we can see that the methods that we tested are not really efficient. The bigger the file, the more memory we need. In longer term this approach might lead to some serious consequences, even killing the application.
Now, we need to ask ourselves a question. Can we process our files line by line? If so, then we can read our files in a different way:
.new
+#each
file = nil
profile do
file = File.new("large_1G.txt")
file.each { |line| line }
end
Objects Freed: 4100808
Time: 2.08 seconds
Memory usage: 57.68 MB
.new
+#advise
+#each
file = nil
profile do
file = File.new("large_1G.txt")
file.advise(:sequential)
file.each { |line| line }
end
Objects Freed: 4100808
Time: 2.22 seconds
Memory usage: 55.71 MB
Calling #advise
method announces an intention to access data from the current file in a specific pattern. No major improvement here with using #advise
method.
.new
+#read
- reading chunk by chunk
file = nil
chunk_size = 4096
buf = ""
profile do
file = File.new("large_1G.txt")
while buf = file.read(chunk_size)
buf.tap { |buf| buf }
end
end
Objects Freed: 256037
Time: 1.27 seconds
Memory usage: 131.64 MB
We defined the chunk
as 4096 bytes and we read our file chunk by chunk. Depending on the structure of your file this approach might be useful.
.foreach
+#each_entry
file = nil
profile do
file = File.foreach("large_1G.txt")
file.each_entry { |line| line }
end
Objects Freed: 4100809
Time: 2.22 seconds
Memory usage: 53.02 MB
Creating an Enumerator instance as file
and reading file line by line using each_entry
method.
First thing we can notice is that memory usage is way lower. Main reason for that is that we read the file line by line(or chunk by chunk) and when the line is processed then it's garbage collected. We can see that by the size of the Objects Freed, it's quite high.
We also tried to use here an #advise
method which we can tell how we want to process our file. More about IO#advise
can be found in the documentation. Unfortunately, it didn't help us out here.
Except IO#each
method we have also similar methods like IO#each_byte
(reading byte by byte),IO#each_char
(reading char by char) and IO#each_codepoint
.
In the example with reading by chunks (IO#read
) the memory usage will vary depending on the chunk size. If you find this way useful you can experiment with the chunk size.
When using IO.foreach
we operate on Enumerator which gives us a few more methods like: IO#each_entry
, IO#each_slice
, IO#each_cons
. There is also lazy
method which returns a Enumerator::Lazy. Lazy Enumerator has a few additional methods which enumerate values only on an as-needed basis. If you don't need to read the entire file but, for example, looking for a particular line containing given expression then it might be worth to check it out.
I could finish the article at this point, but what if before we even start reading the file we need to decrypt it? Let's move further to example.
Decrypting large file and processing it line by line
Prerequisites
Before we decrypt the file we need to encrypt our generated large file. We are going to use AES with 256 bits key length with Cipher Block Chaining (CBC) as mode.
cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
KEY = cipher.random_key
IV = cipher.random_iv
Now, let's encrypt out file:
cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
cipher.key = KEY
cipher.iv = IV
file = nil
enc_file = nil
profile do
file = File.read("large_1G.txt")
enc_file = File.open("large_1G.txt.enc", "wb")
enc_file << cipher.update(file)
enc_file << cipher.final
end
file.close
enc_file.close
Objects Freed: 12
Time: 3.6 seconds
Memory usage: 1000.02 MB
Seems like encrypting is also a quite memory consuming task. Let's adjust the algorithm a little bit:
cipher = OpenSSL::Cipher::AES256.new(:CBC)
cipher.encrypt
cipher.key = KEY
cipher.iv = IV
file = nil
enc_file = nil
profile do
buf = ""
file = File.open("large_1G.txt", "rb")
enc_file = File.open("large_1G.txt.enc", "wb")
while buf = file.read(4096)
enc_file << cipher.update(buf)
end
enc_file << cipher.final
end
file.close
enc_file.close
Objects Freed: 768048
Time: 5.05 seconds
Memory usage: 145.93 MB
By changing the algorithm to read and cipher the file by chunks it made the task much less memory consuming.
Decrypt
All right, let's try to decrypt it now:
decipher = OpenSSL::Cipher::AES256.new(:CBC)
decipher.decrypt
decipher.key = KEY
decipher.iv = IV
dec_file = nil
enc_file = nil
profile do
buf = ""
enc_file = File.open("large_1G.txt.enc", "rb")
dec_file = File.open("large_1G.txt.dec", "wb")
while buf = enc_file.read(4096)
dec_file << decipher.update(buf)
end
dec_file << decipher.final
end
dec_file.close
enc_file.close
Objects Freed: 768050
Time: 3.5 seconds
Memory usage: 152.12 MB
Now, let's compare our files whether we properly encrypted and decrypted it:
❯ diff large_1G.txt large_1G.txt.dec
No differences were found. We are good here!
We managed to lower the memory usage quite significantly. That's great!
Treat this article as a toolset that you can use in your specific case.
Photo by Erwan Hesry on Unsplash