Log4j file rotation

From EggeWiki
Jump to: navigation, search

If you have an application writing to a log file using log4j, under what circumstances could the log rotation in DailyRollingFileAppender cause the hosting process to fail?

When it failed to rotate, you may see a message like:

Caused by: java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING
        at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941)
        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:537)
        at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:384)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
        at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146)
        at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204)
        at java.io.Writer.write(Writer.java:126)
        at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:47)
        at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:301)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:358)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65)
        at org.apache.log4j.Category.callAppenders(Category.java:203)
        at org.apache.log4j.Category.forcedLog(Category.java:388)
        at org.apache.log4j.Category.log(Category.java:853)
        at org.jboss.logging.Log4jLoggerPlugin.info(Log4jLoggerPlugin.java:161)
        at org.jboss.logging.Logger.info(Logger.java:189)
        at org.jboss.system.server.ServerImpl.shutdown(ServerImpl.java:654)

I spent some time investigation this today, and here are my findings. First, lets take a look at the log4j which rotates the log file.

http://www.docjar.com/html/api/org/apache/log4j/DailyRollingFileAppender.java.html

  320       File file = new File(fileName);
  321       boolean result = file.renameTo(target);
  322       if(result) {
  323         LogLog.debug(fileName +" -> "+ scheduledFilename);
  324       } else {
  325         LogLog.error("Failed to rename ["+fileName+"] to ["+scheduledFilename+"].");
  326       }
  327   
  328       try {
  329         // This will also close the file. This is OK since multiple
  330         // close operations are safe.
  331         this.setFile(fileName, false, this.bufferedIO, this.bufferSize);
  332       }
  333       catch(IOException e) {
  334         errorHandler.error("setFile("+fileName+", false) call failed.");
  335       }

OK. So it looks like log4j will attempt to rename the file, and if it fails, it'll continue to log to the original file. We'll see some message like this send to stderr:

log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35].

How does File#renameTo work inside the JVM. We'll it looks like it just makes a system call with the two paths.

https://openjdk.dev.java.net/source/browse/openjdk/jdk/trunk/jdk/src/solaris/native/java/io/UnixFileSystem_md.c?rev=257&view=markup

JNIEXPORT jboolean JNICALL
Java_java_io_UnixFileSystem_rename0(JNIEnv *env, jobject this,
                                    jobject from, jobject to)
{
    jboolean rv = JNI_FALSE;
 
    WITH_FIELD_PLATFORM_STRING(env, from, ids.path, fromPath) {
        WITH_FIELD_PLATFORM_STRING(env, to, ids.path, toPath) {
            if (rename(fromPath, toPath) == 0) {
                rv = JNI_TRUE;
            }
        } END_PLATFORM_STRING(env, toPath);
    } END_PLATFORM_STRING(env, fromPath);
    return rv;
}

OK. This is pretty simple. So how could we cause rename to break? Well here's a basic test harness.

while (!Thread.currentThread().isInterrupted()) {
 File a = new File("a.txt");
 if (!a.exists()) {
  a.createNewFile();
 }
 File b = new File("b.txt");
 if (b.exists()) {
  b.delete();
 }
 boolean ok = a.renameTo(b);
 if (!ok)
  System.err.println("Failed to rename a to b");
 Thread.yield();
}

When I run this on Windows, it's pretty easy to get the program to fail. I just need to run tail against it. It takes a couple of tries because the file is only there some of the time, but once you get the file, the test harness will start to print errors.

OK. The interesting bit is when we run the test harness on Solaris. On my Solaris test system, I have three different file systems - local disk, Netapp, and Solaris NFS. They're all a litte different in how the handle file locks.

I run my test harness on Solaris on the Solaris NFS, but I can't get the test harness to break by using 'tail'.

If I look at my tail program, I notice that's it's handle to the file has gone stale.

>> pfiles 69
69:     tail -f a.txt
  Current rlimit: 256 file descriptors
   1: S_IFCHR mode:0620 dev:295,0 ino:12582918 uid:4911 gid:7 rdev:24,1
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:1
   2: S_IFCHR mode:0620 dev:295,0 ino:12582918 uid:4911 gid:7 rdev:24,1
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:1
   0: Stale NFS file handle

This is normal for a SAN, but slightly different for a Netapp mount. Running on the Netapp, it REALLY, REALLY hard to get a stale NFS handle. It generally takes something like a network interruption to get a stale handle on the netapp. Here's the same test on a Netapp:

$ pfiles 27297 
27297:  tail -f a.txt
  Current rlimit: 256 file descriptors
   0: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:4
   1: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:4
   2: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
      O_RDWR|O_NOCTTY|O_LARGEFILE
      /devices/pseudo/pts@0:4
   3: S_IFREG mode:0644 dev:284,1 ino:17193816 uid:5294 gid:200 size:0
      O_RDONLY|O_NONBLOCK|O_LARGEFILE

The last device is my file. With the netapp, it will keep a temporary file in existence until my process releases it's handle. Here's the file I have a lock on:

$ ls -ltra .nfs* -rw-r--r-- 1 user lehman 0 2008-09-26 14:43 .nfsCF00C

The Netapp has basically renamed the file instead of deleting it, allow my process to continue to use it until the process exists.

So, I've shown that you can break a program by tailing it in another process on Windows, but so not on Solaris. We'll I'll try some more tests to see if I can get Solaris to break. I've created this test harness for log4j and a corresponding log4j.xml file.

public static void main(String[] args) {
 Logger log = Logger.getLogger(AppendIt.class);
 long i = 0;
 while(!Thread.currentThread().isInterrupted()) {
  log.info(i++);
  long now = System.currentTimeMillis();
  // gobs of silent logging
  while(System.currentTimeMillis() - now < 1000) {
   log.debug(i++);
  }
 }
}
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
 
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="false">
 
   <!-- A time/date based rolling appender -->
   <appender name="FILE" class="org.apache.log4j.DailyRollingFileAppender">
      <param name="File" value="test.log"/>
      <param name="Append" value="true"/>
 
      <!-- Rollover every minute -->
      <param name="DatePattern" value="'.'yyyy-MM-dd-HH-mm"/>
 
      <layout class="org.apache.log4j.PatternLayout">
         <!-- The default pattern: Date Priority [Category] Message\n -->
         <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
      </layout>
   </appender>
 
   <root>
      <priority value="INFO"/>
      <appender-ref ref="FILE"/>
   </root>
 
</log4j:configuration>

I've tried by above 'tail' commands, running the test harness on all three file system, but still no luck.

Deleting the output file causes log4j to output this error: log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35]. but then continue without incident.

OK... now I'll try to gzip the file being written to.

log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35].
log4j:ERROR Could not close org.apache.log4j.helpers.QuietWriter@1c9b9ca
java.io.IOException: Stale NFS file handle
        at java.io.FileOutputStream.close0(Native Method)
        at java.io.FileOutputStream.close(FileOutputStream.java:279)
        at sun.nio.cs.StreamEncoder$CharsetSE.implClose(StreamEncoder.java:431)
        at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:160)
        at java.io.OutputStreamWriter.close(OutputStreamWriter.java:222)
        at java.io.FilterWriter.close(FilterWriter.java:92)
        at org.apache.log4j.FileAppender.closeFile(FileAppender.java:184)
        at org.apache.log4j.DailyRollingFileAppender.rollOver(DailyRollingFileAppender.java:313)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:353)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.info(Category.java:666)
        at AppendIt.main(AppendIt.java:13)
log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-42].
Exception in thread "main" java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING_END
        at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941)
        at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:537)
        at sun.nio.cs.StreamEncoder$CharsetSE.flushLeftoverChar(StreamEncoder.java:358)
        at sun.nio.cs.StreamEncoder$CharsetSE.implClose(StreamEncoder.java:414)
        at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:160)
        at java.io.OutputStreamWriter.close(OutputStreamWriter.java:222)
        at java.io.FilterWriter.close(FilterWriter.java:92)
        at org.apache.log4j.FileAppender.closeFile(FileAppender.java:184)
        at org.apache.log4j.FileAppender.reset(FileAppender.java:339)
        at org.apache.log4j.FileAppender.setFile(FileAppender.java:284)
        at org.apache.log4j.DailyRollingFileAppender.rollOver(DailyRollingFileAppender.java:331)
        at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:353)
        at org.apache.log4j.WriterAppender.append(WriterAppender.java:160)
        at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251)
        at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66)
        at org.apache.log4j.Category.callAppenders(Category.java:206)
        at org.apache.log4j.Category.forcedLog(Category.java:391)
        at org.apache.log4j.Category.info(Category.java:666)
        at AppendIt.main(AppendIt.java:13)

Cool - this crashed my test program. In JBoss this would likely cause the running thread to exit. It probably dies silently, as it can't log the problem it's having. In my test harness, the main thread will write to stderr if it has an exception.

Lessons learned - on Windows don't try to tail a file unless your in a dev environment. In Solaris you can tail a file, but don't zip it up, and try not to delete it.

See Also

JBoss log4j rotation

Personal tools