Log4j file rotation: Difference between revisions
mNo edit summary |
mNo edit summary |
||
(2 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
If you have an application writing to a log file using log4j, under what circumstances could the log rotation cause the hosting process to fail? | If you have an application writing to a log file using log4j, under what circumstances could the log rotation in '''DailyRollingFileAppender''' cause the hosting process to fail? | ||
When it failed to rotate, you may see a message like: | When it failed to rotate, you may see a message like: | ||
< | <syntaxhighlight lang="java5"> | ||
Caused by: java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING | Caused by: java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING | ||
at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941) | at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941) | ||
Line 23: | Line 23: | ||
at org.jboss.logging.Logger.info(Logger.java:189) | at org.jboss.logging.Logger.info(Logger.java:189) | ||
at org.jboss.system.server.ServerImpl.shutdown(ServerImpl.java:654) | at org.jboss.system.server.ServerImpl.shutdown(ServerImpl.java:654) | ||
</ | </syntaxhighlight> | ||
I spent some time investigation this today, and here are my findings. First, lets take a look at the log4j which rotates the log file. | I spent some time investigation this today, and here are my findings. First, lets take a look at the log4j which rotates the log file. | ||
http://www.docjar.com/html/api/org/apache/log4j/DailyRollingFileAppender.java.html | http://www.docjar.com/html/api/org/apache/log4j/DailyRollingFileAppender.java.html | ||
< | <syntaxhighlight lang="java5"> | ||
320 File file = new File(fileName); | 320 File file = new File(fileName); | ||
321 boolean result = file.renameTo(target); | 321 boolean result = file.renameTo(target); | ||
Line 45: | Line 45: | ||
334 errorHandler.error("setFile("+fileName+", false) call failed."); | 334 errorHandler.error("setFile("+fileName+", false) call failed."); | ||
335 } | 335 } | ||
</ | </syntaxhighlight> | ||
OK. So it looks like log4j will attempt to rename the file, and if it fails, it'll continue to log to the original file. We'll see some message like this send to stderr: | OK. So it looks like log4j will attempt to rename the file, and if it fails, it'll continue to log to the original file. We'll see some message like this send to stderr: | ||
Line 55: | Line 55: | ||
https://openjdk.dev.java.net/source/browse/openjdk/jdk/trunk/jdk/src/solaris/native/java/io/UnixFileSystem_md.c?rev=257&view=markup | https://openjdk.dev.java.net/source/browse/openjdk/jdk/trunk/jdk/src/solaris/native/java/io/UnixFileSystem_md.c?rev=257&view=markup | ||
< | <syntaxhighlight lang="c"> | ||
JNIEXPORT jboolean JNICALL | JNIEXPORT jboolean JNICALL | ||
Java_java_io_UnixFileSystem_rename0(JNIEnv *env, jobject this, | Java_java_io_UnixFileSystem_rename0(JNIEnv *env, jobject this, | ||
Line 71: | Line 71: | ||
return rv; | return rv; | ||
} | } | ||
</ | </syntaxhighlight> | ||
OK. This is pretty simple. So how could we cause rename to break? Well here's a basic test harness. | OK. This is pretty simple. So how could we cause rename to break? Well here's a basic test harness. | ||
< | <syntaxhighlight lang="java5"> | ||
while (!Thread.currentThread().isInterrupted()) { | while (!Thread.currentThread().isInterrupted()) { | ||
File a = new File("a.txt"); | File a = new File("a.txt"); | ||
Line 90: | Line 90: | ||
Thread.yield(); | Thread.yield(); | ||
} | } | ||
</ | </syntaxhighlight> | ||
When I run this on Windows, it's pretty easy to get the program to fail. I just need to run '''tail''' against it. It takes a couple of tries because the file is only there some of the time, but once you get the file, the test harness will start to print errors. | When I run this on Windows, it's pretty easy to get the program to fail. I just need to run '''tail''' against it. It takes a couple of tries because the file is only there some of the time, but once you get the file, the test harness will start to print errors. | ||
Line 99: | Line 99: | ||
If I look at my tail program, I notice that's it's handle to the file has gone stale. | If I look at my tail program, I notice that's it's handle to the file has gone stale. | ||
< | <syntaxhighlight lang="bash"> | ||
>> pfiles 69 | >> pfiles 69 | ||
69: tail -f a.txt | 69: tail -f a.txt | ||
Line 110: | Line 110: | ||
/devices/pseudo/pts@0:1 | /devices/pseudo/pts@0:1 | ||
0: Stale NFS file handle | 0: Stale NFS file handle | ||
</ | </syntaxhighlight> | ||
This is normal for a SAN, but slightly different for a Netapp mount. Running on the Netapp, it REALLY, REALLY hard to get a stale NFS handle. It generally takes something like a network interruption to get a stale handle on the netapp. Here's the same test on a Netapp: | This is normal for a SAN, but slightly different for a Netapp mount. Running on the Netapp, it REALLY, REALLY hard to get a stale NFS handle. It generally takes something like a network interruption to get a stale handle on the netapp. Here's the same test on a Netapp: | ||
< | <syntaxhighlight lang="bash"> | ||
$ pfiles 27297 | $ pfiles 27297 | ||
27297: tail -f a.txt | 27297: tail -f a.txt | ||
Line 129: | Line 129: | ||
3: S_IFREG mode:0644 dev:284,1 ino:17193816 uid:5294 gid:200 size:0 | 3: S_IFREG mode:0644 dev:284,1 ino:17193816 uid:5294 gid:200 size:0 | ||
O_RDONLY|O_NONBLOCK|O_LARGEFILE | O_RDONLY|O_NONBLOCK|O_LARGEFILE | ||
</ | </syntaxhighlight> | ||
The last device is my file. With the netapp, it will keep a temporary file in existence until my process releases it's handle. Here's the file I have a lock on: | The last device is my file. With the netapp, it will keep a temporary file in existence until my process releases it's handle. Here's the file I have a lock on: | ||
Line 140: | Line 140: | ||
So, I've shown that you can break a program by tailing it in another process on Windows, but so not on Solaris. We'll I'll try some more tests to see if I can get Solaris to break. I've created this test harness for log4j and a corresponding log4j.xml file. | So, I've shown that you can break a program by tailing it in another process on Windows, but so not on Solaris. We'll I'll try some more tests to see if I can get Solaris to break. I've created this test harness for log4j and a corresponding log4j.xml file. | ||
< | <syntaxhighlight lang="java5"> | ||
public static void main(String[] args) { | public static void main(String[] args) { | ||
Logger log = Logger.getLogger(AppendIt.class); | Logger log = Logger.getLogger(AppendIt.class); | ||
Line 153: | Line 153: | ||
} | } | ||
} | } | ||
</ | </syntaxhighlight> | ||
< | <syntaxhighlight lang="xml"> | ||
<?xml version="1.0" encoding="UTF-8"?> | <?xml version="1.0" encoding="UTF-8"?> | ||
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> | <!DOCTYPE log4j:configuration SYSTEM "log4j.dtd"> | ||
Line 181: | Line 181: | ||
</log4j:configuration> | </log4j:configuration> | ||
</ | </syntaxhighlight> | ||
I've tried by above 'tail' commands, running the test harness on all three file system, but still no luck. | I've tried by above 'tail' commands, running the test harness on all three file system, but still no luck. |
Latest revision as of 22:55, 10 December 2011
If you have an application writing to a log file using log4j, under what circumstances could the log rotation in DailyRollingFileAppender cause the hosting process to fail?
When it failed to rotate, you may see a message like:
Caused by: java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING
at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941)
at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:537)
at sun.nio.cs.StreamEncoder$CharsetSE.implWrite(StreamEncoder.java:384)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:136)
at sun.nio.cs.StreamEncoder.write(StreamEncoder.java:146)
at java.io.OutputStreamWriter.write(OutputStreamWriter.java:204)
at java.io.Writer.write(Writer.java:126)
at org.apache.log4j.helpers.QuietWriter.write(QuietWriter.java:47)
at org.apache.log4j.WriterAppender.subAppend(WriterAppender.java:301)
at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:358)
at org.apache.log4j.WriterAppender.append(WriterAppender.java:159)
at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:230)
at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:65)
at org.apache.log4j.Category.callAppenders(Category.java:203)
at org.apache.log4j.Category.forcedLog(Category.java:388)
at org.apache.log4j.Category.log(Category.java:853)
at org.jboss.logging.Log4jLoggerPlugin.info(Log4jLoggerPlugin.java:161)
at org.jboss.logging.Logger.info(Logger.java:189)
at org.jboss.system.server.ServerImpl.shutdown(ServerImpl.java:654)
I spent some time investigation this today, and here are my findings. First, lets take a look at the log4j which rotates the log file.
http://www.docjar.com/html/api/org/apache/log4j/DailyRollingFileAppender.java.html
320 File file = new File(fileName);
321 boolean result = file.renameTo(target);
322 if(result) {
323 LogLog.debug(fileName +" -> "+ scheduledFilename);
324 } else {
325 LogLog.error("Failed to rename ["+fileName+"] to ["+scheduledFilename+"].");
326 }
327
328 try {
329 // This will also close the file. This is OK since multiple
330 // close operations are safe.
331 this.setFile(fileName, false, this.bufferedIO, this.bufferSize);
332 }
333 catch(IOException e) {
334 errorHandler.error("setFile("+fileName+", false) call failed.");
335 }
OK. So it looks like log4j will attempt to rename the file, and if it fails, it'll continue to log to the original file. We'll see some message like this send to stderr:
log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35].
How does File#renameTo work inside the JVM. We'll it looks like it just makes a system call with the two paths.
JNIEXPORT jboolean JNICALL
Java_java_io_UnixFileSystem_rename0(JNIEnv *env, jobject this,
jobject from, jobject to)
{
jboolean rv = JNI_FALSE;
WITH_FIELD_PLATFORM_STRING(env, from, ids.path, fromPath) {
WITH_FIELD_PLATFORM_STRING(env, to, ids.path, toPath) {
if (rename(fromPath, toPath) == 0) {
rv = JNI_TRUE;
}
} END_PLATFORM_STRING(env, toPath);
} END_PLATFORM_STRING(env, fromPath);
return rv;
}
OK. This is pretty simple. So how could we cause rename to break? Well here's a basic test harness.
while (!Thread.currentThread().isInterrupted()) {
File a = new File("a.txt");
if (!a.exists()) {
a.createNewFile();
}
File b = new File("b.txt");
if (b.exists()) {
b.delete();
}
boolean ok = a.renameTo(b);
if (!ok)
System.err.println("Failed to rename a to b");
Thread.yield();
}
When I run this on Windows, it's pretty easy to get the program to fail. I just need to run tail against it. It takes a couple of tries because the file is only there some of the time, but once you get the file, the test harness will start to print errors.
OK. The interesting bit is when we run the test harness on Solaris. On my Solaris test system, I have three different file systems - local disk, Netapp, and Solaris NFS. They're all a litte different in how the handle file locks.
I run my test harness on Solaris on the Solaris NFS, but I can't get the test harness to break by using 'tail'.
If I look at my tail program, I notice that's it's handle to the file has gone stale.
>> pfiles 69
69: tail -f a.txt
Current rlimit: 256 file descriptors
1: S_IFCHR mode:0620 dev:295,0 ino:12582918 uid:4911 gid:7 rdev:24,1
O_RDWR|O_NOCTTY|O_LARGEFILE
/devices/pseudo/pts@0:1
2: S_IFCHR mode:0620 dev:295,0 ino:12582918 uid:4911 gid:7 rdev:24,1
O_RDWR|O_NOCTTY|O_LARGEFILE
/devices/pseudo/pts@0:1
0: Stale NFS file handle
This is normal for a SAN, but slightly different for a Netapp mount. Running on the Netapp, it REALLY, REALLY hard to get a stale NFS handle. It generally takes something like a network interruption to get a stale handle on the netapp. Here's the same test on a Netapp:
$ pfiles 27297
27297: tail -f a.txt
Current rlimit: 256 file descriptors
0: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
O_RDWR|O_NOCTTY|O_LARGEFILE
/devices/pseudo/pts@0:4
1: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
O_RDWR|O_NOCTTY|O_LARGEFILE
/devices/pseudo/pts@0:4
2: S_IFCHR mode:0620 dev:270,0 ino:12582924 uid:5294 gid:7 rdev:24,4
O_RDWR|O_NOCTTY|O_LARGEFILE
/devices/pseudo/pts@0:4
3: S_IFREG mode:0644 dev:284,1 ino:17193816 uid:5294 gid:200 size:0
O_RDONLY|O_NONBLOCK|O_LARGEFILE
The last device is my file. With the netapp, it will keep a temporary file in existence until my process releases it's handle. Here's the file I have a lock on:
$ ls -ltra .nfs* -rw-r--r-- 1 user lehman 0 2008-09-26 14:43 .nfsCF00C
The Netapp has basically renamed the file instead of deleting it, allow my process to continue to use it until the process exists.
So, I've shown that you can break a program by tailing it in another process on Windows, but so not on Solaris. We'll I'll try some more tests to see if I can get Solaris to break. I've created this test harness for log4j and a corresponding log4j.xml file.
public static void main(String[] args) {
Logger log = Logger.getLogger(AppendIt.class);
long i = 0;
while(!Thread.currentThread().isInterrupted()) {
log.info(i++);
long now = System.currentTimeMillis();
// gobs of silent logging
while(System.currentTimeMillis() - now < 1000) {
log.debug(i++);
}
}
}
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE log4j:configuration SYSTEM "log4j.dtd">
<log4j:configuration xmlns:log4j="http://jakarta.apache.org/log4j/" debug="false">
<!-- A time/date based rolling appender -->
<appender name="FILE" class="org.apache.log4j.DailyRollingFileAppender">
<param name="File" value="test.log"/>
<param name="Append" value="true"/>
<!-- Rollover every minute -->
<param name="DatePattern" value="'.'yyyy-MM-dd-HH-mm"/>
<layout class="org.apache.log4j.PatternLayout">
<!-- The default pattern: Date Priority [Category] Message\n -->
<param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
</layout>
</appender>
<root>
<priority value="INFO"/>
<appender-ref ref="FILE"/>
</root>
</log4j:configuration>
I've tried by above 'tail' commands, running the test harness on all three file system, but still no luck.
Deleting the output file causes log4j to output this error: log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35]. but then continue without incident.
OK... now I'll try to gzip the file being written to.
log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-35]. log4j:ERROR Could not close org.apache.log4j.helpers.QuietWriter@1c9b9ca java.io.IOException: Stale NFS file handle at java.io.FileOutputStream.close0(Native Method) at java.io.FileOutputStream.close(FileOutputStream.java:279) at sun.nio.cs.StreamEncoder$CharsetSE.implClose(StreamEncoder.java:431) at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:160) at java.io.OutputStreamWriter.close(OutputStreamWriter.java:222) at java.io.FilterWriter.close(FilterWriter.java:92) at org.apache.log4j.FileAppender.closeFile(FileAppender.java:184) at org.apache.log4j.DailyRollingFileAppender.rollOver(DailyRollingFileAppender.java:313) at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:353) at org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.info(Category.java:666) at AppendIt.main(AppendIt.java:13) log4j:ERROR Failed to rename [test.log] to [test.log.2008-09-26-13-42]. Exception in thread "main" java.lang.IllegalStateException: Current state = FLUSHED, new state = CODING_END at java.nio.charset.CharsetEncoder.throwIllegalStateException(CharsetEncoder.java:941) at java.nio.charset.CharsetEncoder.encode(CharsetEncoder.java:537) at sun.nio.cs.StreamEncoder$CharsetSE.flushLeftoverChar(StreamEncoder.java:358) at sun.nio.cs.StreamEncoder$CharsetSE.implClose(StreamEncoder.java:414) at sun.nio.cs.StreamEncoder.close(StreamEncoder.java:160) at java.io.OutputStreamWriter.close(OutputStreamWriter.java:222) at java.io.FilterWriter.close(FilterWriter.java:92) at org.apache.log4j.FileAppender.closeFile(FileAppender.java:184) at org.apache.log4j.FileAppender.reset(FileAppender.java:339) at org.apache.log4j.FileAppender.setFile(FileAppender.java:284) at org.apache.log4j.DailyRollingFileAppender.rollOver(DailyRollingFileAppender.java:331) at org.apache.log4j.DailyRollingFileAppender.subAppend(DailyRollingFileAppender.java:353) at org.apache.log4j.WriterAppender.append(WriterAppender.java:160) at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:251) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.info(Category.java:666) at AppendIt.main(AppendIt.java:13)
Cool - this crashed my test program. In JBoss this would likely cause the running thread to exit. It probably dies silently, as it can't log the problem it's having. In my test harness, the main thread will write to stderr if it has an exception.
Lessons learned - on Windows don't try to tail a file unless your in a dev environment. In Solaris you can tail a file, but don't zip it up, and try not to delete it.