Add pacct_struct to fix some pacct bugs.

From: KaiGai Kohei
Date: Tue Jun 20 2006 - 02:24:15 EST


Hi, I noticed three problems in pacct facility.

1. Pacct facility has a possibility to write incorrect ac_flag
in multi-threading cases.
2. There is a possibility to be waken up OOM Killer from
pacct facility. It will cause system stall.
3. If several threads are killed at same time, There is
a possibility not to pick up a part of those accountings.

The attached patch will resolve those matters.
Any comments please. Thanks,

Signed-off-by: KaiGai Kohei <kaigai@xxxxxxxxxxxxx>

[The details of those matters]

(1) about incorrect ac_flag
When pacct facility generate an 'ac_flag' field in accounting record,
it refers a task_struct of the thread which died last in the process.
But any other task_structs are ignored.

Therefore, pacct facility drops ASU flag even if root-privilege
operations are used by any other threads except the last one.
In addition, AFORK flag is always set when the thread of group-leader
didn't die last, although this process has called execve() after fork().

---- in kernel/acct.c : do_acct_process() ----
ac.ac_flag = 0;
if (current->flags & PF_FORKNOEXEC)
ac.ac_flag |= AFORK;
if (current->flags & PF_SUPERPRIV)
ac.ac_flag |= ASU;
if (current->flags & PF_DUMPCORE)
ac.ac_flag |= ACORE;
if (current->flags & PF_SIGNALED)
ac.ac_flag |= AXSIG;
----------------------------------------------
I think ASU, ACORE, AXSIG flag should be set if any task_structs
satisfy the conditions. And, AFORK flag should not be checked
for any threads except the group leader.


(2) about OOM killer
The pacct facility need an i/o operation when an accounting record
is generated. There is a possibility to wake OOM killer up.
If OOM killer is activated, it kills some processes to make them
release process memory regions.
But acct_process() is called in the killed processes context
before calling exit_mm(), so those processes cannot release
own memory. In the results, any processes stop in this point and
it finally cause a system stall.

---- in kernel/exit.c : do_exit() ------------
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
acct_process(code);
}
:
- snip -
:
exit_mm(tsk);
----------------------------------------------
I think acct_process() should be called after exit_mm() to avoid
this matter. As mm_struct is necessary to calculate ac_mem field,
it should be kept before exit_mm().


(3) about race condition in pacct
In current 2.6.17 implementation, signal_struct refered from task_struct
is used for per-process data structure. The pacct facility also uses it
as a per-process data structure to save stime, utime, minflt, majflt.
But those members are saved in __exit_signal(). It's too late.

For example, if some threads exits at same time, pacct facility has
a possibility to drop accountings for a part of those threads.
(see, the following 'The results of original 2.6.17 kernel')
I think accounting information should be completely collected into
the per-process data structure before writing out an accounting record.


[The solution for those matters]
a. pacct_struct is newly defined, and it's deployed into signal_struct
as a per-process accounting storage.
b. acct_collect() is newly added to collect any thread's accountings
before generating accounting record and calling exit_mm().
c. Calling acct_process() is moved after exit_mm().
d. do_acct_process() becomes to refer the per-process accounting
structure, not task_strcut of the last thread.


[The results of original 2.6.17 kernel]
* test for ac_flag
# time -p ./bugacct
real 10.11
user 5.94
sys 0.17
# touch /tmp/acct-`uname -r`
# accton /tmp/acct-`uname -r`
# time -p ./bugacct
real 10.04
user 5.86
sys 0.17
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
F--- 06/20-14:31:47 1003 586 16 140992 32776 0 bugacct

(*) 'F' means it didn't execve() after fork().
'P' means it used root-privilege operations.

=> original 2.6.17 drops 'P' flag, and appends undesired 'F' flag.

* test for the race condition
# time -p ./raceacct 4
real 2.59
user 7.56
sys 0.00
# time -p ./raceacct 4
real 2.27
user 6.62
sys 0.00
# time -p ./raceacct 4
real 2.16
user 6.16
sys 0.00
# time -p ./raceacct 4
real 2.87
user 8.36
sys 0.00
# time -p ./raceacct 4
real 2.77
user 8.11
sys 0.00

-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
F--- 06/20-11:18:57 259 596 0 28528 38 0 raceacct *
---- 06/20-11:19:14 227 662 0 28528 171 0 raceacct
F--- 06/20-11:19:18 216 509 0 28528 38 0 raceacct *
F--- 06/20-11:19:22 287 651 0 28528 39 0 raceacct *
---- 06/20-11:19:25 276 811 0 28528 170 0 raceacct
(*) = UTIME does not match with the result of 'time -p'


[The results of patched 2.6.17-kg kernel]
* test for ac_flag
# time -p ./bugacct
real 10.07
user 5.88
sys 0.17
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
-P-- 06/20-14:08:06 1007 588 16 140992 32944 0 bugacct

* test for the race condition
# time -p ./raceacct 4
real 2.55
user 7.48
sys 0.00
# time -p ./raceacct 4
real 2.59
user 7.58
sys 0.00
# time -p ./raceacct 4
real 2.64
user 7.75
sys 0.00
# time -p ./raceacct 4
real 2.29
user 6.71
sys 0.00
# time -p ./raceacct 4
real 2.68
user 7.83
sys 0.00
-- The accounting results -----------------------------------------
FLAG BTIME ETIME UTIME STIME MEM MINFLT MAJFLT COMM
---- 06/20-11:09:33 255 748 0 28528 171 0 raceacct
---- 06/20-11:09:37 259 758 0 28528 171 0 raceacct
---- 06/20-11:09:41 264 775 0 28528 170 0 raceacct
---- 06/20-11:09:44 229 671 0 28528 170 0 raceacct
---- 06/20-11:09:57 268 783 0 28528 170 0 raceacct

[The test program : raceacct.c]
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void *worker(void *dummy) {
int i, *flag = dummy;

for (i=0; i < 100000000 && *flag; i++);
*flag = 0;
pthread_exit(NULL);
}

int main(int argc, char *argv[]) {
pthread_t pthread;
int i, num = 0, flag = 1;

if (argc > 2) {
fprintf(stderr, "argument too large\n");
return 1;
} else if (argc == 2) {
num = atoi(argv[1]);
}
for (i=1; i < num; i++)
pthread_create(&pthread, NULL, worker, (void *)&flag);
worker((void *)&flag);
}

[The test program : bugacct.c]
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <sys/time.h>

#define BUFFER_SIZE (128 * 1024 * 1024) /* 128MB */

void *mychild(void *buffer) {
struct timeval tv;
u_int64_t t1, t2;

gettimeofday(&tv, NULL);
t1 = tv.tv_sec * 1000000 + tv.tv_usec;
t2 = t1 + 6 * 1000000;
/* heavy CPU/Mem job */
srand(tv.tv_usec);
while(t1 < t2) {
memset(buffer, rand(), BUFFER_SIZE);
gettimeofday(&tv, NULL);
t1 = tv.tv_sec * 1000000 + tv.tv_usec;
}
return NULL;
}

int main(int argc, char *argv[]) {
pthread_t pthread;
void *buffer = NULL;

buffer = malloc(BUFFER_SIZE);
if (!buffer)
return 1;

sleep(4);
pthread_create(&pthread, NULL, mychild, buffer);
nice(-1);
sleep(1);
pthread_exit(0);
}

[The patch against to 2.6.17]
--- linux-2.6.17/include/linux/acct.h 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/include/linux/acct.h 2006-06-19 17:04:11.000000000 +0900
@@ -123,13 +123,17 @@ struct vfsmount;
struct super_block;
extern void acct_auto_close_mnt(struct vfsmount *m);
extern void acct_auto_close(struct super_block *sb);
-extern void acct_process(long exitcode);
+extern void acct_init_pacct(struct pacct_struct *pacct, struct task_struct *tsk);
+extern void acct_collect(long exitcode, int group_dead);
+extern void acct_process(void);
extern void acct_update_integrals(struct task_struct *tsk);
extern void acct_clear_integrals(struct task_struct *tsk);
#else
#define acct_auto_close_mnt(x) do { } while (0)
#define acct_auto_close(x) do { } while (0)
-#define acct_process(x) do { } while (0)
+#define acct_init_pacct(x,y) do { } while (0)
+#define acct_collect(x,y) do { } while (0)
+#define acct_process() do { } while (0)
#define acct_update_integrals(x) do { } while (0)
#define acct_clear_integrals(task) do { } while (0)
#endif
--- linux-2.6.17/include/linux/sched.h 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/include/linux/sched.h 2006-06-19 16:59:16.000000000 +0900
@@ -358,6 +358,15 @@ struct sighand_struct {
spinlock_t siglock;
};

+struct pacct_struct {
+ int ac_flag;
+ long ac_exitcode;
+ cputime_t ac_utime, ac_stime;
+ unsigned long ac_mem;
+ unsigned long ac_minflt, ac_majflt;
+ struct timespec ac_start_time;
+};
+
/*
* NOTE! "signal_struct" does not have it's own
* locking, because a shared signal_struct always
@@ -449,6 +458,9 @@ struct signal_struct {
struct key *session_keyring; /* keyring inherited over fork */
struct key *process_keyring; /* keyring private to this process */
#endif
+#ifdef CONFIG_BSD_PROCESS_ACCT
+ struct pacct_struct pacct; /* per-process accounting information */
+#endif
};

/* Context switch must be unlocked if interrupts are to be enabled */
--- linux-2.6.17/kernel/acct.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/acct.c 2006-06-20 14:55:34.000000000 +0900
@@ -75,7 +75,7 @@ int acct_parm[3] = {4, 2, 30};
/*
* External references and all of the globals.
*/
-static void do_acct_process(long, struct file *);
+static void do_acct_process(struct file *);

/*
* This structure is used so that all the data protected by lock
@@ -196,7 +196,7 @@ static void acct_file_reopen(struct file
if (old_acct) {
mnt_unpin(old_acct->f_vfsmnt);
spin_unlock(&acct_globals.lock);
- do_acct_process(0, old_acct);
+ do_acct_process(old_acct);
filp_close(old_acct, NULL);
spin_lock(&acct_globals.lock);
}
@@ -419,16 +419,16 @@ static u32 encode_float(u64 value)
/*
* do_acct_process does all actual work. Caller holds the reference to file.
*/
-static void do_acct_process(long exitcode, struct file *file)
+static void do_acct_process(struct file *file)
{
+ struct signal_struct *sig = current->signal;
+ struct pacct_struct *pacct = &sig->pacct;
acct_t ac;
mm_segment_t fs;
- unsigned long vsize;
unsigned long flim;
u64 elapsed;
u64 run_time;
struct timespec uptime;
- unsigned long jiffies;

/*
* First check to see if there is enough free_space to continue
@@ -449,8 +449,8 @@ static void do_acct_process(long exitcod
/* calculate run_time in nsec*/
do_posix_clock_monotonic_gettime(&uptime);
run_time = (u64)uptime.tv_sec*NSEC_PER_SEC + uptime.tv_nsec;
- run_time -= (u64)current->group_leader->start_time.tv_sec * NSEC_PER_SEC
- + current->group_leader->start_time.tv_nsec;
+ run_time -= (u64)pacct->ac_start_time.tv_sec * NSEC_PER_SEC
+ + pacct->ac_start_time.tv_nsec;
/* convert nsec -> AHZ */
elapsed = nsec_to_AHZ(run_time);
#if ACCT_VERSION==3
@@ -469,12 +469,8 @@ static void do_acct_process(long exitcod
#endif
do_div(elapsed, AHZ);
ac.ac_btime = xtime.tv_sec - elapsed;
- jiffies = cputime_to_jiffies(cputime_add(current->utime,
- current->signal->utime));
- ac.ac_utime = encode_comp_t(jiffies_to_AHZ(jiffies));
- jiffies = cputime_to_jiffies(cputime_add(current->stime,
- current->signal->stime));
- ac.ac_stime = encode_comp_t(jiffies_to_AHZ(jiffies));
+ ac.ac_utime = encode_comp_t(jiffies_to_AHZ(cputime_to_jiffies(pacct->ac_utime)));
+ ac.ac_stime = encode_comp_t(jiffies_to_AHZ(cputime_to_jiffies(pacct->ac_stime)));
/* we really need to bite the bullet and change layout */
ac.ac_uid = current->uid;
ac.ac_gid = current->gid;
@@ -496,37 +492,14 @@ static void do_acct_process(long exitcod
old_encode_dev(tty_devnum(current->signal->tty)) : 0;
read_unlock(&tasklist_lock);

- ac.ac_flag = 0;
- if (current->flags & PF_FORKNOEXEC)
- ac.ac_flag |= AFORK;
- if (current->flags & PF_SUPERPRIV)
- ac.ac_flag |= ASU;
- if (current->flags & PF_DUMPCORE)
- ac.ac_flag |= ACORE;
- if (current->flags & PF_SIGNALED)
- ac.ac_flag |= AXSIG;
-
- vsize = 0;
- if (current->mm) {
- struct vm_area_struct *vma;
- down_read(&current->mm->mmap_sem);
- vma = current->mm->mmap;
- while (vma) {
- vsize += vma->vm_end - vma->vm_start;
- vma = vma->vm_next;
- }
- up_read(&current->mm->mmap_sem);
- }
- vsize = vsize / 1024;
- ac.ac_mem = encode_comp_t(vsize);
+ ac.ac_flag = pacct->ac_flag;
+ ac.ac_mem = encode_comp_t(pacct->ac_mem);
ac.ac_io = encode_comp_t(0 /* current->io_usage */); /* %% */
ac.ac_rw = encode_comp_t(ac.ac_io / 1024);
- ac.ac_minflt = encode_comp_t(current->signal->min_flt +
- current->min_flt);
- ac.ac_majflt = encode_comp_t(current->signal->maj_flt +
- current->maj_flt);
+ ac.ac_minflt = encode_comp_t(pacct->ac_minflt);
+ ac.ac_majflt = encode_comp_t(pacct->ac_majflt);
ac.ac_swaps = encode_comp_t(0);
- ac.ac_exitcode = exitcode;
+ ac.ac_exitcode = pacct->ac_exitcode;

/*
* Kernel segment override to datasegment and write it
@@ -545,13 +518,56 @@ static void do_acct_process(long exitcod
set_fs(fs);
}

+void acct_init_pacct(struct pacct_struct *pacct, struct task_struct *tsk)
+{
+ memset(pacct, 0, sizeof(struct pacct_struct));
+ pacct->ac_utime = pacct->ac_stime = cputime_zero;
+}
+
+void acct_collect(long exitcode, int group_dead)
+{
+ struct pacct_struct *pacct = &current->signal->pacct;
+ unsigned long vsize = 0;
+
+ if (group_dead && current->mm) {
+ struct vm_area_struct *vma;
+ down_read(&current->mm->mmap_sem);
+ vma = current->mm->mmap;
+ while (vma) {
+ vsize += vma->vm_end - vma->vm_start;
+ vma = vma->vm_next;
+ }
+ up_read(&current->mm->mmap_sem);
+ }
+ spin_lock(&current->sighand->siglock);
+ if (group_dead)
+ pacct->ac_mem = vsize / 1024;
+ if (thread_group_leader(current)) {
+ pacct->ac_start_time = current->start_time;
+ pacct->ac_exitcode = exitcode;
+ if (current->flags & PF_FORKNOEXEC)
+ pacct->ac_flag |= AFORK;
+ }
+ if (current->flags & PF_SUPERPRIV)
+ pacct->ac_flag |= ASU;
+ if (current->flags & PF_DUMPCORE)
+ pacct->ac_flag |= ACORE;
+ if (current->flags & PF_SIGNALED)
+ pacct->ac_flag |= AXSIG;
+ pacct->ac_utime = cputime_add(pacct->ac_utime, current->utime);
+ pacct->ac_stime = cputime_add(pacct->ac_stime, current->stime);
+ pacct->ac_minflt += current->min_flt;
+ pacct->ac_majflt += current->maj_flt;
+ spin_unlock(&current->sighand->siglock);
+}
+
/**
* acct_process - now just a wrapper around do_acct_process
* @exitcode: task exit code
*
* handles process accounting for an exiting task
*/
-void acct_process(long exitcode)
+void acct_process()
{
struct file *file = NULL;

@@ -570,11 +586,10 @@ void acct_process(long exitcode)
get_file(file);
spin_unlock(&acct_globals.lock);

- do_acct_process(exitcode, file);
+ do_acct_process(file);
fput(file);
}

-
/**
* acct_update_integrals - update mm integral fields in task_struct
* @tsk: task_struct for accounting
--- linux-2.6.17/kernel/fork.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/fork.c 2006-06-19 16:16:30.000000000 +0900
@@ -871,6 +871,7 @@ static inline int copy_signal(unsigned l
tsk->it_prof_expires =
secs_to_cputime(sig->rlim[RLIMIT_CPU].rlim_cur);
}
+ acct_init_pacct(&sig->pacct, tsk);

return 0;
}
--- linux-2.6.17/kernel/exit.c 2006-06-18 10:49:35.000000000 +0900
+++ linux-2.6.17.kg/kernel/exit.c 2006-06-19 16:14:31.000000000 +0900
@@ -895,8 +895,8 @@ fastcall NORET_TYPE void do_exit(long co
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
exit_itimers(tsk->signal);
- acct_process(code);
}
+ acct_collect(code, group_dead);
if (unlikely(tsk->robust_list))
exit_robust_list(tsk);
#ifdef CONFIG_COMPAT
@@ -907,6 +907,8 @@ fastcall NORET_TYPE void do_exit(long co
audit_free(tsk);
exit_mm(tsk);

+ if (group_dead)
+ acct_process();
exit_sem(tsk);
__exit_files(tsk);
__exit_fs(tsk);
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/