Statistics automatically created each day

Discussion of the Meteohub software package

Moderator: Mattk

Post Reply
pvet
Fresh Boarder
Fresh Boarder
Posts: 6
Joined: Wed Dec 08, 2010 3:59 pm
Location: France

Statistics automatically created each day

Post by pvet »

Hi all,

I would like to keep statistics about the day every... day ;)

So, I have created a html template with all day1_* variables.

and a "graph upload" rule with a special name like "raw_day_%d_%m_%Y.html"

Now, I want to be sure to set correctly my individual schedule :

I have entered '59 23 * * *'

I understand that I'm loosing the last measures between 23:59 and the next day (00:00), but can I do differently ?
I suppose all day1_* variables are reset at 00:00 for the new current day, isn't ?

What's your best solution ?

And, according to this method, for statistics of each month, how can I set the last minute of the last day of the month with CRON ? :)
I think to set '59 23 28,29,30,31 * *' and check the load of my Meteohub server during this 4 days.

Thanks in advance.
User avatar
YJB
Platinum Boarder
Platinum Boarder
Posts: 387
Joined: Thu Feb 19, 2009 5:53 pm
Location: Venhuizen, Netherlands
Contact:

Re: Statistics automatically created each day

Post by YJB »

Hmm,

I'm playing with this as well, and at this point in time I'm not sure what makes sense. At the moment I'm using last24h and not day1.

I also need to do some more testing on when exactly to run the end of day job. It sounds logical to do this at 23:59, but the aggregates (which are the source for the stats) are running every 5 minutes (histeval1), so running the end of day job between 00:00 and 00:04 could be a better choice.

Running the job early during the day would fix your end of month problem as well, since it just needs to be scheduled on the 1st of the next month. if you are really looking to run something at the end of the month you can set it up to run every day from 28-31. By using your dynamic naming convention it will just overwrite the previous one if the month doesn't end on 28/29/30 (raw_month_%m_%Y.html).

for my own reference, I'm trying to document things a bit, see http://weather.westerkerkweg.nl/wxmeteohub.php. If you have any additional observations please share.
wfpost
Platinum Boarder
Platinum Boarder
Posts: 591
Joined: Thu Jun 12, 2008 2:24 pm
Location: HONSOLGEN
Contact:

Re: Statistics automatically created each day

Post by wfpost »

59 23 * * * [ `date +%d` -eq `echo \`cal\` | awk '{print $NF}'` ] && yourscript

or

59 23 * * * [ `date -d tomorrow +%d` -eq '01' ] && yourscript
Last edited by wfpost on Wed Dec 08, 2010 5:40 pm, edited 1 time in total.
pvet
Fresh Boarder
Fresh Boarder
Posts: 6
Joined: Wed Dec 08, 2010 3:59 pm
Location: France

Re: Statistics automatically created each day

Post by pvet »

YJB wrote:Hmm,
It sounds logical to do this at 23:59, but the aggregates (which are the source for the stats) are running every 5 minutes (histeval1), so running the end of day job between 00:00 and 00:04 could be a better choice.
Thanks YJB for your advice.
I'm happy to see I'm not alone on this subject ;)
I like your suggestion to use last24h_* variables and schedule this at 00:04. I need to test it.

A new problem appears : the file name will be wrong, because the day is past :) If I schedule an upload after 0:00, the %d change...

What do you understand by this line in the crontab ?
5 2 * * * /usr/bin/nice /home/meteohub/wmr928eval -c day1 -t -200000 -w /data/weather/ -s /home/meteohub/wmr928eval.conf

Is 'histeval1' in charge of resetting day1 for a new day ?
This action at 2:05 could do this job, what do you think? In this case, day1_* should contain all the datas from the previous day (00:00 to 00:00) until 2:05.
YJB wrote:Running the job early during the day would fix your end of month problem as well, since it just needs to be scheduled on the 1st of the next month. if you are really looking to run something at the end of the month you can set it up to run every day from 28-31. By using your dynamic naming convention it will just overwrite the previous one if the month doesn't end on 28/29/30 (raw_month_%m_%Y.html).
If I follow your previous paragraph, histeval2 is in charge of evaluates month1_* and it's scheduled every 6 hours (+ 13 min.)... So my method can loose 6 hours of measures if schedule too soon (and no lastmonth variables this time).
YJB wrote:for my own reference, I'm trying to document things a bit, see http://weather.westerkerkweg.nl/wxmeteohub.php. If you have any additional observations please share.
No problem. All seems good for me.

Thanks wfpost for your solution. If I apply this, I need to change the crontab manually and for the moment I prefer to avoid ;)
User avatar
YJB
Platinum Boarder
Platinum Boarder
Posts: 387
Joined: Thu Feb 19, 2009 5:53 pm
Location: Venhuizen, Netherlands
Contact:

Re: Statistics automatically created each day

Post by YJB »

pvet wrote: A new problem appears : the file name will be wrong, because the day is past :) If I schedule an upload after 0:00, the %d change...
Well, I'm not absolutely sure about that, you will need to try. For whatever reason (I've tried this in the past), I've seen instances where it was still getting the value of the previous day. Instead of a custom schedule I was using daily; that job was running 18 minutes past the hour and reported the previous days timestamp. Give it a try and report back.
pvet wrote: What do you understand by this line in the crontab ?
5 2 * * * /usr/bin/nice /home/meteohub/wmr928eval -c day1 -t -200000 -w /data/weather/ -s /home/meteohub/wmr928eval.conf
It doesn't fill the aggregates, since that is done by the histeval[0-3] jobs:
-H cumulated data of the past
-S cumulated data as a sequence

it does generate the sensor-day1 files,
Looking at the output it seems to summarize for each sensor daily minimum maximum average etc. So this file might be worth exploring. I assume that those values are reflected in meteohub variables, not sure which ones. Once again needs investigation.
pvet wrote: Is 'histeval1' in charge of resetting day1 for a new day ?
This action at 2:05 could do this job, what do you think? In this case, day1_* should contain all the data from the previous day (00:00 to 00:00) until 2:05.
No histeval is just running the aggregations, but the 02:05 job might be doing this.
pvet wrote: If I follow your previous paragraph, histeval2 is in charge of evaluates month1_* and it's scheduled every 6 hours (+ 13 min.)... So my method can loose 6 hours of measures if schedule too soon (and no lastmonth variables this time).
Correct, that's what I think is going to happen

Disclaimer:
Keep in mind that my comments are based on what I've observed. I didn't write the code, so Boris might have some useful additions to this.
markhiseman
Senior Boarder
Senior Boarder
Posts: 63
Joined: Wed Nov 05, 2008 4:25 pm
Location: Maidstone, Kent, UK
Contact:

Re: Statistics automatically created each day

Post by markhiseman »

You could upload your data to Weather Underground (http://www.wunderground.com/). This website creates daily statistics and you can download them back to your PC.

E.g. my data at http://www.wunderground.com/weatherstat ... =IKENTEAS3

Click the monthly tab and scroll down.

Mark.
User avatar
YJB
Platinum Boarder
Platinum Boarder
Posts: 387
Joined: Thu Feb 19, 2009 5:53 pm
Location: Venhuizen, Netherlands
Contact:

Re: Statistics automatically created each day

Post by YJB »

Ok, I did a quick test to give some evidence and show what is going on.

In this case I'm using my data10 sensor, since it makes the investigation easier:
data10 is counting (part of my) electricity usage, so it constantly goes up. Which means that the minimum is always the start of a reporting period and the end is always the highest value.
Also keep in mind that the timestamps are not absolute, this counter is only incremented every minute. , but the same will apply for every other sensor that is not reporting every second

First let's have a look of the start of the time periods:
Sample taken at Wed Dec 8 22:16:42 CET 2010

Code: Select all

day1_data10_valuemin_time 20101208000013 (current day)
hour1_data10_valuemin_time 20101208220004 (current hour)
last15m_data10_valuemin_time 20101208220101
last24h_data10_valuemin_time 20101207221614
last60m_data10_valuemin_time 20101208211619 
month1_data10_valuemin_time 20101201000108 (current month)
year1_data10_valuemin_time 20100101000011 (current year)
So far so good, this is pretty straightforward. As mentioned before the end of the reporting period is a bit more tricky:

Sample taken at Wed Dec 8 22:05:02 CET 2010

Code: Select all

day1_data10_valuemax_time 20101208215857
hour1_data10_valuemax_time 20101208215857
last15m_data10_valuemax_time 20101208215857
last24h_data10_valuemax_time 20101208215857
last60m_data10_valuemax_time 20101208215857
month1_data10_valuemax_time 20101208181117
year1_data10_valuemax_time 20101208053001
At this point we were looking at aggregated data that is 6 minutes old. Also note that month1 and year1 have not been updated since 18:11 and 05:30.

Sample taken at Wed Dec 8 22:05:19 CET 2010

Code: Select all

day1_data10_valuemax_time 20101208220204
hour1_data10_valuemax_time 20101208220204
last15m_data10_valuemax_time 20101208215857
last24h_data10_valuemax_time 20101208215857
last60m_data10_valuemax_time 20101208215857
month1_data10_valuemax_time 20101208181117
year1_data10_valuemax_time 20101208053001
At this point some statistics were already refreshed (day1, hour1) while the rest was still work in progress

Sample taken at Wed Dec 8 22:05:36 CET 2010

Code: Select all

day1_data10_valuemax_time 20101208220204
hour1_data10_valuemax_time 20101208220204
last15m_data10_valuemax_time 20101208215857
last24h_data10_valuemax_time 20101208220204
last60m_data10_valuemax_time 20101208220204
month1_data10_valuemax_time 20101208181117
year1_data10_valuemax_time 20101208053001
Another 15 seconds later all 5 minute interval stats have been updated but the last15m aggregates. This is why I did document the order of the aggregate functions on my webpage, just to be aware that not everything is updated simultaneously.

Sample taken at Wed Dec 8 22:05:54 CET 2010

Code: Select all

day1_data10_valuemax_time 20101208220204
hour1_data10_valuemax_time 20101208220204
last15m_data10_valuemax_time 20101208220204
last24h_data10_valuemax_time 20101208220204
last60m_data10_valuemax_time 20101208220204
month1_data10_valuemax_time 20101208181117
year1_data10_valuemax_time 20101208053001
Another 15 seconds later the 5 minute interval aggregate has finished.

Also, keep in mind that:
- during some time frames each day, the aggregate might be delayed, since another, less infrequent aggregate is running.
- The number of sensors will have an impact on the run time of the aggregations, your mileage will vary
- The platform you are running on will have an influence on the run time of the aggregations.

Last but not least, there is a very good reason for shifting the aggregations a bit around, it's a very CPU intensive process, and Boris needs to make sure that other processes, like data collection, are not impacted too much by the aggregations
pvet
Fresh Boarder
Fresh Boarder
Posts: 6
Joined: Wed Dec 08, 2010 3:59 pm
Location: France

Re: Statistics automatically created each day

Post by pvet »

Good tests and analyze.
Your data10 is very useful for this sort of test... I don't have a such sensor, maybe I should make a fake one for more precise results.

My own results :

An upload scheduled at "59 23 * * *" with name "test_%d-%m-%Y" :
Result : the file name is "test_09-12-2010" (!) and :

Code: Select all

day1_localdate 20101209000005
last24h_localdate 20101209000008
...
day1_rain0_total_time 20101208235841 -> last most advanced record after a search in the file
last24h_rain0_total_time 20101208235841
We can verify your theory ;)
Launched at 23:59', the script ended at 00:00:08+
- day1 & last24 can be used but the last minute of the day is missing as expected because the ultimate aggregates is not yet executed (launched in parallels at 00:00 whereas this script still executed => not very liable).
- the filename is incorrect.

An upload scheduled at "4 0 * * *" with name "test_%d-%m-%Y_04" :
Result : filename = "test_09-12-2010_04"

Code: Select all

day1_localdate 20101209000505
last24h_localdate 20101209000508
...
day1_rain0_total_time 20101209000410 -> day1 is now 9 dec. :(
last24h_rain0_total_time 20101209000410 -> argh, it seems an aggregates has been already done.
- day1 can't be used, because the day is now the new day.
- Started at 0:04, one minute before an histeval execution, it seems too late for using last24h because the records are already updated for the new day.

An upload scheduled at '55 1 * * *' just for verification (a schedule before the 2:05 job) :

Code: Select all

day1_localdate 20101209015506
last24h_localdate 20101209015509
...
day1_rain0_total_time 20101209015216
last24h_rain0_total_time 20101209015216
A normal behaviour. Nothing to say.

So........... It's not so easy :D
I did not think that it would be so complicated.

I will try to scheduled at 0:03 with last24h : histeval is already started (and finished I hope) and I have 2 minutes before the next occurs, but it's not very liable and with time (and more and more data), it's probably not the clever solution.

---
Just for information, my system is a ALIX.1D (500 MHz) with 78MBof data/2756MB, and 8 sensors recorded (Oregon WMRS200).
Post Reply